simpo-training — Skillopedia

SimPO - Simple Preference Optimization Quick start SimPO is a reference-free preference optimization method that outperforms DPO without needing a reference model. Installation : Training (Mistral 7B): Common workflows Workflow 1: Train from base model (Mistral 7B) Config ( ): Launch training : Workflow 2: Fine-tune instruct model (Llama 3 8B) Config ( ): Launch : Workflow 3: Reasoning-intensive tasks (lower LR) For math/code tasks : When to use vs alternatives Use SimPO when : - Want simpler training than DPO (no reference model) - Have preference data (chosen/rejected pairs) - Need better p…