AI Glossary
What is Direct Preference Optimization?
Overview
Direct Preference Optimization is a training method that tunes language models from preference data without a separate reinforcement learning loop. It matters because many labs and open-model teams use DPO-style methods to align responses, improve instruction following, and make models cheaper to refine after supervised training.
Why it matters
DPO is a common post-training technique behind instruction-tuned and preference-aligned models.
Where it appears in AI research
- Open-weight model training reports
- Alignment and post-training papers
- RLHF alternative discussions
- Model release technical notes
Related terms
Related DeepSignal articles
Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents
Evoflux enhances the execution feasibility of compact language models in tool workflows from 3% to 17-24% on -Bench tasks, outperforming SFT and ReAct under limited teacher-trace budgets. This evolutionary search method effectively repairs executable workflows through structured edits and adaptive feedback.

