Essay

Diagnosing a “Missing Reward Model” DPO Implementation and Its Offline Implications

You join an LLM alignment project where an engineer claims they have implemented Direct Policy Optimization (DPO) “without a reward model.” Their training code, however, still computes a learned scalar score r^(x,y)\hat r(x,y) for each response and then runs an on-policy PPO-style loop that repeatedly samples new responses from the current model during training. The engineer argues this is still DPO because they also keep a fixed reference model πref\pi_{\text{ref}} and they have preference pairs (x,ychosen,yrejected)(x, y_{\text{chosen}}, y_{\text{rejected}}).

Write an analysis that (1) pinpoints the conceptual mismatch(es) between what they built and what DPO is, (2) explains—using the DPO preference-probability form based on log policy ratios—how DPO can update πθ\pi_\theta directly from preference pairs without an explicit reward model (be explicit about what cancels/why a separate r^\hat r is unnecessary), and (3) connects this to why DPO is considered an offline RL method and how that changes the data-collection and training pipeline compared with RLHF+PPO. Conclude by proposing a corrected high-level pipeline for DPO in this setting and one tradeoff this correction introduces versus the PPO-style approach they attempted.

0

1

Updated 2026-02-06

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related