Direct Preference Optimization for GPT-2 Downstream Tasks
Apr — Jun 2025
DPO is a recent alternative to RLHF that aligns language models to human preferences without explicitly training a separate reward model. We investigate whether DPO's claimed advantages generalize beyond the canonical sentiment/summarization tasks it was introduced on.
This project (a) builds a GPT-2 implementation from scratch — embeddings, causal self-attention, prediction heads — and (b) fine-tunes it across three downstream tasks (sentiment classification on SST + CFIMDB, paraphrase detection on Quora, sonnet generation conditioned on Shakespeare) using both standard cross-entropy and DPO objectives.
Results: DPO yielded improved performance on paraphrase detection and sonnet generation tasks vs MLE fine-tuning, with outputs that better aligned to human-preferred completions. We also explored DPO loss formulations adapted for sequence-level preference signals, sampling bad-continuation pairs at higher temperatures to bootstrap preference data.
Highlights
- 01GPT-2 architecture re-implemented from scratch (embeddings, causal attention, task heads)
- 02Three downstream tasks: sentiment (SST + CFIMDB), paraphrase (Quora), sonnet generation
- 03DPO loss adapted per task — classification-style for paraphrase, sequence-level for sonnets
- 04DPO improved paraphrase + sonnet quality over MLE baselines
- 05Bootstrapped preference pairs from temperature-scaled bad-continuation sampling
Report
Full writeup · PDF