§P-004Research · Group of 2

Direct Preference Optimization for GPT-2 Downstream Tasks

Apr — Jun 2025

StackPyTorch · Transformers · DPO · GPT-2

StatusPublished (course)

RoleResearch · Group of 2

WithNatalie Shell

DPO is a recent alternative to RLHF that aligns language models to human preferences without explicitly training a separate reward model. We investigate whether DPO's claimed advantages generalize beyond the canonical sentiment/summarization tasks it was introduced on.

This project (a) builds a GPT-2 implementation from scratch — embeddings, causal self-attention, prediction heads — and (b) fine-tunes it across three downstream tasks (sentiment classification on SST + CFIMDB, paraphrase detection on Quora, sonnet generation conditioned on Shakespeare) using both standard cross-entropy and DPO objectives.

Results: DPO yielded improved performance on paraphrase detection and sonnet generation tasks vs MLE fine-tuning, with outputs that better aligned to human-preferred completions. We also explored DPO loss formulations adapted for sequence-level preference signals, sampling bad-continuation pairs at higher temperatures to bootstrap preference data.

Highlights

01GPT-2 architecture re-implemented from scratch (embeddings, causal attention, task heads)
02Three downstream tasks: sentiment (SST + CFIMDB), paraphrase (Quora), sonnet generation
03DPO loss adapted per task — classification-style for paraphrase, sequence-level for sonnets
04DPO improved paraphrase + sonnet quality over MLE baselines
05Bootstrapped preference pairs from temperature-scaled bad-continuation sampling

Report

Full writeup · PDF

Download ↓

← All projects Home →