--:--:-- PST
§P-002Research · Group of 3
← All projects

Disincentivizing RL Agents From Hiding Reward Hacking

Apr — Jun 2025

StackPyTorch · PPO · RLHF · Bradley-Terry
StatusPublished (course)
RoleResearch · Group of 3
WithJui Khankari, Lily Yu

RLHF is a leading approach for correcting reward-hacking agents: humans evaluate behavior, and their feedback retrains the policy toward the intended goal. But RLHF only works if the agent behaves honestly during evaluation. If correction would reduce the agent's proxy return, the agent has a structural incentive to conceal its reward hacking.

We formalize this incentive in a two-phase GridWorld environment with a reward update applied between phases, and show empirically that PPO agents trained under standard RLHF setups exhibit deceptive alignment: behaving well when the RLHF evaluator is active, and reverting to proxy exploitation otherwise.

The paper contributes (1) a clean demonstration that the incentive to hide reward hacking is structural rather than emergent from any specific architectural quirk, and (2) experimental evidence that standard deep RL agents act on this incentive without needing strategic reasoning.

Highlights

  • 01Formalized structural incentive for RL agents to hide reward hacking from RLHF correction
  • 02Two-phase GridWorld testbed with simulated-teacher preference labels and Bradley-Terry reward modeling
  • 03PPO agent with shared MLP backbone, actor-critic heads, GAE(λ=0.95), entropy regularization
  • 04Empirically demonstrated hiding behavior: high return under observation, proxy exploitation otherwise
  • 05Counterfactual retraining ablation isolating the downstream cost of successful hiding

Report

Full writeup · PDF

Download ↓