Alignment methods hit same benchmarks but learn different weight landscapes
SFT, RFT, DFT, DPO, GRPO, and DAPO achieve identical metrics on the same data yet carve distinct paths through weight space—a finding that upends the assumption methods are interchangeable.

Six popular alignment methods produce nearly identical benchmark scores when trained on the same dataset, but their internal weight landscapes diverge sharply. An interactive HuggingFace Space comparison shows that supervised fine-tuning and rejection sampling variants (SFT, RFT, DFT, offline GRPO) cluster together in weight space, while policy-gradient methods (GRPO, DAPO) and preference-based DPO each carve out distinct trajectories. The effect holds across learning rates and random seeds, suggesting the divergence is structural rather than a training artifact.
The Space visualizes weight distributions and loss surfaces for each method, letting practitioners compare how different objectives shape the same starting checkpoint. SFT, RFT, DFT, and offline GRPO converge to similar weight neighborhoods, while DPO, online GRPO, and DAPO each occupy separate regions—implying they encode different inductive biases despite matching eval scores. This complicates the common assumption that methods with similar benchmark numbers are interchangeable; for downstream fine-tuning or merging, the choice of alignment recipe may matter more than leaderboard position suggests. The Space is live at huggingface.co/spaces/AlexWortega/same-data-different-losses.



