OGPSA cuts alignment tax by 9–13 points on Qwen and Llama
A new training technique preserves general capabilities during safety tuning by projecting safety gradients away from a low-rank reference subspace, lifting post-training performance by up to 13 percentage points without large-scale replay.

Researchers at Tsinghua University propose OGPSA (Orthogonal Gradient Projection for Safety Alignment), a lightweight training method that treats safety alignment as a continual learning problem. When models undergo safety tuning—via supervised fine-tuning, direct preference optimization, or both in sequence—the shifted data distribution and new objectives can interfere with gradients that support general capabilities. This interference is a known source of the alignment tax: the performance drop on unrelated tasks that often follows safety post-training.
OGPSA estimates a low-rank reference subspace from gradients computed on a small general-capability dataset, then removes from each safety gradient the component lying in that subspace. What remains is the steepest descent direction for safety that preserves first-order performance on reference objectives. The method integrates into standard post-training pipelines—SFT, DPO, or sequential SFT→DPO—without requiring large-scale replay, though it does add periodic reference-gradient computation.
What stands out
- 01Sequential SFT→DPO on Qwen2.5-7B-Instruct: average performance gain rose from 33.98% to 42.74%, a 9-point lift.
- 02Sequential SFT→DPO on Llama3.1-8B-Instruct: gain increased from 19.74% to 32.98%, a 13-point improvement.
- 03Single-stage SFT and DPO: OGPSA improved the safety–utility trade-off over baselines in both standalone settings, not only in the sequential pipeline.
- 04Minimal data overhead: the reference subspace is estimated from a small general-capability dataset, not a full replay buffer.
- 05