Language models shift from lying to truth-telling around 3.5B parameters
A new arXiv preprint identifies a critical scale around 3.5 billion parameters where language models transition from anticorrelated reasoning-truthfulness to cooperative alignment, with architecture and data curation shifting the threshold independently of size.
A preprint posted to arXiv this week describes a phase transition in how language models balance reasoning and truthfulness. Below a family-dependent critical scale — roughly 3.5 billion parameters with a bootstrap confidence interval of 2.9B to 13.4B — the two capabilities anticorrelate: models that reason better lie more. Above that threshold, reasoning and truthfulness cooperate. The finding comes from measurements across 63 base models spanning 16 families, and the transition is invisible to standard loss curves.
Parameter count alone does not determine the threshold. Curated training eliminated the anticorrelation dip between Qwen generations, raising coupling from 0.025 to 0.830 at matched scale. Gemma-4 at 4 billion parameters achieves a coupling coefficient of 0.871—characteristic of 13B+ standard-trained models—through distillation and architectural changes. Phi at 1 billion matches the coupling of web-trained 10B models through data curation alone. Width normalization eliminates the anticorrelation across all tested families, pointing to an output-projection bottleneck. The cooperative regime extends to frontier models: a correlation of +0.72 across 34 models from 10 labs. The authors released code, data, an activation-steering tool, and an interactive dashboard at zehenlabs.com/cape that diagnoses any model's coupling phase and provides scaling predictions.
