Tapered transformers cut perplexity by narrowing FFN width toward output layers
A new arXiv preprint proposes Tapered Language Models, which allocate parameters unevenly across layers using a cosine taper schedule, delivering lower validation perplexity and stronger commonsense reasoning without adding compute cost.

Tapered Language Models (TLMs), a parameter-allocation strategy from researchers at Mila and the University of Montreal, challenge the industry-standard uniform layer design. The preprint, posted to arXiv on July 1, shows that shrinking the feedforward (FFN) hidden dimension from early layers to late layers—following a smooth half-cosine schedule—improves validation perplexity and downstream task accuracy at fixed parameter and FLOP budgets.
The core idea rests on a simple observation: early layers need more capacity to extract features from raw tokens, while deeper layers mostly refine existing representations. TLMs front-load parameters by keeping early FFN widths large and tapering them monotonically toward the output. A 125M-parameter TLM, for example, might start with a 2048-wide FFN in layer 1 and shrink to 1024 by the final layer, keeping total parameters and training cost identical to a uniform baseline. The paper reports consistent perplexity drops across model families—GPT-2, LLaMA-style, and Mistral-style architectures—at scales from 125M to 1.3B parameters.
Benchmark results
The authors evaluated TLMs on commonsense reasoning suites (HellaSwag, PIQA, WinoGrande, ARC-Easy, ARC-Challenge) and found accuracy gains of 1–3 percentage points over uniform baselines at the same compute budget. Validation perplexity improvements ranged from 0.2 to 0.5 points depending on model size and dataset. The taper schedule itself is architecture-agnostic: the paper demonstrates it on both dense transformers and mixture-of-experts (MoE) variants, with no changes to attention mechanisms or tokenization.
No official code or pretrained checkpoints are available yet. The technique requires no new operations, no dynamic routing, and no latency penalty—practitioners can drop the cosine width schedule into any pretraining pipeline by modifying a single config file. For teams building base models, the approach offers a straightforward way to improve quality within existing compute budgets.



