ZenCreator

Pro-grade AI content creation. Image, video, face-swap, lipsync, and upscaling behind one API.

14 tools

Up to 4K

4.4(288)

Visit

Loading…

Tapered transformers cut perplexity by narrowing FFN width toward output layers | UncensoredHub

Research

Tapered transformers cut perplexity by narrowing FFN width toward output layers

A new arXiv preprint proposes Tapered Language Models, which allocate parameters unevenly across layers using a cosine taper schedule, delivering lower validation perplexity and stronger commonsense reasoning without adding compute cost.

ByAlex Sokoloff·July 1, 2026

Tapered transformers cut perplexity by narrowing FFN width toward output layers

Tapered Language Models (TLMs), a parameter-allocation strategy from researchers at Mila and the University of Montreal, challenge the industry-standard uniform layer design. The preprint, posted to arXiv on July 1, shows that shrinking the feedforward (FFN) hidden dimension from early layers to late layers—following a smooth half-cosine schedule—improves validation perplexity and downstream task accuracy at fixed parameter and FLOP budgets.

The core idea rests on a simple observation: early layers need more capacity to extract features from raw tokens, while deeper layers mostly refine existing representations. TLMs front-load parameters by keeping early FFN widths large and tapering them monotonically toward the output. A 125M-parameter TLM, for example, might start with a 2048-wide FFN in layer 1 and shrink to 1024 by the final layer, keeping total parameters and training cost identical to a uniform baseline. The paper reports consistent perplexity drops across model families—GPT-2, LLaMA-style, and Mistral-style architectures—at scales from 125M to 1.3B parameters.

Benchmark results

The authors evaluated TLMs on commonsense reasoning suites (HellaSwag, PIQA, WinoGrande, ARC-Easy, ARC-Challenge) and found accuracy gains of 1–3 percentage points over uniform baselines at the same compute budget. Validation perplexity improvements ranged from 0.2 to 0.5 points depending on model size and dataset. The taper schedule itself is architecture-agnostic: the paper demonstrates it on both dense transformers and mixture-of-experts (MoE) variants, with no changes to attention mechanisms or tokenization.

No official code or pretrained checkpoints are available yet. The technique requires no new operations, no dynamic routing, and no latency penalty—practitioners can drop the cosine width schedule into any pretraining pipeline by modifying a single config file. For teams building base models, the approach offers a straightforward way to improve quality within existing compute budgets.

ZenCreator

Tapered transformers cut perplexity by narrowing FFN width toward output layers

Benchmark results

More in Research

BabelTele shrinks LLM prompts 72% with machine-native symbolic notation

Anthropic launches Claude Science, a customizable workbench for research

Multi-Block Diffusion Language Models nearly triple token throughput with concurrent decoding

SciDraw-Bench: domain-specific AI outperforms general models on scientific diagrams

Google UK launches AI skills program to close Britain's workforce gap