FocuSFT bilevel optimizer cuts attention sink mass 529× in long-context fine-tuning
A new training framework from Hong Kong researchers tackles attention dilution in long-context LLM fine-tuning, lifting BABILong accuracy by 14 percentage points and RULER CWE scores from 72.9% to 81.1% at 16K context.

FocuSFT is a bilevel optimization framework from researchers at Hong Kong University of Science and Technology and Huawei that addresses how large language models waste attention budget during supervised fine-tuning on long sequences. The preprint shows that positional biases and attention sinks starve semantically relevant tokens of gradient signal during training—a phenomenon the authors call "attention dilution." FocuSFT runs an inner loop that adapts fast-weight parameters to build a parametric memory concentrating attention on content, then an outer loop performs standard SFT conditioned on that sharpened representation. Both loops use bidirectional attention over context tokens while preserving causal masking for responses, reducing the causal asymmetry that creates attention sinks.
On the BABILong benchmark, FocuSFT improves accuracy by up to 14 percentage points across 4K to 32K context lengths. On RULER at 16K, it lifts the CWE aggregation score from 72.9% to 81.1%. On GPQA with agentic tool use, the method yields a 24% relative gain in pass@1. Attention analysis shows FocuSFT reduces attention sink mass by 529 times and triples context engagement during training. Code is available on GitHub.