Lighthouse Attention cuts transformer training time with gradient-free hierarchical pooling
A new training-only attention mechanism compresses queries, keys, and values symmetrically to break the quadratic bottleneck in long-context pre-training, then recovers full attention in a final stage.

Lighthouse Attention is a training-only attention algorithm from researchers Bowen Peng, Subho Ghosh, and Jeffrey Quesnelle that addresses the quadratic time and memory cost of scaled dot-product attention in causal transformers. The method wraps around standard SDPA during pre-training and can be removed in a short recovery phase at the end, leaving a full-attention model. The approach pools queries, keys, and values symmetrically while preserving left-to-right causality, and the hierarchical selection step is gradient-free—no custom backward pass kernel required.
The team ran preliminary small-scale LLM pre-training experiments comparing Lighthouse Attention to full-attention baselines with matched hyperparameters. They report faster total training time and lower final loss after the recovery phase. The preprint and full code on GitHub are both available this week.
What stands out
- 01Subquadratic compression. The hierarchical pre- and post-processing step adaptively compresses and decompresses the sequence, breaking the O(n²) scaling of standard attention during the bulk of training.
- 02Symmetrical pooling. Unlike asymmetric selection methods that compress only keys and values, Lighthouse Attention pools queries, keys, and values together. This preserves causality and improves parallelism across the sequence.
- 03Gradient-free selection. The compression logic runs without gradients, sidestepping the complexity and potential inefficiency of a custom backward pass. The selection hierarchy is a fixed algorithmic step, not a learned operation.
- 04Two-stage training. The majority of pre-training happens with Lighthouse Attention active. A short final stage removes the wrapper and recovers full attention, yielding a standard transformer checkpoint with lower loss than a full-attention baseline trained for the same wall-clock time.