HRM-Text 1B matches 7B models on 40B tokens and $1,500 budget
A 1B-parameter hierarchical recurrent model trained from scratch on instruction pairs matches 2-7B Transformers while using 100-900× fewer tokens, demonstrating radical compute savings through co-designed architecture and objective.
HRM-Text is a 1B-parameter language model that achieves competitive performance with far larger models while training on a fraction of the data. The model replaces standard Transformer layers with a Hierarchical Recurrent Model (HRM) that splits computation into slow-evolving strategic layers and fast-evolving execution layers, inspired by the functional organization of biological neural systems like the frontoparietal loop. Trained from scratch on only 40 billion unique tokens—roughly 100 to 900 times fewer than standard baselines—and a $1,500 compute budget, HRM-Text scores 60.7% on MMLU, 81.9% on ARC-Challenge, 82.2% on DROP, 84.5% on GSM8K, and 56.2% on MATH.
The architecture introduces two stabilization techniques for deep recurrence in language modeling: MagicNorm, a normalization scheme, and warmup deep credit assignment, which gradually enables gradient flow through the recurrent hierarchy. Instead of pretraining on raw internet text, the team trained exclusively on instruction-response pairs using a task-completion objective with PrefixLM masking. This departure from standard next-token prediction on unstructured corpora is central to the efficiency gains.
On benchmarks
Despite using an estimated 96 to 432 times less compute than comparable Transformer baselines, HRM-Text performs within range of open models in the 2-7B parameter class. The work, posted to arXiv this week, positions itself as an "empirical existence proof" that co-designing architecture and training objective can radically lower the compute-to-performance ratio. The authors argue that current pretraining paradigms—massive compute, internet-scale raw text—create barriers to foundational research, and that HRM-Text demonstrates a viable alternative path.
The model's performance on mathematical reasoning tasks is particularly notable: 84.5% on GSM8K and 56.2% on MATH suggest that the instruction-pair training regime and hierarchical recurrence together capture compositional reasoning patterns efficiently. The paper does not specify a release date for weights or code, and no HuggingFace model card is yet available.
