BDH stores LLM memory in sparse weight activations, ditching the KV cache

Jan Chorowski's BDH architecture replaces the ever-growing key-value cache with memory encoded directly in high-dimensional neuron activations, treating retrieval as graph propagation through learned connectivity.

May 12, 2026

BDH stores LLM memory in sparse weight activations, ditching the KV cache

Jan Chorowski's BDH architecture proposes storing LLM memory directly in network weights rather than the ever-growing key-value cache that transformers rely on. The core idea: stop treating keys and queries as small abstract vectors and instead set them equal to neuron activations in high-dimensional space, turning memory retrieval into graph propagation through an accumulated connectivity matrix. Standard transformers encode context in two places — static pre-training knowledge compressed into weights and short-term session state in KV cache. Chorowski frames this as a memory bottleneck: transformers cannot form new long-term memories and instead compensate by growing the KV cache with each token. BDH aims to unify both by making the memory space itself part of the model's learned structure.

The architecture linearizes attention but compensates by radically expanding the key-query space. Chorowski claims BDH uses over 10 million key-query dimensions versus roughly 1,000 in transformers, projecting short-term memory states into fixed, positive, and very high-dimensional spaces that are far more expressive than KV cache. The practical constraint is severe: a full neurons-by-neurons connectivity matrix is too large to materialize. The implementation uses low-rank factorization and ReLU thresholding to keep the graph compressed and sparse, avoiding the O(N²) blowup. Chorowski's key claim: vanilla state-space models fail because they linearize attention without changing the memory substrate — you cannot swap a non-linear attention layer for a linear one and leave everything else untouched.

Other claims from Chorowski's seminar suggest RNNs may have had the wrong memory-to-compute ratio, with O(N²) transition parameters but only O(N) state. BDH memory behaves more like a noisy fixed-size hash table with sparse keys. The architecture remains in the seminar-circuit phase, with no public weights or benchmark runs yet. The practical test will be whether BDH can match transformer perplexity on long-context tasks without the KV cache overhead that currently bottlenecks inference at scale.

More in Research