NGM training-free memory module boosts Qwen3-14B by 3 points on code and knowledge tasks
N-gram Memory (NGM) is a training-free memory module that averages pretrained token embeddings to build n-gram representations, improving Qwen3 models by 0.5–1.2 points on average across eight benchmarks.
N-gram Memory (NGM), a training-free memory module from researchers Yuwen Qu, Wenhui Dong, Chenyang Si, and Caifeng Shan, directly averages pretrained token embeddings to construct n-gram representations without requiring learned embeddings or additional training. The module pairs a Causal N-Gram Encoder with a Cosine-Gated Memory Injector that uses a non-parametric cosine gate and ReLU to modulate retrieved embeddings into contextual representations. Unlike mixture-of-experts architectures that rely on dynamic computation paths, NGM provides explicit lookup-based knowledge retrieval without a separate memory table or retrieval pipeline.
Tested on the Qwen3 series from 0.6B to 14B parameters across eight benchmarks, NGM improved average performance by 0.5 to 1.2 points, with the largest gains on code generation and knowledge-intensive tasks: Qwen3-14B saw +3.0 on LiveCodeBench and +3.03 on GPQA. The module also improved multimodal performance, adding +1.53 points on MMStar for Qwen3-VL-2B. The plug-and-play design allows practitioners to integrate NGM into existing inference pipelines without modifying weights or retraining, making it a practical option for teams running open-weight models in production seeking performance gains on knowledge-heavy workloads.
