RAG-Pref improves LLM refusal rates 3.7× without retraining
A new preprint introduces RAG-Pref, a training-free alignment method that retrieves preference pairs at inference time and improves agentic attack refusals by 3.7× when layered over offline alignment.
Researchers have introduced RAG-Pref, a training-free alignment method that uses retrieval augmented generation to condition language models on preferred and dispreferred examples during inference. Published on arXiv on May 13, the approach sidesteps the computational cost of post-training by pulling contrastive preference pairs from a datastore at runtime. When tested across five widely used LLMs, RAG-Pref combined with offline alignment algorithms delivered an average 3.7× improvement in refusal rates against agentic attacks—manipulations that exploit multi-step agent behavior—compared to 2.9× for other online alignment methods and 1.5× for offline alignment alone.
State-of-the-art alignment algorithms typically rely on post-training over preference pairs, demanding significant compute while still struggling with recent agentic attack vectors. RAG-Pref addresses both problems by operating entirely at inference time, making it compatible with off-the-shelf RAG packages and existing alignment pipelines. The authors report that the method also improves performance on general human-preference alignment tasks without the latency penalties or architectural changes that other online alignment techniques impose. Because RAG-Pref is modular and training-free, it can be layered onto any pre-aligned model without retraining or fine-tuning.
The preprint does not specify which five LLMs were tested, nor does it break out per-model refusal rates or the size of the preference datastore. The authors note that RAG-Pref does not drastically increase overall computational requirements, but they do not publish wall-clock latency numbers or memory overhead figures. Independent replication and open-weight releases will be needed to validate the method on diverse models and preference datasets. If RAG-Pref holds up under adversarial red-teaming and the authors release the retrieval index and query logic, it could become a standard inference-time safety layer for agentic systems.
