OBBR defense cuts LLM backdoor attack success by 51 percent across four models
New arXiv preprint shows rewriting training samples against benign references blocks data poisoning attacks more effectively than closed-book methods across four major models.
A preprint by John T. Halloran and Noopur S. Bhatt proposes open-book benign rewriting (OBBR), a data-cleaning defense that rewrites LLM training samples by referencing known-safe prompts before fine-tuning. Tested against five backdoor attack patterns on four widely used models, OBBR raised safety performance by an average 51 percent over prior state-of-the-art defenses and 25.7 percent over closed-book rewriting, which lacks the benign reference set. The technique works by projecting poisoned samples—those seeded with trigger phrases designed to elicit harmful outputs—into the space of verified benign prompts, effectively neutralizing the trigger before the model ever sees it during training.
The paper includes a formal proof that OBBR's probability of producing a benign rewritten output strictly exceeds that of closed-book methods, which generate rewrites from scratch without a reference corpus. Halloran and Bhatt also report that OBBR does not degrade downstream task performance after fine-tuning and remains computationally cheaper than competing defenses that rely on adversarial training or input filtering. The method extends beyond trigger-based backdoors: the authors demonstrate it also mitigates non-trigger data poisoning, where attackers corrupt training distributions without embedding explicit phrases. Across the five attack types tested, OBBR reduced attack success rates by an average of 51 percentage points compared to the next-best defense.
