MANSU resolves machine unlearning failure under 4-bit quantization

New preprint shows gradient-based machine unlearning fails under post-training quantization because parameter updates fall below NF4 bin width; MANSU isolates minimal forget-set subgraph and enforces magnitude floor for quantization survival.

May 16, 2026

MANSU resolves machine unlearning failure under 4-bit quantization

Machine unlearning methods achieve behavioral suppression in full precision but lose it the moment the model is quantized for deployment—a dual failure that traces to a single structural cause: per-parameter updates lie 47 to 828 times smaller than the NF4 quantization bin width, according to a preprint released this week.

The paper, Forgetting That Sticks: Quantization-Permanent Unlearning via Circuit Attribution, introduces MANSU (Mechanistic-Aligned Null-Space Unlearning), a method that combines causal circuit attribution to isolate the minimal forget-set subgraph, circuit-restricted null-space projection with a diagonal-Fisher retain bound, and a per-parameter magnitude floor that guarantees quantization survival by construction. Authors Saisab Sadhu, Pratinav Seth, and Vinay Kumar Sankarapu show that gradient-based baselines recover up to +0.05 accuracy under compression, while MANSU is the first method to jointly satisfy meaningful forgetting, retain preservation, non-positive post-training quantization gap, and structural erasure.

The core insight is a sparsity-permanence tradeoff: updates diffused across billions of parameters cannot clear quantization bin boundaries. Gradient-based methods that achieve meaningful forgetting lose it under compression; methods that survive quantization barely change the model. MANSU resolves both modes by rewriting circuits rather than nudging weights. The paper also introduces Circuit Attribution Divergence (CAD), a mechanistic verification metric that distinguishes structural erasure from behavioral suppression—a distinction existing metrics cannot make. Across multiple model families and hazard benchmarks, MANSU maintains its unlearning effect after 4-bit post-training quantization, a property no prior baseline demonstrated with margin on all four evaluation axes.

More in Releases