MANSU resolves machine unlearning failure under 4-bit quantization
New preprint shows gradient-based machine unlearning fails under post-training quantization because parameter updates fall below NF4 bin width; MANSU isolates minimal forget-set subgraph and enforces magnitude floor for quantization survival.

Machine unlearning methods achieve behavioral suppression in full precision but lose it the moment the model is quantized for deployment—a dual failure that traces to a single structural cause: per-parameter updates lie 47 to 828 times smaller than the NF4 quantization bin width, according to a preprint released this week.
The paper, Forgetting That Sticks: Quantization-Permanent Unlearning via Circuit Attribution, introduces MANSU (Mechanistic-Aligned Null-Space Unlearning), a method that combines causal circuit attribution to isolate the minimal forget-set subgraph, circuit-restricted null-space projection with a diagonal-Fisher retain bound, and a per-parameter magnitude floor that guarantees quantization survival by construction. Authors Saisab Sadhu, Pratinav Seth, and Vinay Kumar Sankarapu show that gradient-based baselines recover up to +0.05 accuracy under compression, while MANSU is the first method to jointly satisfy meaningful forgetting, retain preservation, non-positive post-training quantization gap, and structural erasure.
The core insight is a sparsity-permanence tradeoff: updates diffused across billions of parameters cannot clear quantization bin boundaries. Gradient-based methods that achieve meaningful forgetting lose it under compression; methods that survive quantization barely change the model. MANSU resolves both modes by rewriting circuits rather than nudging weights. The paper also introduces Circuit Attribution Divergence (CAD), a mechanistic verification metric that distinguishes structural erasure from behavioral suppression—a distinction existing metrics cannot make. Across multiple model families and hazard benchmarks, MANSU maintains its unlearning effect after 4-bit post-training quantization, a property no prior baseline demonstrated with margin on all four evaluation axes.