SelfCI framework trains LLMs to honor privacy norms without sacrificing task performance
A new self-distillation method trains language models to honor contextual privacy norms without sacrificing task performance, outperforming reinforcement learning baselines on agentic workflows.

Privacy in large language models isn't just about hiding information—it's about respecting when and how data should flow in a given context. A preprint published May 21 proposes SelfCI, a self-distillation framework that lets models learn to withhold sensitive information without degrading their ability to complete tasks.
The method, developed by Sangwoo Park, Woongyeong Yeo, Seanie Lee, Yumin Choi, Hyomin Lee, and Kangsan Kim, addresses what researchers call Contextual Integrity: the principle that privacy norms vary by situation. A calendar app should never share your medical history; an email draft tool should never leak your salary. Existing safety approaches either leak information or cripple performance. SelfCI decouples the two problems.
The framework trains two independent teacher distributions from feedback signals. One preserves task-relevant information to maintain utility; the other enforces minimal, appropriate disclosure. The model learns from both via complementary reverse KL divergences, producing what the authors call a Product-of-Experts target—essentially the intersection of "can do the job" and "respects privacy norms." No external supervision is required beyond the feedback itself.
In evaluations against online reinforcement learning baselines like GRPO, SelfCI consistently outperformed on both in-domain and out-of-domain tests. The gains held up in agentic workflows where models accumulate private context over multiple turns—the kind of scenario where a personal agent (calendar manager, email drafter, health tracker) must make nuanced decisions about what to reveal and when. The results suggest the approach scales to real deployment scenarios where models handle sensitive workflows as trusted agents.