PASA embeds watermarks in semantic space to survive paraphrasing attacks
A new watermarking algorithm operates on semantic clusters rather than token distributions, preserving detectability even when attackers rewrite generated text while keeping meaning intact.

Watermarking LLM output at the semantic level—rather than the vocabulary level—can survive paraphrasing attacks that rewrite every token yet leave the meaning unchanged, according to a preprint posted this week.
PASA (Principled Approach for Semantic-invariant Attacks) is a watermarking algorithm from researchers Zhenxin Ai and Haiyun He that embeds a signal in the latent embedding space of a language model. Instead of biasing which tokens the model samples, PASA clusters semantically similar tokens and builds a distributional dependency between the main output sequence and an auxiliary sequence, synchronized by a secret key and the semantic history of the generation. The authors frame the design as a jointly optimal embedding-detection pair that balances detection accuracy, robustness to semantic-preserving attacks, and text quality.
Evaluations across multiple LLMs show PASA remains detectable under strong paraphrasing attacks—rewrites that change surface form but preserve meaning—while standard vocabulary-space watermarks degrade or vanish. The preprint includes ablation studies validating hyperparameter choices and reports that text quality, measured by perplexity and human eval proxies, stays high. The method's reliance on semantic clusters means it can tolerate token-level edits that would break token-distribution watermarks.