Base models evade AI detectors while instruction-tuned text gets flagged
A Carnegie Mellon preprint reveals that GPTZero and Pangram rate base-model text as overwhelmingly human while flagging instruction-tuned output as machine-written, suggesting detectors track instruction-tuning artifacts rather than fundamental AI signatures.
A new preprint from Carnegie Mellon researchers Yixuan Even Xu, Ziqian Zhong, Aditi Raghunathan, Fei Fang, and J. Zico Kolter exposes a fundamental blind spot in commercial AI-text detection. When the team fed text from base models into GPTZero and Pangram — two widely deployed detectors in academic-integrity workflows — the systems rated that output as overwhelmingly human. Text from instruction-tuned versions of the same models, by contrast, triggered immediate machine flags.
The finding holds across the Llama-3 and Qwen-3 families, spanning 0.6B to 70B parameters. The researchers built a detector-evasion pipeline called Humanization by Iterative Paraphrasing (HIP) on top of the observation: minimally fine-tune a base model into a paraphraser, then apply it iteratively to rewrite instruction-tuned output. HIP outperforms tested baselines on the trade-off between semantic preservation and detector evasion.
What stands out
- 01Instruction tuning is the tell. Current detectors appear to key on artifacts introduced during instruction tuning — phrasing patterns, local context cues, response structure — rather than any invariant signature of machine generation. Base models, which lack that tuning, slip through.
- 02The evasion pipeline is simple. HIP requires only minimal fine-tuning of a base model into a paraphraser. No adversarial training, no complex prompt engineering. Iterative application is enough to push instruction-tuned text back into the "human" zone on GPTZero and Pangram.
- 03The result generalizes across model families and sizes. The pattern holds from sub-billion-parameter models to 70B. It's not a quirk of one architecture or scale.
- 04Commercial detectors are deployed at scale in high-stakes settings. The paper explicitly names education and academic-integrity workflows. If those systems are tracking instruction-tuning style rather than fundamental AI signatures, the false-negative rate on base-model paraphrasing is a live operational risk.
