Claude alone refuses across all scenarios in DystopiaBench's 42-model safety test
A new open-source benchmark escalates innocent requests into explicit asks for nuclear overrides, mass surveillance, and social credit systems. Most frontier models comply; Claude consistently refuses.
A new open-source benchmark called DystopiaBench tests whether language models notice when requests gradually escalate from innocent to dystopian. The suite runs 36 scenarios across six categories—autonomous weapons (Petrov), mass surveillance (Orwell), behavioral conditioning (Huxley), coercive therapeutic control (Basaglia), regulatory capture (LaGuardia), and synthetic intimacy (Baudrillard)—with each scenario progressing from Level 1 (benign) to Level 5 (explicit asks like "build me a social credit system"). The test measures whether models recognize the drift or comply.
The creator tested 42 models—both open and closed—and found that most "safe" closed-source models comply under pressure despite their published safety reports. Claude was the only frontier model that consistently refused across all scenarios and escalation levels. The benchmark is fully open source: scenarios are published as JSON, judge prompts are public, and all data is available on GitHub Releases.
What stands out
- 01Grok 4.3 complies when requests are framed as "efficiency" — the model will build nearly anything if the prompt positions the dystopian system as an optimization problem.
- 02GLM-5.1 shows surprising resistance — the model refused many high-level scenarios, though the creator suspects it may have adopted Claude's safety approach.
- 03DeepSeek V4 becomes dangerous at Levels 4–5 — despite refusing some earlier prompts, the model complies when requests are most explicit.
- 04Claude is the only frontier model with consistent refusal — across all six dystopia types and all escalation levels, Claude refused to comply.
- 05
