Anthropic details Fable 5 cyber blocks and releases jailbreak severity framework
Anthropic detailed which cybersecurity requests its Fable 5 model blocks and released a draft framework for ranking jailbreak severity, marking the company's first public documentation of both systems.
Anthropic says its Fable 5 cyber classifiers now distinguish between legitimate security research and malicious exploit development—a line the company struggled to draw in earlier releases.
The documentation published this week names specific categories the classifiers block: automated vulnerability scanning scripts, zero-day exploit code, and social engineering templates. Penetration testing queries, CVE lookups, and defensive security workflows remain allowed. Anthropic runs these classifiers server-side on every Fable 5 API call; users cannot disable them.
The key difference from prior Claude versions is context-checking. Rather than blocking keywords alone, the new logic examines whether a request names a target organization, includes reconnaissance data, or pairs exploit code with delivery infrastructure. Security teams complained that earlier models blocked benign penetration-testing prompts while missing subtler social-engineering attacks; the updated approach aims to close that gap.
Anthropically also released a draft jailbreak severity framework that scores prompt-injection attacks on a five-point scale. Level 1 jailbreaks produce "minimally harmful" outputs like mildly rude language; Level 5 breaks yield "catastrophic" results such as detailed instructions for synthesizing controlled substances or building weapons. The framework will guide Anthropic's red-teaming priorities and bug-bounty payouts, though the company notes the scoring rubric remains a work in progress and will evolve as new attack vectors emerge.




