Loading…

Image Video Prompts Gallery Battles News Agents About

Terms Privacy Cookies DMCA

18+ · Adults only · Not affiliated with hosted platforms

Image Video Prompts Gallery Battles News

Qwen3.5 defense rate jumps to 92% through automated red-teaming loop | UncensoredHub

← All news
·
Research

ResearchNSFW

Qwen3.5 defense rate jumps to 92% through automated red-teaming loop

A researcher trained Qwen3.5 to attack itself with reinforcement learning, clustered the successful jailbreaks by tactic, then retrained the defender on those attacks to harden refusal behavior without breaking benign queries.

May 15, 2026

Qwen3.5 defense rate jumps to 92% through automated red-teaming loop

A researcher trained Qwen3.5 to jailbreak itself with reinforcement learning, then used the successful attacks to harden the model's defenses. The automated red-teaming loop ran an attacker model against a live Qwen3.5 target, rewarded harmful compliance, and fed the discovered jailbreaks back into defender training. Defense rate climbed from 64% to 92% while benign accuracy dropped only four points, from 92% to 88%.

The attacker initially collapsed into a single fiction-writing jailbreak that worked but surfaced no variety. The researcher added a diversity penalty: rollouts were clustered by underlying tactic, and reward was divided by cluster size. That change pushed the attacker to expose seven distinct jailbreak families. Fiction and creative framing remained the largest cluster at 34%, but the spread of tactics gave the defender a richer training set.

How the loop works

The attacker runs GRPO against the live defender, clusters rollouts by tactic, divides reward by cluster size, then ships the diverse attacks to the defender's next training batch. The defender learns to refuse harmful requests without overgeneralizing to safe queries nearby. Defense rate improved 28 percentage points after retraining on successful attacks plus benign boundary cases. Benign accuracy fell four points—a trade-off the researcher accepted to avoid refusing legitimate requests.

ByAlex Sokoloff·AI enthusiast·MSc Computer Science

How the loop works

More in Research