Qwen3.5 defense rate jumps to 92% through automated red-teaming loop
A researcher trained Qwen3.5 to attack itself with reinforcement learning, clustered the successful jailbreaks by tactic, then retrained the defender on those attacks to harden refusal behavior without breaking benign queries.
A researcher trained Qwen3.5 to jailbreak itself with reinforcement learning, then used the successful attacks to harden the model's defenses. The automated red-teaming loop ran an attacker model against a live Qwen3.5 target, rewarded harmful compliance, and fed the discovered jailbreaks back into defender training. Defense rate climbed from 64% to 92% while benign accuracy dropped only four points, from 92% to 88%.
The attacker initially collapsed into a single fiction-writing jailbreak that worked but surfaced no variety. The researcher added a diversity penalty: rollouts were clustered by underlying tactic, and reward was divided by cluster size. That change pushed the attacker to expose seven distinct jailbreak families. Fiction and creative framing remained the largest cluster at 34%, but the spread of tactics gave the defender a richer training set.
How the loop works
The attacker runs GRPO against the live defender, clusters rollouts by tactic, divides reward by cluster size, then ships the diverse attacks to the defender's next training batch. The defender learns to refuse harmful requests without overgeneralizing to safe queries nearby. Defense rate improved 28 percentage points after retraining on successful attacks plus benign boundary cases. Benign accuracy fell four points—a trade-off the researcher accepted to avoid refusing legitimate requests.
