Learn-by-Wire Guard cuts Qwen2.5-7B perplexity 18.7% under aggressive learning rates
A new governance layer that sits above AdamW observes training telemetry and applies bounded control, delivering 10.74 final perplexity on WikiText-103 where baseline AdamW hits 13.21—and remains trainable at 3× higher learning rates.

Language-model training runs are increasingly fragile under aggressive learning rates and scale, wasting compute on degraded checkpoints and diverged runs. A new preprint argues that a governance layer above the optimizer—rather than replacing the optimizer itself—can stabilize training under stress while preserving fixed objectives.
Learn-by-Wire Guard (LBW-Guard), introduced by Anis Radianis in a preprint posted May 21, observes training telemetry in real time and applies bounded control to AdamW execution without changing the update rule. The approach was evaluated in a stress-and-robustness suite centered on Qwen2.5 models and WikiText-103. In the 7B reference setting, LBW-Guard reduces final perplexity from 13.21 to 10.74—an 18.7 percent improvement—and cuts end-to-end training time from 392.54 seconds to 357.02 seconds, a 1.10× speedup. Under stronger learning-rate stress, baseline AdamW degrades to 1,885.24 final perplexity at LR=3e-3 and 659.76 at LR=1e-3, while LBW-Guard remains trainable at 11.57 and 10.33, respectively. Gradient-clipping baselines do not reproduce the effect.
The evaluation also includes model-size comparisons against Qwen2.5-3B and Qwen2.5-14B, plus a no-LoRA TinyLlama-1B full-parameter sanity check. The results support the conclusion that stability-sensitive LLM training can benefit from a governance plane above the optimizer, distinct from both optimizer replacement and local gradient suppression.