Mega-ASR cuts real-world speech recognition error 30% via compound acoustic simulation
A new ASR framework trains on 2M synthetic compound-distortion samples, achieving 21.49% WER on NOIZEUS Sta-0 versus 29.34% for prior state-of-the-art.

Mega-ASR addresses what researchers call the "acoustic robustness bottleneck" — the tendency of speech recognition models to lose grounding and hallucinate under severe real-world distortion. The framework combines a 2-million-sample synthetic dataset called Voices-in-the-Wild-2M with a two-stage training regime: Acoustic-to-Semantic Progressive Supervised Fine-Tuning followed by Dual-Granularity WER-Gated Policy Optimization. The dataset covers seven classic acoustic phenomena and 54 physically plausible compound scenarios, simulating overlapping distortions like reverberation, noise, and codec artifacts that occur together in the wild.
Mega-ASR achieves 45.69% word error rate on the VOiCES R4-B-F benchmark versus 54.01% for the previous best system, and 21.49% WER on NOIZEUS Sta-0 versus 29.34%. On the authors' own compound-distortion test set, the framework delivers over 30% relative WER reduction against both open- and closed-source baselines, including Whisper-large-v3 and proprietary cloud ASR APIs. The paper frames the improvement as a shift from single-distortion training to compositional simulation at scale.
On benchmarks
The VOiCES R4-B-F subset represents far-field, reverberant speech in a large room. NOIZEUS Sta-0 is a standard noisy-speech corpus with station-noise recordings at 0 dB SNR. Prior models often omit words or insert phantom tokens when multiple distortions stack, a failure mode the progressive training schedule is designed to mitigate. The 30% relative gain on compound scenarios demonstrates that training on physically plausible combinations of distortions—rather than isolated noise or reverb—improves robustness to real-world acoustic complexity.
The preprint does not specify whether weights will be released or under what license. The Voices-in-the-Wild-2M dataset construction pipeline is described in the paper but not yet publicly available. The work positions itself as a scalable paradigm for "ASR-in-the-wild," with the compound-data approach intended to generalize beyond the seven acoustic phenomena explicitly modeled.