RTX 4090 hits 240× real-time on OmniVoice TTS; 21 GPUs benchmarked
A practitioner rented 21 consumer GPUs on vast.ai to benchmark OmniVoice, a small text-to-speech model with 5GB VRAM peak, finding the RTX 4090 leads at 240× real-time generation speed.

A practitioner benchmarked 21 consumer GPUs running OmniVoice, a small text-to-speech model that peaks at 5GB VRAM, to measure inference speed across three generations of Nvidia hardware and a handful of AMD and workstation cards. Each GPU generated a short paragraph with voice cloning enabled, averaged across three runs on vast.ai rentals.
The RTX 4090 topped the chart at 240× real-time—synthesizing audio 240 times faster than playback speed. The RTX 3090 Ti and 3090 landed at 120× and 100× respectively. Mid-range cards like the RTX 4070 Ti Super hit 150×, while the RTX 4060 Ti managed 80×. Older hardware like the GTX 1080 Ti clocked 30×, and the GTX 1660 Super came in at 20×. The RTX A6000 workstation card delivered 110×. AMD's RX 7900 XTX reached 70×, and the RX 6800 XT hit 50×.
OmniVoice's 5GB footprint sits comfortably below the 8GB floor of most modern consumer cards, making it accessible to a wide range of hardware. Voice cloning adds computational overhead compared to single-speaker synthesis, so the numbers reflect a realistic production scenario. The RTX 4090 delivers roughly 2.4× the throughput of an RTX 3090 on this task—a meaningful gap for batch processing or real-time applications. For users running TTS locally, even mid-tier 40-series hardware delivers strong performance: the RTX 4070 Ti Super's 150× real-time mark means a 10-second audio clip synthesizes in 67 milliseconds, well within interactive latency budgets.