Kokoro 82M outpaces Supertonic 3 on quality; Supertonic wins speed on CPU

A CPU-only benchmark compares Supertonic 3 TTS (flow-matching, 2–5 inference steps) against Kokoro 82M on AMD EPYC hardware. Supertonic 2-step reaches 6.1x realtime but produces slurred audio; Supertonic 5-step and Kokoro trade speed for naturalness.

May 17, 2026

Kokoro 82M outpaces Supertonic 3 on quality; Supertonic wins speed on CPU

"Supertonic is faster, but Kokoro still sounds better," according to a detailed CPU benchmark comparing the two text-to-speech models on AMD EPYC 7763 hardware (4 vCPUs, 16GB RAM, no GPU). The test ran 120 timed passes across six text lengths from 12 to 1,712 characters, splitting the two models into distinct use cases: Supertonic for speed, Kokoro for quality.

Supertonic 3, a flow-matching TTS that trades inference steps for speed, achieved a mean RTF (real-time factor) of 0.313 at its default 5 steps—synthesizing audio 3.2 times faster than playback speed. Dialing down to 2-step "speed mode" pushed RTF to 0.165 (6.1x realtime), but the audio degraded noticeably: words slurred, prosody flattened, and the tester deemed it unsuitable for production. Kokoro 82M PyTorch clocked 0.469 RTF (2.1x realtime), with its ONNX variant slightly slower at 0.509.

On a 196-character passage (roughly 13 seconds of audio), Supertonic 5-step finished in 3.67 seconds of wall-clock time versus Kokoro PyTorch's 5.62 seconds. Steady-state throughput favored Supertonic at ~55 characters per second compared to Kokoro's 33–36. One anomaly emerged: Kokoro's ONNX build ran slower than PyTorch on this AMD chip, likely due to higher fixed overhead on short texts—a pattern worth retesting on Intel hardware.

Kokoro ranks first on the TTS Arena leaderboard and produced the most natural speech in the comparison. Supertonic's per-call overhead scales poorly on tiny inputs (RTF 0.30 on 12 characters, dropping to 0.13 on medium text), while Kokoro's RTF remains flat across text lengths. The practical ranking: choose Kokoro when output quality is non-negotiable, Supertonic 5-step for chatbot or assistant workflows where sub-4-second latency matters, and Supertonic 2-step only for prototyping.

More in Community