Stable Audio 3 Medium model generates six-minute audio on open weights
Stability AI shipped three Stable Audio 3 models this week—two small checkpoints for music and sound effects, plus a medium model that generates up to six minutes twenty seconds of audio on NVIDIA GPUs.
Stability AI released Stable Audio 3, a text-to-audio model family for music and sound effects generation, with open weights on HuggingFace under the Stability AI Community License. The Medium checkpoint synthesizes audio up to six minutes twenty seconds long and runs in seconds on NVIDIA GPUs, while the two Small variants—one for music, one for sound effects—generate up to two minutes and are optimized for CPU inference. All three models are free for personal and creative use, with no royalty claims on outputs.
The release includes a GitHub repo for inference and LoRA fine-tuning, plus two arXiv papers: one on the Stable Audio 3 architecture and another on the SAME autoencoder the models are built on. The Medium model is available as stable-audio-3-medium on HuggingFace; the Small checkpoints are stable-audio-3-small-music and stable-audio-3-small-sfx. Stability AI also launched a web demo at stableaudio.com/generate, letting users test prompts before pulling the checkpoints locally.
The Medium model's six-minute-twenty-second ceiling is the longest single-shot audio generation window in an open-weight text-to-audio release to date. The Small models' CPU optimization opens local experimentation on laptops and edge devices—a capability that has been standard in open-weight image models but rare in audio. Stability AI's licensing posture—claiming no royalties or ownership on model outputs—aligns with the company's broader open-weight strategy across image, video, and now audio domains.
