Stable Audio 3 generates minutes of music in under 2 seconds on consumer GPUs
Stability AI released Stable Audio 3, a family of latent diffusion models (small, medium, large) that generate and edit variable-length audio in under two seconds on an H200 GPU, with small and medium weights now available for local use.
Stability AI released Stable Audio 3 this week, a family of latent diffusion models designed for fast, variable-length audio generation and editing. The small and medium weights are now available as open-source releases; the large model remains proprietary. All three sizes can produce several minutes of audio in under two seconds on an H200 GPU, or within a few seconds on a MacBook Pro M4.
The models operate on a novel semantic-acoustic autoencoder that compresses audio into a compact latent space while preserving fidelity and encouraging semantic structure. This design enables efficient diffusion-based generation without the typical cost of producing full-length audio when only short sounds are needed. Inpainting support allows targeted editing and continuation of short recordings without full regeneration.
The team trained on licensed and Creative Commons data and applied adversarial post-training to reduce diffusion steps while improving fidelity and prompt adherence. The small and medium models run on consumer-grade hardware; weights and inference code are available on HuggingFace.
What stands out
- 01Variable-length generation. The models produce audio of any duration without forcing full-length synthesis for short sounds, cutting compute costs for practical workflows.
- 02Inpainting for editing. Users can edit specific regions of an audio clip or extend short recordings, rather than regenerating from scratch.
- 03Semantic-acoustic autoencoder. The novel autoencoder projects audio into a latent space that preserves fidelity while encouraging semantic structure, making diffusion-based generation more efficient.
- 04Adversarial post-training. The team applied adversarial training to reduce inference steps while improving quality and prompt alignment—a departure from standard diffusion fine-tuning.
