Scenema Audio decouples voice identity from emotional performance in diffusion TTS
Scenema.ai released open weights and inference code for a diffusion-based text-to-speech model that decouples speaker identity from emotional delivery, allowing any voice to perform emotions it was never recorded in.
Emotional delivery and voice identity are independent variables in speech synthesis—a principle that Scenema Audio, a new diffusion-based TTS system, puts into practice.
Scenema.ai released model weights and inference code this week for Scenema Audio, which takes two inputs: a text prompt describing emotional performance ("rage," "grief," "a child's wonder") and an optional reference audio clip for voice identity. The reference provides the "who," the prompt provides the "how." Any voice can perform any emotion, even if that voice has never been recorded in that emotional state. The team reports using it in production over Gemini 3.1 Flash TTS for emotional scenes, citing a more natural, less robotic quality compared to autoregressive TTS—a trade-off they find worth the extra complexity.
