Scenema Audio lets any voice perform any emotion via diffusion TTS
Scenema.ai released open weights and inference code for a diffusion-based text-to-speech model that decouples emotional performance from voice identity, letting any voice perform any emotion even if never recorded in that state.
Emotional performance and voice identity are independent dimensions in speech synthesis — that's the design principle behind Scenema Audio, a diffusion-based text-to-speech model released this week by Scenema.ai as open weights and inference code.
Scenema Audio is a zero-shot voice cloning system built on diffusion rather than autoregressive TTS. Users provide a text prompt describing the emotional delivery (rage, grief, excitement, a child's wonder) and optionally supply reference audio for voice identity. The reference defines who speaks; the prompt defines how they speak. Any voice can perform any emotion, even if that voice has never been recorded in that emotional state.
