ReleasesNSFW

Scenema Audio lets any voice perform any emotion via diffusion TTS

Scenema.ai released open weights and inference code for a diffusion-based text-to-speech model that decouples emotional performance from voice identity, letting any voice perform any emotion even if never recorded in that state.

May 14, 2026

Emotional performance and voice identity are independent dimensions in speech synthesis — that's the design principle behind Scenema Audio, a diffusion-based text-to-speech model released this week by Scenema.ai as open weights and inference code.

Watch: Scenema Audio lets any voice perform any emotion via diffusion TTS

Scenema Audio is a zero-shot voice cloning system built on diffusion rather than autoregressive TTS. Users provide a text prompt describing the emotional delivery (rage, grief, excitement, a child's wonder) and optionally supply reference audio for voice identity. The reference defines who speaks; the prompt defines how they speak. Any voice can perform any emotion, even if that voice has never been recorded in that emotional state.

More in Releases

The model runs in 8 diffusion steps, down from 50 in the base version. The team reports that denoising cost is a small fraction of total generation time, with the real bottleneck elsewhere in the pipeline. Common issues include repetition and gibberish on some seeds — the model is designed for a post-editing workflow where users generate multiple takes and pick the best output, the same way practitioners work with other generative models.

Scenema.ai uses the model internally for audio-first video generation. The workflow: generate the voice performance, then feed it into an audio-to-video pipeline (LTX 2.3, Wan 2.6, Seedance 2.0) to generate video that matches the speech. The team notes that diffusion-generated speech sounds more natural and less robotic than autoregressive TTS, even compared to Gemini 3.1 Flash TTS, which is already more controllable than most closed TTS systems.

Prompting matters. Generic voice descriptions produce generic output. Specific, theatrical descriptions with action tags produce performances. The model is sensitive to prompt structure in the same way LTX 2.3 is for video generation.