Scenema Audio is a zero-shot voice cloning system built on diffusion rather than autoregressive TTS. Users provide a text prompt describing the emotional delivery (rage, grief, excitement, a child's wonder) and optionally supply reference audio for voice identity. The reference defines who speaks; the prompt defines how they speak. Any voice can perform any emotion, even if that voice has never been recorded in that emotional state.
The model runs in 8 diffusion steps, down from 50 in the base version. The team reports that denoising cost is a small fraction of total generation time, with the real bottleneck elsewhere in the pipeline. Common issues include repetition and gibberish on some seeds — the model is designed for a post-editing workflow where users generate multiple takes and pick the best output, the same way practitioners work with other generative models.
Scenema.ai uses the model internally for audio-first video generation. The workflow: generate the voice performance, then feed it into an audio-to-video pipeline (LTX 2.3, Wan 2.6, Seedance 2.0) to generate video that matches the speech. The team notes that diffusion-generated speech sounds more natural and less robotic than autoregressive TTS, even compared to Gemini 3.1 Flash TTS, which is already more controllable than most closed TTS systems.
Prompting matters. Generic voice descriptions produce generic output. Specific, theatrical descriptions with action tags produce performances. The model is sensitive to prompt structure in the same way LTX 2.3 is for video generation.