Scenema Audio brings stage-direction control to open TTS with 8-step distillation
ScenemaAI released Scenema Audio, a zero-shot voice cloning model that distills LTX 2.3 into 8 sampling steps, runs 1.5× real-time on RTX 4090, and accepts inline `<action>` tags for emotional control.
ScenemaAI released Scenema Audio this week, a text-to-speech model that distills the LTX 2.3 audio architecture into an 8-step sampler. The model accepts inline stage-direction tags like to control emotional delivery, runs at 1.5× real-time on an RTX 4090, and fits in 16GB VRAM. It supports 13 languages and outputs 48kHz stereo audio.
The model uses Gemma 3 12B for text encoding and handles zero-shot voice cloning without per-speaker fine-tuning. According to the HuggingFace model card, it also generates matching environmental sounds alongside speech—footsteps, doors, background noise—that fit the narrative context rather than isolated voice alone.
What stands out
- 01Stage-direction control via inline tags. Users specify emotional tone directly in the prompt:
,,. No retraining or prompt engineering required. - 028-step distillation for real-time speed. The model cuts LTX 2.3's sampling steps to 8, hitting 1.5× real-time on a single RTX 4090—fast enough for interactive applications.
- 0316GB VRAM footprint. The full pipeline, including the Gemma 3 12B text encoder, runs on a single consumer GPU.
- 04Environment-sound synthesis. The model generates ambient audio that matches speech context, not just isolated voice.
- 05
