Scenema Audio generates layered soundscapes and effects from text
Scenema Audio, a new open-weight model built on LTX-2.3, creates sound effects and ambient tracks from text descriptions. The tool requires 40GB of VRAM and ships with voice cloning capabilities.
Scenema Audio is an open-weight audio generation model from ScenemaAI that produces sound effects and ambient audio tracks from text prompts. Released this week on HuggingFace and GitHub, the model generates everything from discrete sound effects to layered background atmospheres — city streets, forest ambience, office noise — and can combine multiple sounds into a single composition. The weights are built on LTX-2.3, the same foundation used in recent text-to-speech and voice cloning tools.
The model ships with voice cloning and voice design capabilities alongside its sound-effect generation. Users describe a sound or scene in natural language, and Scenema Audio renders the corresponding audio. A HuggingFace Space demo is live for browser-based testing, though local deployment requires approximately 40GB of VRAM — putting it out of reach for most consumer hardware.
The GitHub repository includes setup instructions and sample workflows. The HuggingFace model card lists technical specs and example outputs. No formal benchmark comparisons or licensing details are published yet, and the 40GB VRAM floor may limit adoption until quantized or distilled versions appear. Community interest will likely drive demand for lighter checkpoints or GGUF conversions in the coming weeks.
