Langswap open-sources video dubbing pipeline combining Whisper, Gemma, and OmniVoice
Langswap released its full production video translation pipeline on GitHub, combining speech separation, Whisper ASR, Gemma-4-E2B translation, and OmniVoice synthesis to automate multilingual dubbing.

Langswap released its full production video translation pipeline on GitHub this week, opening the code that powers its dubbing service at langswap.app. The system automates multilingual video translation by separating speech from background audio, translating dialogue with length-matching constraints, and synthesizing new speech that preserves the original speaker's characteristics.
The workflow splits incoming audio into speech and background sound, preserving music and ambient noise untouched. Whisper handles speech recognition, with voice activity detection refining segment boundaries and assigning speaker labels. Translation runs through Gemma-4-E2B with a vowel-count check — if the translated text diverges too far in length from the source, the model retries once or twice to tighten the match. This constraint prevents the dubbed audio from falling out of sync with lip movements.
Pipeline components
OmniVoice generates the new speech, using the original audio segment as a reference for prosody and speaker characteristics. The system then swaps the audio track back into the video and stamps a watermark indicating Langswap translation. The codebase includes modules for speech-to-text management, ASR with VAD, translation via llama.cpp, OmniVoice TTS, and FFmpeg video assembly.
The maintainer notes that earlier iterations carried more proprietary logic around text-to-speech systems and length control, but the current release strips most of that complexity to simplify contributions and lower the barrier to forking or extending the pipeline. The repository is live at github.com/langswap-app/langswap.






