OmniNFT LoRA aligns audio and video in LTX-2 using reinforcement learning
OmniNFT is an open-weight LoRA that uses reinforcement learning to align audio and video generation in LTX-2, improving synchronization through modality-wise routing and gradient surgery.
OmniNFT is an open-weight LoRA from zghhui that trains LTX-2 to generate synchronized audio and video using reinforcement learning. The framework tackles the alignment problem between audio and video modalities — ensuring that sound and motion stay in sync without sacrificing quality in either stream.
The approach uses three techniques to keep the two modalities coherent. Modality-wise advantage routing distributes reward signals separately to the audio and video branches during training. Layer-wise gradient surgery isolates gradients between the audio and video layers while preserving cross-modal interaction. Region-wise loss reweighting focuses training attention on the spatial and temporal zones where audio-visual sync matters most — typically motion boundaries and sound onsets.
Compatibility and availability
The LoRA builds on LTX-2, the earlier version of Lightricks' video diffusion model. The GitHub repository includes training code and inference scripts. The team has not yet released a version compatible with LTX-2.3, the current checkpoint. The HuggingFace card lists the LoRA weights and example outputs showing improved lip-sync and ambient sound alignment compared to the base LTX-2 model.
The framework is open-weight and runs locally. Practitioners working on text-to-audio-video pipelines can fine-tune the LoRA on custom datasets or adapt the gradient surgery method to other multimodal diffusion models. The code is MIT-licensed and available on GitHub and HuggingFace.
