Tencent CogOmniControl turns sparse prompts into video via vision-language parsing
CogOmniControl from Tencent pairs a vision-language model with a diffusion transformer to turn mixed text-image prompts into video, trained with reinforcement learning to handle sparse or abstract input.

CogOmniControl is a multimodal video-generation system from Tencent that accepts mixed text-and-image prompts and outputs video aligned to those conditions. The system pairs CogVLM, a vision-language model that interprets user intent from sparse or abstract input, with CogOmniDiT, a diffusion transformer that executes the generation step. The two components are trained together using reinforcement learning to synchronize outputs across modalities.
Announced this week on GitHub, the project is listed as forthcoming with no weights or inference code available yet. The system functions as a control layer for video synthesis that can parse creative intent from incomplete prompts—text alone, image alone, or both—and translate that into structured generation instructions.
What stands out
- 01Sparse-prompt handling. CogVLM extracts detailed generation instructions from minimal or vague user input, filling in gaps that would otherwise require manual prompt engineering.
- 02Multimodal conditioning. The system accepts text, image, or combined prompts in a single request, then aligns video output to all provided conditions simultaneously.
- 03Reinforcement learning for alignment. CogOmniDiT uses RL during training to coordinate multiple control signals, aiming to reduce drift between text semantics and visual reference in the final video.
- 04No public weights yet. The GitHub page is a placeholder. No model card, no inference script, no release timeline—practitioners will need to wait for Tencent to ship the artifacts.
- 05Tencent lineage. The project builds on Tencent's earlier CogVLM work, extending vision-language understanding into the video-generation control loop rather than treating it as a separate preprocessing step.