ByteDance Bernini unifies text, image, and video editing in one model
ByteDance released Bernini, a multimodal video generator and editor that handles text-to-video, subject-to-video, video-to-video, and reference-guided video editing in a single DiT-based architecture.
ByteDance released Bernini, a video generation and editing model that runs four modes in one architecture: text-to-video (T2V), subject-to-video (S2V), video-to-video (V2V), and reference-guided video-to-video (RV2V). The model combines a multimodal large language model (MLLM) with a diffusion transformer (DiT) backbone. Weights are available on HuggingFace, and a ComfyUI custom node for fp8-scaled inference appeared within hours.
Bernini's reference-guided mode (RV2V) preserves face identity in reference-guided edits 20 percent better than the nearest competitor, though the baseline and metric remain unspecified. The four-mode design lets users start from text, a still image, or existing video, then steer output with a reference frame—useful for character consistency across shots or applying a specific look to generated footage. The S2V mode animates a still object or character, while V2V handles style transfer and motion editing on existing clips.
The MLLM integration allows Bernini to parse complex text prompts and spatial instructions before passing them to the diffusion backbone, a pattern seen in other multimodal video models but rarely shipped with all four generation modes in a single checkpoint. A custom node repository on HuggingFace hosts fp8-scaled weights optimized for consumer GPUs; the quantization should make Bernini accessible to users with 24GB VRAM, though ByteDance has not published official hardware requirements or inference benchmarks. The model card does not yet include sample outputs, training data details, or a technical paper. License terms are also unspecified.




