Jina Embeddings v5 Omni adds multimodal support by training only 0.35% of weights
Jina AI's frozen-encoder composition method extends its text embedding models to multimodal inputs by training only connecting layers, leaving 99.65% of weights unchanged.

Jina Embeddings v5 Omni is a pair of multimodal embedding models from Jina AI that encode text, image, audio, and video into a unified semantic space. The models extend the existing Jina Embeddings v5 Text suite by adding frozen encoders for non-text media and training only the connecting components — 0.35% of total weights — while leaving the backbone text model and new media encoders untouched. Text inputs produce identical embeddings to the original v5 Text models, meaning practitioners can drop in the multimodal version without breaking existing text pipelines.
The approach, called frozen-encoder model composition, adapts non-text encoders to feed a language model that generates embeddings for all input types. Because only the adapter layers are trained, the method sidesteps full-parameter retraining and its computational cost. The frozen backbone also preserves the text model's behavior exactly—a property that matters for production systems already tuned around Jina's v5 Text embeddings. For open-source practitioners running embedding workloads on local hardware, the ability to extend a known text model to multimodal without retraining the entire stack or losing text-embedding consistency is a practical win. Evaluations show performance competitive with larger state-of-the-art multimodal embedding models, with the two models in the v5 Omni suite corresponding to the two sizes of the original Jina Embeddings v5 Text release.