jina-embeddings-v5-omni Released! A Lightweight Omni-Modal Vector Model

jina-embeddings-v5-omni has been officially released, extending the capabilities of our v5-text vector model to images, audio, and video. The text side remains unchanged; the text vectors produced by v5-omni are byte-for-byte identical to those of v5-text, so no existing indexes need to be rebuilt.

jina-embeddings-v5-omni-small achieves an average score of 53.93 across four modalities, nearly matching LCO-7B (54.43) with 1/5.7 of the parameters.
jina-embeddings-v5-omni-nano, with 0.95B parameters, still delivers competitive performance on document retrieval.

Links:

https://huggingface.co/collections/jinaai/jina-embeddings-v5-omni

https://modelscope.cn/organization/jinaai

https://arxiv.org/abs/2605.08384

https://jina.ai/embeddings/

Open-source Omni Embedding Models (covering text, image, audio, video) on the Pareto frontier. — Pareto optimality for open-source omni-modal embedding models (covering text, image, audio, and video).

v5-omni-small (1.57B) has less than 1/5 the parameters of LCO-7B (8.93B), yet its average score matches it; v5-omni-nano (0.95B) is even smaller, but scores 8.9 points higher than LanguageBind (1.14B). Baselines for comparison include LanguageBind, Omni-Embed-Nemotron-3B, LCO-Embedding-Omni-3B, and LCO-Embedding-Omni-7B.

Performance across different modalities.

Broken down by modality, we evaluated on MMTEB (text), MIEB (image), MMEB-Video (video), and MAEB (audio).

v5-omni-small scores 67.0 on text, leading all Omni models. This score is inherited directly from v5-text-small without any loss.
On image tasks, it scores 56.05, with a clustering task score of 84.57, which is the highest on the board.
On audio, it scores 51.46, nearly tying with LCO-7B (52.37), and its audio classification score of 55.89 is also top-ranked.
The weakness is in video: 41.20 vs. LCO-7B's 47.41. This is the most significant gap with end-to-end trained solutions, as temporal reasoning relies more heavily on end-to-end training.

Further breaking down the four modalities into 13 task categories, the stars on the chart mark tasks where v5-omni-small outperforms the strongest open-source baselines (which generally have 3-9x more parameters).

Four leading tasks: Image Classification (68.55 vs. 64.30), Image Clustering (84.57 vs. 83.24), Multilingual Image Retrieval (65.88 vs. 61.99), and Audio Classification (55.89 vs. 53.39).
Main gaps: Video Retrieval (27.82 vs. 58.73) and Composed Retrieval/VQA (44.23 vs. 53.40). This aligns with the previous finding—video remains an area for improvement.

Looking specifically at document retrieval (ViDoRe-in-MIEB), v5-omni-small, activating only 0.92B text + image parameters, achieves a score of 79.08, outperforming LCO-3B (78.24, with 4.07B activated parameters).

v5-omni-nano is even more extreme: with 0.31B activated parameters, it scores 70.05, nearly doubling LanguageBind's score of 37.33. Nemotron-3B currently leads at 85.64, but has 5.1 times the parameters of v5-omni-small.

Model Architecture

The approach of v5-omni is to keep the text-side base, v5-text, and the newly added visual and audio encoders entirely frozen. Only a small, trainable projection layer is inserted in the middle. Its job is to align representations from different modalities into the semantic space of v5-text. The three towers are structured as follows:

Vision: The base is a Qwen3.5 vision encoder (modified from SigLIP2), using 2x2 spatial merging to reduce the token count to 1/4 of the original. The entire encoder is frozen; only the final `fc_vision_2` layer is replaced with a randomly initialized projection layer. This layer's purpose is to align visual features to the input dimension of v5-text, and it's the only part of the vision tower involved in training.
Audio: The base is a Qwen2.5-Omni encoder (modified from Whisper-large-v3), similarly frozen in its entirety. A single randomly initialized fc_audio layer projects the 1280-dimensional output to v5-text's input dimension.
Video: No new encoder is introduced. Instead, video is treated as a sequence of visual frames fed to the vision tower, with an optional audio segment extracted from the video if needed.

On the task side, v5-omni directly inherits the four task-specific LoRA adapters from v5-text (retrieval, text matching, classification, clustering). Each task variant trains its own projection layer weights separately.

This "freeze + project" architecture offers a direct benefit: complete modularity. For text only, load just the text weights (memory footprint is exactly the same as v5-text). For text-image tasks, attach the image tower. Audio and video towers are mounted on demand. Only when running in full omni-mode are all towers active together.

The only components actually undergoing training are the tiny projection layers in the middle, which account for 0.35% of the total weights. The vision, audio, and text towers are all frozen. Task-specific LoRA adapters handle retrieval, classification, clustering, and text matching respectively.

Feature	`jina-embeddings-v5-omni-small`	`jina-embeddings-v5-omni-nano`
Base Text Model	`jina-embeddings-v5-text-small` (Qwen3-0.6B)	`jina-embeddings-v5-text-nano` (EuroBERT-210m)
Total Parameters	~1.56B	~0.95B
Supported Modalities	Text, Image, Audio, Video, PDF	Text, Image, Audio, Video, PDF
Embedding Dimension	1024	768
Matryoshka Dimensions	32, 64, 128, 256, 512, 768, 1024	32, 64, 128, 256, 512, 768
Context Length	32768 tokens	8192 tokens
Vision Encoder	Qwen3.5-2B ViT (SigLIP2)	SigLIP2 Base
Audio Encoder	Whisper-large-v3	Whisper-large-v3
Task Adapters	4 (Retrieval, Text Matching, Classification, Clustering)
Text Compatibility	Byte-for-byte identical with `v5-text-small`	Byte-for-byte identical with `v5-text-nano`
Trainable Parameters	~18M projection layers (0.35%)	~7M projection layers (0.35%)
Pooling Method	Last-token	Last-token
Model License	CC BY-NC 4.0	CC BY-NC 4.0

Quick Start

Elasticsearch (Elastic Inference Service)

If you're already using jina-embeddings-v5-text in Elasticsearch, your existing text indexes are out-of-the-box compatible with v5-omni. The Omni model produces a vector for text input that is byte-for-byte identical to v5-text: same input, same vector. No re-embedding, no index rebuilding required. To also search images, audio, and video, simply create a new v5-omni index and write your multimodal content into it.

Create a semantic_text index using v5-omni as the inference endpoint. EIS will automatically select the corresponding LoRA adapter during indexing and retrieval:

PUT multimodal-semantic-index
{
  "mappings": {
    "properties": {
      "content": {
        "type": "semantic_text",
        "inference_id": ".jina-embeddings-v5-omni-small"
      }
    }
  }
}

Write text, images (base64 data URI), audio, and video into the same field and the same index:

// Ingest text
POST multimodal-semantic-index/_doc
{
  "content": "'Kraft Dinner' is what Canadians call macaroni and cheese when prepared from a kit."
}

// Ingest image (base64)
POST multimodal-semantic-index/_doc
{
  "content": "data:image/png;base64,iVBORw0KGgoAAAAN..."
}

Search across all modalities using a single text query:

GET multimodal-semantic-index/_search
{
  "query": {
    "semantic": {
      "field": "content",
      "query": "Was bedeutet 'Kraft Dinner' für Kanadier?"
    }
  }
}

Jina Embedding API

curl https://api.jina.ai/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "jina-embeddings-v5-omni-small",
    "task": "retrieval.query",
    "dimensions": 1024,
    "input": ["What does this image show?"],
    "images": ["data:image/png;base64,..."]
  }'

Please visit jina.ai/embeddings to get an API Key.

Hugging Face

from sentence_transformers import SentenceTransformer
import torch

model = SentenceTransformer(
    "jinaai/jina-embeddings-v5-omni-small-retrieval",
    model_kwargs={"dtype": torch.bfloat16},
)

# Text embedding (identical to v5-text)
text_emb = model.encode(
    "What is knowledge distillation?", prompt_name="query"
)

# Image embedding
from PIL import Image
img = Image.open("photo.jpg")
img_emb = model.encode(img)

# Cross-modal similarity
similarity = model.similarity(text_emb, img_emb)

Training Methodology

We call this architecture frozen-encoder model composition. The approach takes a sufficiently strong text embedding model as a base, attaches pre-trained vision and audio encoders to it, and leaves only a small, trainable projection layer in between. Everything else is frozen.

The entire joint model sees only 0.35% of its weights trained. This yields three key benefits: 1. Text performance is completely untouched: the same input produces the same vector, down to the byte; 2. Fast training and memory efficient: training only projection layers is 1.8-3.9x faster with 42%-64% lower memory compared to full training; 3. Modularity: individual towers can be loaded independently.

The chart above compares the time cost for projection-layer training vs. full training on 4x H100 GPUs, batch size 256, for 15K steps. The speedup is most dramatic on the audio side: small achieves a 3.2x speedup (154 min vs. 497 min), and nano achieves a 3.9x speedup (112 min vs. 441 min). Memory savings of 42%-64% are realized because frozen encoders do not need to store gradients and optimizer states.

v5-omni fully inherits the Matryoshka representation dimension support from v5-text. Image and audio vectors suffer virtually no loss under dimension truncation, while video vectors show more noticeable decay at smaller dimensions.

Radar summary of v5-omni performance vs. strongest baselines.

Summarizing the four modalities onto a single radar chart, v5-omni results are plotted against the strongest baselines. v5-omni-small (1.57B) closely matches or surpasses baselines on text, image, and audio. Video is the only area with a clear dip on the radar chart, marking a key focus for our next version.

Conclusion

This is Jina's first foray into omni-modal vector models. We wanted to approach the problem from a different angle: Must a multimodal embedding model be trained end-to-end?

The answer from v5-omni is: not necessarily.

By freezing the text base and training just 0.35% of the weights, v5-omni is sufficient to catch up to models with 5-7x its parameters on text, image, and audio. Our key insight is that composition trumps retraining. The truly hard part is training a strong enough text encoder first. Once that is done, attaching vision and audio via lightweight projection layers comes at almost zero cost.

But the most notable aspect of this release is not just the benchmarks. It's the direct, practical benefit for production users from this frozen-base design: Your existing v5-text indexes require absolutely no changes.

If you are already using v5-text, just switch your inference endpoint to v5-omni. The same query yields the same vector, byte-for-byte. You gain image, audio, and video retrieval capability without needing to re-vectorize a single piece of data. This is our view on upgrading to multimodal retrieval: it should be an in-place upgrade, not a migration project.

jina-embeddings-v5-omni-small is the strongest open-source Omni embedding model under 2 billion parameters. jina-embeddings-v5-omni-nano maintains competitive omni-modal retrieval capability at the sub-1-billion parameter scale.

Both models are now available on Hugging Face and the Jina Search Foundation API, and can also be used directly through Elasticsearch's native inference endpoints.