Elastic acquired Jina AI in late 2025, and jina-embeddings-v5-omni is now available on the Elastic Inference Service in both small and nano variants. The model handles text, images, audio, and video in a single shared embedding space, so you can query across all media types with one index and one query.
One index for everything you can't search today
You know this situation: something exists somewhere (a PDF attachment, a meeting recording, or one of 120 files all named “weekly stakeholder presentation”), but your search engine can only work with text and can’t find it.
Today, building multimodal search means accepting one of two compromises. The first is using a separate embedding model and index per modality, then somehow ranking and merging results at query time. The second is a single large multimodal model, but those tend to run to 7 billion parameters or more, are slow and expensive, and the frontier ones are closed-weight, so you cannot run them locally or inspect what is inside.
jina-embeddings-v5-omni takes a different path: a compact model family that maps all four modalities into the same vector space, so a text query can directly retrieve a relevant video frame, audio clip, or scanned document, with no cross-index merging needed.
Ranked results for the text query "cat" across 28 scene embeddings from the Breakfast at Tiffany's trailer. The cat scene ranks first.
To demonstrate video search, the Elastic team took the 1961 Breakfast at Tiffany's trailer (158 seconds), split it into 28 scenes using pyscenedetect, and embedded each scene with jina-embeddings-v5-omni-small into a single Elasticsearch index. Querying with the word "cat" returned the cat scene as the top result. Querying "kiss" returned only kiss scenes. All from plain text, with no video-specific pipeline.
The same principle extends across every modality:
-
Audio → image: Speaking "meow" into the model produces an embedding that retrieves cat images from the dataset, since both audio and images share the same vector space.
-
Image → document: Uploading a photo of an invoice finds matching invoices in a document collection, without any OCR or text extraction step.
-
Multimodal query: A sketch of a car combined with the text "white" retrieves images of white cars, with both modalities folded into a single query vector.
-
Text → music genre: A text description of a genre returns matching audio clips, useful for cataloguing media libraries.
On the Charades-STA benchmark for moment retrieval inside video, v5-omni-small scores 55.57. ByteDance's Seed 1.6, a closed-weight model, scores 29.3. The paper notes that moment retrieval (finding the right segment inside a longer video) is where the omni model particularly shines.
Benchmarks: best open-weight model under 5B parameters

Charades-STA (video moment retrieval). v5-omni-small scores 55.57 with under 2B parameters; the next best models use 7–9B.
The v5-omni-small was tested across four standard benchmarks: MMTEB for text, MIEB for images, MMEB for video, and MAEB for audio. Its average score across all four is 53.93, the highest of any open-weight model under 5 billion parameters.
On visual document retrieval (ViDoRe benchmark), v5-omni-small, using under 1 billion active parameters, scores better than a leading 3 billion parameter model and close to a 7 billion parameter one that is nearly eight times its size. For text-only queries, it inherits the full jina-embeddings-v5-text baseline, which already leads its size class on MMTEB, making it the strongest text performer of any comparable omni model.
Elasticsearch integration: backwards-compatible and storage-efficient
Because the text backbone in v5-omni is completely unchanged from v5-text, the model produces bit-identical text embeddings. If you already have a text index built on jina-embeddings-v5-text, you can add images, audio, and video to it without rebuilding the index or re-embedding any existing documents.
v5-omni also inherits both of Elasticsearch's major storage optimizations:
-
Better Binary Quantization (BBQ): Binarizes vectors to achieve 93% storage reduction with less than 3% accuracy loss. See the BBQ documentation for configuration details.
-
Matryoshka representation learning: Embeddings can be truncated to as few as 32 dimensions. Truncation sensitivity varies by modality; video is more sensitive than text or images, so check the trade-off charts before picking a dimension budget.
Truncating to 256 dimensions and applying binary quantization together cut the index footprint substantially while retaining most retrieval quality.
On the Elastic Inference Service, inference endpoints and Kibana connectors for both jina-embeddings-v5-omni-small and jina-embeddings-v5-omni-nano are created automatically, with no manual configuration required. The Elastic documentation covers local deployment via Hugging Face as well. Both models are also available on the Jina API and Hugging Face (CC-BY-NC-4.0).
The full technical write-up, including architecture details and benchmark breakdowns, is on the Elasticsearch Labs blog and the GELATO paper on arXiv. The original video walkthrough is on YouTube.