Gemini Embedding 2: Google's First Natively Multimodal Embedding Model

Gemini CLIMar 10, 2026

Google launched Gemini Embedding 2 (gemini-embedding-2-preview) — the first natively multimodal embedding model in the Gemini API, mapping text, images, video, audio, and PDF documents into a single unified embedding space. The model supports up to 8,192 input tokens for text and uses Matryoshka Representation Learning to produce flexible output dimensions from 768 to 3,072.

Sources & Mentions

5 external resources covering this update

Gemini Embedding 2: Our first natively multimodal embedding model

Dev.to

Google for Developers announcement

X (Twitter)

Google releases Gemini Embedding 2 AI model with multimodal support

Neowin

Google AI Introduces Gemini Embedding 2

MarkTechPost

Google AI Studio announcement

X (Twitter)

A Unified Embedding Space Across All Modalities

Semantic embedding — transforming content into numerical vectors that capture meaning — has historically been siloed by modality. Text embeddings live in one space, image embeddings in another, and comparing across them requires complex multi-step pipelines. Gemini Embedding 2 (gemini-embedding-2-preview) eliminates that boundary: all five supported input types — text, images, video, audio, and PDF documents — are mapped into a single, shared vector space.

This means a query expressed in plain text can retrieve semantically similar images, a video clip can be matched to a related document, and a voice recording can surface related written content — all through the same embedding lookup, without any modality-specific adapters or cross-encoder steps. The model captures semantic intent across over 100 languages, enabling multilingual and cross-lingual retrieval scenarios out of the box.

Technical Specifications

Supported input modalities and per-request limits:

Text: up to 8,192 input tokens; interleaved multimodal input (e.g., image + text in a single request) is supported
Images: up to 6 per request in PNG or JPEG format
Video: up to 120 seconds in MP4 or MOV format
Audio: native ingestion — no text transcription required
Documents: PDFs up to 6 pages

Output dimensions use Matryoshka Representation Learning (MRL), allowing the embedding vector to be truncated to smaller sizes without retraining. Recommended configurations are 3,072 (default), 1,536, and 768 dimensions. Smaller dimensions trade marginal quality for significantly lower storage costs and faster similarity search — a practical lever for large-scale deployments.

Primary Use Cases

Google positions Gemini Embedding 2 as best suited for cross-modal semantic search, multimodal RAG (Retrieval-Augmented Generation), document retrieval across mixed-media knowledge bases, and recommendation systems operating over heterogeneous content libraries. The ability to natively ingest audio without prior transcription is particularly notable — acoustic nuances that transcription would flatten are preserved in the embedding.

Availability and Ecosystem Integration

Gemini Embedding 2 is available in public preview via both the Gemini API and Google Cloud Vertex AI. The model is already integrated with major vector database and ML orchestration platforms including LangChain, LlamaIndex, Haystack, Weaviate, Qdrant, ChromaDB, and Google Cloud Vector Search.

Mentioned onDev.to X (Twitter)Neowin MarkTechPost X (Twitter)