Gemini Embedding 2: Google's First Natively Multimodal Embedding Model
Google launched Gemini Embedding 2 (gemini-embedding-2-preview) β the first natively multimodal embedding model in the Gemini API, mapping text, images, video, audio, and PDF documents into a single unified embedding space. The model supports up to 8,192 input tokens for text and uses Matryoshka Representation Learning to produce flexible output dimensions from 768 to 3,072.
Sources & Mentions
5 external resources covering this update
A Unified Embedding Space Across All Modalities
Semantic embedding β transforming content into numerical vectors that capture meaning β has historically been siloed by modality. Text embeddings live in one space, image embeddings in another, and comparing across them requires complex multi-step pipelines. Gemini Embedding 2 (gemini-embedding-2-preview) eliminates that boundary: all five supported input types β text, images, video, audio, and PDF documents β are mapped into a single, shared vector space.
This means a query expressed in plain text can retrieve semantically similar images, a video clip can be matched to a related document, and a voice recording can surface related written content β all through the same embedding lookup, without any modality-specific adapters or cross-encoder steps. The model captures semantic intent across over 100 languages, enabling multilingual and cross-lingual retrieval scenarios out of the box.
Technical Specifications
Supported input modalities and per-request limits:
- Text: up to 8,192 input tokens; interleaved multimodal input (e.g., image + text in a single request) is supported
- Images: up to 6 per request in PNG or JPEG format
- Video: up to 120 seconds in MP4 or MOV format
- Audio: native ingestion β no text transcription required
- Documents: PDFs up to 6 pages
Output dimensions use Matryoshka Representation Learning (MRL), allowing the embedding vector to be truncated to smaller sizes without retraining. Recommended configurations are 3,072 (default), 1,536, and 768 dimensions. Smaller dimensions trade marginal quality for significantly lower storage costs and faster similarity search β a practical lever for large-scale deployments.
Primary Use Cases
Google positions Gemini Embedding 2 as best suited for cross-modal semantic search, multimodal RAG (Retrieval-Augmented Generation), document retrieval across mixed-media knowledge bases, and recommendation systems operating over heterogeneous content libraries. The ability to natively ingest audio without prior transcription is particularly notable β acoustic nuances that transcription would flatten are preserved in the embedding.
Availability and Ecosystem Integration
Gemini Embedding 2 is available in public preview via both the Gemini API and Google Cloud Vertex AI. The model is already integrated with major vector database and ML orchestration platforms including LangChain, LlamaIndex, Haystack, Weaviate, Qdrant, ChromaDB, and Google Cloud Vector Search.