Gemini 3.1 Flash TTS: Google's Expressive, Prompt-Steerable Speech Model
Google launched Gemini 3.1 Flash TTS on April 15, 2026, making it available to developers through the Gemini API under the model ID gemini-3.1-flash-tts-preview. The model introduces 200+ inline audio tags for granular voice direction and handles native multi-speaker dialogue in a single API call. On the Artificial Analysis TTS leaderboard, the model achieved an Elo score of 1,211, placing it second overall.
Sources & Mentions
5 external resources covering this update
A New Approach to AI Speech
Google's text-to-speech models have historically operated on a simple contract: provide text, receive audio. Gemini 3.1 Flash TTS changes that contract fundamentally. Launched on April 15, 2026 via the Gemini API, the model (gemini-3.1-flash-tts-preview) treats speech generation as a directed performance rather than mechanical text conversion.
The model responds to a structured prompt format organized into three optional layers:
- Audio Profile — the character's identity, archetype, and voice persona
- Scene — environmental context, mood, and conversational setting
- Director's Notes — specific instructions on pacing, accent, emotional delivery, and style
Audio Tags: 200+ Inline Voice Controls
Gemini 3.1 Flash TTS ships with over 200 inline modifiers that developers embed directly in the prompt text. Tags such as [whispers], [excitedly], [very slowly], and [laughs] allow mid-sentence style changes without requiring separate API calls or audio stitching. Accent control supports regional variants including American "Valley," American "Southern," British "Received Pronunciation," "Brixton," "Transatlantic," and more.
Native Multi-Speaker Dialogue
The model handles multi-speaker dialogue natively in a single API call, maintaining natural conversational rhythm without requiring developers to stitch audio clips together. This capability makes it well-suited for podcast generators, interactive storytelling, customer service bots, language tutors, and dialogue-driven educational tools.
Language Support and Performance
Gemini 3.1 Flash TTS supports over 70 languages with granular accent and style controls. On the Artificial Analysis TTS leaderboard, the model achieved an Elo score of 1,211, ranking second overall among all evaluated text-to-speech systems.
Developer Access and Pricing
The model is available in paid preview via the Gemini API, Google AI Studio, Vertex AI, and Google Vids. Pricing is set at /bin/zsh.50 per 1M input tokens and .00 per 1M output tokens. All audio output carries a SynthID watermark, consistent with Google's responsible AI practices across the Gemini model family.