Gemini 3.1 Flash TTS: Google's Expressive, Prompt-Steerable Speech Model

Gemini CLIApr 15, 2026

Google launched Gemini 3.1 Flash TTS on April 15, 2026, making it available to developers through the Gemini API under the model ID gemini-3.1-flash-tts-preview. The model introduces 200+ inline audio tags for granular voice direction and handles native multi-speaker dialogue in a single API call. On the Artificial Analysis TTS leaderboard, the model achieved an Elo score of 1,211, placing it second overall.

Sources & Mentions

5 external resources covering this update

Gemini 3.1 Flash TTS

Simon Willison

Google's Gemini 3.1 Flash TTS

SiliconANGLE

Google AI Launches Gemini 3.1 Flash TTS

MarkTechPost

How to prompt Gemini 3.1's new TTS model

Dev.to

Google ships most expressive Gemini 3.1 TTS

The Decoder

A New Approach to AI Speech

Google's text-to-speech models have historically operated on a simple contract: provide text, receive audio. Gemini 3.1 Flash TTS changes that contract fundamentally. Launched on April 15, 2026 via the Gemini API, the model (gemini-3.1-flash-tts-preview) treats speech generation as a directed performance rather than mechanical text conversion.

The model responds to a structured prompt format organized into three optional layers:

Audio Profile — the character's identity, archetype, and voice persona
Scene — environmental context, mood, and conversational setting
Director's Notes — specific instructions on pacing, accent, emotional delivery, and style

Audio Tags: 200+ Inline Voice Controls

Gemini 3.1 Flash TTS ships with over 200 inline modifiers that developers embed directly in the prompt text. Tags such as [whispers], [excitedly], [very slowly], and [laughs] allow mid-sentence style changes without requiring separate API calls or audio stitching. Accent control supports regional variants including American "Valley," American "Southern," British "Received Pronunciation," "Brixton," "Transatlantic," and more.

Native Multi-Speaker Dialogue

The model handles multi-speaker dialogue natively in a single API call, maintaining natural conversational rhythm without requiring developers to stitch audio clips together. This capability makes it well-suited for podcast generators, interactive storytelling, customer service bots, language tutors, and dialogue-driven educational tools.

Language Support and Performance

Gemini 3.1 Flash TTS supports over 70 languages with granular accent and style controls. On the Artificial Analysis TTS leaderboard, the model achieved an Elo score of 1,211, ranking second overall among all evaluated text-to-speech systems.

Developer Access and Pricing

The model is available in paid preview via the Gemini API, Google AI Studio, Vertex AI, and Google Vids. Pricing is set at /bin/zsh.50 per 1M input tokens and .00 per 1M output tokens. All audio output carries a SynthID watermark, consistent with Google's responsible AI practices across the Gemini model family.

Mentioned onSimon Willison SiliconANGLE MarkTechPost Dev.to The Decoder