Gemini API: Streaming Support for Text-to-Speech Generation
The Gemini API now officially supports streaming for the gemini-3.1-flash-tts-preview model, enabling developers to receive audio output incrementally as it is generated rather than waiting for a complete batch response. By setting stream: true in the Interactions API or using streamGenerateContent, applications can begin processing and playing back audio chunks immediately, reducing time-to-first-audio from tens of seconds to near-instant. This change is particularly significant for voice application builders, audio narration tools, and agentic pipelines that pipe speech output to users in real time.
Key Takeaways
- Streaming TTS is now officially supported for
gemini-3.1-flash-tts-preview, reversing the prior documentation that said "TTS does not support streaming" and stabilizing what was previously an unofficial capability. - Time-to-first-audio drops dramatically: instead of waiting for a full batch cycle (potentially 60-90 seconds for long narrations), applications can begin playback within the first few seconds of generation.
- Two delivery paths are available:
stream: truein the Interactions API, orstreamGenerateContent?alt=sseβ both return base64-encoded PCM chunks via Server-Sent Events. - This is distinct from the Gemini Live API: Live handles bidirectional real-time voice conversation; TTS streaming handles unidirectional synthesis where expressive prompt-steerability and 70+ language coverage matter more than sub-100ms latency.
- Gemini CLI voice mode and Managed Agents benefit directly, as pipelines no longer need to buffer complete audio responses before routing output to speakers or downstream processing steps.
- A developer forum thread from June 2 surfaced truncation bugs when streaming beyond ~60 seconds of audio via the unofficial endpoint β the June 17 official announcement signals these issues have been stabilized and the feature is production-ready.
Sources & Mentions
5 external resources covering this update
Streaming Gemini 3.1's expressive new TTS model in Java
Medium
Gemini 3.1 Flash TTS
Simon Willison's Blog
Google Launches Gemini 3.1 Flash TTS in Preview
Winbuzzer
Gemini 3.1: Real-World Voice Recognition with Flash Live
Dev.to
streamGenerateContent truncates audio past ~60s (community thread)
Google AI Developers Forum
Gemini API Now Supports Streaming for Text-to-Speech
The Gemini API's text-to-speech capabilities received an important developer-facing upgrade on June 17, 2026: the gemini-3.1-flash-tts-preview model now officially supports streaming output via streamGenerateContent and the Interactions API's stream: true option. The change addresses one of the most common friction points for developers building real-time voice applications on Gemini TTS.
What Changed
Previously, gemini-3.1-flash-tts-preview operated exclusively in batch mode: the model would generate the complete audio before returning any output to the caller. For longer narrations or interactive scenarios, this meant developers had to wait for the full audio to be produced before they could play or process even the first syllable β a delay that could stretch to 60-90 seconds for multi-minute audio.
With streaming now officially supported, the model pushes audio chunks incrementally using Server-Sent Events (SSE) over the standard streamGenerateContent endpoint. Developers can pipe those chunks directly to an audio output buffer or player as they arrive, dramatically reducing time-to-first-audio. The official documentation, which previously stated "TTS does not support streaming," has been updated to reflect this new capability.
How to Use It
The Interactions API is the recommended path for streaming TTS. Set stream=True (Python) or stream: true (REST) in the request alongside response_format: { type: "audio" }. The streamGenerateContent REST endpoint also accepts streaming requests by appending ?alt=sse. Each SSE chunk delivers a base64-encoded PCM audio payload (24 kHz / 16-bit mono) that can be decoded and played back immediately.
Why It Matters for Voice Application Developers
Lower perceived latency is the headline benefit. Streaming TTS allows applications to start audio playback within the first few seconds of generation rather than waiting for a batch cycle to complete in full. For user-facing experiences β AI assistants, reading apps, podcast generators, and voice agents β this difference is perceptible and directly improves user experience quality.
Smoother integration with agent pipelines is a secondary benefit. Gemini CLI voice mode and Google's Managed Agents can now pipe TTS output to audio sinks continuously rather than buffering complete responses, enabling more natural real-time assistant interactions.
How Streaming TTS Differs from the Live API
It is worth drawing a clear distinction: TTS streaming is not the same as the Gemini Live API. The Live API is designed for bidirectional real-time audio β voice conversations where the model listens and speaks simultaneously, optimized for sub-100ms turn-around. Streaming TTS is appropriate for unidirectional synthesis scenarios: a developer provides text, the model generates speech, and that speech is delivered incrementally. The TTS model's advantages over Live API for these use cases are its superior prompt-steerability (200+ inline audio tags like [whispers], [laughs], [excited]), broader multi-language support (70+ languages), and greater per-voice customization.
Model Capabilities Recap
The gemini-3.1-flash-tts-preview model, launched in April 2026, retains all its existing capabilities: 200+ inline audio tags for mid-sentence style control, 30 voice options across 70+ languages with automatic language detection, SynthID watermarking on all generated audio, up to two speakers with independent voice and style configuration, and 24 kHz / 16-bit mono PCM output. Streaming is now an officially supported delivery method alongside batch generation, giving developers the choice of latency versus simplicity depending on their application's needs.