Gemini 3.1 Flash-Lite: Google's Fastest Model Reaches General Availability

Gemini CLI

Google released gemini-3.1-flash-lite as a generally available model on May 7, 2026, graduating it from the preview it entered on March 3. The model delivers 2.5x faster time-to-first-token and 45% higher throughput than 2.5 Flash-Lite, with a 1M token context window, up to 66K output tokens, and built-in thinking levels for configurable reasoning depth. Pricing is set at $0.25/$1.50 per million tokens. Developers using the companion gemini-3.1-flash-lite-preview model ID must migrate before its May 25, 2026 shutdown.


Gemini 3.1 Flash-Lite Reaches General Availability

Google officially promoted gemini-3.1-flash-lite to general availability on May 7, 2026. The model had been available in preview since March 3, 2026; the GA release signals production readiness with stability guarantees, full enterprise support via Vertex AI, and a fixed deprecation timeline for the preview endpoint.

Performance and Capabilities

Gemini 3.1 Flash-Lite is the fastest and most cost-efficient model in Google's Gemini 3 series, designed for high-volume, latency-sensitive workloads where throughput and cost-per-call are the primary constraints. Compared to the prior-generation 2.5 Flash-Lite:

  • 2.5x faster time to first answer token (TTFT)
  • 45% higher output throughput
  • GPQA Diamond: 86.9% β€” competitive with significantly larger models
  • 1M token context window with up to 66K output tokens per request

The model is fully multimodal, supporting text, images, audio, and function calling. Thinking levels are built in natively, allowing developers to configure reasoning depth dynamically: minimal mode for near-instant classifier and routing responses; higher levels for complex tasks such as UI generation, simulation, and multi-step agentic pipelines.

Pricing

Gemini 3.1 Flash-Lite is priced at $0.25 per million input tokens and $1.50 per million output tokens. While this represents a substantial performance upgrade over 2.5 Flash-Lite, the pricing is higher than the previous generation ($0.10/$0.40). Community benchmarks have noted that the model's improved instruction-following efficiency β€” consuming fewer tokens per task β€” can offset the higher per-token rate in many workloads, though use cases with heavy reasoning settings may see sharply increased costs due to extended thinking token consumption.

Real-World Deployments

Google highlighted several production deployments at GA launch:

  • Gladly β€” handling millions of customer-service calls weekly across SMS, WhatsApp, and Instagram, with p95 latency of ~1.8 seconds for full replies and sub-second p95 for classifiers and tool calls; ~60% lower costs versus comparable thinking-tier models
  • OffDeal β€” powering real-time data lookups and email triage for AI agents in investment banking workflows
  • Ramp β€” deployed for high-volume, latency-sensitive financial features
  • JetBrains β€” integrated for code completion and agentic developer tooling in IntelliJ-based IDEs
  • krea.ai β€” multimodal safety checks and prompt enhancement at creative scale

Migration: Preview Endpoint Shutdown

The gemini-3.1-flash-lite-preview model ID is being deprecated and will be fully shut down on May 25, 2026. The deprecation timeline:

Date Event
May 7, 2026 gemini-3.1-flash-lite GA release
May 11, 2026 gemini-3.1-flash-lite-preview begins deprecation
May 25, 2026 gemini-3.1-flash-lite-preview fully shut down

Developers must update all references from gemini-3.1-flash-lite-preview to gemini-3.1-flash-lite before May 25 to avoid service interruption.

Accessing the Model in Gemini CLI

Gemini CLI users can access the GA model immediately β€” no CLI update required. To use it, pass --model gemini-3.1-flash-lite at the command line, or set the model field in ~/.gemini/settings.json. The model is also available directly via the Gemini API in Google AI Studio and via Vertex AI.