Gemini API: Flex and Priority Inference Tiers
Google launched two new inference service tiers for the Gemini API on April 1, 2026: Flex (50% cost reduction, best-effort latency of 1β15 minutes, ideal for background workloads) and Priority (75β100% premium, low and predictable latency, SLO-grade performance for production applications). Both tiers are available via the standard GenerateContent and Interactions APIs across 8 supported models, with selection controlled by a service_tier parameter. Priority access is restricted to Tier 2 and Tier 3 accounts.
Sources & Mentions
2 external resources covering this update
Gemini API Introduces Flex and Priority Inference Tiers
Google expanded the Gemini API's pricing and performance model on April 1, 2026 with the introduction of two new service tiers β Flex and Priority β giving developers explicit control over the cost-latency tradeoff for their AI workloads.
Flex: Cost-Optimized for Background Workloads
The Flex tier is designed for tasks where cost efficiency matters more than response speed. It offers:
- 50% cost reduction compared to standard inference pricing
- Best-effort latency of approximately 1 to 15 minutes β not suitable for real-time applications
- Sheddable requests: under high system load, Flex requests may be deprioritized or dropped in favor of higher-priority traffic
- Activated by setting
"service_tier": "flex"in the API request
Flex is ideal for offline evaluation pipelines, background research agents, nightly batch jobs, and any workload where the developer would rather save cost than guarantee a fast response. Given the budget-conscious positioning, Flex is accessible to all account tiers.
Priority: Predictable Performance for Production
The Priority tier targets latency-sensitive production applications that require consistent, low-latency responses. It provides:
- 75β100% pricing premium over standard inference
- Low, predictable latency with non-sheddable requests β the API guarantees processing even under load
- Automatic downgrade behavior: if the account's Priority tier limit is exceeded, requests automatically fall back to Standard tier rather than failing
- Activated by setting
"service_tier": "priority"in the request - Restricted to Tier 2 and Tier 3 accounts only
Priority tier is appropriate for customer-facing applications, real-time coding assistants, fraud detection systems, and any production integration with latency SLO requirements.
Supported Models and Integration
Both tiers are available on 8 Gemini models and integrate with the existing GenerateContent and Interactions APIs β no SDK upgrade or new endpoint is required. Developers add a single service_tier parameter to their existing requests:
"service_tier": "flex" // cost-optimized, best-effort
"service_tier": "priority" // premium, low-latency, non-sheddable
"service_tier": "standard" // default behavior (unchanged)
The Flex tier is open to all paying accounts. The Priority tier requires a Tier 2 or Tier 3 account in good standing. Account tier status is visible in Google AI Studio under the billing dashboard.