Gemini API: Flex and Priority Inference Tiers

Gemini CLI

Google launched two new inference service tiers for the Gemini API on April 1, 2026: Flex (50% cost reduction, best-effort latency of 1–15 minutes, ideal for background workloads) and Priority (75–100% premium, low and predictable latency, SLO-grade performance for production applications). Both tiers are available via the standard GenerateContent and Interactions APIs across 8 supported models, with selection controlled by a service_tier parameter. Priority access is restricted to Tier 2 and Tier 3 accounts.


Gemini API Introduces Flex and Priority Inference Tiers

Google expanded the Gemini API's pricing and performance model on April 1, 2026 with the introduction of two new service tiers β€” Flex and Priority β€” giving developers explicit control over the cost-latency tradeoff for their AI workloads.

Flex: Cost-Optimized for Background Workloads

The Flex tier is designed for tasks where cost efficiency matters more than response speed. It offers:

  • 50% cost reduction compared to standard inference pricing
  • Best-effort latency of approximately 1 to 15 minutes β€” not suitable for real-time applications
  • Sheddable requests: under high system load, Flex requests may be deprioritized or dropped in favor of higher-priority traffic
  • Activated by setting "service_tier": "flex" in the API request

Flex is ideal for offline evaluation pipelines, background research agents, nightly batch jobs, and any workload where the developer would rather save cost than guarantee a fast response. Given the budget-conscious positioning, Flex is accessible to all account tiers.

Priority: Predictable Performance for Production

The Priority tier targets latency-sensitive production applications that require consistent, low-latency responses. It provides:

  • 75–100% pricing premium over standard inference
  • Low, predictable latency with non-sheddable requests β€” the API guarantees processing even under load
  • Automatic downgrade behavior: if the account's Priority tier limit is exceeded, requests automatically fall back to Standard tier rather than failing
  • Activated by setting "service_tier": "priority" in the request
  • Restricted to Tier 2 and Tier 3 accounts only

Priority tier is appropriate for customer-facing applications, real-time coding assistants, fraud detection systems, and any production integration with latency SLO requirements.

Supported Models and Integration

Both tiers are available on 8 Gemini models and integrate with the existing GenerateContent and Interactions APIs β€” no SDK upgrade or new endpoint is required. Developers add a single service_tier parameter to their existing requests:

"service_tier": "flex"      // cost-optimized, best-effort
"service_tier": "priority"  // premium, low-latency, non-sheddable
"service_tier": "standard"  // default behavior (unchanged)

The Flex tier is open to all paying accounts. The Priority tier requires a Tier 2 or Tier 3 account in good standing. Account tier status is visible in Google AI Studio under the billing dashboard.


Mentioned onGitHubGitHub