Gemini API: Flex and Priority Inference Tiers

Gemini CLIApr 1, 2026

Google launched two new inference service tiers for the Gemini API on April 1, 2026: Flex (50% cost reduction, best-effort latency of 1–15 minutes, ideal for background workloads) and Priority (75–100% premium, low and predictable latency, SLO-grade performance for production applications). Both tiers are available via the standard GenerateContent and Interactions APIs across 8 supported models, with selection controlled by a service_tier parameter. Priority access is restricted to Tier 2 and Tier 3 accounts.

Sources & Mentions

2 external resources covering this update

Gemini API changelog — Flex and Priority inference tiers

GitHub

Gemini CLI v0.36.0 release — April 1 2026

GitHub

Gemini API Introduces Flex and Priority Inference Tiers

Google expanded the Gemini API's pricing and performance model on April 1, 2026 with the introduction of two new service tiers — Flex and Priority — giving developers explicit control over the cost-latency tradeoff for their AI workloads.

Flex: Cost-Optimized for Background Workloads

The Flex tier is designed for tasks where cost efficiency matters more than response speed. It offers:

50% cost reduction compared to standard inference pricing
Best-effort latency of approximately 1 to 15 minutes — not suitable for real-time applications
Sheddable requests: under high system load, Flex requests may be deprioritized or dropped in favor of higher-priority traffic
Activated by setting "service_tier": "flex" in the API request

Flex is ideal for offline evaluation pipelines, background research agents, nightly batch jobs, and any workload where the developer would rather save cost than guarantee a fast response. Given the budget-conscious positioning, Flex is accessible to all account tiers.

Priority: Predictable Performance for Production

The Priority tier targets latency-sensitive production applications that require consistent, low-latency responses. It provides:

75–100% pricing premium over standard inference
Low, predictable latency with non-sheddable requests — the API guarantees processing even under load
Automatic downgrade behavior: if the account's Priority tier limit is exceeded, requests automatically fall back to Standard tier rather than failing
Activated by setting "service_tier": "priority" in the request
Restricted to Tier 2 and Tier 3 accounts only

Priority tier is appropriate for customer-facing applications, real-time coding assistants, fraud detection systems, and any production integration with latency SLO requirements.

Supported Models and Integration

Both tiers are available on 8 Gemini models and integrate with the existing GenerateContent and Interactions APIs — no SDK upgrade or new endpoint is required. Developers add a single service_tier parameter to their existing requests:

"service_tier": "flex"      // cost-optimized, best-effort
"service_tier": "priority"  // premium, low-latency, non-sheddable
"service_tier": "standard"  // default behavior (unchanged)

The Flex tier is open to all paying accounts. The Priority tier requires a Tier 2 or Tier 3 account in good standing. Account tier status is visible in Google AI Studio under the billing dashboard.

Mentioned onGitHub GitHub