Gemma 4: Google's Open-Weight Models Now Available via Gemini API

Gemini CLIApr 2, 2026

Google DeepMind released Gemma 4 on April 2, 2026 — a family of four open-weight models built on the same research as Gemini 3, now available under the Apache 2.0 license. The two server-scale variants, gemma-4-26b-a4b-it (26B Mixture-of-Experts with 4B active parameters per token) and gemma-4-31b-it (31B Dense), are immediately accessible via the Gemini API and Google AI Studio. The 31B model ranks #3 on the Arena AI text leaderboard among all open-weight models, while the 26B MoE secures #6 despite activating only a fraction of its weights per token. All Gemma 4 models are natively multimodal, supporting text, image, and video input, with 256K context windows on the server-scale variants.

Key Takeaways

Gemma 4 arrives via the Gemini API with two server-scale models (gemma-4-26b-a4b-it and gemma-4-31b-it) immediately callable through the standard generateContent endpoint alongside existing Gemini 3 models.
Apache 2.0 licensing removes the prior commercial restrictions of the Gemma Terms of Service, making Gemma 4 the most commercially permissive open-weight model family Google has shipped to date.
The 26B Mixture-of-Experts variant activates only 4B parameters per token, delivering near-frontier performance at roughly half the inference cost of the 31B Dense model — a strong fit for cost-sensitive agentic workloads.
All four model variants are natively multimodal, processing text, image, and video natively; the edge models also handle audio for on-device speech recognition.
256K context window on the server-scale models enables long-document analysis and extended multi-turn agent sessions directly via the Gemini API without chunking.
Community response was immediate and substantial — the Hacker News thread peaked at 1,300+ points and 390 comments, with developers reporting the 26B MoE running at ~150 tokens/second on 24GB VRAM hardware via quantized GGUF builds.

Sources & Mentions

5 external resources covering this update

Gemma 4: Google's newest open models

Hacker News

Gemma 4

Simon Willison

Google releases Gemma 4, a family of open models built off of Gemini 3

Engadget

Google releases Gemma 4 under Apache 2.0 and that license change may matter

VentureBeat

Google Gemma 4 open-source AI models released

9to5Google

What Is Gemma 4?

Google DeepMind released Gemma 4 on April 2, 2026 — a family of four open-weight models that bring Gemini 3-class reasoning capabilities to developers who want to run, fine-tune, or deploy models without cloud dependency. Unlike previous Gemma generations, Gemma 4 ships under the Apache 2.0 license, which removes the custom Gemma Terms of Service that previously restricted certain commercial use cases.

The family spans four distinct model sizes designed for different deployment scenarios:

E2B — 2 billion effective parameters (4.41 GB), targeting ultra-constrained mobile and edge devices
E4B — 4 billion effective parameters (6.33 GB), for mid-range devices and browser-side inference
26B A4B — 26 billion parameters in a Mixture-of-Experts architecture with only 4 billion activated per token (17.99 GB), offering strong capability with efficient inference
31B — 31 billion Dense parameters (19.89 GB), the highest-capability variant and the one directly available via the Gemini API

Gemini API Access

Two models are directly callable via the Gemini API and Google AI Studio: gemma-4-26b-a4b-it and gemma-4-31b-it. Developers can access them immediately through the standard generateContent endpoint. The smaller E2B and E4B variants are available through Google AI Edge Gallery and for download via Hugging Face, Kaggle, and Ollama for local deployment.

Multimodal Capabilities and Context Windows

All Gemma 4 variants are natively multimodal. The models process:

Text — with support for over 140 languages
Images — including variable aspect ratio and resolution
Video — supported across all model sizes
Audio — natively supported on the E2B and E4B edge models for on-device speech recognition

Context windows scale with model size: the edge models (E2B, E4B) offer a 128K token context, while the server-scale models (26B A4B, 31B) support 256K tokens — sufficient for long-document analysis and extended multi-turn agent sessions.

Performance and Benchmarks

The 31B Dense model currently ranks #3 among all open-weight models on the Arena AI text leaderboard, while the 26B MoE variant ranks #6, outperforming models up to 20 times its parameter count on several benchmarks. Google attributes this efficiency to the Mixture-of-Experts architecture — by activating only 4B parameters per token during inference, the 26B model achieves near-frontier performance at a fraction of the compute cost.

The edge models inherit Per-Layer Embeddings (PLE) from Gemini Nano, a technique that gives each decoder layer its own embedding table for every token, enabling stronger representation without increasing active parameter counts. On Android, Gemma 4 E2B is reported to be up to 4x faster than previous Gemma versions while consuming up to 60% less battery.

Developer Tooling and Agentic Use Cases

Gemma 4 is designed with agentic workflows in mind. All variants support function calling and structured JSON output for integration with tool-use pipelines. The models handle multi-step reasoning and planning tasks, making them well-suited for autonomous coding agents, offline document analysis, and multi-turn customer-facing applications where data sovereignty matters.

The 26B and 31B models are most relevant for Gemini CLI developers building server-side agentic applications, while the edge models are targeted at Android developers building on-device features via the AICore Developer Preview.

Open-Source Ecosystem Reaction

The Hacker News discussion for the release attracted over 1,300 points and 390 comments within hours. Community members highlighted quantized GGUF variants from Unsloth as practical options for fitting the 26B model on 24GB VRAM at roughly 150 tokens per second. Simon Willison's testing through LM Studio confirmed the smaller models performed well locally, though initial reports flagged the 31B Dense variant for producing looping outputs in some environments — LM Studio developers acknowledged the bug and promised prompt fixes.

Mentioned onHacker News Simon Willison Engadget VentureBeat 9to5Google