Skip to main content
Gemma 4: Google's Open-Source AI Just Became a Real Alternative to Cloud
  1. Reviews/

Gemma 4: Google's Open-Source AI Just Became a Real Alternative to Cloud

1458 words·7 mins·
Open Source AI Local AI Mobile AI LLM Reviews Gemma 4 Google Open Source Local AI On-Device AI Mixture of Experts Apache 2.0 Mobile AI Edge AI Open Weight LLM AI

TL;DR
#

Gemma 4 dropped on April 2nd, and Google actually did something meaningful: it abandoned the restrictive custom license that made Gemma 3 a liability for serious commercial projects and switched to Apache 2.0. Combined with the technical improvements — a 26B Mixture of Experts model that activates only 3.8 billion of its parameters per inference, a 31B dense model debuting at #3 on the open-weight Arena leaderboard, and genuinely capable edge variants that fit on a phone — this is the most credible open-weight alternative to cloud AI we’ve seen from a major vendor. If you’ve been avoiding Gemma because of the license, now’s the time to revisit it. If you’ve been watching from a distance because the benchmarks weren’t there yet, also time to revisit.

For those who want to experiment without managing infrastructure, NanoGPT offers hosted access to Gemma 4 models alongside their existing catalog. Full disclosure: I’ve been using NanoGPT for a few months now and find their setup more straightforward than juggling API keys across five different providers.


The License Was Always the Problem
#

Let me be direct about why Gemma 3 never quite landed the way Google probably hoped. The custom Gemma license was a mess. It had a prohibited-use policy that Google could update unilaterally. It required developers to enforce Google’s terms across any project built with Gemma. And buried in the details was language that many developers reasonably interpreted as transferring usage rights to any AI model created with synthetic data produced by Gemma.

Read that last part again. You’re building a product with Gemma. Google can apparently decide that any model trained on outputs from your product is now subject to their license terms. That’s not how open source works.

The result was predictable. Developers who cared about legal clarity picked Llama, Mistral, or the various DeepSeek variants. Gemma 3 had good benchmarks, but the license uncertainty made it a liability for anything that might scale or matter.

Apache 2.0 is the opposite of that. It’s the license underneath most of the software your engineering team already uses. It can’t be unilaterally changed. It doesn’t claim rights to downstream work. Commercial use is unrestricted. Google can’t wake up one morning and decide the terms are different.

That’s the real headline with Gemma 4, even if it’s not the flashy one.


Four Models, Four Different Problems
#

Google released Gemma 4 as a family of four models, each targeting a different deployment scenario. That’s worth understanding because the use case for the 31B dense model is completely different from the E2B variant.

26B Mixture of Experts (MoE)
#

The 26B MoE is the interesting technical piece. It has 26 billion total parameters, but activates only 3.8 billion during inference. That sounds confusing if you haven’t followed the MoE architecture trend, but the practical implication is straightforward: it runs much faster than a 26B dense model because most of the computation is skipped for each token.

In practice, this means the 26B MoE delivers quality competitive with models two to three times its parameter count while using a fraction of the compute. Google says it was specifically optimized for high tokens-per-second on local hardware.

The 26B MoE runs unquantized in bfloat16 on a single 80GB Nvidia H100. That’s a $20,000 GPU, so “local” here means “in your data center,” not “on your laptop.” But quantize it to 4-bit and it fits on consumer hardware like an RTX 4090. That’s a meaningful improvement in accessibility.

Context window is 256k tokens. For reference, that’s enough to hold about 200,000 words — three average-length novels, or a substantial codebase with room to spare.

31B Dense
#

The 31B dense model is where Google went for raw quality. And they got it: the 31B debuts at #3 on the open-weight Arena leaderboard, behind only GLM-5 and Kimi 2.5. Both of those are substantially larger models, which is a meaningful signal.

Dense models like this are more predictable than MoE architectures — every parameter participates in every inference — which makes them better candidates for fine-tuning. If you need a model specifically optimized for your domain, your code style, or your output format, the 31B is your starting point.

The tradeoff is speed. The 31B is more about quality than throughput, and it shows. Google expects developers to fine-tune it rather than use it raw.

Also ships with 256k context.

E2B and E4B: Actually On-Device
#

The effective 2B (E2B) and effective 4B (E4B) models are where things get interesting for a different reason. These are designed to run on mobile hardware — Pixel phones, Raspberry Pi, Jetson Nano — with “near-zero latency” as Google’s explicit target.

These aren’t shrunk-down versions of the big models that happen to run slower. They’re architected for the constraints of mobile NPUs: low memory bandwidth, limited battery, thermal limits. Google worked directly with Qualcomm and MediaTek on optimization.

Context window drops to 128k tokens on these models, which is still generous for on-device use. They also have native speech recognition support built in, which matters for voice interfaces that need to stay offline.

And here’s something notable: Google confirmed that the next generation of Gemini Nano — the on-device AI that handles scam call detection, call summaries, and note summarization on Pixel phones — will be based on these Gemma 4 E2B and E4B models. This is the first time Google has publicly confirmed that connection.

Agentic Capabilities: Native Function Calling
#

Every model in the Gemma 4 family supports native function calling, structured JSON output, and native instructions for common tools and APIs. This matters because agentic workflows — where an AI model breaks a task into steps, calls tools or APIs between steps, and iterates on results — have become the dominant pattern for production AI applications.

Previous Gemma models required some awkward engineering to get reliable function calling. Gemma 4 has it built in. That reduces the middleware code you need to wrap around the model, which makes self-hosted Gemma 4 more competitive with API-based agents.

Code generation is also a focus. Google explicitly calls out improved code quality, which aligns with where the market has shifted: AI coding assistance has become one of the highest-value production use cases, and open-weight models that can run code generation locally are in demand for IP-sensitive enterprise environments.


Vision, Languages, and Everything Else
#

Gemma 4 ships with improved vision capabilities across the board. Tasks like OCR and chart understanding — historically weak points for open-weight models — are more reliable. If you’re building document processing, data extraction, or anything involving visual input, this matters.

The language support is similarly broad: 140+ languages out of the box. That’s not unusual for a modern model, but it’s worth noting for teams building in multilingual markets.

All of this is built on the same underlying technology as Gemini 3, Google’s closed cloud models. Gemma has always been derived from the Gemini research, but the Gemma 4 release is the closest alignment yet between the open and closed families.


Where to Run It
#

Google is offering Gemma 4 across a range of platforms:

  • AI Studio and AI Edge Gallery for browser-based experimentation
  • Hugging Face for model weights and community fine-tunes
  • Kaggle for direct integration with Google’s ML ecosystem
  • Ollama for macOS, Linux, and Windows local inference
  • Google Cloud for hosted deployment at scale

For teams that don’t want to manage their own inference infrastructure, NanoGPT provides hosted API access to Gemma 4 models with their standard setup — no Google Cloud account required, no different API interface to learn.


What This Means for the Market
#

The open-weight model space has gotten genuinely competitive. A year ago, “open source AI” meant either Llama derivatives with questionable licensing, or Mistral models that were capable but small. Now you’ve got Google shipping a top-3 Arena model under Apache 2.0, with a mobile variant that fits on a phone.

The licensing shift matters most for enterprises that have been waiting for legal clarity before committing. The technical improvements matter most for developers who evaluated Gemma 3 and found the benchmarks not quite competitive enough for their use case. Both of those barriers are substantially lower with Gemma 4.

Whether Google can maintain that technical trajectory — and whether they’ll resist the temptation to add restrictive terms the next time they feel threatened by a competitor — remains to be seen. But right now, Gemma 4 is the best argument I’ve seen for building with open-weight models instead of locking yourself into a single cloud provider.


Article published April 5, 2026. Gemma 4 models referenced: 26B MoE, 31B Dense, E2B, E4B. Testing primarily on 26B MoE and 31B Dense via Hugging Face and NanoGPT.

Related

LLM Benchmarks That Actually Matter in 2026: Real Production Numbers Across OpenAI, Anthropic, Google, and NanoGPT
2462 words·12 mins
AI Infrastructure Benchmarks Production Engineering Benchmarks LLM Performance Throughput Latency OpenAI Anthropic Gemini Production Scaling Enterprise
How We Cut Our AI Bill from $10K to $2K/month: The 2026 Enterprise Cost Optimization Playbook
1783 words·9 mins
AI Infrastructure Cost Engineering Production AI Cost Optimization LLM Enterprise Infrastructure OpenAI Anthropic Scaling
NanoGPT Review: Affordable AI Platform
838 words·4 mins
AI Platforms AI API Nanogpt Artificial Intelligence Low-Cost Developers