How to Run Gemma 4 Locally: Google's Most Capable Open Model on Your Hardware
Google DeepMind released Gemma 4 on April 2, 2026 — and it’s a genuine inflection point for open models. The 31B dense model ranks #3 among all open models on Arena AI, the E2B (2.3B effective parameters) outperforms Gemma 3 27B on several benchmarks despite being 12x smaller, and the whole family ships under Apache 2.0 — no more licensing ambiguity for commercial use.
Here’s how to actually run it, from your phone to your workstation, in under five minutes.
The Model Family at a Glance
| Model | Total Params | Active Params | Architecture | VRAM (4-bit) | Best For |
|---|---|---|---|---|---|
| E2B | ~2.3B | all | Dense + PLE | ~2GB | Mobile / edge devices |
| E4B | ~4.5B | all | Dense + PLE | ~3.6GB | Laptops / tablets |
| 26B A4B | 25.2B | 3.8B | MoE | ~16GB | Consumer GPUs |
| 31B | 30.7B | all | Dense | ~18GB | Workstations |
The 26B MoE is the sweet spot for most people. Despite having 25.2B total parameters, only 3.8B activate per token — so it runs like a 4B model while thinking like a 26B one.
Option 1: On Your Phone (Android)
The fastest path to running Gemma 4 with zero setup:
- Download Google AI Edge Gallery from the Play Store
- Select Gemma 4 E2B or E4B
- It downloads and runs entirely on-device
No account needed. No API key. No internet after the initial download. The E2B model runs under 1.5GB RAM thanks to Google’s LiteRT-LM runtime with 2-bit and 4-bit quantization.
For iOS, there’s no consumer app yet — the current path is the MediaPipe LLM Inference SDK for developers building their own apps.
Option 2: On Your Laptop (Ollama or LM Studio)
Ollama (Fastest)
# E4B — recommended for most laptops (16GB RAM)
ollama pull gemma4:e4b
ollama run gemma4:e4b
# 26B MoE — needs 16GB+ VRAM
ollama pull gemma4:26b-a4b
ollama run gemma4:26b-a4b
Requires Ollama 0.20 or newer.
LM Studio
Search “gemma4” in the model browser. E4B and 26B A4B are available as pre-quantized GGUF files. One-click download and run.
Apple Silicon (MLX)
If you’re on a Mac and want lower memory usage at the cost of some throughput:
pip install mlx-lm
mlx_lm.generate --model unsloth/gemma-4-e4b-it-mlx --prompt "Hello"
Unsloth’s MLX builds use ~40% less memory than Ollama — worth it on memory-constrained MacBooks.
Option 3: For Developers (API + Production)
Google AI Studio
For quick prototyping with the full 31B model:
- Open Google AI Studio
- Select Gemma 4 31B
- Use the function-calling API for agentic workflows
Production Deployment
- Vertex AI — managed deployment with autoscaling and SLA guarantees
- Cloud Run — serverless containers, lower operational overhead
- GKE + vLLM — high-throughput serving for teams already on Kubernetes
- OpenRouter — API access at $0.13/M input, $0.40/M output tokens
What Makes Gemma 4 Architecturally Interesting
Mixture-of-Experts (26B A4B)
Gemma’s MoE implementation is unusual. Instead of replacing MLP blocks with sparse experts (like DeepSeek or Qwen), Gemma adds MoE blocks as separate layers alongside standard MLP blocks and sums their outputs. This trades some efficiency for architectural simplicity — and the results speak for themselves.
Per-Layer Embeddings (E2B and E4B)
The edge models use PLE instead of MoE. Standard transformers give each token a single embedding vector at input. PLE adds a parallel lower-dimensional conditioning pathway — each decoder layer gets token-specific information only when relevant, rather than frontloading everything into one embedding. This is what enables E2B to be genuinely useful at 2.3B parameters.
Multimodal Across the Board
All four models support image and video input. The edge models (E2B, E4B) additionally support native audio input via a USM-style conformer encoder — speech-to-text and speech-to-translated-text on-device.
| Feature | E2B / E4B | 26B / 31B |
|---|---|---|
| Context | 128K tokens | 256K tokens |
| Images | ✅ | ✅ |
| Video | ✅ (60s @ 1fps) | ✅ |
| Audio | ✅ | ❌ |
| Function calling | ✅ | ✅ |
Benchmarks That Matter
The 31B dense model with reasoning:
- AIME 2026: 89.2%
- GPQA Diamond: 85.7%
- LiveCodeBench v6: 80.0%
Perhaps more telling: on Arena AI’s human preference rankings, the 31B scores higher than raw benchmarks against Qwen 3.5 27B would suggest. Humans prefer its outputs even when accuracy is similar.
Fine-Tuning Caveats
QLoRA fine-tuning had some rough edges at launch — HuggingFace Transformers initially didn’t recognize the gemma4 architecture, PEFT couldn’t handle the new Gemma4ClippableLinear layer type, and a new mm_token_type_ids field is required during training. These issues are being resolved upstream, but check the HuggingFace repo status before attempting fine-tuning.
The Bottom Line
Gemma 4 under Apache 2.0 is the most commercially viable open model family available. The 26B MoE gives you near-frontier performance on a consumer GPU. The E2B runs on a phone. And the 31B dense competes with models 20x its size on human preference rankings.
For anyone building products with open models, Gemma 4 is now a first-class option alongside Qwen 3.5 and Llama.
Models: Hugging Face | Docs: deepmind.google/models/gemma | AI Studio: aistudio.google.com