How to Run Gemma 4 Locally: Google's Most Capable Open Model on Your Hardware

By Prahlad Menon 5 min read

Google DeepMind released Gemma 4 on April 2, 2026 — and it’s a genuine inflection point for open models. The 31B dense model ranks #3 among all open models on Arena AI, the E2B (2.3B effective parameters) outperforms Gemma 3 27B on several benchmarks despite being 12x smaller, and the whole family ships under Apache 2.0 — no more licensing ambiguity for commercial use.

Here’s how to actually run it, from your phone to your workstation, in under five minutes.

The Model Family at a Glance

ModelTotal ParamsActive ParamsArchitectureVRAM (4-bit)Best For
E2B~2.3BallDense + PLE~2GBMobile / edge devices
E4B~4.5BallDense + PLE~3.6GBLaptops / tablets
26B A4B25.2B3.8BMoE~16GBConsumer GPUs
31B30.7BallDense~18GBWorkstations

The 26B MoE is the sweet spot for most people. Despite having 25.2B total parameters, only 3.8B activate per token — so it runs like a 4B model while thinking like a 26B one.

Option 1: On Your Phone (Android)

The fastest path to running Gemma 4 with zero setup:

  1. Download Google AI Edge Gallery from the Play Store
  2. Select Gemma 4 E2B or E4B
  3. It downloads and runs entirely on-device

No account needed. No API key. No internet after the initial download. The E2B model runs under 1.5GB RAM thanks to Google’s LiteRT-LM runtime with 2-bit and 4-bit quantization.

For iOS, there’s no consumer app yet — the current path is the MediaPipe LLM Inference SDK for developers building their own apps.

Option 2: On Your Laptop (Ollama or LM Studio)

Ollama (Fastest)

# E4B — recommended for most laptops (16GB RAM)
ollama pull gemma4:e4b
ollama run gemma4:e4b

# 26B MoE — needs 16GB+ VRAM
ollama pull gemma4:26b-a4b
ollama run gemma4:26b-a4b

Requires Ollama 0.20 or newer.

LM Studio

Search “gemma4” in the model browser. E4B and 26B A4B are available as pre-quantized GGUF files. One-click download and run.

Apple Silicon (MLX)

If you’re on a Mac and want lower memory usage at the cost of some throughput:

pip install mlx-lm
mlx_lm.generate --model unsloth/gemma-4-e4b-it-mlx --prompt "Hello"

Unsloth’s MLX builds use ~40% less memory than Ollama — worth it on memory-constrained MacBooks.

Option 3: For Developers (API + Production)

Google AI Studio

For quick prototyping with the full 31B model:

  1. Open Google AI Studio
  2. Select Gemma 4 31B
  3. Use the function-calling API for agentic workflows

Production Deployment

  • Vertex AI — managed deployment with autoscaling and SLA guarantees
  • Cloud Run — serverless containers, lower operational overhead
  • GKE + vLLM — high-throughput serving for teams already on Kubernetes
  • OpenRouter — API access at $0.13/M input, $0.40/M output tokens

What Makes Gemma 4 Architecturally Interesting

Mixture-of-Experts (26B A4B)

Gemma’s MoE implementation is unusual. Instead of replacing MLP blocks with sparse experts (like DeepSeek or Qwen), Gemma adds MoE blocks as separate layers alongside standard MLP blocks and sums their outputs. This trades some efficiency for architectural simplicity — and the results speak for themselves.

Per-Layer Embeddings (E2B and E4B)

The edge models use PLE instead of MoE. Standard transformers give each token a single embedding vector at input. PLE adds a parallel lower-dimensional conditioning pathway — each decoder layer gets token-specific information only when relevant, rather than frontloading everything into one embedding. This is what enables E2B to be genuinely useful at 2.3B parameters.

Multimodal Across the Board

All four models support image and video input. The edge models (E2B, E4B) additionally support native audio input via a USM-style conformer encoder — speech-to-text and speech-to-translated-text on-device.

FeatureE2B / E4B26B / 31B
Context128K tokens256K tokens
Images
Video✅ (60s @ 1fps)
Audio
Function calling

Benchmarks That Matter

The 31B dense model with reasoning:

  • AIME 2026: 89.2%
  • GPQA Diamond: 85.7%
  • LiveCodeBench v6: 80.0%

Perhaps more telling: on Arena AI’s human preference rankings, the 31B scores higher than raw benchmarks against Qwen 3.5 27B would suggest. Humans prefer its outputs even when accuracy is similar.

Fine-Tuning Caveats

QLoRA fine-tuning had some rough edges at launch — HuggingFace Transformers initially didn’t recognize the gemma4 architecture, PEFT couldn’t handle the new Gemma4ClippableLinear layer type, and a new mm_token_type_ids field is required during training. These issues are being resolved upstream, but check the HuggingFace repo status before attempting fine-tuning.

The Bottom Line

Gemma 4 under Apache 2.0 is the most commercially viable open model family available. The 26B MoE gives you near-frontier performance on a consumer GPU. The E2B runs on a phone. And the 31B dense competes with models 20x its size on human preference rankings.

For anyone building products with open models, Gemma 4 is now a first-class option alongside Qwen 3.5 and Llama.

Models: Hugging Face | Docs: deepmind.google/models/gemma | AI Studio: aistudio.google.com