Can I run Gemma 4 on my phone?

Yes. Download Google AI Edge Gallery from the Play Store, select Gemma 4 E2B or E4B, and it runs entirely on-device — no account, API key, or internet needed.

How much RAM does Gemma 4 need?

The E2B model runs under 1.5GB RAM on mobile. The E4B needs about 3.6GB. The 26B MoE needs 16GB+ VRAM, and the 31B dense needs about 18GB.

What is the Gemma 4 26B MoE model?

It's a Mixture-of-Experts model with 25.2B total parameters but only 3.8B active per token. It runs like a 4B model but performs like a 26B model.

Is Gemma 4 free for commercial use?

Yes. Gemma 4 is released under Apache 2.0, making it fully usable for commercial products without licensing restrictions.

How do I run Gemma 4 with Ollama?

Install Ollama 0.20+, then run 'ollama pull gemma4:e4b' for laptops or 'ollama pull gemma4:26b-a4b' for the MoE model on a GPU with 16GB+ VRAM.

How to Run Gemma 4 Locally: Google's Most Capable Open Model on Your Hardware

By Prahlad Menon Published 2026-04-09 5 min read

Google DeepMind released Gemma 4 on April 2, 2026 — and it’s a genuine inflection point for open models. The 31B dense model ranks #3 among all open models on Arena AI, the E2B (2.3B effective parameters) outperforms Gemma 3 27B on several benchmarks despite being 12x smaller, and the whole family ships under Apache 2.0 — no more licensing ambiguity for commercial use.

Here’s how to actually run it, from your phone to your workstation, in under five minutes.

The Model Family at a Glance

Model	Total Params	Active Params	Architecture	VRAM (4-bit)	Best For
E2B	~2.3B	all	Dense + PLE	~2GB	Mobile / edge devices
E4B	~4.5B	all	Dense + PLE	~3.6GB	Laptops / tablets
26B A4B	25.2B	3.8B	MoE	~16GB	Consumer GPUs
31B	30.7B	all	Dense	~18GB	Workstations

The 26B MoE is the sweet spot for most people. Despite having 25.2B total parameters, only 3.8B activate per token — so it runs like a 4B model while thinking like a 26B one.

Option 1: On Your Phone (Android)

The fastest path to running Gemma 4 with zero setup:

Download Google AI Edge Gallery from the Play Store
Select Gemma 4 E2B or E4B
It downloads and runs entirely on-device

No account needed. No API key. No internet after the initial download. The E2B model runs under 1.5GB RAM thanks to Google’s LiteRT-LM runtime with 2-bit and 4-bit quantization.

For iOS, there’s no consumer app yet — the current path is the MediaPipe LLM Inference SDK for developers building their own apps.

Option 2: On Your Laptop (Ollama or LM Studio)

Ollama (Fastest)

# E4B — recommended for most laptops (16GB RAM)
ollama pull gemma4:e4b
ollama run gemma4:e4b

# 26B MoE — needs 16GB+ VRAM
ollama pull gemma4:26b-a4b
ollama run gemma4:26b-a4b

Requires Ollama 0.20 or newer.

LM Studio

Search “gemma4” in the model browser. E4B and 26B A4B are available as pre-quantized GGUF files. One-click download and run.

Apple Silicon (MLX)

If you’re on a Mac and want lower memory usage at the cost of some throughput:

pip install mlx-lm
mlx_lm.generate --model unsloth/gemma-4-e4b-it-mlx --prompt "Hello"

Unsloth’s MLX builds use ~40% less memory than Ollama — worth it on memory-constrained MacBooks.

Option 3: For Developers (API + Production)

Google AI Studio

For quick prototyping with the full 31B model:

Open Google AI Studio
Select Gemma 4 31B
Use the function-calling API for agentic workflows

Production Deployment

Vertex AI — managed deployment with autoscaling and SLA guarantees
Cloud Run — serverless containers, lower operational overhead
GKE + vLLM — high-throughput serving for teams already on Kubernetes
OpenRouter — API access at $0.13/M input, $0.40/M output tokens

What Makes Gemma 4 Architecturally Interesting

Mixture-of-Experts (26B A4B)

Gemma’s MoE implementation is unusual. Instead of replacing MLP blocks with sparse experts (like DeepSeek or Qwen), Gemma adds MoE blocks as separate layers alongside standard MLP blocks and sums their outputs. This trades some efficiency for architectural simplicity — and the results speak for themselves.

Per-Layer Embeddings (E2B and E4B)

The edge models use PLE instead of MoE. Standard transformers give each token a single embedding vector at input. PLE adds a parallel lower-dimensional conditioning pathway — each decoder layer gets token-specific information only when relevant, rather than frontloading everything into one embedding. This is what enables E2B to be genuinely useful at 2.3B parameters.

Multimodal Across the Board

All four models support image and video input. The edge models (E2B, E4B) additionally support native audio input via a USM-style conformer encoder — speech-to-text and speech-to-translated-text on-device.

Feature	E2B / E4B	26B / 31B
Context	128K tokens	256K tokens
Images	✅	✅
Video	✅ (60s @ 1fps)	✅
Audio	✅	❌
Function calling	✅	✅

Benchmarks That Matter

The 31B dense model with reasoning:

AIME 2026: 89.2%
GPQA Diamond: 85.7%
LiveCodeBench v6: 80.0%

Perhaps more telling: on Arena AI’s human preference rankings, the 31B scores higher than raw benchmarks against Qwen 3.5 27B would suggest. Humans prefer its outputs even when accuracy is similar.

Fine-Tuning Caveats

QLoRA fine-tuning had some rough edges at launch — HuggingFace Transformers initially didn’t recognize the gemma4 architecture, PEFT couldn’t handle the new Gemma4ClippableLinear layer type, and a new mm_token_type_ids field is required during training. These issues are being resolved upstream, but check the HuggingFace repo status before attempting fine-tuning.

The Bottom Line

Gemma 4 under Apache 2.0 is the most commercially viable open model family available. The 26B MoE gives you near-frontier performance on a consumer GPU. The E2B runs on a phone. And the 31B dense competes with models 20x its size on human preference rankings.

For anyone building products with open models, Gemma 4 is now a first-class option alongside Qwen 3.5 and Llama.

Models: Hugging Face | Docs: deepmind.google/models/gemma | AI Studio: aistudio.google.com