Train Gemma 4 with Reinforcement Learning (GRPO) for Free on Google Colab
Google released Gemma 4 in four sizes — E2B, E4B, 26B-A4B (MoE), and 31B Dense — with native support for vision, text, and audio across 140 languages. The smallest variant, E2B, trains on just 8GB of VRAM. That means you can fine-tune it on a free Google Colab T4 GPU.
This guide walks through training Gemma 4 E2B with GRPO reinforcement learning using Unsloth, which cuts VRAM usage by 60% and speeds up training by 1.5x.
Why GRPO for Gemma 4?
Group Relative Policy Optimization (GRPO) is a reinforcement learning technique that teaches models to reason better. Unlike supervised fine-tuning where you provide correct answers, GRPO works differently:
- Generate multiple candidate responses to the same prompt
- Score each response with a reward function (correctness, format, etc.)
- Update the model policy based on relative performance within the group
This approach is particularly effective for tasks requiring structured reasoning — math, logic puzzles, code generation, and multi-step problem solving. The model learns how to think, not just what to answer.
What You Need
- A Google account (free Colab tier works)
- A Hugging Face token (for model access)
- About 30-60 minutes of training time
Quick Start with Unsloth’s Notebook
The fastest path is Unsloth’s ready-made Colab notebook that trains Gemma 4 E2B on Sudoku puzzles using GRPO:
The notebook handles installation, model loading, dataset preparation, reward function definition, and training — all within the free T4 GPU’s 16GB VRAM limit.
Why Fine-Tune Instead of RAG or Prompting?
Before reaching for RL fine-tuning, it’s worth asking: do you actually need it?
| Approach | Best For | Tradeoff |
|---|---|---|
| Prompt engineering | Format control, persona, simple instructions | Zero cost, but limited by context window and model’s existing knowledge |
| RAG (knowledge base) | Factual recall, domain documents, up-to-date info | Adds knowledge without changing the model, but doesn’t improve reasoning |
| SFT (supervised fine-tuning) | Teaching specific output formats, domain language, style | Needs labeled examples; teaches what to output |
| RL (GRPO/RLHF) | Improving how the model thinks — reasoning, strategy, judgment | No labeled answers needed, but requires a reward signal |
RAG gives the model better information. RL gives the model better reasoning.
If your problem is “the model doesn’t know about my company’s products” — use RAG. If your problem is “the model knows the rules of Sudoku but can’t reliably solve puzzles” — that’s a reasoning gap, and RL is the right tool.
The Sudoku example is illustrative: the model already “knows” Sudoku rules from pretraining. What it lacks is the ability to consistently generate working algorithmic strategies. GRPO trains that capability by letting it practice thousands of attempts and learning from what works.
When RL fine-tuning makes sense:
- Tasks with verifiable correctness (math, code, logic, games)
- Improving chain-of-thought reasoning quality
- Teaching the model to avoid specific failure modes
- When you can write a reward function but can’t easily write 10,000 gold-standard examples
When it doesn’t:
- You just need the model to access specific facts → RAG
- You want a particular output format → SFT or prompting
- Your task has no clear “right answer” to reward against
The beauty of GRPO is that it sits on top of everything else. You can RAG for knowledge, SFT for format, and then RL for reasoning — they’re complementary layers, not competitors.
The Sudoku Example: End-to-End Walkthrough
The Unsloth notebook teaches Gemma 4 to write Sudoku-solving code — not just solve puzzles directly, but generate Python strategy functions that solve them. This is a compelling RL task because it requires genuine algorithmic reasoning.
Step 1: The Sudoku Environment
First, a SudokuGame class provides the training environment. It generates puzzles at configurable difficulty (number of empty cells), validates moves, and tracks game state:
game = SudokuGame(difficulty=30, seed=42)
print(game.pretty()) # Shows the 9x9 grid with 0s for empty cells
Step 2: The Prompt (Your “Dataset”)
Here’s the clever part — the dataset is just one prompt repeated 1,000 times:
from datasets import Dataset
prompt = """
Create a Sudoku solving strategy using only native Python built-in functions
without any import statements.
You are given two lists of lists (9x9 grids):
- board: current state (0 means empty)
- initial: starting puzzle (0 means was empty, numbers are fixed)
Return a tuple (row, col, number) for the next move.
"""
dataset = Dataset.from_list([
{"prompt": [{"role": "user", "content": prompt.strip()}], "answer": 0}
] * 1000)
There are no pre-written solutions. The model generates candidate strategy functions, and the reward functions judge them. That’s the entire RL loop — no labeled data needed.
Step 3: Three Reward Functions
GRPO needs reward functions to score each generated response. The notebook uses three:
1. function_works — Does the generated code parse and execute as valid Python? Returns -2.0 for broken code, +1.0 for working functions.
2. no_cheating — Does the function import external libraries? The prompt says “no imports,” so this penalizes reward hacking. Uses check_python_modules to inspect the AST.
3. strategy_succeeds — The big one. It actually runs the generated strategy against random Sudoku puzzles and rewards based on how many valid moves it makes and whether it solves the puzzle. Partial credit for valid moves, full credit for completion.
def strategy_succeeds(completions, **kwargs):
"""Reward valid moves even if strategy eventually fails."""
scores = []
seed = np.random.randint(10000)
difficulty = 40
for completion in completions:
response = completion[0]["content"]
function = extract_function(response)
# Execute the strategy against a real puzzle
# Score based on valid moves and completion
...
Code execution is sandboxed with a 10-second time limit to prevent infinite loops.
Step 4: Train with GRPO
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
num_generations=4,
max_completion_length=2048,
per_device_train_batch_size=1,
gradient_accumulation_steps=1,
num_train_epochs=1,
learning_rate=5e-6,
)
trainer = GRPOTrainer(
model=model,
args=training_args,
train_dataset=dataset,
reward_funcs=[function_works, no_cheating, strategy_succeeds],
)
trainer.train()
Step 5: Export to GGUF
After training, export for local inference with Ollama or llama.cpp:
model.save_pretrained_gguf("gemma4-grpo", tokenizer, quantization_method="q4_k_m")
Adapting This to Your Own Tasks
The Sudoku example is a template. To train on your own RL task, you swap three things:
- The prompt — describe what you want the model to produce
- The reward functions — define what “good” looks like (correctness, formatting, safety, etc.)
- The evaluation environment — if your task needs execution, build a sandbox
No labeled dataset required. That’s the power of RL — the reward signal is the supervision.
Bugs Unsloth Fixed
Training Gemma 4 isn’t plug-and-play without Unsloth. They fixed several upstream issues:
- KV cache sharing — Gemma 4’s sliding window attention shares KV caches across layers, which caused silent failures in standard implementations
- Gradient accumulation — loss calculations were incorrect with gradient accumulation steps > 1
- Audio fp16 overflow — audio input processing caused numerical overflow in half precision
These fixes are baked into Unsloth, so you don’t need to patch anything yourself.
Model Size Guide
| Model | VRAM (4-bit) | Free Colab? |
|---|---|---|
| E2B | ~8 GB | ✅ T4 works |
| E4B | ~10-17 GB | ⚠️ Tight on T4 |
| 26B-A4B (MoE) | ~20 GB | ❌ Needs A100 |
| 31B Dense | ~24 GB | ❌ Needs A100 |
What’s Next
Once you have a GRPO-trained Gemma 4 model, you can:
- Run locally with Ollama using the GGUF export
- Train on your own tasks by swapping the dataset and reward functions
- Scale up to E4B or the MoE variant on Colab Pro (A100 GPU)
- Add vision/audio inputs — Gemma 4 is natively multimodal
The combination of a free GPU, efficient tooling, and a capable base model makes this one of the most accessible ways to experiment with RL-based LLM training today.
📓 GitHub repo with notebook: menonpg/gemma4-rl-colab