Train Gemma 4 with Reinforcement Learning (GRPO) for Free on Google Colab

By Prahlad Menon 3 min read

Google released Gemma 4 in four sizes — E2B, E4B, 26B-A4B (MoE), and 31B Dense — with native support for vision, text, and audio across 140 languages. The smallest variant, E2B, trains on just 8GB of VRAM. That means you can fine-tune it on a free Google Colab T4 GPU.

This guide walks through training Gemma 4 E2B with GRPO reinforcement learning using Unsloth, which cuts VRAM usage by 60% and speeds up training by 1.5x.

Why GRPO for Gemma 4?

Group Relative Policy Optimization (GRPO) is a reinforcement learning technique that teaches models to reason better. Unlike supervised fine-tuning where you provide correct answers, GRPO works differently:

  1. Generate multiple candidate responses to the same prompt
  2. Score each response with a reward function (correctness, format, etc.)
  3. Update the model policy based on relative performance within the group

This approach is particularly effective for tasks requiring structured reasoning — math, logic puzzles, code generation, and multi-step problem solving. The model learns how to think, not just what to answer.

What You Need

  • A Google account (free Colab tier works)
  • A Hugging Face token (for model access)
  • About 30-60 minutes of training time

Quick Start with Unsloth’s Notebook

The fastest path is Unsloth’s ready-made Colab notebook that trains Gemma 4 E2B on Sudoku puzzles using GRPO:

👉 Open the Colab Notebook

The notebook handles installation, model loading, dataset preparation, reward function definition, and training — all within the free T4 GPU’s 16GB VRAM limit.

Why Fine-Tune Instead of RAG or Prompting?

Before reaching for RL fine-tuning, it’s worth asking: do you actually need it?

ApproachBest ForTradeoff
Prompt engineeringFormat control, persona, simple instructionsZero cost, but limited by context window and model’s existing knowledge
RAG (knowledge base)Factual recall, domain documents, up-to-date infoAdds knowledge without changing the model, but doesn’t improve reasoning
SFT (supervised fine-tuning)Teaching specific output formats, domain language, styleNeeds labeled examples; teaches what to output
RL (GRPO/RLHF)Improving how the model thinks — reasoning, strategy, judgmentNo labeled answers needed, but requires a reward signal

RAG gives the model better information. RL gives the model better reasoning.

If your problem is “the model doesn’t know about my company’s products” — use RAG. If your problem is “the model knows the rules of Sudoku but can’t reliably solve puzzles” — that’s a reasoning gap, and RL is the right tool.

The Sudoku example is illustrative: the model already “knows” Sudoku rules from pretraining. What it lacks is the ability to consistently generate working algorithmic strategies. GRPO trains that capability by letting it practice thousands of attempts and learning from what works.

When RL fine-tuning makes sense:

  • Tasks with verifiable correctness (math, code, logic, games)
  • Improving chain-of-thought reasoning quality
  • Teaching the model to avoid specific failure modes
  • When you can write a reward function but can’t easily write 10,000 gold-standard examples

When it doesn’t:

  • You just need the model to access specific facts → RAG
  • You want a particular output format → SFT or prompting
  • Your task has no clear “right answer” to reward against

The beauty of GRPO is that it sits on top of everything else. You can RAG for knowledge, SFT for format, and then RL for reasoning — they’re complementary layers, not competitors.

The Sudoku Example: End-to-End Walkthrough

The Unsloth notebook teaches Gemma 4 to write Sudoku-solving code — not just solve puzzles directly, but generate Python strategy functions that solve them. This is a compelling RL task because it requires genuine algorithmic reasoning.

Step 1: The Sudoku Environment

First, a SudokuGame class provides the training environment. It generates puzzles at configurable difficulty (number of empty cells), validates moves, and tracks game state:

game = SudokuGame(difficulty=30, seed=42)
print(game.pretty())  # Shows the 9x9 grid with 0s for empty cells

Step 2: The Prompt (Your “Dataset”)

Here’s the clever part — the dataset is just one prompt repeated 1,000 times:

from datasets import Dataset

prompt = """
Create a Sudoku solving strategy using only native Python built-in functions 
without any import statements.
You are given two lists of lists (9x9 grids):
- board: current state (0 means empty)
- initial: starting puzzle (0 means was empty, numbers are fixed)

Return a tuple (row, col, number) for the next move.
"""

dataset = Dataset.from_list([
    {"prompt": [{"role": "user", "content": prompt.strip()}], "answer": 0}
] * 1000)

There are no pre-written solutions. The model generates candidate strategy functions, and the reward functions judge them. That’s the entire RL loop — no labeled data needed.

Step 3: Three Reward Functions

GRPO needs reward functions to score each generated response. The notebook uses three:

1. function_works — Does the generated code parse and execute as valid Python? Returns -2.0 for broken code, +1.0 for working functions.

2. no_cheating — Does the function import external libraries? The prompt says “no imports,” so this penalizes reward hacking. Uses check_python_modules to inspect the AST.

3. strategy_succeeds — The big one. It actually runs the generated strategy against random Sudoku puzzles and rewards based on how many valid moves it makes and whether it solves the puzzle. Partial credit for valid moves, full credit for completion.

def strategy_succeeds(completions, **kwargs):
    """Reward valid moves even if strategy eventually fails."""
    scores = []
    seed = np.random.randint(10000)
    difficulty = 40
    for completion in completions:
        response = completion[0]["content"]
        function = extract_function(response)
        # Execute the strategy against a real puzzle
        # Score based on valid moves and completion
        ...

Code execution is sandboxed with a 10-second time limit to prevent infinite loops.

Step 4: Train with GRPO

from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(
    num_generations=4,
    max_completion_length=2048,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    num_train_epochs=1,
    learning_rate=5e-6,
)

trainer = GRPOTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    reward_funcs=[function_works, no_cheating, strategy_succeeds],
)
trainer.train()

Step 5: Export to GGUF

After training, export for local inference with Ollama or llama.cpp:

model.save_pretrained_gguf("gemma4-grpo", tokenizer, quantization_method="q4_k_m")

Adapting This to Your Own Tasks

The Sudoku example is a template. To train on your own RL task, you swap three things:

  1. The prompt — describe what you want the model to produce
  2. The reward functions — define what “good” looks like (correctness, formatting, safety, etc.)
  3. The evaluation environment — if your task needs execution, build a sandbox

No labeled dataset required. That’s the power of RL — the reward signal is the supervision.

Bugs Unsloth Fixed

Training Gemma 4 isn’t plug-and-play without Unsloth. They fixed several upstream issues:

  • KV cache sharing — Gemma 4’s sliding window attention shares KV caches across layers, which caused silent failures in standard implementations
  • Gradient accumulation — loss calculations were incorrect with gradient accumulation steps > 1
  • Audio fp16 overflow — audio input processing caused numerical overflow in half precision

These fixes are baked into Unsloth, so you don’t need to patch anything yourself.

Model Size Guide

ModelVRAM (4-bit)Free Colab?
E2B~8 GB✅ T4 works
E4B~10-17 GB⚠️ Tight on T4
26B-A4B (MoE)~20 GB❌ Needs A100
31B Dense~24 GB❌ Needs A100

What’s Next

Once you have a GRPO-trained Gemma 4 model, you can:

  • Run locally with Ollama using the GGUF export
  • Train on your own tasks by swapping the dataset and reward functions
  • Scale up to E4B or the MoE variant on Colab Pro (A100 GPU)
  • Add vision/audio inputs — Gemma 4 is natively multimodal

The combination of a free GPU, efficient tooling, and a capable base model makes this one of the most accessible ways to experiment with RL-based LLM training today.

📓 GitHub repo with notebook: menonpg/gemma4-rl-colab