Magika: Google's AI File Detection Tool That Protects Gmail and Drive Is Now Open Source

By Prahlad Menon 5 min read

File extensions lie. A piece of malware renamed to invoice.pdf is still malware. A script disguised as a .jpg is still a script. For years, Google quietly solved this problem at massive scale — now they’ve open-sourced the solution.

Magika is Google’s AI-powered file type detection tool, and it’s genuinely impressive: a custom deep learning model that weighs just a few megabytes, identifies 200+ content types with ~99% accuracy, and runs in about 5ms per file on a single CPU. It’s been protecting Gmail, Drive, and Safe Browsing internally for years, processing hundreds of billions of files every week. As of 2024, it’s open source and available to everyone.

Why File Extension Trust Is a Security Problem

The classic approach to identifying file types is to look at the extension: .pdf means PDF, .exe means executable. This works until it doesn’t — and in security contexts, it fails constantly.

Attackers rename executables, embed scripts in unexpected formats, and craft files that look benign to simple extension checks but execute malicious code when opened. The more principled approach is to look at the actual content of the file — the bytes, the structure, the magic numbers — and infer what it actually is.

This is what file(1) on Unix has done for decades using handcrafted rules. Magika does it with a neural network trained on 100 million real-world samples, which is why it outperforms the rule-based approach, especially on text-based formats where the differences are subtle.

The Model Is Tiny and Fast — By Design

One of the most interesting engineering choices in Magika is the model size: a few megabytes. Not a few gigabytes — megabytes. This is intentional.

Google needed something that could run at scale across Gmail and Drive — hundreds of billions of files a week — without adding significant latency or infrastructure cost. The model was custom-designed to be highly optimized for this specific task rather than being a general-purpose LLM with file detection bolted on.

The result: inference time of ~5ms per file after the model is loaded, running on a single CPU. The model only reads a limited subset of each file’s content, which means inference time is near-constant regardless of file size. Scan a 10KB config file or a 10GB video — same latency.

It also ships with a per-content-type threshold system. If the model isn’t confident enough about a prediction, it falls back to a generic label ("Generic text document" or "Unknown binary data") rather than guessing wrong. Confidence modes are tunable: high-confidence, medium-confidence, and best-guess depending on your tolerance for false positives vs. coverage.

Real-World Deployment

Magika isn’t a research demo. It’s already integrated into:

  • Gmail — routing attachments to the right security and content policy scanners
  • Google Drive — file type identification at upload and share time
  • Safe Browsing — flagging disguised threats
  • VirusTotal — file type metadata on submissions
  • abuse.ch MalwareBazaar — content type tagging on malware samples

The open-source release is the same model and tooling, not a stripped-down version.

Using It

Installation is straightforward across multiple paths:

# CLI (Rust binary — recommended)
pipx install magika          # via Python package
brew install magika          # macOS / Linux
cargo install --locked magika-cli  # via Rust

# Python API
pip install magika

# JavaScript/TypeScript
npm install magika

CLI usage is what you’d expect:

# Identify a single file
magika suspicious_file.pdf

# Recursive directory scan
magika -r ./uploads/

# JSON output for pipelines
magika --json ./file.bin

# Pipe from stdin
cat unknown_file | magika -

# Show confidence scores
magika -s ./files/*

Example output on a mixed directory:

asm/code.asm:     Assembly (code)
batch/simple.bat: DOS batch file (code)
c/code.c:         C source (code)
docx/doc.docx:    Microsoft Word 2007+ document (document)
eml/sample.eml:   RFC 822 mail (text)

The Python API is clean for embedding in your own pipelines:

from magika import Magika

m = Magika()

# From bytes (useful for streams, downloads)
res = m.identify_bytes(b'function log(msg) {console.log(msg);}')
print(res.output.label)   # → javascript
print(res.output.score)   # → 0.997

# From path
res = m.identify_path('./upload/suspicious.pdf')
print(res.output.label)   # → might be 'elf' or 'python' if it's not actually a PDF
print(res.output.mime_type)

# From stream
with open('./file', 'rb') as f:
    res = m.identify_stream(f)

Each result includes the label, MIME type, description, group, possible extensions, and confidence score — structured output that’s easy to use in downstream security logic.

Where to Use It

File upload validation — don’t trust the Content-Type header or the extension a user provides. Run Magika on the bytes and route accordingly. A user uploading photo.jpg that Magika identifies as elf (Linux executable) is a red flag.

Malware analysis pipelines — integrate into triage workflows. Knowing what a file actually is before sending it to specialized scanners lets you route more efficiently and catch misclassifications early.

Security scanning CI — add to your CI pipeline to detect accidentally committed binaries, suspicious encoded payloads, or files that don’t match their declared type.

Log/artifact analysis — when ingesting unknown files from external sources (threat intel feeds, customer uploads, log archives), use Magika to characterize the corpus before processing.

SIEM enrichment — tag file-related events with accurate content type metadata rather than relying on extension-derived metadata that attackers can trivially spoof.

What It Doesn’t Do

Magika identifies file types — it doesn’t analyze content for malicious behavior. It tells you “this is a PE executable” not “this executable contains ransomware.” For that, you need a scanner. What Magika does is make sure the right scanner sees the right file type, which matters because specialized scanners are tuned for specific formats.

It also doesn’t handle every edge case. The per-content-type threshold system means ambiguous files may come back as generic labels. That’s a deliberate design choice — better to say “I don’t know” than to guess wrong in a security context.

Why the Open-Source Release Matters

Google solved a real, hard problem at a scale that most organizations never face. But the underlying threat — file type spoofing as an attack vector — is universal. Whether you’re running a small SaaS with file uploads or a large enterprise SOC, the same attack works.

Before Magika, the options were libmagic / file(1) (rule-based, good but not great on text formats), commercial tools, or rolling your own heuristics. Magika is now a serious free alternative with production provenance that most commercial tools can’t claim.

It’s also published as a research paper at IEEE/ACM ICSE 2025, which means the methodology is peer-reviewed — unusual for a security tool of this type.


Magikagithub.com/google/magika
Web demo (runs locally in-browser) → securityresearch.google/magika/demo
Apache 2.0 · Google Security Research