Privacy Parser: The Reverse of OpenAI's Privacy Filter
When OpenAI released their Privacy Filter — a 1.5B parameter model designed to mask Personally Identifiable Information (PII) in text — the security community noticed something immediately. The same model that learns where PII lives in order to mask it could, with minimal re-plumbing, be used to extract that PII instead.
That’s exactly what Privacy Parser does. Same checkpoint, same weights, opposite purpose. Instead of replacing your name with [REDACTED], it hands you back a structured object with the exact text, label, and character offsets. Apache-2.0 licensed, installable via uv, and ready to run on CPU.
How Does Privacy Parser Extract PII From Text?
The architecture uses a standard sequence labeling approach:
- OpenAI Privacy Filter 1.5B (~3GB checkpoint) processes input text and produces BIOES (Begin/Inside/Outside/End/Single) logits — the standard tagging scheme used in named entity recognition.
- Viterbi decoding finds the most likely label sequence across the full input.
- Logits become character-level spans which are then merged into contiguous entities.
- A regex backstop catches patterns the model misses — emails, phone numbers, URLs — anything with a predictable structure.
- The final output is a list of structured spans, each with a label, the matched text, and exact start/end character offsets.
The tool recognizes eight label categories: private_person, private_email, private_phone, private_address, private_url, private_date, account_number, and secret. That last one is a catch-all for API keys, passwords, and other credentials.
How Accurate Is Privacy Parser Compared to Commercial PII Detection?
Privacy Parser ships with three extraction modes, each with different speed and accuracy trade-offs:
Regex-only runs in microseconds and achieves an F1 Score (F1 — the harmonic mean of precision and recall) of 1.0 on known patterns. If your data contains standard-format emails, US phone numbers, or URLs, regex catches them perfectly. The limitation: it can’t find names or mailing addresses without a predictable pattern.
Model-only uses the full 1.5B checkpoint (~500ms on CPU) and achieves F1 = 0.733. That means roughly 73% of PII entities are correctly identified with correct boundaries — respectable for a general-purpose task, but it will miss some entities. The model excels at contextual items like names and addresses where regex falls flat.
Hybrid mode combines both (~600ms on CPU) and hits F1 = 0.929. For context, commercial PII detection APIs typically score in the 0.90–0.95 range, so hybrid mode is competitive — and it runs entirely on your local machine without sending sensitive data to a third party. For most production use cases, this is the mode you want.
Is Privacy Parser a Security Tool or an Attack Tool?
This is where Privacy Parser gets interesting — and uncomfortable.
The defensive use case is clear. Before shipping a dataset to a vendor, fine-tuning a model, or publishing research data, you run Privacy Parser to audit what PII exists in your corpus. Compliance teams can verify that redaction pipelines actually caught everything. It’s the equivalent of a penetration test for your data hygiene.
The offensive use case is equally obvious. If you have access to leaked data — breached databases, scraped documents, exposed logs — Privacy Parser will systematically extract every name, email, phone number, address, and credential it can find.
This duality isn’t unique. Nmap scans networks for defenders and attackers alike. Metasploit exists for penetration testers, but nothing stops a malicious actor from using it. The authors acknowledge this directly by releasing under Apache-2.0 with no usage restrictions — the capability already exists in the model weights OpenAI published. The tool itself is legal; how you use the extracted data determines legality under data protection laws like GDPR and CCPA.
Can Privacy Parser Run Locally Without a GPU?
Yes — and that’s one of its biggest advantages. All three extraction modes run on CPU. Hybrid mode takes about 600ms per extraction on a typical machine, which is fast enough for batch processing thousands of documents without any GPU infrastructure.
This matters for compliance workflows where you can’t send sensitive data to a cloud API. Privacy Parser keeps everything on your machine.
Who Should Use Privacy Parser?
Compliance teams finally get a tool that fills the gap between pure regex (fast but incomplete) and cloud APIs (accurate but you’re sending sensitive data to a third party). Privacy Parser runs locally and gives structured output you can pipe directly into remediation workflows.
ML engineers should run it on training data before fine-tuning. Flag anything it finds, and make a conscious decision about what stays and what gets scrubbed. This should be a standard pre-processing step for any dataset that might contain user data.
Security researchers analyzing breach data get structured extraction with character offsets, meaning you can quickly assess the scope of exposed information without manually reading through dumps.
How Do You Install and Use Privacy Parser?
Installation is a single command:
uv pip install privacy-parser
The first run downloads the ~3GB model checkpoint. After that, extraction is a single function call returning labeled spans with offsets. The tool detects eight PII categories: names, emails, phone numbers, addresses, URLs, dates, account numbers, and secrets (API keys, passwords, credentials).
Links
- GitHub: chiefautism/privacy-parser
- OpenAI Privacy Filter (base model): OpenAI Blog
- PyPI Package: privacy-parser on PyPI