Hyper-Extract: From Unstructured Text to Knowledge Graphs, Hypergraphs, and Beyond

By Prahlad Menon 7 min read

There’s a quiet revolution happening in how we structure knowledge from text. For years, knowledge graphs have been the gold standard — triples of (subject, predicate, object) neatly capturing who did what to whom. But real-world knowledge isn’t binary. A drug interaction involves a drug, a patient condition, a dosage, a contraindication, and a temporal window. A legal precedent connects a court, multiple parties, a statute, a jurisdiction, and a date. Force these into triples and you’re left picking which context to drop.

Enter hypergraphs — and a new generation of tools that make building them from raw text as simple as running a CLI command.

The Problem with Triples

Traditional knowledge graphs model facts as binary edges: (Aspirin) --treats--> (Headache). Clean. Simple. And incomplete.

What about the dosage? The patient population? The contraindication with blood thinners? The temporal constraint that it should be taken after meals? In a standard knowledge graph, you’d need to decompose this single medical fact into a constellation of auxiliary nodes and edges — reification — creating an explosion of synthetic structure that obscures the original semantics.

Hypergraphs solve this by allowing a single hyperedge to connect n entities simultaneously. That one drug interaction fact becomes a single hyperedge linking {Aspirin, Headache, 500mg, Post-meal, Contraindicated-with-Warfarin}. No information loss. No structural gymnastics.

This isn’t just a theoretical nicety. The HyperGraphRAG paper (NeurIPS 2025) demonstrated that hypergraph-structured knowledge representation outperforms both standard RAG and previous graph-based RAG methods in answer accuracy, retrieval efficiency, and generation quality — tested across medicine, agriculture, computer science, and law.

Hyper-Extract: One Command, Eight Output Types

Hyper-Extract by Yifan Feng is an open-source framework (Apache 2.0, 350+ stars) that makes this entire pipeline accessible. It transforms unstructured documents into structured knowledge using LLMs, with support for eight distinct output types:

Auto-TypeWhat It Produces
AutoModelPydantic-typed structured objects
AutoListOrdered collections
AutoSetDeduplicated collections
AutoGraphStandard knowledge graphs
AutoHypergraphHypergraphs with n-ary relations
AutoTemporalGraphTime-aware knowledge graphs
AutoSpatialGraphLocation-aware knowledge graphs
AutoSpatioTemporalGraphCombined space-time graphs

The spatio-temporal types are particularly interesting. A biographical text about a historical figure doesn’t just contain who knew whom — it contains when and where those interactions happened. Hyper-Extract captures all of this natively.

Getting Started

Installation and extraction take three lines:

uv tool install hyperextract
he config init -k YOUR_OPENAI_API_KEY
he parse document.md -t general/biography_graph -o ./output/ -l en

That -t general/biography_graph flag points to one of 80+ declarative YAML templates spanning six domains: Finance, Legal, Medical, Traditional Chinese Medicine, Industry, and General. Each template defines the entity types, relation types, extraction guidelines, and display format — no code required.

Incremental Evolution

One of Hyper-Extract’s most practical features is incremental knowledge expansion. Extract from an initial document, then feed new documents to expand the graph:

he parse initial_report.md -t medical/diagnosis_graph -o ./patient_kg/
he feed ./patient_kg/ followup_notes.md
he feed ./patient_kg/ lab_results.md
he show ./patient_kg/

The knowledge graph grows over time, merging entities, resolving duplicates, and expanding relationships. This is how knowledge actually accumulates in practice — incrementally, from multiple sources, over time.

Under the Hood: 10+ Extraction Engines

Hyper-Extract isn’t tied to a single extraction algorithm. It ships with integrations for:

  • GraphRAG — Microsoft’s community-summarization approach
  • LightRAG — Lightweight graph-based retrieval
  • Hyper-RAG — Hypergraph-driven retrieval (from the same author)
  • HypergraphRAG — N-ary fact extraction via hyperedges
  • KG-Gen — Direct knowledge graph generation
  • iText2KG — Iterative text-to-graph extraction
  • Cog-RAG — Cognitive retrieval-augmented generation

You choose the method that fits your domain and quality requirements. The templates abstract over the engine selection, so switching between them is a configuration change, not a code rewrite.

The Research Ecosystem Behind It

Hyper-Extract doesn’t exist in isolation. It’s the practical toolkit sitting atop a rapidly maturing research ecosystem. Here are the key papers driving this space:

Hyper-RAG: Cutting Hallucinations with Hypergraphs

Hyper-RAG (also by Yifan Feng) tackles one of the biggest problems in production LLM systems: hallucinations. By structuring retrieved knowledge as hypergraphs that capture both pairwise and beyond-pairwise correlations, Hyper-RAG improved accuracy by an average of 12.3% over direct LLM use on a neurology dataset, and outperformed GraphRAG and LightRAG by 6.3% and 6.0% respectively.

Crucially, Hyper-RAG maintained stable performance as query complexity increased — while existing methods degraded. Its lightweight variant, Hyper-RAG-Lite, achieved 2× retrieval speed with a 3.3% performance boost over LightRAG.

HyperGraphRAG: N-ary Facts for Real-World Knowledge (NeurIPS 2025)

HyperGraphRAG formalizes the intuition that binary edges are insufficient. Each hyperedge in their system encodes an n-ary relational fact — a single atomic statement that can involve two, three, or more entities. The paper demonstrated consistent improvements over both standard RAG and graph-based RAG across four domains: medicine, agriculture, computer science, and law.

The NeurIPS 2025 acceptance signals that the community is taking hypergraph representations seriously as the next step beyond GraphRAG.

Hyper-KGGen: Learning to Extract Better

Hyper-KGGen (February 2026) addresses a subtler problem: generic extractors struggle with domain-specific jargon and conventions. Their solution is a skill-driven framework where the extractor actively learns domain expertise through an adaptive skill acquisition module.

A “coarse-to-fine” mechanism decomposes documents systematically, ensuring coverage from simple binary links to complex hyperedges. A stability-based feedback loop identifies where extraction is unstable and distills corrections into a Global Skill Library. The result: significantly better extraction quality than static few-shot approaches, especially across diverse domains.

The GraphRAG Reality Check

It’s worth noting that the graph-based RAG landscape isn’t uniformly positive. The GraphRAG-Bench paper (ICLR 2026) conducted a comprehensive analysis and found that GraphRAG frequently underperforms vanilla RAG on many real-world tasks. The overhead of graph construction doesn’t always pay off, particularly for straightforward factoid queries where simple chunk retrieval is sufficient.

This finding makes Hyper-Extract’s multi-engine approach especially valuable. Rather than betting everything on graph-based extraction, you can choose the right tool for the job — standard graphs where relationships matter, hypergraphs where n-ary facts dominate, spatio-temporal graphs where context is critical, or simple structured extraction where that’s all you need.

Why Healthcare Needs Hypergraphs

The medical domain is where hypergraphs shine brightest, and it’s no coincidence that several of these papers use clinical data for evaluation.

Consider a typical clinical knowledge fragment:

Patient presents with Type 2 diabetes (diagnosed 2019), currently on Metformin 1000mg BID and Lisinopril 10mg daily. Recent HbA1c of 8.2% suggests inadequate glycemic control. Consider adding Empagliflozin 10mg, noting cardiovascular benefit in patients with established atherosclerotic disease.

A standard knowledge graph would need to create dozens of triples to represent this. A hypergraph captures it naturally:

  • Hyperedge 1: {Patient, T2D, 2019, Diagnosis} — the diagnostic fact with temporal context
  • Hyperedge 2: {Metformin, 1000mg, BID, Current} — the medication with dosage and frequency
  • Hyperedge 3: {HbA1c, 8.2%, Inadequate-control, Current} — the lab result with interpretation
  • Hyperedge 4: {Empagliflozin, 10mg, CV-benefit, ASCVD, Consideration} — the recommendation with its qualifying condition

Each hyperedge is an atomic fact. No information is lost to reification. And when you query “What medications should be considered for a diabetic patient with cardiovascular disease?”, the retrieval system can walk the hyperedge structure to find exactly the right context — including the conditions under which the recommendation applies.

This is why Hyper-RAG showed its strongest results on medical datasets, and why HyperGraphRAG’s medicine evaluation was particularly compelling.

Where This Is Heading

The trajectory is clear: we’re moving from “extract triples from text” to “extract the full semantic complexity from text.” Hyper-Extract sits at the practical end of this spectrum — a tool you can install today and use to build knowledge structures that were research prototypes a year ago.

A few things to watch:

  • Template ecosystems — Hyper-Extract’s 80+ YAML templates hint at a future where domain-specific extraction is truly plug-and-play. Expect community-contributed templates to grow rapidly.
  • Incremental knowledge bases — The he feed pattern of continuously expanding knowledge from new documents is closer to how humans actually build understanding. This will become standard.
  • Engine selection — As more benchmark results emerge (like the GraphRAG-Bench findings), automatic engine selection based on query type and domain will become important. Hyper-Extract’s multi-engine architecture is well-positioned for this.
  • Spatio-temporal reasoning — The AutoSpatialGraph and AutoSpatioTemporalGraph types are ahead of most extraction tools. As LLMs get better at temporal and spatial reasoning, these will unlock use cases in logistics, historical analysis, and epidemiology.

If you’re building RAG systems, knowledge bases, or any application that needs structured understanding of unstructured text, Hyper-Extract is worth a serious look. The combination of declarative templates, multiple extraction engines, hypergraph support, and incremental evolution makes it one of the most complete knowledge extraction frameworks available today.

uv tool install hyperextract
he parse your_documents/ -t medical/diagnosis_hypergraph -o ./knowledge/ -l en
he show ./knowledge/

Three commands. Full knowledge hypergraph. No excuses.