OpenSeeker: The First Open-Source Search Agent That Beats Frontier Models

By Prahlad Menon 3 min read

The most capable search agents today β€” Perplexity Deep Research, OpenAI deep research, Gemini Deep Research β€” are closed black boxes. You can use them but you can’t inspect them, reproduce them, or build on them. Their training data is proprietary, their fine-tuning approaches are unpublished, and their benchmark scores are unverifiable.

OpenSeeker changes that.

Published in March 2026 by a purely academic team at SJTU, OpenSeeker is the first open-source search agent to achieve state-of-the-art performance on frontier search benchmarks while simultaneously releasing everything: training data, model weights, evaluation code.

What It Achieves

OpenSeeker-v1 was fine-tuned from Qwen3-30B-A3B-Thinking-2507 using just 11.7K training examples. The results:

BenchmarkScore
BrowseComp-ZH48.4
BrowseComp29.5
xbench-DeepSearch74.0
WideSearch59.4

BrowseComp is OpenAI’s hardest search benchmark β€” complex, multi-step questions that require synthesizing information across many web pages. Scoring 29.5 on BrowseComp puts OpenSeeker in range of proprietary systems that have had years and massive datasets to train on.

The 11.7K figure is the key. This isn’t a β€œthrow compute at it” result. It’s a data quality and training recipe story β€” and they’ve open-sourced both.

How It Works

OpenSeeker is a tool-using agent built on a thinking LLM. The architecture is clean:

OpenSeeker/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ llm_tool_openseeker.py   # Core agent loop
β”‚   └── tools/
β”‚       β”œβ”€β”€ search.py            # Web search
β”‚       └── visit.py             # Page reading
β”œβ”€β”€ eval/
β”‚   β”œβ”€β”€ generate_answer.py       # Run agent on benchmark
β”‚   └── eval.py                  # Score results
└── run_openseeker.sh            # Deploy model server

The agent loop is iterative: search β†’ read β†’ reason β†’ search again if needed β†’ answer. The Qwen3 thinking model reasons through each step before deciding what tool to call, which is what drives the high scores on complex multi-hop questions.

Running It

# Clone and install
git clone https://github.com/rui-ye/OpenSeeker.git
cd OpenSeeker
conda create --name openseeker python=3.10
conda activate openseeker
pip install -r requirements.txt

# Download model (requires git-xet for HuggingFace large files)
brew install git-xet
git xet install
git clone https://huggingface.co/OpenSeeker/OpenSeeker-v1-30B-SFT

# Configure and start model server
# Edit run_openseeker.sh: set MODEL_PATH to your model directory
bash run_openseeker.sh

# Set API endpoints
source setup_env.sh

# Evaluate
python eval/generate_answer.py \
  --dataset_path /path/to/dataset.jsonl \
  --out_dir ./output

python eval/eval.py \
  --data_path ./output/result_tool200.jsonl \
  --max_workers 20

You need a machine capable of running a 30B model β€” an A100 or equivalent. The model server is vLLM-based, so standard serving infrastructure applies.

Why Open-Sourcing Training Data Matters

The model weights alone aren’t the breakthrough. The training data is.

Fine-tuning a search agent is hard because you need examples of good search behavior β€” knowing when to search again, how to synthesize conflicting results, how to visit pages selectively. This is expensive to generate and almost no one publishes it.

OpenSeeker publishes 11.7K of these examples, with a higher-quality batch available on request. That dataset is now available for anyone to fine-tune their own models, run ablations, or build on.

This is what academic AI research is supposed to look like: reproducible, auditable, and genuinely useful to the community.

The Research

The paper is on arXiv at 2603.15594:

OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data
Yuwen Du, Rui Ye, Shuo Tang, Xinyu Zhu, Yijun Lu, Yuzhu Cai, Siheng Chen
SJTU, 2026

What to Build With It

A few directions worth exploring:

Domain-specific search agents. The architecture is domain-agnostic. Fine-tune on medical literature searches, legal case research, financial filings β€” the 11.7K training format gives you a template for curating domain-specific search trajectories.

Smaller distilled versions. 30B is large. The training data could be used to distill a 7B or 14B search agent that trades some benchmark points for deployability.

Search + memory. Pair OpenSeeker with a persistent memory layer (soul.py) so the agent remembers what it searched for in prior sessions and avoids redundant queries. Research agents that compound knowledge over time rather than starting fresh each session.

RAG replacement. For complex multi-hop questions, OpenSeeker’s iterative search approach may outperform static RAG pipelines. Instead of embedding a fixed corpus, let the agent search live.

Resources