OpenSeeker: The First Open-Source Search Agent That Beats Frontier Models
The most capable search agents today β Perplexity Deep Research, OpenAI deep research, Gemini Deep Research β are closed black boxes. You can use them but you canβt inspect them, reproduce them, or build on them. Their training data is proprietary, their fine-tuning approaches are unpublished, and their benchmark scores are unverifiable.
OpenSeeker changes that.
Published in March 2026 by a purely academic team at SJTU, OpenSeeker is the first open-source search agent to achieve state-of-the-art performance on frontier search benchmarks while simultaneously releasing everything: training data, model weights, evaluation code.
What It Achieves
OpenSeeker-v1 was fine-tuned from Qwen3-30B-A3B-Thinking-2507 using just 11.7K training examples. The results:
| Benchmark | Score |
|---|---|
| BrowseComp-ZH | 48.4 |
| BrowseComp | 29.5 |
| xbench-DeepSearch | 74.0 |
| WideSearch | 59.4 |
BrowseComp is OpenAIβs hardest search benchmark β complex, multi-step questions that require synthesizing information across many web pages. Scoring 29.5 on BrowseComp puts OpenSeeker in range of proprietary systems that have had years and massive datasets to train on.
The 11.7K figure is the key. This isnβt a βthrow compute at itβ result. Itβs a data quality and training recipe story β and theyβve open-sourced both.
How It Works
OpenSeeker is a tool-using agent built on a thinking LLM. The architecture is clean:
OpenSeeker/
βββ src/
β βββ llm_tool_openseeker.py # Core agent loop
β βββ tools/
β βββ search.py # Web search
β βββ visit.py # Page reading
βββ eval/
β βββ generate_answer.py # Run agent on benchmark
β βββ eval.py # Score results
βββ run_openseeker.sh # Deploy model server
The agent loop is iterative: search β read β reason β search again if needed β answer. The Qwen3 thinking model reasons through each step before deciding what tool to call, which is what drives the high scores on complex multi-hop questions.
Running It
# Clone and install
git clone https://github.com/rui-ye/OpenSeeker.git
cd OpenSeeker
conda create --name openseeker python=3.10
conda activate openseeker
pip install -r requirements.txt
# Download model (requires git-xet for HuggingFace large files)
brew install git-xet
git xet install
git clone https://huggingface.co/OpenSeeker/OpenSeeker-v1-30B-SFT
# Configure and start model server
# Edit run_openseeker.sh: set MODEL_PATH to your model directory
bash run_openseeker.sh
# Set API endpoints
source setup_env.sh
# Evaluate
python eval/generate_answer.py \
--dataset_path /path/to/dataset.jsonl \
--out_dir ./output
python eval/eval.py \
--data_path ./output/result_tool200.jsonl \
--max_workers 20
You need a machine capable of running a 30B model β an A100 or equivalent. The model server is vLLM-based, so standard serving infrastructure applies.
Why Open-Sourcing Training Data Matters
The model weights alone arenβt the breakthrough. The training data is.
Fine-tuning a search agent is hard because you need examples of good search behavior β knowing when to search again, how to synthesize conflicting results, how to visit pages selectively. This is expensive to generate and almost no one publishes it.
OpenSeeker publishes 11.7K of these examples, with a higher-quality batch available on request. That dataset is now available for anyone to fine-tune their own models, run ablations, or build on.
This is what academic AI research is supposed to look like: reproducible, auditable, and genuinely useful to the community.
The Research
The paper is on arXiv at 2603.15594:
OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data
Yuwen Du, Rui Ye, Shuo Tang, Xinyu Zhu, Yijun Lu, Yuzhu Cai, Siheng Chen
SJTU, 2026
What to Build With It
A few directions worth exploring:
Domain-specific search agents. The architecture is domain-agnostic. Fine-tune on medical literature searches, legal case research, financial filings β the 11.7K training format gives you a template for curating domain-specific search trajectories.
Smaller distilled versions. 30B is large. The training data could be used to distill a 7B or 14B search agent that trades some benchmark points for deployability.
Search + memory. Pair OpenSeeker with a persistent memory layer (soul.py) so the agent remembers what it searched for in prior sessions and avoids redundant queries. Research agents that compound knowledge over time rather than starting fresh each session.
RAG replacement. For complex multi-hop questions, OpenSeekerβs iterative search approach may outperform static RAG pipelines. Instead of embedding a fixed corpus, let the agent search live.