AutoResearchClaw: We Ran a Fully Autonomous Research Pipeline β Here's What Actually Happened
TL;DR: AutoResearchClaw is a 23-stage autonomous research pipeline that takes a topic and outputs a conference-ready paper with real literature, generated experiments, and LaTeX formatting. We ran it on βfile-based vs. vector-based memory architectures for LLM agentsβ and got working code, multi-agent analysis, and self-healing refinement loops. Itβs impressive but raw β hereβs exactly what happened, what it produced, and how to set it up yourself.
Most βAI researchβ tools are glorified summarizers. They scrape papers and regurgitate abstracts. AutoResearchClaw is different β it actually does research. It generates hypotheses through multi-agent debate, writes and executes experiment code, analyzes results statistically, and writes a paper with real citations.
We spent a day wrestling with it. Hereβs the unvarnished truth.
What Is AutoResearchClaw?
AutoResearchClaw is an open-source Python pipeline that automates the entire research workflow:
- Topic β Problem decomposition β Breaks your idea into research questions
- Literature discovery β Searches arXiv and Semantic Scholar for real papers
- Hypothesis generation β Multi-agent debate produces testable predictions
- Experiment design β Plans methodology with hardware awareness
- Code generation β Writes actual Python experiments
- Sandbox execution β Runs experiments with NaN/Inf detection and self-healing
- Result analysis β Multi-perspective statistical analysis
- Decision loops β Autonomously decides to PROCEED, REFINE, or PIVOT
- Paper writing β Generates Markdown and LaTeX with real
\cite{}references
The key differentiator: it doesnβt just plan research β it executes experiments and captures real metrics.
How Do You Install AutoResearchClaw?
Installation is straightforward:
# Clone the repository
git clone https://github.com/aiming-lab/AutoResearchClaw.git
cd AutoResearchClaw
# Create virtual environment
python3 -m venv .venv && source .venv/bin/activate
# Install
pip install -e .
# Copy config template
cp config.researchclaw.example.yaml config.arc.yaml
Edit config.arc.yaml with your LLM settings:
project:
name: "my-research"
mode: "full-auto"
research:
topic: "Your research topic here"
llm:
provider: "openai"
base_url: "https://api.openai.com/v1"
api_key_env: "OPENAI_API_KEY"
primary_model: "gpt-4o"
fallback_models: ["gpt-4o-mini"]
experiment:
mode: "sandbox"
time_budget_sec: 300
sandbox:
python_path: ".venv/bin/python3"
Then run:
export OPENAI_API_KEY="sk-..."
researchclaw run --config config.arc.yaml --topic "Your research idea" --auto-approve
What Did We Actually Research?
We gave it a topic close to home:
βComparative analysis of file-based vs. vector-based memory architectures for LLM agentsβ
This is literally what soul.py explores β so we knew enough to evaluate the output quality.
What Artifacts Did It Produce?
After running, we got this folder structure:
artifacts/rc-20260316-200601-7be555/
βββ stage-01/ # Topic initialization
βββ stage-02/ # Problem decomposition
βββ stage-03/ # Search strategy
βββ stage-04/ # Literature (5,153 lines of real BibTeX!)
β βββ candidates.jsonl
β βββ references.bib
β βββ search_meta.json
βββ stage-08/ # Hypotheses from multi-agent debate
β βββ hypotheses.md
β βββ novelty_report.json
β βββ perspectives/
βββ stage-10/ # Generated experiment code
β βββ experiment/
β β βββ main.py
β β βββ models.py
β β βββ experiment_harness.py
β βββ experiment_spec.md
βββ stage-13/ # Iterative refinement (self-healing)
βββ stage-14/ # Multi-agent analysis
β βββ analysis.md
β βββ results_table.tex
β βββ perspectives/
βββ stage-15/ # Research decision (triggered REFINE)
βββ evolution/ # Self-learning lessons
β βββ lessons.jsonl
βββ deliverables/ # LaTeX template ready
βββ neurips_2025.sty
The Hypotheses It Generated
From stage-08/hypotheses.md β these came from a multi-perspective debate:
Hypothesis 1: Quantum-Inspired Memory Compression
Apply quantum-inspired entropic compression to file-based memory. Predict 30% improvement in retrieval speed and 20% reduction in energy consumption.
Hypothesis 2: Neuroplasticity-Inspired Dynamic Switching
Allow LLM agents to dynamically switch between file-based and vector-based architectures based on task demands. Predict 20% improvement in task completion time.
Hypothesis 3: File-Based Security Advantage
File-based architectures offer more secure data handling in privacy-sensitive applications. Predict 30% lower vulnerability incidence.
These are⦠actually interesting? The quantum-inspired angle is speculative, but the security hypothesis aligns with real concerns about embedding inversion attacks.
The Experiment Code It Wrote
From stage-10/experiment/main.py:
REGISTERED_CONDITIONS = [
'Quantum_Compression_File_Based_System',
'Neuroplastic_Dynamic_Memory_Switch',
'Compression_Without_Quantum_Inspiration', # ablation
'Static_Memory_Architecture' # baseline
]
def run_condition(condition_name, seed):
np.random.seed(seed)
if condition_name == 'Quantum_Compression_File_Based_System':
model = QuantumCompressionMemory(
compression_factor=0.7,
retrieval_speed=1.0
)
elif condition_name == 'Neuroplastic_Dynamic_Memory_Switch':
model = NeuroplasticMemorySwitcher(
decision_accuracy=0.9,
switch_threshold=0.8
)
# ... ablation conditions
compression_ratio, dynamic_adjustment = model.run_training()
latency = (50 / model.retrieval_speed) * compression_ratio * dynamic_adjustment
energy_consumption = (200 / compression_ratio) * dynamic_adjustment
return {'latency': latency, 'energy_consumption': energy_consumption}
It created a proper experimental design with baselines and ablations. The actual models are simplified simulations, but the structure is correct.
The Analysis It Produced
From stage-14/analysis.md:
## Metrics Summary
- Latency Reduction: Quantum-inspired compression achieved 25.2 ms vs 35.0 ms baseline
- Energy Consumption: Reduced to 253.97 J from 285.71 J
- Sample Size: 5 seeds
## Consensus Findings
1. Quantum-inspired techniques reduced both latency and energy consumption
2. Results consistent across different seed trials
## Contested Points (Multi-Agent Critique)
1. Small sample size questions robustness
2. Absence of statistical significance testing
3. Identical outputs in ablation studies flag experimental design flaw
## Recommendation: REFINE
Address methodological gaps before proceeding to paper writing.
The multi-agent analysis caught real issues β the ablation results were suspicious, and it correctly flagged the need for more rigorous statistics. This self-critique is the most impressive part.
The Self-Learning Lessons
From evolution/lessons.jsonl:
{
"stage_name": "research_decision",
"category": "pipeline",
"severity": "warning",
"description": "Research decision was REFINE: warranted due to critical issues in methodology..."
}
{
"stage_name": "experiment_run",
"category": "experiment",
"severity": "warning",
"description": "Runtime warning: Mean of empty slice..."
}
It captures lessons from each run. Future runs on similar topics will (theoretically) avoid these pitfalls.
What Went Wrong?
Plenty.
Trial 1: Wrong Python Path
The sandbox couldnβt find Python. Fix: explicitly set python_path: ".venv/bin/python3" in config.
Trial 2: Experiment Timeout
The 300-second time budget wasnβt enough. The pipeline hit the limit and couldnβt capture metrics.
Trial 3: REFINE Loop
The pipeline decided to REFINE (go back and improve experiments) based on its own quality analysis. This is actually good β it caught methodological issues. But it also means the run took longer and cost more API calls.
Trial 4: Paper Blocked
The paper writing stage refused to proceed because experiment metrics were incomplete. It printed:
Paper Draft Blocked
Reason: Experiment stage produced no metrics (status: failed/timeout)
Action Required: Fix experiment execution or increase time_budget_sec
Honest, at least.
What Is AutoResearchClaw Good At?
- Literature discovery β It found 50+ real papers from arXiv/Semantic Scholar
- Hypothesis structure β The multi-agent debate produces testable, falsifiable hypotheses
- Code scaffolding β The generated experiments have proper baselines and ablations
- Self-critique β The analysis stage catches real methodological issues
- Self-healing β When code fails, it attempts repairs
What Is AutoResearchClaw NOT?
- A paper mill β The output needs human review and editing
- Cheap β A full run costs $5-15+ in API calls
- Fast β Expect 30 min to 2 hours depending on complexity
- Reliable β Many runs fail partway through
- A shortcut β You still need domain expertise to evaluate output quality
How Does AutoResearchClaw Compare to Alternatives?
| Feature | AutoResearchClaw | AI Scientist (Sakana) | Manual Research |
|---|---|---|---|
| Literature Search | β Real APIs | β Real APIs | β Manual |
| Hypothesis Gen | β Multi-agent debate | β LLM-based | β Human insight |
| Code Execution | β Sandboxed | β Sandboxed | β Manual |
| Self-Healing | β Auto-repair | β οΈ Limited | β No |
| Self-Critique | β Multi-agent | β οΈ Single pass | β Peer review |
| Cost | $5-15/run | Similar | Time cost |
| Open Source | β MIT | β οΈ Partial | N/A |
When Should You Use AutoResearchClaw?
Good use cases:
- Exploring a new research direction quickly
- Generating baseline experiments for a topic
- Literature review with structured output
- Teaching research methodology (watch it work)
Bad use cases:
- Producing final papers for submission
- Topics requiring proprietary data
- Time-sensitive research (runs are slow)
- Anything requiring reproducibility guarantees
How Do You Configure AutoResearchClaw for Your Needs?
Use a Different LLM (Claude via ACP)
llm:
provider: "acp"
acp:
agent: "claude" # Uses Claude Code CLI
cwd: "."
Increase Experiment Time Budget
experiment:
time_budget_sec: 600 # 10 minutes instead of 5
max_iterations: 15 # More refinement loops
Enable Human-in-the-Loop Gates
security:
hitl_required_stages: [5, 9, 20] # Pause for approval
allow_publish_without_approval: false
Target a Different Conference
export:
target_conference: "iclr_2026" # or neurips_2025, icml_2026
What Did We Learn?
-
Autonomous research is real, but raw. AutoResearchClaw does things no other tool does β it actually runs experiments. But expect failures.
-
The self-critique is the killer feature. The multi-agent analysis caught methodological issues we would have missed in a quick manual review.
-
Budget more time and money than you expect. REFINE loops are common and expensive.
-
The output is a starting point, not a finish line. Treat it as a research assistant that did the grunt work, not as a paper factory.
-
Domain expertise still matters. We could evaluate the output because we know the memory architecture space. On an unfamiliar topic, the quality would be harder to assess.
Frequently Asked Questions
What is AutoResearchClaw?
AutoResearchClaw is an open-source 23-stage pipeline that automates research from topic to paper, including literature review, hypothesis generation, experiment execution, and LaTeX output.
How much does it cost to run?
A full run with GPT-4o costs $5-15 in API calls. Complex topics with multiple REFINE loops can cost more.
Does it actually run experiments?
Yes. It generates Python code and executes it in a sandbox with metric capture, NaN detection, and self-healing code repair.
Can I use local models?
Yes. Set base_url to your local endpoint (e.g., Ollama at http://localhost:11434/v1) and configure the model name accordingly.
How long does a run take?
20 minutes to 2+ hours depending on topic complexity and refinement loops.
What if a run fails?
Check the stage_health.json files in each stage folder. They contain error details. Common fixes: increase time budget, fix Python path, or simplify the topic.
Is the output publishable?
Not directly. The output requires human review, additional experiments, and substantial editing. Itβs a first draft, not a final paper.
What license is it under?
MIT license. Free for commercial and academic use.
Links: