AutoResearchClaw: We Ran a Fully Autonomous Research Pipeline β€” Here's What Actually Happened

By Prahlad Menon 4 min read

TL;DR: AutoResearchClaw is a 23-stage autonomous research pipeline that takes a topic and outputs a conference-ready paper with real literature, generated experiments, and LaTeX formatting. We ran it on β€œfile-based vs. vector-based memory architectures for LLM agents” and got working code, multi-agent analysis, and self-healing refinement loops. It’s impressive but raw β€” here’s exactly what happened, what it produced, and how to set it up yourself.

Most β€œAI research” tools are glorified summarizers. They scrape papers and regurgitate abstracts. AutoResearchClaw is different β€” it actually does research. It generates hypotheses through multi-agent debate, writes and executes experiment code, analyzes results statistically, and writes a paper with real citations.

We spent a day wrestling with it. Here’s the unvarnished truth.

What Is AutoResearchClaw?

AutoResearchClaw is an open-source Python pipeline that automates the entire research workflow:

  1. Topic β†’ Problem decomposition β€” Breaks your idea into research questions
  2. Literature discovery β€” Searches arXiv and Semantic Scholar for real papers
  3. Hypothesis generation β€” Multi-agent debate produces testable predictions
  4. Experiment design β€” Plans methodology with hardware awareness
  5. Code generation β€” Writes actual Python experiments
  6. Sandbox execution β€” Runs experiments with NaN/Inf detection and self-healing
  7. Result analysis β€” Multi-perspective statistical analysis
  8. Decision loops β€” Autonomously decides to PROCEED, REFINE, or PIVOT
  9. Paper writing β€” Generates Markdown and LaTeX with real \cite{} references

The key differentiator: it doesn’t just plan research β€” it executes experiments and captures real metrics.

How Do You Install AutoResearchClaw?

Installation is straightforward:

# Clone the repository
git clone https://github.com/aiming-lab/AutoResearchClaw.git
cd AutoResearchClaw

# Create virtual environment
python3 -m venv .venv && source .venv/bin/activate

# Install
pip install -e .

# Copy config template
cp config.researchclaw.example.yaml config.arc.yaml

Edit config.arc.yaml with your LLM settings:

project:
  name: "my-research"
  mode: "full-auto"

research:
  topic: "Your research topic here"

llm:
  provider: "openai"
  base_url: "https://api.openai.com/v1"
  api_key_env: "OPENAI_API_KEY"
  primary_model: "gpt-4o"
  fallback_models: ["gpt-4o-mini"]

experiment:
  mode: "sandbox"
  time_budget_sec: 300
  sandbox:
    python_path: ".venv/bin/python3"

Then run:

export OPENAI_API_KEY="sk-..."
researchclaw run --config config.arc.yaml --topic "Your research idea" --auto-approve

What Did We Actually Research?

We gave it a topic close to home:

β€œComparative analysis of file-based vs. vector-based memory architectures for LLM agents”

This is literally what soul.py explores β€” so we knew enough to evaluate the output quality.

What Artifacts Did It Produce?

After running, we got this folder structure:

artifacts/rc-20260316-200601-7be555/
β”œβ”€β”€ stage-01/          # Topic initialization
β”œβ”€β”€ stage-02/          # Problem decomposition
β”œβ”€β”€ stage-03/          # Search strategy
β”œβ”€β”€ stage-04/          # Literature (5,153 lines of real BibTeX!)
β”‚   β”œβ”€β”€ candidates.jsonl
β”‚   β”œβ”€β”€ references.bib
β”‚   └── search_meta.json
β”œβ”€β”€ stage-08/          # Hypotheses from multi-agent debate
β”‚   β”œβ”€β”€ hypotheses.md
β”‚   β”œβ”€β”€ novelty_report.json
β”‚   └── perspectives/
β”œβ”€β”€ stage-10/          # Generated experiment code
β”‚   β”œβ”€β”€ experiment/
β”‚   β”‚   β”œβ”€β”€ main.py
β”‚   β”‚   β”œβ”€β”€ models.py
β”‚   β”‚   └── experiment_harness.py
β”‚   └── experiment_spec.md
β”œβ”€β”€ stage-13/          # Iterative refinement (self-healing)
β”œβ”€β”€ stage-14/          # Multi-agent analysis
β”‚   β”œβ”€β”€ analysis.md
β”‚   β”œβ”€β”€ results_table.tex
β”‚   └── perspectives/
β”œβ”€β”€ stage-15/          # Research decision (triggered REFINE)
β”œβ”€β”€ evolution/         # Self-learning lessons
β”‚   └── lessons.jsonl
└── deliverables/      # LaTeX template ready
    └── neurips_2025.sty

The Hypotheses It Generated

From stage-08/hypotheses.md β€” these came from a multi-perspective debate:

Hypothesis 1: Quantum-Inspired Memory Compression

Apply quantum-inspired entropic compression to file-based memory. Predict 30% improvement in retrieval speed and 20% reduction in energy consumption.

Hypothesis 2: Neuroplasticity-Inspired Dynamic Switching

Allow LLM agents to dynamically switch between file-based and vector-based architectures based on task demands. Predict 20% improvement in task completion time.

Hypothesis 3: File-Based Security Advantage

File-based architectures offer more secure data handling in privacy-sensitive applications. Predict 30% lower vulnerability incidence.

These are… actually interesting? The quantum-inspired angle is speculative, but the security hypothesis aligns with real concerns about embedding inversion attacks.

The Experiment Code It Wrote

From stage-10/experiment/main.py:

REGISTERED_CONDITIONS = [
    'Quantum_Compression_File_Based_System',
    'Neuroplastic_Dynamic_Memory_Switch', 
    'Compression_Without_Quantum_Inspiration',  # ablation
    'Static_Memory_Architecture'                 # baseline
]

def run_condition(condition_name, seed):
    np.random.seed(seed)
    
    if condition_name == 'Quantum_Compression_File_Based_System':
        model = QuantumCompressionMemory(
            compression_factor=0.7,
            retrieval_speed=1.0
        )
    elif condition_name == 'Neuroplastic_Dynamic_Memory_Switch':
        model = NeuroplasticMemorySwitcher(
            decision_accuracy=0.9,
            switch_threshold=0.8
        )
    # ... ablation conditions
    
    compression_ratio, dynamic_adjustment = model.run_training()
    latency = (50 / model.retrieval_speed) * compression_ratio * dynamic_adjustment
    energy_consumption = (200 / compression_ratio) * dynamic_adjustment
    
    return {'latency': latency, 'energy_consumption': energy_consumption}

It created a proper experimental design with baselines and ablations. The actual models are simplified simulations, but the structure is correct.

The Analysis It Produced

From stage-14/analysis.md:

## Metrics Summary
- Latency Reduction: Quantum-inspired compression achieved 25.2 ms vs 35.0 ms baseline
- Energy Consumption: Reduced to 253.97 J from 285.71 J
- Sample Size: 5 seeds

## Consensus Findings
1. Quantum-inspired techniques reduced both latency and energy consumption
2. Results consistent across different seed trials

## Contested Points (Multi-Agent Critique)
1. Small sample size questions robustness
2. Absence of statistical significance testing
3. Identical outputs in ablation studies flag experimental design flaw

## Recommendation: REFINE
Address methodological gaps before proceeding to paper writing.

The multi-agent analysis caught real issues β€” the ablation results were suspicious, and it correctly flagged the need for more rigorous statistics. This self-critique is the most impressive part.

The Self-Learning Lessons

From evolution/lessons.jsonl:

{
  "stage_name": "research_decision",
  "category": "pipeline", 
  "severity": "warning",
  "description": "Research decision was REFINE: warranted due to critical issues in methodology..."
}
{
  "stage_name": "experiment_run",
  "category": "experiment",
  "severity": "warning", 
  "description": "Runtime warning: Mean of empty slice..."
}

It captures lessons from each run. Future runs on similar topics will (theoretically) avoid these pitfalls.

What Went Wrong?

Plenty.

Trial 1: Wrong Python Path

The sandbox couldn’t find Python. Fix: explicitly set python_path: ".venv/bin/python3" in config.

Trial 2: Experiment Timeout

The 300-second time budget wasn’t enough. The pipeline hit the limit and couldn’t capture metrics.

Trial 3: REFINE Loop

The pipeline decided to REFINE (go back and improve experiments) based on its own quality analysis. This is actually good β€” it caught methodological issues. But it also means the run took longer and cost more API calls.

Trial 4: Paper Blocked

The paper writing stage refused to proceed because experiment metrics were incomplete. It printed:

Paper Draft Blocked
Reason: Experiment stage produced no metrics (status: failed/timeout)
Action Required: Fix experiment execution or increase time_budget_sec

Honest, at least.

What Is AutoResearchClaw Good At?

  1. Literature discovery β€” It found 50+ real papers from arXiv/Semantic Scholar
  2. Hypothesis structure β€” The multi-agent debate produces testable, falsifiable hypotheses
  3. Code scaffolding β€” The generated experiments have proper baselines and ablations
  4. Self-critique β€” The analysis stage catches real methodological issues
  5. Self-healing β€” When code fails, it attempts repairs

What Is AutoResearchClaw NOT?

  1. A paper mill β€” The output needs human review and editing
  2. Cheap β€” A full run costs $5-15+ in API calls
  3. Fast β€” Expect 30 min to 2 hours depending on complexity
  4. Reliable β€” Many runs fail partway through
  5. A shortcut β€” You still need domain expertise to evaluate output quality

How Does AutoResearchClaw Compare to Alternatives?

FeatureAutoResearchClawAI Scientist (Sakana)Manual Research
Literature Searchβœ… Real APIsβœ… Real APIsβœ… Manual
Hypothesis Genβœ… Multi-agent debateβœ… LLM-basedβœ… Human insight
Code Executionβœ… Sandboxedβœ… Sandboxedβœ… Manual
Self-Healingβœ… Auto-repair⚠️ Limited❌ No
Self-Critiqueβœ… Multi-agent⚠️ Single passβœ… Peer review
Cost$5-15/runSimilarTime cost
Open Sourceβœ… MIT⚠️ PartialN/A

When Should You Use AutoResearchClaw?

Good use cases:

  • Exploring a new research direction quickly
  • Generating baseline experiments for a topic
  • Literature review with structured output
  • Teaching research methodology (watch it work)

Bad use cases:

  • Producing final papers for submission
  • Topics requiring proprietary data
  • Time-sensitive research (runs are slow)
  • Anything requiring reproducibility guarantees

How Do You Configure AutoResearchClaw for Your Needs?

Use a Different LLM (Claude via ACP)

llm:
  provider: "acp"
  acp:
    agent: "claude"  # Uses Claude Code CLI
    cwd: "."

Increase Experiment Time Budget

experiment:
  time_budget_sec: 600  # 10 minutes instead of 5
  max_iterations: 15    # More refinement loops

Enable Human-in-the-Loop Gates

security:
  hitl_required_stages: [5, 9, 20]  # Pause for approval
  allow_publish_without_approval: false

Target a Different Conference

export:
  target_conference: "iclr_2026"  # or neurips_2025, icml_2026

What Did We Learn?

  1. Autonomous research is real, but raw. AutoResearchClaw does things no other tool does β€” it actually runs experiments. But expect failures.

  2. The self-critique is the killer feature. The multi-agent analysis caught methodological issues we would have missed in a quick manual review.

  3. Budget more time and money than you expect. REFINE loops are common and expensive.

  4. The output is a starting point, not a finish line. Treat it as a research assistant that did the grunt work, not as a paper factory.

  5. Domain expertise still matters. We could evaluate the output because we know the memory architecture space. On an unfamiliar topic, the quality would be harder to assess.

Frequently Asked Questions

What is AutoResearchClaw?

AutoResearchClaw is an open-source 23-stage pipeline that automates research from topic to paper, including literature review, hypothesis generation, experiment execution, and LaTeX output.

How much does it cost to run?

A full run with GPT-4o costs $5-15 in API calls. Complex topics with multiple REFINE loops can cost more.

Does it actually run experiments?

Yes. It generates Python code and executes it in a sandbox with metric capture, NaN detection, and self-healing code repair.

Can I use local models?

Yes. Set base_url to your local endpoint (e.g., Ollama at http://localhost:11434/v1) and configure the model name accordingly.

How long does a run take?

20 minutes to 2+ hours depending on topic complexity and refinement loops.

What if a run fails?

Check the stage_health.json files in each stage folder. They contain error details. Common fixes: increase time budget, fix Python path, or simplify the topic.

Is the output publishable?

Not directly. The output requires human review, additional experiments, and substantial editing. It’s a first draft, not a final paper.

What license is it under?

MIT license. Free for commercial and academic use.


Links: