MedOpenClaw: When Giving AI More Tools Makes It Worse
There’s a finding buried in a new paper from TUM, Imperial College, CMU, Oxford, and NUS that deserves more attention than the LinkedIn post announcing it gave it:
State-of-the-art VLMs like Gemini 3.1 Pro and GPT-5.4 can navigate a medical viewer to answer basic study-level questions — but their performance paradoxically degrades when given access to professional clinical support tools.
More tools. Worse results. In a clinical context, that’s not a curiosity — it’s a safety signal.
The Problem with How We Evaluate Medical AI
Every major medical VLM benchmark to date shares a structural flaw: the model is handed one or two pre-selected, diagnostically relevant 2D images and asked a localized question. Did you see the nodule? What’s the Gleason grade? Is this a hemorrhage?
That’s not how radiology works.
A real radiologist opens a full 3D study — hundreds of DICOM slices across multiple sequences, sometimes fused CT/PET volumes — and navigates. They scroll, adjust windowing, switch between sequences, compare views, take measurements. Many findings are invisible in any single slice. They emerge across adjacent slices, across modalities, after the right region has been localized.
Benchmarks that skip this step aren’t measuring clinical capability. They’re measuring pattern matching on curated inputs — a much easier problem.
MedOpenClaw is an attempt to close that gap.
What MedOpenClaw Actually Is
The paper introduces two components:
MedOpenClaw — an auditable runtime that sits between a VLM and a standard medical viewer (specifically 3D Slicer). The model gets a bounded set of viewer actions: scroll slices, adjust windowing, switch sequences, fuse modalities, take measurements. Every action is logged. The full reasoning trajectory — where it looked, what it adjusted, what evidence it gathered — is recorded and replayable.
That auditability piece matters clinically. A model that returns an answer is useful. A model that returns an answer and a complete record of how it arrived there is defensible — to a radiologist, a regulator, a malpractice attorney.
MedFlowBench — a benchmark built on top of this runtime covering multi-sequence brain MRI and lung CT/PET studies. Three tracks:
- Viewer-only: navigate the volume, no tools beyond basic viewing
- Tool-use: access professional support tools (measurements, fusion, windowing presets)
- Open-method: no constraints
The Finding That Should Give Everyone Pause
The benchmark results are where it gets interesting.
Top frontier models — Gemini 3.1 Pro and GPT-5.4 — can navigate the viewer well enough to handle basic study-level tasks. Give them a volume and a question, they’ll scroll to the right place and find an answer. That’s genuinely impressive progress from where medical AI was two years ago.
But when those same models are given access to the professional tools — the measurement tools, windowing presets, fusion controls that a real radiologist uses daily — performance drops.
The paper attributes this to a lack of precise spatial grounding. The models don’t have a stable enough internal representation of 3D space to use tools that require precise spatial coordinates. They know roughly where to look. They don’t know exactly where well enough to place a measurement caliper accurately, or to fuse volumes at the right registration point.
This is a meaningful distinction. Navigating is a coarse task — scroll toward the right region. Tool use is a fine-grained task — click precisely here, measure exactly this distance. Current VLMs have the former but not the latter.
The implication for clinical deployment: giving a medical AI model access to professional tools before it has the spatial grounding to use them reliably doesn’t just waste the tools — it actively degrades performance compared to keeping the interface simple.
Why Auditability Is the Right Frame
The paper positions MedOpenClaw as an “auditable runtime” rather than just a benchmark or a model — and that framing is deliberate.
Clinical AI adoption has stalled partly because of the black-box problem. A model says “this looks like a GBM.” The radiologist asks: based on what? Which sequences? Did you look at the FLAIR suppression? The model has no answer. The radiologist either trusts it or overrides it, with no way to verify the reasoning.
A fully logged trajectory changes that interaction. The radiologist can replay exactly what the model did — which slices it scrolled to, which sequences it prioritized, where it paused. Disagreement becomes a teachable moment rather than a trust failure.
This is the same principle driving the shift toward chain-of-thought reasoning in general LLMs, applied to spatial medical reasoning. The output matters less than the process being visible.
The Broader Context
This work sits at the intersection of two trends worth watching:
Agentic medical AI — models that don’t just answer questions but actively investigate, the way a physician does. MedOpenClaw is one of the cleaner implementations of this idea in a clinical imaging context. The Virtual CFI work in aviation uses the same pattern: an agent that autonomously calls tools (METAR, TAF, NOTAMs) rather than waiting to be fed data.
Harness quality as a bottleneck — the finding that tool access hurts performance is a specific instance of a more general problem: the harness around a model shapes its behavior as much as the model itself. The Meta-Harness paper argues this explicitly. MedOpenClaw’s results provide empirical confirmation in a high-stakes domain.
The authors — Weixiang Shen, Yanzhu Hu, Che Liu, Junde Wu, Jiayuan Zhu, Chengzhi Shen, Min Xu (CMU), Yueming Jin (NUS), Benedikt Wiestler, Daniel Rueckert, and Jiazhen Pan from TUM, Imperial, Oxford, LMU Munich, and the Munich Center for Machine Learning — have released both the runtime and the benchmark. It’s reproducible, which is rarer than it should be in medical AI.
Paper: arXiv:2603.24649