INSID3: SOTA Image Segmentation With Zero Training, One Example, and a Frozen Backbone

By Prahlad Menon 3 min read

Accepted at CVPR 2026, INSID3 just set a new bar for in-context image segmentation — and the headline is almost hard to believe: it beats specialized trained models using a single frozen backbone, no fine-tuning, no segmentation decoder, and no extra supervision whatsoever.

+7.5% mIoU over previous SOTA. 3x fewer parameters. Zero training.

Here’s what’s going on.

The Problem: In-Context Segmentation

In-context segmentation (ICS) is the task of segmenting an arbitrary concept in a target image given just one annotated example — the “context.” Show it one picture of a dog with a mask, and it segments dogs in new images. Show it a medical lesion, it finds that lesion type elsewhere.

The catch: “arbitrary concept” means you need to handle objects, parts, textures, and personalized instances without retraining per category. This is genuinely hard.

Previous approaches fell into one of two traps:

  1. Fine-tune a foundation model → improves in-domain performance, kills generalization
  2. Combine multiple frozen models → preserves generalization, but architecturally complex and locked to fixed segmentation granularities

INSID3 asks a simpler question: can a single self-supervised backbone do both semantic matching and segmentation, without any supervision at all?

The answer, with DINOv3, is yes.

Why DINOv3 Features Are Enough

DINO-family models (and DINOv3 specifically) produce dense, spatially structured features as a byproduct of self-supervised training. Each patch in the image gets a feature vector, and semantically similar patches end up near each other in that space — not because you trained for segmentation, but because the self-supervised objective forced the model to understand spatial structure.

The key insight from the INSID3 paper: scaled-up DINOv3 features have strong enough semantic correspondence that you don’t need additional supervision to match and segment. The structure is already there. You just need to use it correctly.

There’s one catch the authors found: DINOv3 has a positional bias. Patch features encode not just what is in a patch, but where it is in the image. If you naively match features between a reference and target image, you’ll get spurious matches driven by position rather than semantics. INSID3 identifies and removes this bias — and that’s a significant part of why it works.

What INSID3 Actually Does

The pipeline is conceptually clean:

  1. Encode the reference image and target image through frozen DINOv3
  2. Remove positional bias from the features (their novel contribution)
  3. Match reference patches to target patches using semantic similarity
  4. Propagate the reference mask to the target based on those matches
  5. Refine at varying granularities (object, part, or instance level)

No learned decoder. No auxiliary segmentation model. No per-dataset fine-tuning. The backbone weights never change.

Results

Tested across one-shot semantic, part, and personalized segmentation benchmarks:

TaskPrevious SOTAINSID3Δ
One-shot semantic seg (COCO)+7.5% mIoUover prior SOTA
Part segmentationtrained modelsoutperformswith 3x fewer params
Personalized segmentationfine-tunedoutperformstraining-free

And it generalizes across domains that weren’t in any training data: natural images, medical images, underwater scenes, aerial imagery. That last point matters a lot for applied work — most segmentation models crater outside their training distribution.

Running INSID3

Setup:

git clone https://github.com/visinf/INSID3
cd INSID3
conda create --name insid3 python=3.10 -y
conda activate insid3
pip install -r requirements.txt

Download DINOv3 weights (Large model by default):

mkdir -p pretrain
# Download from https://github.com/facebookresearch/dinov3
# Place here: pretrain/dinov3_vitl16_pretrain_lvd1689m-8aa4cbdd.pth

Minimal usage — segment anything with one example:

from models import build_insid3

# Load model (frozen DINOv3 — no GPU-heavy training needed)
model = build_insid3()

# Give it one example: reference image + its mask
model.set_reference("ref_image.png", "ref_mask.png")

# Point it at a new image
model.set_target("target_image.png")

# Segment
pred_mask = model.segment()  # (1024, 1024) bool array

That’s it. Three lines of setup, one prediction call.

Run evaluation on standard benchmarks:

# COCO one-shot semantic segmentation
python inference.py --dataset coco --exp-name insid3-coco

# Part segmentation
python inference.py --dataset pascal_part --exp-name insid3-parts

# Medical (ISIC skin lesions, lung X-ray)
python inference.py --dataset isic --exp-name insid3-medical
python inference.py --dataset lung --exp-name insid3-lung

# Adjust backbone size
python inference.py --dataset coco --model_size base  # smaller, faster
python inference.py --dataset coco --model_size large  # default, best results

Supported datasets: coco, lvis, pascal_part, paco_part, isaid, isic, lung, suim, permis

Why This Matters Beyond the Benchmark Numbers

For medical AI specifically: most segmentation tools require annotated training data per anatomy, per pathology, per imaging modality. INSID3’s one-shot, training-free approach means you can segment a new pathology type with a single annotated example. The paper explicitly validates on ISIC (skin lesions) and lung X-rays. This is a meaningful unlock for rare disease or low-resource clinical settings where you simply can’t gather thousands of labeled examples.

For RAG and multimodal pipelines: as vision-language models get more capable, “show me one example of X and find it everywhere” becomes a natural interface. INSID3 is a clean, open-source building block for that pattern — no fine-tuning required means it can slot into existing pipelines without expensive retraining cycles.

For the field: the positional bias insight is the kind of thing that, once identified, is obvious in retrospect. DINO features encode location. Of course that contaminates semantic matching. Removing it cleanly — without any learned correction — is the technical contribution that makes everything else work.

The Minimalist Lesson

INSID3 is the latest in a line of results reminding us that more supervision isn’t always the answer. The features from a well-trained self-supervised backbone already contain the structure you need. The research question is whether you can unlock it without adding complexity.

Here, the answer is a clean yes. One backbone. Frozen. Training-free. SOTA.


Resources: