DART (Detect Anything in Real Time) is a training-free framework that converts SAM3 into a real-time multi-class open-vocabulary object detector achieving 55.8 AP on COCO at 15.8 FPS.

DART: Turn SAM3 Into a Real-Time Object Detector — No Training Required

Q: How do I install DART?

Create a conda environment with Python 3.11+, install PyTorch with CUDA, then pip install -e . from the cloned repo. TensorRT is optional for maximum speed.

Q: Does DART require training?

No. The core framework is training-free — it converts existing SAM3 weights into a detector. Optional distilled student backbones are available for faster inference.

Q: What hardware does DART need?

DART runs on any CUDA GPU. Benchmarks show 15.8 FPS at 1008px resolution on an RTX 4080. Lower-end GPUs work with smaller resolutions or student backbones.

Q: Can DART detect custom objects?

Yes. DART is open-vocabulary — specify any class names as text prompts like 'person', 'fire hydrant', or 'utility pole'. No retraining needed.

Q: How does DART compare to YOLO?

YOLO is faster but limited to fixed classes. DART is open-vocabulary (detect anything via text) with comparable accuracy. DART achieves 55.8 AP vs YOLOv8's ~50 AP on COCO.

Q: Does DART support video?

Yes. DART includes video inference with inter-frame pipelining and ByteTrack integration for persistent object tracking across frames.

Q: What license is DART released under?

DART is released under the Apache 2.0 license, allowing commercial use with attribution.

By Prahlad Menon Published 2026-03-19 4 min read

TL;DR: DART converts SAM3 into a real-time object detector without any training. Describe what you want to find (“poles”, “insulators”, “damaged equipment”), and DART detects them at 15+ FPS. Install with pip install -e . after cloning the repo. TensorRT engines available for maximum speed.

You have SAM3 for segmentation. But what if you need detection — bounding boxes, class labels, confidence scores — in real-time? Training a custom detector means collecting data, labeling thousands of images, and waiting hours for training.

DART skips all of that. It’s a training-free framework that repurposes SAM3’s architecture for multi-class detection. Tell it what to look for, and it finds it.

What Problem Does DART Solve?

DART solves the “I need a detector but don’t have labeled data” problem. Traditional object detection requires:

Thousands of labeled images per class
Hours of training time
Retraining when you add new classes

DART eliminates all three. It’s open-vocabulary — you specify classes at runtime via text prompts. Need to detect utility poles today and solar panels tomorrow? Just change the text.

How Does DART Work Under the Hood?

DART repurposes SAM3’s architecture for detection:

Text encoder generates embeddings for your target classes
Image backbone (ViT-H/14) extracts features from the input
Encoder-decoder produces class-aware attention maps
Detection head converts attention maps to bounding boxes + scores

The key insight: SAM3 already understands what objects look like (trained on 4M concepts). DART just adds the detection output layer without modifying the core weights.

Input Image → ViT-H Backbone → SAM3 Encoder-Decoder → Detection Head
                                      ↑
                              Text Embeddings
                        ("pole", "insulator", "transformer")

How Do I Install DART?

DART requires Python 3.11+, PyTorch 2.7+, and a CUDA GPU.

# 1. Create environment
conda create -n dart python=3.11 -y
conda activate dart

# 2. Install PyTorch with CUDA
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126

# 3. Clone and install DART
git clone https://github.com/mkturkcan/DART.git
cd DART
pip install -e .

# 4. (Optional) Install TensorRT for maximum speed
pip install tensorrt

The SAM3 checkpoint (~1.7 GB) auto-downloads on first run.

How Do I Run Detection?

Basic detection on a single image:

python demo_multiclass.py \
  --image photo.jpg \
  --classes person car bicycle dog \
  --fast --detection-only

For video with tracking:

python demo_video.py \
  --video input.mp4 \
  --classes person car bicycle \
  --checkpoint sam3.pt \
  --track \
  -o output.mp4

Python API for integration:

from sam3.model.sam3_multiclass_fast import Sam3MultiClassPredictorFast

# Load model
predictor = Sam3MultiClassPredictorFast.from_pretrained("sam3.pt", device="cuda")

# Set target classes (open vocabulary!)
predictor.set_classes(["utility pole", "insulator", "transformer", "damaged wire"])

# Run detection
state = predictor.set_image(image)  # PIL Image
results = predictor.predict(state, confidence_threshold=0.3)

# Results: boxes, scores, class_ids, class_names
for box, score, name in zip(results['boxes'], results['scores'], results['class_names']):
    print(f"{name}: {score:.2f} at {box}")

What Speed Can I Expect?

DART offers multiple backbone options for different speed/accuracy tradeoffs:

Model	Params	COCO AP	FPS (RTX 4080)	Use Case
DART (ViT-H)	439M	55.8	15.8	Maximum accuracy
DART-Pruned-16	220M	53.6	37.6	Balanced
DART-RepViT	8.2M	38.7	55.8	Real-time edge
DART-TinyViT	21M	30.1	57.8	Embedded systems

For maximum speed, build TensorRT engines:

# Build backbone engine (one-time, ~5 min)
python scripts/export_hf_backbone.py --image sample.jpg --imgsz 1008

# Build encoder-decoder engine
python -m sam3.trt.export_enc_dec --checkpoint sam3.pt \
  --output enc_dec.onnx --max-classes 4 --imgsz 1008
python -m sam3.trt.build_engine --onnx enc_dec.onnx \
  --output enc_dec_fp16.engine --fp16

# Run with TRT
python demo_multiclass.py --image photo.jpg \
  --classes person car bicycle dog \
  --trt hf_backbone_1008_fp16.engine \
  --trt-enc-dec enc_dec_fp16.engine \
  --fast --detection-only

How Does DART Compare to Alternatives?

Feature	DART	YOLOv8	Grounding DINO	OWL-ViT
Open vocabulary	✅	❌	✅	✅
Training-free	✅	❌	❌	❌
Real-time (>15 FPS)	✅	✅	⚠️	❌
COCO AP	55.8	~50	52.5	34.7
Video tracking	✅ Built-in	❌ Separate	❌	❌

When to use DART:

You need to detect custom classes without training
You want real-time performance with open-vocabulary flexibility
Video applications with tracking requirements

When to use YOLO:

You have labeled data and fixed classes
Maximum FPS is critical (YOLO hits 100+ FPS)

What Are Practical Use Cases?

Industrial Inspection:

predictor.set_classes([
    "rust", "corrosion", "crack", "deformation",
    "missing bolt", "damaged insulation"
])

Utility Asset Detection:

predictor.set_classes([
    "utility pole", "transmission tower", "insulator",
    "transformer", "power line", "cross arm"
])

Traffic Monitoring:

predictor.set_classes([
    "vehicle", "pedestrian", "cyclist", "traffic sign",
    "construction cone", "road damage"
])

Retail/Warehouse:

predictor.set_classes([
    "pallet", "forklift", "safety vest", "hard hat",
    "spill", "blocked exit"
])

How Do I Use Student Backbones for Faster Inference?

Pre-trained student backbones are available on HuggingFace:

from sam3.distillation.sam3_student import build_sam3_student_model
import torch

# Load student model (8.2M params vs 439M for full model)
model = build_sam3_student_model(
    backbone_config="repvit_m2_3",  # Fast, lightweight
    teacher_checkpoint="sam3.pt",
    device="cuda",
)

# Load distilled weights
ckpt = torch.load("distilled/repvit_m2_3_distilled.pt", map_location="cuda")
model.backbone.student_backbone.load_state_dict(ckpt["student_state_dict"])
model.eval()

Available student backbones:

repvit_m2_3 — Best speed/accuracy balance (38.7 AP, 55.8 FPS)
tiny_vit_21m — Higher accuracy (30.1 AP, 57.8 FPS)
efficientvit_l2 — Maximum speed (21.7 AP, 62.5 FPS)

Frequently Asked Questions

What is DART?

DART (Detect Anything in Real Time) converts SAM3 into a real-time multi-class object detector. It’s training-free and open-vocabulary — specify any class via text prompts.

How do I install DART?

Clone the repo, create a conda environment with Python 3.11+, install PyTorch with CUDA, then run pip install -e . from the DART directory.

Does DART require training?

No. The core framework works out-of-the-box with SAM3 weights. Optional student backbones are pre-trained and available on HuggingFace.

What hardware does DART need?

Any NVIDIA GPU with CUDA support. Tested on RTX 4080 (15.8 FPS at 1008px). Use smaller resolutions or student backbones for lower-end GPUs.

Can DART detect custom objects?

Yes. DART is open-vocabulary — add any class name as a text prompt. No retraining or fine-tuning needed.

How does DART compare to YOLO?

YOLO is faster (100+ FPS) but limited to fixed classes. DART is open-vocabulary with comparable accuracy (55.8 vs ~50 AP). Use DART when you need flexibility; use YOLO when you have labeled data and need maximum speed.

Does DART support video?

Yes. Use demo_video.py with --track for ByteTrack integration. Inter-frame pipelining reduces latency.

Can I run DART on edge devices?

Yes, with student backbones. DART-RepViT (8.2M params) achieves 55.8 FPS and fits on Jetson devices.

What license is DART released under?

Apache 2.0 — commercial use allowed with attribution.

How do I improve accuracy?

Use the full ViT-H backbone instead of student models, increase resolution (—imgsz 1008), and lower the confidence threshold for more detections.

Links: