DART: Turn SAM3 Into a Real-Time Object Detector — No Training Required
TL;DR: DART converts SAM3 into a real-time object detector without any training. Describe what you want to find (“poles”, “insulators”, “damaged equipment”), and DART detects them at 15+ FPS. Install with
pip install -e .after cloning the repo. TensorRT engines available for maximum speed.
You have SAM3 for segmentation. But what if you need detection — bounding boxes, class labels, confidence scores — in real-time? Training a custom detector means collecting data, labeling thousands of images, and waiting hours for training.
DART skips all of that. It’s a training-free framework that repurposes SAM3’s architecture for multi-class detection. Tell it what to look for, and it finds it.
What Problem Does DART Solve?
DART solves the “I need a detector but don’t have labeled data” problem. Traditional object detection requires:
- Thousands of labeled images per class
- Hours of training time
- Retraining when you add new classes
DART eliminates all three. It’s open-vocabulary — you specify classes at runtime via text prompts. Need to detect utility poles today and solar panels tomorrow? Just change the text.
How Does DART Work Under the Hood?
DART repurposes SAM3’s architecture for detection:
- Text encoder generates embeddings for your target classes
- Image backbone (ViT-H/14) extracts features from the input
- Encoder-decoder produces class-aware attention maps
- Detection head converts attention maps to bounding boxes + scores
The key insight: SAM3 already understands what objects look like (trained on 4M concepts). DART just adds the detection output layer without modifying the core weights.
Input Image → ViT-H Backbone → SAM3 Encoder-Decoder → Detection Head
↑
Text Embeddings
("pole", "insulator", "transformer")
How Do I Install DART?
DART requires Python 3.11+, PyTorch 2.7+, and a CUDA GPU.
# 1. Create environment
conda create -n dart python=3.11 -y
conda activate dart
# 2. Install PyTorch with CUDA
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126
# 3. Clone and install DART
git clone https://github.com/mkturkcan/DART.git
cd DART
pip install -e .
# 4. (Optional) Install TensorRT for maximum speed
pip install tensorrt
The SAM3 checkpoint (~1.7 GB) auto-downloads on first run.
How Do I Run Detection?
Basic detection on a single image:
python demo_multiclass.py \
--image photo.jpg \
--classes person car bicycle dog \
--fast --detection-only
For video with tracking:
python demo_video.py \
--video input.mp4 \
--classes person car bicycle \
--checkpoint sam3.pt \
--track \
-o output.mp4
Python API for integration:
from sam3.model.sam3_multiclass_fast import Sam3MultiClassPredictorFast
# Load model
predictor = Sam3MultiClassPredictorFast.from_pretrained("sam3.pt", device="cuda")
# Set target classes (open vocabulary!)
predictor.set_classes(["utility pole", "insulator", "transformer", "damaged wire"])
# Run detection
state = predictor.set_image(image) # PIL Image
results = predictor.predict(state, confidence_threshold=0.3)
# Results: boxes, scores, class_ids, class_names
for box, score, name in zip(results['boxes'], results['scores'], results['class_names']):
print(f"{name}: {score:.2f} at {box}")
What Speed Can I Expect?
DART offers multiple backbone options for different speed/accuracy tradeoffs:
| Model | Params | COCO AP | FPS (RTX 4080) | Use Case |
|---|---|---|---|---|
| DART (ViT-H) | 439M | 55.8 | 15.8 | Maximum accuracy |
| DART-Pruned-16 | 220M | 53.6 | 37.6 | Balanced |
| DART-RepViT | 8.2M | 38.7 | 55.8 | Real-time edge |
| DART-TinyViT | 21M | 30.1 | 57.8 | Embedded systems |
For maximum speed, build TensorRT engines:
# Build backbone engine (one-time, ~5 min)
python scripts/export_hf_backbone.py --image sample.jpg --imgsz 1008
# Build encoder-decoder engine
python -m sam3.trt.export_enc_dec --checkpoint sam3.pt \
--output enc_dec.onnx --max-classes 4 --imgsz 1008
python -m sam3.trt.build_engine --onnx enc_dec.onnx \
--output enc_dec_fp16.engine --fp16
# Run with TRT
python demo_multiclass.py --image photo.jpg \
--classes person car bicycle dog \
--trt hf_backbone_1008_fp16.engine \
--trt-enc-dec enc_dec_fp16.engine \
--fast --detection-only
How Does DART Compare to Alternatives?
| Feature | DART | YOLOv8 | Grounding DINO | OWL-ViT |
|---|---|---|---|---|
| Open vocabulary | ✅ | ❌ | ✅ | ✅ |
| Training-free | ✅ | ❌ | ❌ | ❌ |
| Real-time (>15 FPS) | ✅ | ✅ | ⚠️ | ❌ |
| COCO AP | 55.8 | ~50 | 52.5 | 34.7 |
| Video tracking | ✅ Built-in | ❌ Separate | ❌ | ❌ |
When to use DART:
- You need to detect custom classes without training
- You want real-time performance with open-vocabulary flexibility
- Video applications with tracking requirements
When to use YOLO:
- You have labeled data and fixed classes
- Maximum FPS is critical (YOLO hits 100+ FPS)
What Are Practical Use Cases?
Industrial Inspection:
predictor.set_classes([
"rust", "corrosion", "crack", "deformation",
"missing bolt", "damaged insulation"
])
Utility Asset Detection:
predictor.set_classes([
"utility pole", "transmission tower", "insulator",
"transformer", "power line", "cross arm"
])
Traffic Monitoring:
predictor.set_classes([
"vehicle", "pedestrian", "cyclist", "traffic sign",
"construction cone", "road damage"
])
Retail/Warehouse:
predictor.set_classes([
"pallet", "forklift", "safety vest", "hard hat",
"spill", "blocked exit"
])
How Do I Use Student Backbones for Faster Inference?
Pre-trained student backbones are available on HuggingFace:
from sam3.distillation.sam3_student import build_sam3_student_model
import torch
# Load student model (8.2M params vs 439M for full model)
model = build_sam3_student_model(
backbone_config="repvit_m2_3", # Fast, lightweight
teacher_checkpoint="sam3.pt",
device="cuda",
)
# Load distilled weights
ckpt = torch.load("distilled/repvit_m2_3_distilled.pt", map_location="cuda")
model.backbone.student_backbone.load_state_dict(ckpt["student_state_dict"])
model.eval()
Available student backbones:
repvit_m2_3— Best speed/accuracy balance (38.7 AP, 55.8 FPS)tiny_vit_21m— Higher accuracy (30.1 AP, 57.8 FPS)efficientvit_l2— Maximum speed (21.7 AP, 62.5 FPS)
Frequently Asked Questions
What is DART?
DART (Detect Anything in Real Time) converts SAM3 into a real-time multi-class object detector. It’s training-free and open-vocabulary — specify any class via text prompts.
How do I install DART?
Clone the repo, create a conda environment with Python 3.11+, install PyTorch with CUDA, then run pip install -e . from the DART directory.
Does DART require training?
No. The core framework works out-of-the-box with SAM3 weights. Optional student backbones are pre-trained and available on HuggingFace.
What hardware does DART need?
Any NVIDIA GPU with CUDA support. Tested on RTX 4080 (15.8 FPS at 1008px). Use smaller resolutions or student backbones for lower-end GPUs.
Can DART detect custom objects?
Yes. DART is open-vocabulary — add any class name as a text prompt. No retraining or fine-tuning needed.
How does DART compare to YOLO?
YOLO is faster (100+ FPS) but limited to fixed classes. DART is open-vocabulary with comparable accuracy (55.8 vs ~50 AP). Use DART when you need flexibility; use YOLO when you have labeled data and need maximum speed.
Does DART support video?
Yes. Use demo_video.py with --track for ByteTrack integration. Inter-frame pipelining reduces latency.
Can I run DART on edge devices?
Yes, with student backbones. DART-RepViT (8.2M params) achieves 55.8 FPS and fits on Jetson devices.
What license is DART released under?
Apache 2.0 — commercial use allowed with attribution.
How do I improve accuracy?
Use the full ViT-H backbone instead of student models, increase resolution (—imgsz 1008), and lower the confidence threshold for more detections.
Links: