LocateAnything: NVIDIA's 10x Faster Object Detection with Parallel Box Decoding
Vision-language models have a dirty secret: they’re slow at localization. When you ask GPT-4o or Qwen3-VL to draw a bounding box, they generate each coordinate one token at a time — x1, then y1, then x2, then y2. It’s sequential, inefficient, and doesn’t match how boxes actually work geometrically.
NVIDIA’s new LocateAnything model fixes this with Parallel Box Decoding (PBD). Instead of emitting coordinates token-by-token, it predicts the entire bounding box in one atomic step. The result? 10x faster throughput than Qwen3-VL and 2.5x faster than the next-best alternative — while also improving localization accuracy.
The Problem with Autoregressive Box Generation
Current VLMs treat object detection as a text generation problem. Ask them to locate a cat, and they’ll output something like:
<box>0.25</box><box>0.30</box><box>0.75</box><box>0.80</box>
Four tokens, generated sequentially. This creates two problems:
- Speed: Sequential generation is slow. Each coordinate requires a full forward pass.
- Geometric incoherence: The model learns each coordinate independently, even though
x1,y1,x2,y2form a coupled geometric structure.
In dense scenes with 100+ objects, this becomes a serious bottleneck. LocateAnything achieves 12.7 boxes per second on an H100 GPU in hybrid mode, vs. 1.1 BPS for Qwen3-VL.
What’s In The Box: Model Architecture
LocateAnything is a 7.67 GB download consisting of two components fused together:
| Component | Source | Size | License |
|---|---|---|---|
| Vision encoder | MoonViT-SO-400M | ~800 MB | MIT |
| Language model | Qwen2.5-3B-Instruct | ~6 GB | Qwen Research |
| MLP projector | Trained by NVIDIA | ~100 MB | NVIDIA non-commercial |
Files on HuggingFace:
model-00001-of-00002.safetensors 4.96 GB
model-00002-of-00002.safetensors 2.70 GB
vocab.json + merges.txt ~5 MB
Total: 7.67 GB disk, loads to ~8 GB VRAM in BF16, or ~16 GB in FP32.
There’s only one model variant right now (nvidia/LocateAnything-3B). The community has made quantized versions for Apple Silicon:
mlx-community/LocateAnything-3B-8bit— 8-bit, ~4.5 GBmlx-community/LocateAnything-3B-4bit— mixed 4/8-bit, ~3.2 GB
How Parallel Box Decoding Works
LocateAnything treats each bounding box as an atomic unit of fixed length (6 tokens: semantic label + 4 coordinates + padding). Instead of predicting tokens one at a time, it predicts the entire block in parallel.
Three inference modes:
- Fast Mode (MTP): Full parallel decoding — maximum throughput
- Slow Mode (NTP): Autoregressive fallback — maximum stability
- Hybrid Mode (default): Uses fast mode, falls back to slow when format errors or spatial ambiguity detected
Quick Start: Running LocateAnything Locally
The model is available on Hugging Face. Here’s the minimal setup:
pip install transformers accelerate pillow
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import torch
model_id = "nvidia/LocateAnything-3B"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
image = Image.open("your_image.jpg")
prompt = "Detect all people in this image"
inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=8192,
generation_mode="hybrid" # or "fast" / "slow"
)
result = processor.decode(outputs[0], skip_special_tokens=True)
print(result)
Hardware requirements: ~8GB VRAM minimum. Runs well on RTX 3090/4090, A100, H100.
Practical Use Cases
1. Dense Object Detection
LocateAnything excels at crowded scenes. On the VisDrone benchmark (aerial footage with dozens of tiny objects), it achieves 39.9 mean F1 vs. 35.8 for the previous best.
prompt = "Detect all vehicles and pedestrians"
2. GUI Element Grounding
Need to locate buttons, text fields, and icons in screenshots? LocateAnything hits 60.3 F1 on ScreenSpot-Pro, beating specialized GUI models.
prompt = "Find the search button"
# Returns precise coordinates for automated UI testing
3. Document Layout & OCR
For document understanding, it achieves 76.8 F1 on DocLayNet — tables, headers, paragraphs, all with tight bounding boxes.
prompt = "Locate all table cells in this invoice"
4. Referring Expression Grounding
Natural language queries work well:
prompt = "The person wearing a red jacket on the left side"
Docker Setup (One-Command Deploy)
The community has already built a nice wrapper:
git clone https://github.com/gammahazard/locate-anything
cd locate-anything
docker compose up
This gives you a web UI at localhost:7860 with speed/quality toggles for fast/hybrid/slow modes.
Integrating with Roboflow Supervision
LocateAnything outputs coordinates in [x1, y1, x2, y2] format. You can pipe these directly into Roboflow’s supervision library for visualization and downstream processing:
import supervision as sv
import numpy as np
# Parse LocateAnything output (example format)
boxes = np.array([[100, 150, 300, 400], [50, 80, 200, 250]])
class_ids = np.array([0, 1])
confidences = np.array([0.95, 0.88])
detections = sv.Detections(
xyxy=boxes,
class_id=class_ids,
confidence=confidences
)
# Annotate
annotator = sv.BoxAnnotator()
annotated_image = annotator.annotate(scene=image.copy(), detections=detections)
This combo gives you NVIDIA’s speed with Roboflow’s visualization toolkit.
Benchmarks at a Glance
| Model | LVIS F1 | COCO F1 | Speed (BPS) |
|---|---|---|---|
| Qwen3-VL | 45.2 | 49.1 | 1.1 |
| Rex-Omni | 47.3 | 50.3 | 5.0 |
| LocateAnything | 51.1 | 52.1 | 12.7 |
At high IoU thresholds (IoU=0.95), the gap widens: LocateAnything hits 31.1 F1 on LVIS vs. 20.7 for Rex-Omni. Tighter boxes = more useful detections.
Hardware & Memory Requirements
Here’s what you actually need to run this:
| Hardware | VRAM | Speed (per image) | Notes |
|---|---|---|---|
| RTX 3090 | 8GB+ | ~1-2s | Minimum viable, use hybrid mode |
| RTX 4090 | 24GB | ~0.5-1s | Comfortable, all modes work |
| A100 40GB | 40GB | ~0.3-0.5s | Production-ready |
| H100 | 80GB | ~0.08s (12.7 BPS) | Full speed, batch inference |
| Apple M2/M3 | 16GB unified | ~3-5s | Works via MPS backend |
| CPU only | 16GB+ RAM | 30-60s | Not recommended |
Memory optimization tricks:
# Use BF16 (saves ~50% VRAM vs FP32)
model = AutoModel.from_pretrained(model_id, torch_dtype=torch.bfloat16, ...)
# Reduce max tokens for faster inference
outputs = model.generate(..., max_new_tokens=2048) # default is 8192
# For batch inference with FlashAttention (saves ~3x memory on dense scenes)
python batch_infer.py --attn la_flash --batch-size 4
The la_flash backend cuts VRAM from 35GB to 12GB on a 4K image while maintaining speed — critical for deployment.
Can It Run on Edge Devices?
Short answer: Not yet without work.
Longer answer: NVIDIA mentions “deployment on embedded platforms such as NVIDIA Thor is possible with additional model optimization, including quantization, compression, or distillation.” Translation: you’d need to:
- Quantize to INT8 or INT4 (not officially supported yet)
- Distill to a smaller model
- Use TensorRT (not yet integrated)
For edge deployment today, consider alternatives:
- YOLO-World — smaller, runs on Jetson
- Grounding DINO — has TensorRT builds
- RF-DETR — Roboflow’s edge-optimized detector
LocateAnything is best positioned for server-side inference where you have GPU resources.
Real-World Example: Medication Counting
Here’s a practical healthcare use case — counting pills for pharmacy verification:
from transformers import AutoModel, AutoTokenizer, AutoProcessor
from PIL import Image
import torch
class MedicationCounter:
def __init__(self, model_path="nvidia/LocateAnything-3B"):
self.processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
self.model = AutoModel.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
).eval()
self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
def count_pills(self, image: Image.Image, description: str = "pill"):
prompt = f"Locate all the instances that match: {description}."
messages = [{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": prompt}
]}]
text = self.processor.py_apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
images, _ = self.processor.process_vision_info(messages)
inputs = self.processor(text=[text], images=images, return_tensors="pt").to("cuda")
with torch.no_grad():
response = self.model.generate(
pixel_values=inputs["pixel_values"].to(torch.bfloat16),
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
tokenizer=self.tokenizer,
max_new_tokens=4096,
generation_mode="hybrid",
temperature=0.3,
)
# Parse boxes from response
import re
boxes = re.findall(r'<box>(\d+(?:\.\d+)?),(\d+(?:\.\d+)?),(\d+(?:\.\d+)?),(\d+(?:\.\d+)?)</box>', response[0])
return len(boxes), boxes
# Usage
counter = MedicationCounter()
image = Image.open("pills_on_tray.jpg")
count, boxes = counter.count_pills(image, "white round tablet")
print(f"Found {count} pills")
Why this matters: Pharmacy dispensing errors are a real problem. A camera + LocateAnything can verify pill counts before packaging — faster and more reliable than manual counting for high-volume pharmacies.
Limitations
- License: Non-commercial only. Fine for research, not for production products (unless you’re NVIDIA).
- Model size: 3B parameters requires decent GPU. No CPU-only path.
- Edge deployment: Requires quantization/distillation work not yet officially supported.
- Specialized task: This is for localization, not general chat. Use alongside a general VLM.
The Bottom Line
If you’re building computer vision pipelines and localization speed matters, LocateAnything is a significant upgrade. The parallel decoding approach is elegant — treat boxes as atomic units, not serialized tokens — and the benchmarks prove it works.
Try it: HuggingFace Demo | GitHub | Paper | Pill Counter Notebook
LocateAnything is part of NVIDIA’s Eagle VLM family and has been integrated into Nemotron and Cosmos for production grounding capabilities.