LocateAnything: NVIDIA's 10x Faster Object Detection with Parallel Box Decoding

By Prahlad Menon Published 2026-06-13 4 min read

Vision-language models have a dirty secret: they’re slow at localization. When you ask GPT-4o or Qwen3-VL to draw a bounding box, they generate each coordinate one token at a time — x1, then y1, then x2, then y2. It’s sequential, inefficient, and doesn’t match how boxes actually work geometrically.

NVIDIA’s new LocateAnything model fixes this with Parallel Box Decoding (PBD). Instead of emitting coordinates token-by-token, it predicts the entire bounding box in one atomic step. The result? 10x faster throughput than Qwen3-VL and 2.5x faster than the next-best alternative — while also improving localization accuracy.

The Problem with Autoregressive Box Generation

Current VLMs treat object detection as a text generation problem. Ask them to locate a cat, and they’ll output something like:

<box>0.25</box><box>0.30</box><box>0.75</box><box>0.80</box>

Four tokens, generated sequentially. This creates two problems:

Speed: Sequential generation is slow. Each coordinate requires a full forward pass.
Geometric incoherence: The model learns each coordinate independently, even though x1,y1,x2,y2 form a coupled geometric structure.

In dense scenes with 100+ objects, this becomes a serious bottleneck. LocateAnything achieves 12.7 boxes per second on an H100 GPU in hybrid mode, vs. 1.1 BPS for Qwen3-VL.

What’s In The Box: Model Architecture

LocateAnything is a 7.67 GB download consisting of two components fused together:

Component	Source	Size	License
Vision encoder	MoonViT-SO-400M	~800 MB	MIT
Language model	Qwen2.5-3B-Instruct	~6 GB	Qwen Research
MLP projector	Trained by NVIDIA	~100 MB	NVIDIA non-commercial

Files on HuggingFace:

model-00001-of-00002.safetensors  4.96 GB
model-00002-of-00002.safetensors  2.70 GB
vocab.json + merges.txt           ~5 MB

Total: 7.67 GB disk, loads to ~8 GB VRAM in BF16, or ~16 GB in FP32.

There’s only one model variant right now (nvidia/LocateAnything-3B). The community has made quantized versions for Apple Silicon:

mlx-community/LocateAnything-3B-8bit — 8-bit, ~4.5 GB
mlx-community/LocateAnything-3B-4bit — mixed 4/8-bit, ~3.2 GB

How Parallel Box Decoding Works

LocateAnything treats each bounding box as an atomic unit of fixed length (6 tokens: semantic label + 4 coordinates + padding). Instead of predicting tokens one at a time, it predicts the entire block in parallel.

Three inference modes:

Fast Mode (MTP): Full parallel decoding — maximum throughput
Slow Mode (NTP): Autoregressive fallback — maximum stability
Hybrid Mode (default): Uses fast mode, falls back to slow when format errors or spatial ambiguity detected

Quick Start: Running LocateAnything Locally

The model is available on Hugging Face. Here’s the minimal setup:

pip install transformers accelerate pillow

from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import torch

model_id = "nvidia/LocateAnything-3B"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

image = Image.open("your_image.jpg")
prompt = "Detect all people in this image"

inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=8192,
    generation_mode="hybrid"  # or "fast" / "slow"
)
result = processor.decode(outputs[0], skip_special_tokens=True)
print(result)

Hardware requirements: ~8GB VRAM minimum. Runs well on RTX 3090/4090, A100, H100.

Practical Use Cases

1. Dense Object Detection

LocateAnything excels at crowded scenes. On the VisDrone benchmark (aerial footage with dozens of tiny objects), it achieves 39.9 mean F1 vs. 35.8 for the previous best.

prompt = "Detect all vehicles and pedestrians"

2. GUI Element Grounding

Need to locate buttons, text fields, and icons in screenshots? LocateAnything hits 60.3 F1 on ScreenSpot-Pro, beating specialized GUI models.

prompt = "Find the search button"
# Returns precise coordinates for automated UI testing

3. Document Layout & OCR

For document understanding, it achieves 76.8 F1 on DocLayNet — tables, headers, paragraphs, all with tight bounding boxes.

prompt = "Locate all table cells in this invoice"

4. Referring Expression Grounding

Natural language queries work well:

prompt = "The person wearing a red jacket on the left side"

Docker Setup (One-Command Deploy)

The community has already built a nice wrapper:

git clone https://github.com/gammahazard/locate-anything
cd locate-anything
docker compose up

This gives you a web UI at localhost:7860 with speed/quality toggles for fast/hybrid/slow modes.

Integrating with Roboflow Supervision

LocateAnything outputs coordinates in [x1, y1, x2, y2] format. You can pipe these directly into Roboflow’s supervision library for visualization and downstream processing:

import supervision as sv
import numpy as np

# Parse LocateAnything output (example format)
boxes = np.array([[100, 150, 300, 400], [50, 80, 200, 250]])
class_ids = np.array([0, 1])
confidences = np.array([0.95, 0.88])

detections = sv.Detections(
    xyxy=boxes,
    class_id=class_ids,
    confidence=confidences
)

# Annotate
annotator = sv.BoxAnnotator()
annotated_image = annotator.annotate(scene=image.copy(), detections=detections)

This combo gives you NVIDIA’s speed with Roboflow’s visualization toolkit.

Benchmarks at a Glance

Model	LVIS F1	COCO F1	Speed (BPS)
Qwen3-VL	45.2	49.1	1.1
Rex-Omni	47.3	50.3	5.0
LocateAnything	51.1	52.1	12.7

At high IoU thresholds (IoU=0.95), the gap widens: LocateAnything hits 31.1 F1 on LVIS vs. 20.7 for Rex-Omni. Tighter boxes = more useful detections.

Hardware & Memory Requirements

Here’s what you actually need to run this:

Hardware	VRAM	Speed (per image)	Notes
RTX 3090	8GB+	~1-2s	Minimum viable, use hybrid mode
RTX 4090	24GB	~0.5-1s	Comfortable, all modes work
A100 40GB	40GB	~0.3-0.5s	Production-ready
H100	80GB	~0.08s (12.7 BPS)	Full speed, batch inference
Apple M2/M3	16GB unified	~3-5s	Works via MPS backend
CPU only	16GB+ RAM	30-60s	Not recommended

Memory optimization tricks:

# Use BF16 (saves ~50% VRAM vs FP32)
model = AutoModel.from_pretrained(model_id, torch_dtype=torch.bfloat16, ...)

# Reduce max tokens for faster inference
outputs = model.generate(..., max_new_tokens=2048)  # default is 8192

# For batch inference with FlashAttention (saves ~3x memory on dense scenes)
python batch_infer.py --attn la_flash --batch-size 4

The la_flash backend cuts VRAM from 35GB to 12GB on a 4K image while maintaining speed — critical for deployment.

Can It Run on Edge Devices?

Short answer: Not yet without work.

Longer answer: NVIDIA mentions “deployment on embedded platforms such as NVIDIA Thor is possible with additional model optimization, including quantization, compression, or distillation.” Translation: you’d need to:

Quantize to INT8 or INT4 (not officially supported yet)
Distill to a smaller model
Use TensorRT (not yet integrated)

For edge deployment today, consider alternatives:

YOLO-World — smaller, runs on Jetson
Grounding DINO — has TensorRT builds
RF-DETR — Roboflow’s edge-optimized detector

LocateAnything is best positioned for server-side inference where you have GPU resources.

Real-World Example: Medication Counting

Here’s a practical healthcare use case — counting pills for pharmacy verification:

from transformers import AutoModel, AutoTokenizer, AutoProcessor
from PIL import Image
import torch

class MedicationCounter:
    def __init__(self, model_path="nvidia/LocateAnything-3B"):
        self.processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
        self.model = AutoModel.from_pretrained(
            model_path,
            torch_dtype=torch.bfloat16,
            device_map="auto",
            trust_remote_code=True,
        ).eval()
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    
    def count_pills(self, image: Image.Image, description: str = "pill"):
        prompt = f"Locate all the instances that match: {description}."
        
        messages = [{"role": "user", "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": prompt}
        ]}]
        
        text = self.processor.py_apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        images, _ = self.processor.process_vision_info(messages)
        inputs = self.processor(text=[text], images=images, return_tensors="pt").to("cuda")
        
        with torch.no_grad():
            response = self.model.generate(
                pixel_values=inputs["pixel_values"].to(torch.bfloat16),
                input_ids=inputs["input_ids"],
                attention_mask=inputs["attention_mask"],
                tokenizer=self.tokenizer,
                max_new_tokens=4096,
                generation_mode="hybrid",
                temperature=0.3,
            )
        
        # Parse boxes from response
        import re
        boxes = re.findall(r'<box>(\d+(?:\.\d+)?),(\d+(?:\.\d+)?),(\d+(?:\.\d+)?),(\d+(?:\.\d+)?)</box>', response[0])
        return len(boxes), boxes

# Usage
counter = MedicationCounter()
image = Image.open("pills_on_tray.jpg")
count, boxes = counter.count_pills(image, "white round tablet")
print(f"Found {count} pills")

Why this matters: Pharmacy dispensing errors are a real problem. A camera + LocateAnything can verify pill counts before packaging — faster and more reliable than manual counting for high-volume pharmacies.

Limitations

License: Non-commercial only. Fine for research, not for production products (unless you’re NVIDIA).
Model size: 3B parameters requires decent GPU. No CPU-only path.
Edge deployment: Requires quantization/distillation work not yet officially supported.
Specialized task: This is for localization, not general chat. Use alongside a general VLM.

The Bottom Line

If you’re building computer vision pipelines and localization speed matters, LocateAnything is a significant upgrade. The parallel decoding approach is elegant — treat boxes as atomic units, not serialized tokens — and the benchmarks prove it works.

Try it: HuggingFace Demo | GitHub | Paper | Pill Counter Notebook

LocateAnything is part of NVIDIA’s Eagle VLM family and has been integrated into Nemotron and Cosmos for production grounding capabilities.