LocateAnything: NVIDIA's 10x Faster Object Detection with Parallel Box Decoding

By Prahlad Menon 4 min read

Vision-language models have a dirty secret: they’re slow at localization. When you ask GPT-4o or Qwen3-VL to draw a bounding box, they generate each coordinate one token at a time — x1, then y1, then x2, then y2. It’s sequential, inefficient, and doesn’t match how boxes actually work geometrically.

NVIDIA’s new LocateAnything model fixes this with Parallel Box Decoding (PBD). Instead of emitting coordinates token-by-token, it predicts the entire bounding box in one atomic step. The result? 10x faster throughput than Qwen3-VL and 2.5x faster than the next-best alternative — while also improving localization accuracy.

The Problem with Autoregressive Box Generation

Current VLMs treat object detection as a text generation problem. Ask them to locate a cat, and they’ll output something like:

<box>0.25</box><box>0.30</box><box>0.75</box><box>0.80</box>

Four tokens, generated sequentially. This creates two problems:

  1. Speed: Sequential generation is slow. Each coordinate requires a full forward pass.
  2. Geometric incoherence: The model learns each coordinate independently, even though x1,y1,x2,y2 form a coupled geometric structure.

In dense scenes with 100+ objects, this becomes a serious bottleneck. LocateAnything achieves 12.7 boxes per second on an H100 GPU in hybrid mode, vs. 1.1 BPS for Qwen3-VL.

What’s In The Box: Model Architecture

LocateAnything is a 7.67 GB download consisting of two components fused together:

ComponentSourceSizeLicense
Vision encoderMoonViT-SO-400M~800 MBMIT
Language modelQwen2.5-3B-Instruct~6 GBQwen Research
MLP projectorTrained by NVIDIA~100 MBNVIDIA non-commercial

Files on HuggingFace:

model-00001-of-00002.safetensors  4.96 GB
model-00002-of-00002.safetensors  2.70 GB
vocab.json + merges.txt           ~5 MB

Total: 7.67 GB disk, loads to ~8 GB VRAM in BF16, or ~16 GB in FP32.

There’s only one model variant right now (nvidia/LocateAnything-3B). The community has made quantized versions for Apple Silicon:

  • mlx-community/LocateAnything-3B-8bit — 8-bit, ~4.5 GB
  • mlx-community/LocateAnything-3B-4bit — mixed 4/8-bit, ~3.2 GB

How Parallel Box Decoding Works

LocateAnything treats each bounding box as an atomic unit of fixed length (6 tokens: semantic label + 4 coordinates + padding). Instead of predicting tokens one at a time, it predicts the entire block in parallel.

Three inference modes:

  • Fast Mode (MTP): Full parallel decoding — maximum throughput
  • Slow Mode (NTP): Autoregressive fallback — maximum stability
  • Hybrid Mode (default): Uses fast mode, falls back to slow when format errors or spatial ambiguity detected

Quick Start: Running LocateAnything Locally

The model is available on Hugging Face. Here’s the minimal setup:

pip install transformers accelerate pillow
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import torch

model_id = "nvidia/LocateAnything-3B"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

image = Image.open("your_image.jpg")
prompt = "Detect all people in this image"

inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=8192,
    generation_mode="hybrid"  # or "fast" / "slow"
)
result = processor.decode(outputs[0], skip_special_tokens=True)
print(result)

Hardware requirements: ~8GB VRAM minimum. Runs well on RTX 3090/4090, A100, H100.

Practical Use Cases

1. Dense Object Detection

LocateAnything excels at crowded scenes. On the VisDrone benchmark (aerial footage with dozens of tiny objects), it achieves 39.9 mean F1 vs. 35.8 for the previous best.

prompt = "Detect all vehicles and pedestrians"

2. GUI Element Grounding

Need to locate buttons, text fields, and icons in screenshots? LocateAnything hits 60.3 F1 on ScreenSpot-Pro, beating specialized GUI models.

prompt = "Find the search button"
# Returns precise coordinates for automated UI testing

3. Document Layout & OCR

For document understanding, it achieves 76.8 F1 on DocLayNet — tables, headers, paragraphs, all with tight bounding boxes.

prompt = "Locate all table cells in this invoice"

4. Referring Expression Grounding

Natural language queries work well:

prompt = "The person wearing a red jacket on the left side"

Docker Setup (One-Command Deploy)

The community has already built a nice wrapper:

git clone https://github.com/gammahazard/locate-anything
cd locate-anything
docker compose up

This gives you a web UI at localhost:7860 with speed/quality toggles for fast/hybrid/slow modes.

Integrating with Roboflow Supervision

LocateAnything outputs coordinates in [x1, y1, x2, y2] format. You can pipe these directly into Roboflow’s supervision library for visualization and downstream processing:

import supervision as sv
import numpy as np

# Parse LocateAnything output (example format)
boxes = np.array([[100, 150, 300, 400], [50, 80, 200, 250]])
class_ids = np.array([0, 1])
confidences = np.array([0.95, 0.88])

detections = sv.Detections(
    xyxy=boxes,
    class_id=class_ids,
    confidence=confidences
)

# Annotate
annotator = sv.BoxAnnotator()
annotated_image = annotator.annotate(scene=image.copy(), detections=detections)

This combo gives you NVIDIA’s speed with Roboflow’s visualization toolkit.

Benchmarks at a Glance

ModelLVIS F1COCO F1Speed (BPS)
Qwen3-VL45.249.11.1
Rex-Omni47.350.35.0
LocateAnything51.152.112.7

At high IoU thresholds (IoU=0.95), the gap widens: LocateAnything hits 31.1 F1 on LVIS vs. 20.7 for Rex-Omni. Tighter boxes = more useful detections.

Hardware & Memory Requirements

Here’s what you actually need to run this:

HardwareVRAMSpeed (per image)Notes
RTX 30908GB+~1-2sMinimum viable, use hybrid mode
RTX 409024GB~0.5-1sComfortable, all modes work
A100 40GB40GB~0.3-0.5sProduction-ready
H10080GB~0.08s (12.7 BPS)Full speed, batch inference
Apple M2/M316GB unified~3-5sWorks via MPS backend
CPU only16GB+ RAM30-60sNot recommended

Memory optimization tricks:

# Use BF16 (saves ~50% VRAM vs FP32)
model = AutoModel.from_pretrained(model_id, torch_dtype=torch.bfloat16, ...)

# Reduce max tokens for faster inference
outputs = model.generate(..., max_new_tokens=2048)  # default is 8192

# For batch inference with FlashAttention (saves ~3x memory on dense scenes)
python batch_infer.py --attn la_flash --batch-size 4

The la_flash backend cuts VRAM from 35GB to 12GB on a 4K image while maintaining speed — critical for deployment.

Can It Run on Edge Devices?

Short answer: Not yet without work.

Longer answer: NVIDIA mentions “deployment on embedded platforms such as NVIDIA Thor is possible with additional model optimization, including quantization, compression, or distillation.” Translation: you’d need to:

  1. Quantize to INT8 or INT4 (not officially supported yet)
  2. Distill to a smaller model
  3. Use TensorRT (not yet integrated)

For edge deployment today, consider alternatives:

  • YOLO-World — smaller, runs on Jetson
  • Grounding DINO — has TensorRT builds
  • RF-DETR — Roboflow’s edge-optimized detector

LocateAnything is best positioned for server-side inference where you have GPU resources.

Real-World Example: Medication Counting

Here’s a practical healthcare use case — counting pills for pharmacy verification:

from transformers import AutoModel, AutoTokenizer, AutoProcessor
from PIL import Image
import torch

class MedicationCounter:
    def __init__(self, model_path="nvidia/LocateAnything-3B"):
        self.processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
        self.model = AutoModel.from_pretrained(
            model_path,
            torch_dtype=torch.bfloat16,
            device_map="auto",
            trust_remote_code=True,
        ).eval()
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    
    def count_pills(self, image: Image.Image, description: str = "pill"):
        prompt = f"Locate all the instances that match: {description}."
        
        messages = [{"role": "user", "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": prompt}
        ]}]
        
        text = self.processor.py_apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        images, _ = self.processor.process_vision_info(messages)
        inputs = self.processor(text=[text], images=images, return_tensors="pt").to("cuda")
        
        with torch.no_grad():
            response = self.model.generate(
                pixel_values=inputs["pixel_values"].to(torch.bfloat16),
                input_ids=inputs["input_ids"],
                attention_mask=inputs["attention_mask"],
                tokenizer=self.tokenizer,
                max_new_tokens=4096,
                generation_mode="hybrid",
                temperature=0.3,
            )
        
        # Parse boxes from response
        import re
        boxes = re.findall(r'<box>(\d+(?:\.\d+)?),(\d+(?:\.\d+)?),(\d+(?:\.\d+)?),(\d+(?:\.\d+)?)</box>', response[0])
        return len(boxes), boxes

# Usage
counter = MedicationCounter()
image = Image.open("pills_on_tray.jpg")
count, boxes = counter.count_pills(image, "white round tablet")
print(f"Found {count} pills")

Why this matters: Pharmacy dispensing errors are a real problem. A camera + LocateAnything can verify pill counts before packaging — faster and more reliable than manual counting for high-volume pharmacies.

Limitations

  • License: Non-commercial only. Fine for research, not for production products (unless you’re NVIDIA).
  • Model size: 3B parameters requires decent GPU. No CPU-only path.
  • Edge deployment: Requires quantization/distillation work not yet officially supported.
  • Specialized task: This is for localization, not general chat. Use alongside a general VLM.

The Bottom Line

If you’re building computer vision pipelines and localization speed matters, LocateAnything is a significant upgrade. The parallel decoding approach is elegant — treat boxes as atomic units, not serialized tokens — and the benchmarks prove it works.

Try it: HuggingFace Demo | GitHub | Paper | Pill Counter Notebook


LocateAnything is part of NVIDIA’s Eagle VLM family and has been integrated into Nemotron and Cosmos for production grounding capabilities.