SentrySearch: Natural Language Search Over Hours of Video Footage
The use case is immediately obvious: you have hours of dashcam footage, security camera recordings, or raw video files, and you need to find one specific moment. Not a timestamp. Not a filename. Just: “red truck running a stop sign.”
SentrySearch does exactly that — and exports the trimmed clip.
How It Works
The architecture is straightforward once you see it. SentrySearch:
- Splits your mp4 files into overlapping chunks (default 30s, 5s overlap)
- Embeds each chunk as a video using either Gemini Embedding 2 (API) or Qwen3-VL (local)
- Stores the vectors in a local ChromaDB database
- Searches by embedding your text query into the same vector space and finding the nearest match
- Trims the top match from the original file and saves it as a clip
The key step is #2 — video embeddings, not frame-by-frame image embeddings. The model actually watches each chunk as a video and produces a single embedding that represents the visual content, motion, and sequence of events. This is what makes “red truck running a stop sign” work as a query rather than just “red truck” or “stop sign” independently.
Two Backends: Cloud vs. Fully Local
Gemini backend (default): Uses Google’s Gemini Embedding 2 API. Better search quality, no local GPU required, free tier available at aistudio.google.com. Setup is one command:
sentrysearch init # prompts for API key, validates, done
Local backend (fully private): Uses Qwen3-VL-Embedding running entirely on your machine. No API calls, no data leaves your system. Auto-detects your hardware and picks the right model:
| Hardware | Model | Notes |
|---|---|---|
| Apple Silicon 24GB+ / NVIDIA 18GB+ VRAM | Qwen3-VL 8B | Full precision |
| Apple Silicon 16GB | Qwen3-VL 2B | 8B won’t fit |
| NVIDIA 8–16GB VRAM | Qwen3-VL 8B (4-bit) | .[local-quantized] install |
| Intel Mac / CPU-only | — | Too slow; use Gemini API instead |
Install the local backend:
uv tool install ".[local]" # Mac / NVIDIA full precision
uv tool install ".[local-quantized]" # NVIDIA with 4-bit quantization
The Full Workflow
# Install
git clone https://github.com/ssrajadh/sentrysearch.git
cd sentrysearch
uv tool install .
# Index your footage
sentrysearch index /path/to/footage
# Search
sentrysearch search "red truck running a stop sign"
Output:
#1 [0.87] front_2024-01-15_14-30.mp4 @ 02:15-02:45
#2 [0.74] left_2024-01-15_14-30.mp4 @ 02:10-02:40
#3 [0.61] front_2024-01-20_09-15.mp4 @ 00:30-01:00
Saved clip: ./match_front_2024-01-15_14-30_02m15s-02m45s.mp4
Similarity scores are shown alongside each result. Below the confidence threshold (default 0.41), it prompts before trimming — you won’t silently get a wrong clip. --save-top N exports multiple clips instead of just the best match.
Tesla Dashcam: Search + Telemetry Overlay
If you drive a Tesla, SentrySearch has a feature that goes beyond clip extraction. Starting with Tesla firmware 2025.44.25+, dashcam videos embed telemetry data directly in SEI NAL units in the video file — speed, GPS coordinates, location name, turn signal state. SentrySearch can read that data and burn it as a HUD overlay onto any trimmed clip:
# Search and auto-apply overlay to the result
sentrysearch search "running a red light" --overlay
# Or apply overlay to any Tesla dashcam file directly
sentrysearch overlay /path/to/tesla_video.mp4
The overlay shows speed, GPS coordinates, reverse-geocoded location name, and turn signal status — frame-accurate, since it’s reading from the embedded telemetry rather than estimating. For insurance claims, incident documentation, or just reviewing a close call, that context matters. A clip showing 52mph in a 25mph zone is a different artifact than the same clip without the data.
Why This Matters Beyond the Demo
The obvious use case is security/dashcam footage review. But the underlying capability — semantic search over raw video using natural language — has a much wider range of applications:
Medical imaging and procedure review — search surgical recordings, endoscopy footage, or training videos by describing a specific maneuver or finding. Connects directly to the spatial reasoning work in MedOpenClaw — the same VLM spatial understanding that helps agents navigate 3D volumes could index video with the same approach.
Legal and compliance — search hours of deposition recordings, court proceedings, or workplace incident footage by event description rather than timestamp.
Sports analysis — “fast break leading to turnover,” “penalty kick save,” “player collision in the third quarter.” Frame-accurate clip extraction for coaching review.
Journalism and documentary research — search archive footage by content rather than metadata. Decades of raw footage that currently requires manual review becomes keyword-searchable.
Content moderation at scale — the local model option is particularly relevant here, where processing can’t go through third-party APIs for privacy or compliance reasons.
The Local-First Angle
The Gemini API backend is the path of least resistance for getting started. But the Qwen3-VL local option is the more interesting story.
Video content is one of the last major data types that hasn’t been made fully searchable without cloud dependency. Running a capable VLM locally for video embeddings changes the economics for anyone processing sensitive, proprietary, or high-volume footage where per-API-call costs or data residency requirements would make a cloud approach impractical.
The 8B model on Apple Silicon with 24GB+ RAM or a mid-range NVIDIA GPU is fast enough for real workloads. The 4-bit quantized option brings it down to hardware that’s genuinely accessible.
SentrySearch is a clean implementation of a capability that’s been theoretically possible for a while but hadn’t been packaged into something a developer could install and use in an afternoon. That’s the contribution.
Repo: github.com/ssrajadh/sentrysearch
OpenClaw skill: clawhub.ai/ssrajadh/natural-language-video-search