SentrySearch โ€” Natural Language Search Over Video Footage

SentrySearch โ€” Natural Language Search Over Video Footage

Find โ€œthe frame where a red car ran a stop signโ€ in hours of dashcam footage. SentrySearch is an open-source CLI that searches video content with natural language โ€” no transcription, no frame-by-frame captioning. It projects raw video pixels into the same vector space as text, then matches by similarity.

*Source: GitHub โ€” ssrajadh/sentrysearch (2.4K stars) ๅฎ็މ xp on Weibo (2026-03-31) Also available as a Claude Code skill*

How It Works

Video file (MP4)
       โ”‚
       โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 1. Segment       โ”‚  Split into 30s chunks (5s overlap)
โ”‚    Downsample     โ”‚  โ†’ 480p, 5fps (95% fewer pixels)
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 2. Embed         โ”‚  Gemini Embedding 2 (cloud, $2.84/hr)
โ”‚                  โ”‚  OR Qwen3-VL (local, free)
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚ vectors
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ 3. Store         โ”‚  ChromaDB (local vector database)
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Search:
  "red car running stop sign"
         โ”‚
         โ–ผ  encode to same vector space
  similarity match โ†’ extract matching clip โ†’ save MP4

The key insight: no transcription, no OCR, no frame captions. Multimodal embedding models (Gemini Embedding 2, Qwen3-VL-Embedding) natively understand video frames and text in the same vector space. You compare directly.

Quick Start

# Install
curl -LsSf https://astral.sh/uv/install.sh | sh
git clone https://github.com/ssrajadh/sentrysearch.git
cd sentrysearch && uv tool install .

# Configure (sets up Gemini API key or local model)
sentrysearch init

# Index footage
sentrysearch index /path/to/dashcam/

# Search
sentrysearch search "red truck running a stop sign"

Output:

#1 [0.87] front_2024-01-15_14-30.mp4 @ 02:15-02:45
#2 [0.74] left_2024-01-15_14-30.mp4 @ 02:10-02:40
Saved clip: ./match_front_2024-01-15_14-30_02m15s-02m45s.mp4

Two Backends: Cloud vs Local

ย  Gemini (Cloud) Qwen3-VL (Local)
Cost ~$2.84 per hour of video Free
Quality Highest Good
Privacy Video sent to Google Everything stays local
Requires Free Gemini API key 24GB+ RAM or NVIDIA GPU
Speed Fast 2-8s per chunk

Local Model Selection by Hardware

Hardware Model Memory
Apple Silicon, 24GB+ RAM qwen8b Full float16 via MPS
Apple Silicon, 16GB RAM qwen2b ~6GB usage
NVIDIA, 18GB+ VRAM qwen8b Full bf16 precision
NVIDIA, 8-16GB VRAM qwen8b (4-bit) Quantized
# Install with local model support
uv tool install ".[local]"

# Or with quantization for limited VRAM
uv tool install ".[local-quantized]"

Performance Optimizations

Technique Impact
Downsample to 480p @ 5fps ~95% fewer pixels processed
Max 32 frames per 30s chunk Bounded compute per segment
Matryoshka dimension truncation Only 768 embedding dimensions kept
Auto 4-bit quantization Fits 8B model in 6-8GB VRAM
Still-frame detection Skips static scenes (JPEG size comparison)

Tesla Dashcam Integration

Special adapter for Tesla footage โ€” overlays telemetry on extracted clips:

uv tool install ".[tesla]"
sentrysearch search "accident" --overlay

Displays speed, GPS location, timestamp, and road name on the clip. Requires Tesla firmware 2025.44.25+ with HW3+.

How LearnAI Team Could Use This

  • Multimodal search demos โ€” Show how video and text can be embedded into a shared retrieval space without captioning every frame.
  • Privacy-first AI workflows โ€” Compare Gemini cloud indexing with local Qwen3-VL indexing for sensitive footage.
  • Student projects โ€” Build small video-search assignments around lecture clips, lab recordings, or public-domain footage.

Real-World Use Cases

Use Case Search Query Example
Security footage โ€œperson entering through back door at nightโ€
Sports analysis โ€œgoalkeeper diving to the leftโ€
Wildlife cameras โ€œdeer crossing the clearingโ€
Lecture recordings โ€œthe slide about backpropagationโ€
Manufacturing QC โ€œdefective part on conveyor beltโ€
Body cameras โ€œsuspect reaching into pocketโ€

Works with any MP4 video โ€” recursively indexes all files in a directory.

Why This Matters

Video is the largest untapped data source. Hours of footage sit on hard drives because scrubbing manually is impractical. SentrySearch makes video as searchable as text โ€” and the fact that it works locally with no API key (via Qwen3-VL) means sensitive footage never leaves your machine.

The underlying technique โ€” multimodal embedding without intermediate text โ€” is the same approach powering next-gen search engines. Understanding it now prepares you for where retrieval is heading.