Skeleton Embeddings: How pose-search Finds Images by Body Position, Not Pixels

By Prahlad Menon 7 min read

Most image search engines think about images as pixel patterns. CLIP converts an image to a 512-dimensional embedding that captures what things look like. Perceptual hash functions find near-duplicates. Feature extractors like SIFT or ORB match corners and edges.

pose-search does something fundamentally different. It does not care what a person is wearing, what color the background is, or how the image is lit. It only cares about one thing: what angle is the elbow at? Where is the shoulder relative to the hip?

This is a small, clever project by developer x6ud that deserves more attention than it has gotten. Here is how it actually works.

The Core Idea

The search index is not pixel embeddings. It is joint geometry.

For every photo in the dataset (sourced from Unsplash), the system runs Google’s MediaPipe Pose to detect 33 body landmarks — each a 3D point with x, y, z coordinates and a visibility confidence score. These get stored as a compact binary file (landmarks.dat, about 870KB for the full dataset).

When you want to search, you drag joints on an interactive 3D skeleton model in the browser. That pose is then compared against every stored landmark set using angle-based geometric matching, not cosine similarity on a flat embedding vector.

The Matching Algorithm

This is where it gets interesting. Let me walk through the shoulder matcher, which is representative of how all the joint matchers work.

For a query like “left arm raised at 45 degrees from the torso,” the system:

Step 1: Compute the query direction in local space

Rather than storing absolute positions (which vary by person height, camera angle, and image crop), the system expresses joint directions relative to the body’s own coordinate frame. For the shoulder, that means computing the direction of the upper arm relative to the trunk’s local up and right axes:

this.shoulderLocalDir = getNormalInLocalSpace(
    model.rightUpperArm.originWorldPosition,
    model.leftUpperArm.originWorldPosition,    // shoulder axis
    model.trunk.originWorldPosition,
    model.trunk.controlPointWorldPosition,     // trunk orientation
    model.leftUpperArm.originWorldPosition,
    model.leftUpperArm.controlPointWorldPosition,  // arm direction
);

This gives you a unit vector that says “the arm is pointing in this direction relative to the body.” It is invariant to camera distance and body scale.

Step 2: Check both the pose and the camera angle

Here is the subtle part. A 3D shoulder angle can look identical in a photo regardless of whether the camera is in front or behind. So the matcher checks two things:

  • errorL — the 3D world-space angle between the query arm direction and the photo’s arm direction (from MediaPipe’s world landmarks)
  • viewError1L / viewError2L — the 2D view-space angle of the trunk’s up and right axes (from MediaPipe’s normalized/screen landmarks)

The score is only returned if all three are within a 45-degree threshold:

score: (Math.PI - errorL) * (Math.PI - viewError1L) * (Math.PI - viewError2L)

This multiplicative scoring means all three must be good simultaneously — a near-perfect world angle but wrong camera orientation scores poorly.

Step 3: Handle left-right flipping automatically

The system also checks whether the mirrored version of the pose matches better, which handles photos where the same pose is mirrored. This is done by computing a shoulderLocalDirMirror during preparation and comparing both.

Two Landmark Spaces

MediaPipe Pose returns landmarks in two coordinate systems, and pose-search uses both:

World landmarks (worldLandmarks): 3D coordinates in metric space, camera-independent. Used for the core pose angle comparison. If MediaPipe says the left elbow is 30 degrees from the shoulder in 3D, that is true regardless of camera position.

Normalized landmarks (normalizedLandmarks): 2D screen-space coordinates normalized to [0,1] in the image. Used for the camera orientation check (making sure the body is facing a similar direction in both query and result), and for the bounding box of what to highlight in the result thumbnail.

The insight is that you need both. World landmarks tell you about pose geometry; normalized landmarks tell you about the camera relationship to the body.

What Gets Stored

The landmarks.dat binary file stores, per photo:

  • 33 world landmarks × (x, y, z, visibility) = 132 floats
  • 33 normalized landmarks × (x, y, z) = 99 floats
  • Photo metadata (width, height, Unsplash ID)

For the Unsplash dataset of roughly 4,000 photos, this fits in under 1MB. The entire search runs in the browser — no server query needed at search time.

Separate Matchers per Joint Group

Rather than one monolithic pose vector, there are separate matchers for each body region:

  • MatchShoulder — upper arm direction relative to trunk
  • MatchElbow — forearm direction relative to upper arm
  • MatchHip — thigh direction relative to pelvis
  • MatchKnee — shin direction relative to thigh
  • MatchChest — torso facing direction
  • MatchCrotch — pelvis orientation
  • MatchFace — head direction

Each has a CameraUnrelated variant that drops the view-space checks, for cases where camera orientation is irrelevant to the search.

You pick one joint group to search on. The UI lets you select “left shoulder,” run pose detection on your query image (or pose the skeleton manually), and it finds all photos where that joint is in a similar configuration.

Why This Beats Pixel Similarity for Pose Queries

CLIP will find you a photo of “a person with their arm raised” if you type that query. But it cannot distinguish between:

  • Arm raised 30 degrees from vertical vs 60 degrees
  • Arm raised forward vs to the side
  • Standing upright vs leaning 20 degrees

Skeleton geometry handles all of these precisely, because you are comparing actual joint angles, not a compressed semantic representation.

The tradeoff: CLIP handles appearance, style, and semantics. Skeleton matching handles nothing but geometry. They are complementary.

The Native Version

There is also pose-search-native, which runs against your local photo library. A Node.js server handles the MediaPipe pose estimation for each photo on disk; a Vue.js frontend (same as the web version) handles the 3D skeleton UI and querying. The index is built locally, no data leaves your machine.

The technique generalizes to any domain where you care about body geometry:

Sports analytics: Find all frames in a match where a player has their knee at a specific flexion angle at ball contact. Pixel similarity cannot do this reliably.

Physical therapy / rehabilitation: Flag video frames where a patient’s shoulder is outside acceptable range during an exercise — regardless of camera angle or clothing.

Animation and reference: Artists searching for pose reference images want exact geometry, not visual similarity. This is exactly what pose-search was built for.

Medical imaging positioning: Radiology has strict patient positioning requirements for certain scans. Skeleton matching could flag images where a patient is not positioned within tolerance.

Security / access control: Detecting specific postures (hands raised, sitting vs standing) in CCTV feeds without facial recognition — geometry-based, not appearance-based.

Run It Yourself

The web version is live at x6ud.github.io/pose-search — you need WebGL2 enabled (standard in any modern desktop browser). The editor mode at /#/editor lets you add your own Unsplash photos to the index.

For local photos:

git clone https://github.com/x6ud/pose-search-native
cd pose-search-native/server && npm install && npm run dev
cd ../ui && npm install && npm run dev
# Open http://localhost:5173, click Scan to index your photo folder

The source is clean TypeScript with Vue 3, well under 10k lines. The matching logic in src/Search/impl/ is worth reading even if you never run the project — it is a compact, precise implementation of geometry-based retrieval that would be straightforward to port to Python with NumPy.

FAQ

Does it work without a human in the photo? No. It requires MediaPipe to detect at least the relevant landmarks for the body part you are searching. Low-visibility landmarks are filtered out by a confidence threshold, so partial occlusions are handled gracefully but full absence of a person returns no results.

Can it search for multi-person poses? The current implementation processes one person per photo (the primary detected subject). Multi-person support would require running the matcher on each detected person independently.

How is this different from OpenPose or AlphaPose? pose-search uses MediaPipe Pose (Google’s model), which runs efficiently in-browser via WebAssembly. OpenPose and AlphaPose are more accurate for multi-person detection but require a GPU server. For single-person, in-browser use, MediaPipe is the right call.

Could you combine skeleton matching with CLIP? Yes, and it would be a meaningful improvement. Use CLIP to filter to semantically relevant images first (e.g., “outdoor sports photos”), then rank by skeleton geometry within that subset. The two approaches are orthogonal and composable.

What is the index size limit? The binary format is compact — about 200 bytes per photo (231 floats × 4 bytes, minus overhead). One million photos would be around 200MB, which is feasible to load in a browser with streaming, or trivially manageable server-side.