Hindsight is an open-source agent memory system from Vectorize that organizes knowledge into biomimetic structures — World Facts, Experiences, Opinions, and Mental Models — and retrieves them via 4-way hybrid search (semantic, keyword, graph, temporal). It holds state-of-the-art performance on the LongMemEval benchmark, independently verified by Virginia Tech and The Washington Post.

How is Hindsight different from RAG?

RAG stores flat chunks and retrieves by semantic similarity alone — fast lookups but no structure, no learning, no temporal awareness. Hindsight organizes memories into four typed structures and searches all four simultaneously (semantic + keyword + graph + temporal), then reranks. It also reflects — building mental models from raw facts over time. RAG remembers; Hindsight learns.

How does Hindsight relate to RLM (Recursive Language Models)?

RLM has the LLM write code to programmatically navigate an entire corpus, spawn sub-calls, and synthesize across many documents — ideal for exhaustive queries like 'compare all support tickets.' Hindsight's 'reflect' operation has a similar synthesis flavor: it reasons across raw facts and experiences to produce consolidated mental models. But where RLM runs reasoning on demand, Hindsight stores the outputs of that reasoning persistently, so future queries don't need to re-run it.

What are the four memory types in Hindsight?

World: General facts ('The stove gets hot'). Experiences: Agent's own encountered events ('I touched the stove and it hurt'). Opinions: Beliefs with confidence scores ('I shouldn't touch it again' — 0.99 confidence). Observations/Mental Models: Complex understanding derived from reflecting across facts and experiences ('Curling irons, ovens, and fire are also hot').

How do I add Hindsight to an existing agent?

Two lines with the LLM Wrapper — it swaps your existing LLM client and handles memory automatically. Or use the Python/Node SDK for manual retain()/recall() control. Run locally via Docker: docker run ghcr.io/vectorize-io/hindsight:latest. Supports OpenAI, Anthropic, Gemini, Groq, Ollama, LM Studio, and MiniMax.

How does Hindsight compare to soul.py, OpenViking, Memvid, and the Modulizer pattern?

soul.py: 2-line drop-in, managed SoulMate backend, fastest to production. OpenViking: self-hosted filesystem abstraction, ByteDance-scale, full control. Memvid: single MP4 file stores millions of chunks — no database, no server. Modulizer: structured markdown modules, no vector DB. Hindsight: richest memory structure, strongest public benchmarks, Docker-based self-hosting, best for long-term learning across many sessions.

Hindsight: Agent Memory That Actually Learns (Not Just Remembers)

By Prahlad Menon Published 2026-03-13 6 min read

Most agent memory systems remember what happened. Hindsight is designed to learn from it.

Hindsight, from Vectorize, is an open-source agent memory system that currently holds state-of-the-art performance on the LongMemEval benchmark — independently verified by Virginia Tech’s Sanghani Center for AI and Data Analytics and The Washington Post. It’s in production at Fortune 500 enterprises and a growing number of AI startups.

To understand why it matters, you need to understand where it sits relative to the approaches you’re probably already using.

The RAG ceiling

RAG is the standard: chunk your data, embed it, retrieve by semantic similarity, inject into context. It’s effective for focused lookups — “what does the refund policy say?” — but it has structural limits that can’t be patched away:

Flat structure: all chunks are equal, with no hierarchy, typing, or relationships between them
Single retrieval path: semantic similarity alone misses exact matches, entity connections, and time-sensitive context
No temporal awareness: something from last week and something from two years ago are retrieved identically
No learning: after 1000 conversations, your agent knows nothing more than it did after conversation 1

RAG retrieves. It doesn’t learn.

Where RLM comes in

Recursive Language Models (RLMs) address a different problem — exhaustive reasoning. Instead of retrieving the top-k similar chunks, an RLM has the LLM write code to programmatically navigate an entire corpus: grouping documents, spawning sub-calls to analyze subsets, aggregating results. When a user asks “compare complaints across all regional offices,” RAG will give you a best-guess sample. RLM will actually examine all of them.

As we covered in our RAG+RLM architecture post and soul.py v2’s hybrid approach: RAG handles fast lookups (~90% of queries), RLM handles exhaustive synthesis (~10% that would otherwise fail). Together they cover the full query spectrum.

Hindsight’s reflect operation sits in RLM territory conceptually — it synthesizes across raw facts and experiences to produce mental models. The key difference: Hindsight stores those mental models persistently. Where RLM re-runs reasoning every time you query, Hindsight accumulates its reasoning outputs over time, so future queries benefit from everything the agent has already figured out.

The four memory types

Hindsight abandons flat chunks entirely. Every piece of information is classified into one of four biomimetic structures:

World — General facts about the world. “The stove gets hot.”
Experiences — Things the agent has actually encountered. “I touched the stove and it really hurt.”
Opinions — Beliefs with quantified confidence scores. “I shouldn’t touch the stove again.” — 0.99 confidence
Observations/Mental Models — Complex understanding derived from reflecting across facts and experiences. “Curling irons, ovens, and fire are also hot.”

The difference isn’t cosmetic. An agent with only RAG knows the stove is hot because it retrieved that fact. An agent with Hindsight knows the stove is hot, has a first-person memory of what happened when it touched it, holds a belief about what to do next with a confidence score, and has generalized that understanding to an entire class of related objects — without being told any of that explicitly.

4-way hybrid retrieval

When the agent recalls something, Hindsight runs four retrieval paths simultaneously:

Semantic search — vector similarity across stored memories
Keyword search — BM25-style exact and fuzzy matching
Graph retrieval — entity relationships and connected facts
Temporal retrieval — time-aware context (recency, sequence, when things happened)

Results are combined and reranked before anything enters the model’s context window. Each path catches what the others miss: graph retrieval surfaces connected facts invisible to semantic search; temporal retrieval knows last week matters more than last year; keyword search catches exact terminology that drifts in embedding space.

This is how Hindsight beats systems that use any single path — including well-tuned RAG with reranking.

Reflect: where learning actually happens

The reflect operation is what separates Hindsight from memory systems that just record history. During reflection, raw facts and experiences are processed into mental models — consolidated, generalized understanding that persists.

The retrieval priority order is: Mental Models → Observations → Raw Facts. Over time, the mental model layer becomes the primary source of knowledge. An agent that’s closed 1000 support tickets doesn’t just have 1000 conversations in a log — it has a structured, searchable understanding of common issues, edge cases it’s encountered, and beliefs it’s formed, all queryable by future tasks.

Integration

from hindsight_client import Hindsight

client = Hindsight(base_url="http://localhost:8888")

# Store a memory
client.retain(bank_id="my-agent", content="Alice works at Google as a software engineer")

# Recall relevant context
results = client.recall(bank_id="my-agent", query="Where does Alice work?")

Or use the LLM Wrapper for zero-config: 2 lines replaces your existing LLM client and memory becomes fully automatic. Run locally:

docker run --rm -it --pull always -p 8888:8888 -p 9999:9999 \
  -e HINDSIGHT_API_LLM_API_KEY=$OPENAI_API_KEY \
  ghcr.io/vectorize-io/hindsight:latest

Supports OpenAI, Anthropic, Gemini, Groq, Ollama, LM Studio, and MiniMax.

Where Hindsight sits in the memory landscape

We’ve covered a lot of ground in agent memory. Here’s the honest map:

soul.py + SoulMate — 2-line drop-in, managed backend, BYOK. Fastest path from zero to persistent memory. See also: soul.py v2 RAG+RLM hybrid, soul.py vs memU, soul + CrewAI.

OpenViking — ByteDance’s filesystem-paradigm context database. Self-hosted, L0/L1/L2 tiered loading, full control over the stack. Best for serious agent platforms.

Memvid — Millions of memory chunks stored in a single MP4 file. No database, no server, completely portable. Best for offline and edge use.

The Modulizer Pattern — Structured markdown modules, no vector database at all. Best for deterministic, fully inspectable memory where you want editorial control over what’s stored and how it’s organized.

RAG + RLM Architecture — Router that sends focused queries to RAG (fast) and exhaustive queries to RLM (thorough recursive analysis). Best for production knowledge bases covering the full query spectrum.

PageIndex vs Vector DBs — Page-level indexing vs chunk-level for RAG. Best for document-heavy retrieval where chunk boundaries destroy context.

Hindsight — Biomimetic memory structures, 4-way hybrid retrieval, persistent mental models built through reflection. Strongest public benchmarks in the space. Best for agents that need to genuinely improve over many sessions — where long-term memory quality directly determines outcome quality.

The benchmark result matters. It was independently reproduced by Virginia Tech and The Washington Post. For teams that need to justify their memory layer choice, that’s a real differentiator.

Also see: OpenClaw-RL — Train Any Agent Just by Using It