What MemPalace Got Right (And What's Still Missing in AI Agent Memory)

By Prahlad Menon 7 min read

A Hollywood actress and a developer open-sourced an AI memory system last week. It hit 35,000 GitHub stars in five days. The viral post called it the highest-scoring AI memory system ever benchmarked — free, local, no subscription.

That’s mostly true. But not in the way most people think.

I’ve been building in this space for a while with soul.py, an open-source agent memory library available on PyPI (pip install soul-agent). When MemPalace launched I read the README top to bottom, dug into the benchmark claims, and watched the community tear it apart in real time. Here’s an honest technical assessment — what they got right, what was overstated, and what the whole space is still missing.


What MemPalace Actually Does

The core idea is simple and defensible: store everything verbatim, then make it findable through structure.

Most AI memory systems — Mem0, Zep, and similar — use an LLM to decide what’s worth remembering. They extract facts like “user prefers Postgres” and discard the conversation where you explained why. Over time you lose nuance. The extraction itself can hallucinate.

MemPalace refuses to summarize. Every word goes in. ChromaDB handles the vector storage. Structure does the organizational work that LLM extraction was trying to do.

That structure is the “palace” — borrowed from the ancient method of loci memory technique:

  • Wings — a person or project
  • Rooms — topics within a wing (auth, billing, deploy)
  • Halls — corridors connecting related rooms (decisions, events, discoveries)
  • Closets — summaries pointing to original content
  • Drawers — the verbatim originals, never summarized
  • Tunnels — cross-wing connections when the same topic appears in multiple projects

The hierarchy isn’t decorative. Their benchmark shows it genuinely improves retrieval:

Search scopeR@10
All closets (flat)60.9%
Within wing73.1%
Wing + hall84.8%
Wing + room94.8%

A 34% retrieval improvement from metadata filtering. That’s real — though it’s worth noting this is a standard ChromaDB feature, not proprietary MemPalace magic. The insight is that good organization makes standard tools dramatically more effective.


The Benchmark: What’s Accurate and What Isn’t

96.6% on LongMemEval R@5 — this is the headline number and it’s legitimate. Independently reproduced on an M2 Ultra in under five minutes. Zero API calls. No cloud. Raw verbatim ChromaDB retrieval on 500 questions. The highest published local-only score on this benchmark.

The README itself walked back several other claims within 48 hours of launch, which I respect:

“30x lossless AAAK compression” → AAAK is actually lossy. It uses regex abbreviation, not reversible compression. More importantly, AAAK mode regresses benchmark performance — 84.2% R@5 vs 96.6% in raw mode. The headline score comes from raw verbatim storage, not AAAK. The AAAK dialect is experimental and promising for large-scale scenarios, but it’s not production-ready and the README now says so clearly.

“+34% palace boost” → Accurate in magnitude, but the framing was misleading. This is ChromaDB metadata filtering compared to unfiltered search. Useful and real — just not a novel retrieval mechanism unique to MemPalace.

Contradiction detection → Exists as fact_checker.py but isn’t wired into the knowledge graph operations yet. It’s on the roadmap.

The team’s transparency here is actually a point in their favor. They published reproducer scripts. The community found real issues. They addressed them. That’s how open source is supposed to work.


What MemPalace Genuinely Innovates

Temporal knowledge graph with validity windows. This is underappreciated in the launch coverage. MemPalace maintains a SQLite-backed entity-relationship graph where facts have expiry dates:

kg.add_triple("Maya", "assigned_to", "auth-migration", valid_from="2026-01-15")
kg.add_triple("Maya", "completed", "auth-migration", valid_from="2026-02-01")

# What was true in January?
kg.query_entity("Maya", as_of="2026-01-20")
# → [Maya → assigned_to → auth-migration (active)]

When Maya finishes the migration, you invalidate the fact. Historical queries still find it. Present queries don’t. This is what Zep’s Graphiti does — but MemPalace uses SQLite instead of Neo4j, keeping it free and local.

19 MCP tools. This is the ecosystem play that matters most for adoption. MemPalace works natively with Claude Code, ChatGPT, Cursor, and Gemini CLI. The AI calls mempalace_search automatically — you don’t run commands manually after the initial setup. For developers already living in these tools, that’s the difference between a library and an invisible layer.

Auto-save hooks. Every 15 messages, a hook triggers a structured save. A PreCompact hook fires before context compression — an emergency save before the window shrinks. This solves a real and specific pain point: the moment a long conversation gets compacted, you lose the thread of what you were building.

Token efficiency. The 4-layer wake-up architecture loads ~170 tokens of critical facts at session start (identity + current project + preferences). Then searches only when needed. Compare that to pasting your entire conversation history — which wouldn’t even fit in most context windows.


How soul.py Approaches the Same Problem Differently

soul.py (pip install soul-agent) has been in production for a few months, running on my own agents 24/7. It takes a different architectural bet.

The core difference: query routing.

When an agent receives a question, most memory systems run one retrieval path — either vector search or full-context scan. soul.py v2.0 runs a lightweight classifier first:

  • ~90% of queries → RAG — focused semantic retrieval, sub-second, low token cost
  • ~10% of queries → RLM — full exhaustive synthesis across all memory, for questions that need cross-context reasoning
from hybrid_agent import HybridAgent
agent = HybridAgent()
result = agent.ask("What decisions did we make about the auth system last quarter?")
result["route"]        # "RAG" or "RLM"
result["router_ms"]    # routing latency
result["retrieval_ms"] # retrieval latency

The intuition: most retrieval questions have a clear target (“what did we decide about X?”) and benefit from fast focused search. A smaller fraction require synthesis across disconnected memories (“what patterns emerge across all our architecture debates?”) and need the slower exhaustive path. Routing between them reduces average cost significantly.

The identity layer. soul.py was designed around a concept MemPalace doesn’t address: who is the agent? The SOUL.md and MEMORY.md pattern gives each agent a persistent identity — personality, operating principles, long-term curated knowledge — separate from raw conversation logs. An agent should remember not just what happened, but what it learned, and what kind of entity it is. These compound over time in a way that verbatim storage alone doesn’t support.

Zero-dependency mode. pip install soul-agent (no extras) runs entirely on markdown files with no vector database. For developers who want persistent agent memory without standing up Qdrant or ChromaDB, this is the entry point. It falls back to BM25 keyword search if no vector backend is configured.

FeatureMemPalacesoul.py
StorageChromaDB (vector)Qdrant + BM25 fallback
Retrieval strategyHierarchical metadata filteringRAG + RLM query router
CompressionAAAK (lossy, experimental)Markdown-native
MCP tools19 native toolsRoadmap
Agent identity layerNot addressedSOUL.md + MEMORY.md
Temporal knowledge graphSQLite, validity windowsNot yet
Zero-dep modeNo (requires ChromaDB)Yes
Benchmark96.6% LongMemEval (raw)Not yet published
Installpip install mempalacepip install soul-agent
LicenseMITMIT

What the Whole Space Is Still Missing

Both projects, and most AI memory systems, have the same blind spots:

1. Benchmarks that test agents, not retrieval. LongMemEval tests whether you can retrieve a piece of information from 500 questions. It doesn’t test whether an agent with persistent memory actually performs better at multi-session tasks. These are different problems. A system can score 96.6% on retrieval and still fail in practice if the retrieved context doesn’t integrate well with reasoning.

2. Memory for reasoning, not just facts. Current systems store what happened. Few capture why a decision was made, what tradeoffs were considered, or what was tried and failed. Those are the things that actually prevent teams from repeating mistakes.

3. Forgetting. Both MemPalace and soul.py are better at remembering than forgetting. A mature memory system needs principled forgetting — decay curves, relevance scoring over time, archiving vs deletion. When you’ve been working with an agent for two years, the initial setup conversations shouldn’t have equal weight with last week’s architecture decisions.

4. Cross-agent memory. Most systems assume one agent, one user. Real production systems have multiple agents with overlapping knowledge. A shared memory layer that multiple agents can read from and write to — with attribution and conflict resolution — is unsolved.


The Takeaway

MemPalace is a genuine contribution, not just a viral moment. The core idea — verbatim storage + structural organization instead of LLM extraction — is well-reasoned and the benchmark is reproducible. The temporal knowledge graph is underrated. The MCP integration is the right ecosystem bet.

The overstated claims (AAAK compression, “palace boost” novelty) were corrected quickly, which is the right response. The team is engaged, the issues are being tracked, and the architecture is sound.

What it doesn’t solve: agent identity, query routing intelligence, or principled forgetting. Those are the gaps soul.py is working in.

The broader memory space is still early. 96.6% on a retrieval benchmark is impressive. Building an agent that actually benefits from persistent memory — one whose decisions compound in quality over months — is the harder problem. Nobody’s solved that yet.


FAQ

What is MemPalace? MemPalace is an open-source AI memory system (pip install mempalace) that stores all AI conversations verbatim in ChromaDB and organizes them into a hierarchical structure — wings (projects/people), rooms (topics), closets (summaries), and drawers (original content). It scored 96.6% on the LongMemEval retrieval benchmark without any API calls.

How does MemPalace compare to Mem0 and Zep? Mem0 and Zep use LLM extraction to decide what to remember, scoring around 85% on LongMemEval but charging $19–249/month and $25/month respectively. MemPalace stores everything verbatim, scores 96.6% in raw mode, and is completely free and local. The tradeoff is storage volume: verbatim storage grows large over time.

What is the LongMemEval benchmark? LongMemEval is an academic benchmark with 500 questions testing an AI system’s ability to retrieve information from long conversation histories. It’s the current standard for evaluating AI memory retrieval accuracy. MemPalace’s 96.6% R@5 is the highest published score for a local, zero-API-call system.

Is AAAK compression really 30x? No — the original README overstated this. AAAK is a lossy abbreviation system, not lossless compression. It also currently regresses benchmark performance (84.2% vs 96.6% in raw mode). AAAK is an experimental feature designed for scenarios with many repeated entities at scale, not a production default.

What is soul.py? soul.py is an open-source agent memory library (pip install soul-agent) that takes a different approach from MemPalace: a RAG + RLM hybrid with a query router that directs ~90% of queries to fast focused retrieval and ~10% to exhaustive synthesis. It also introduces a SOUL.md + MEMORY.md pattern for persistent agent identity — not just what an agent remembers, but who it is.

Can I use soul.py without a vector database? Yes. pip install soul-agent (no extras) runs entirely on markdown files with BM25 keyword search as a fallback. Vector retrieval with Qdrant requires pip install soul-agent[anthropic] or [openai] for embeddings.

What is RLM in soul.py? RLM (Retrieval with Language Model) is an exhaustive synthesis mode that reads across all stored memory to answer questions requiring cross-context reasoning — like “what patterns have emerged across all our architecture discussions?” The query router directs only ~10% of questions here because it’s more expensive; the other 90% go to faster RAG retrieval.

Does MemPalace work with Claude Code? Yes. MemPalace ships 19 MCP (Model Context Protocol) tools that integrate natively with Claude Code, ChatGPT, Cursor, and Gemini CLI. After setup, the AI calls memory tools automatically — you don’t run manual CLI commands during sessions.

What’s missing from current AI memory systems? Most systems (including both MemPalace and soul.py) lack: (1) principled forgetting/decay so older memories don’t crowd out recent context, (2) cross-agent shared memory with conflict resolution, (3) benchmarks that measure task performance rather than just retrieval accuracy, and (4) memory that captures reasoning and tradeoffs, not just facts.