soul.py on LoCoMo: What 70% Accuracy With Zero Dependencies Actually Means
We ran soul.py through the LoCoMo benchmark — 1,986 questions across 10 long conversations designed by Snap Research to test how well AI systems remember things. We tested all five retrieval configurations side by side. This is the first reproducible multi-config benchmark comparison for soul.py, and the full results are published on GitHub Pages.
Let’s be upfront: we didn’t beat state of the art. But we think the results tell an interesting story anyway.
The Results
| Configuration | Overall | Single-Hop | Multi-Hop | Open-Domain | Temporal |
|---|---|---|---|---|---|
| RLM | 70.0% | 54.1% | 82.1% | 55.1% | 40.0% |
| Hybrid | 65.6% | — | — | — | — |
| Auto | 64.1% | — | — | — | — |
| Qdrant (RAG) | 63.4% | — | — | — | — |
| BM25 | 63.1% | — | — | — | — |
For context, current SOTA on LoCoMo is ByteRover 2.0 at 92.2%. Other strong systems include Memobase (~85%) and Hindsight (~83%). We’re not in that league — and that’s fine, because we’re playing a different game.
What’s Actually Interesting Here
1. Zero dependencies, no server, still 70%.
soul.py is a single Python file. No vector database. No Redis. No Elasticsearch. No Docker compose file with six services. You pip install soul-agent and you’re done. The entire memory system — ingestion, indexing, retrieval, reflection — runs in-process.
Getting 70% on a serious academic benchmark with that architecture isn’t world-beating, but it’s not trivial either. It means the core approach works. A flat-file memory system with thoughtful retrieval can handle complex conversational recall without the infrastructure overhead that most memory systems require.
2. RLM adds +7 points over pure RAG.
The gap between Qdrant (pure vector RAG, 63.4%) and RLM (Reflective Layered Memory, 70.0%) is the most interesting signal in these results. Both use the same underlying conversations. The difference is that RLM builds structured memory layers — observations, reflections, and higher-order abstractions — on top of raw conversation data.
That +7 point improvement comes entirely from how memories are organized, not from better embeddings or a fancier retrieval engine. It validates the core thesis of soul.py: that memory architecture matters more than retrieval infrastructure.
The multi-hop score (82.1%) is particularly telling. Questions that require synthesizing information across multiple conversation turns are exactly where layered reflection should help — and it does.
3. Five configs, fully reproducible.
We didn’t just test one setup and call it a day. All five retrieval configurations — RLM, Hybrid, Auto, Qdrant, and BM25 — were run against the same 1,986 questions under identical conditions. The benchmark harness, scoring methodology, and per-question results are all published. Anyone can reproduce these numbers or run their own configurations against the same test set.
This matters because benchmarking in the AI memory space is still inconsistent. Different papers test on different subsets, use different scoring methods, and rarely publish per-question breakdowns. We wanted to do better.
Where We Fall Short
Temporal reasoning is the weakest category at 40.0%. Questions like “what did they discuss last Tuesday” or “which came first” require time-aware retrieval that soul.py doesn’t yet handle well. This is a known gap and an active area of development.
The 22-point gap to SOTA is real. Systems like ByteRover 2.0 use sophisticated multi-agent architectures, web-scale retrieval, and significantly more compute. They’re solving a different optimization problem — maximum accuracy regardless of complexity. soul.py optimizes for simplicity and portability first, accuracy second.
What’s Next
These results give us a clear map of where to invest. Temporal retrieval is the obvious priority — closing even half that gap would meaningfully improve overall scores. We’re also exploring whether lightweight re-ranking can bridge some of the distance to more complex systems without sacrificing the zero-dependency constraint.
The full benchmark results, including per-question breakdowns and category analysis, are available at menonpg.github.io/soul-benchmarks. The system architecture and design rationale are described in our arXiv paper.
If you want to try soul.py yourself: pip install soul-agent. That’s it. No infrastructure required.