Long Conversations
LoCoMo tests memory systems on 50+ real-world conversational transcripts — each spanning thousands of turns — to evaluate how well systems retain and recall information over time.
[ Head-to-Head Benchmarks ]
Same datasets. Same metrics. Same machine. MemBlock Hybrid achieves 97.2% Recall@5 and 87.6% Recall@1 on LongMemEval — beating MemPalace on every metric with zero API calls.
[ LongMemEval Benchmark ]
Retrieval recall across ~53 conversation sessions per question. Higher is better.
| Metric | MemBlock Hybrid | MemBlock Basic | MemPalace Raw |
|---|---|---|---|
| Recall@1 | 87.6% | 72.0% | 80.6% |
| Recall@3 | 95.0% | 90.8% | 92.6% |
| Recall@5 | 97.2% | 94.4% | 96.6% |
| Recall@10 | 98.6% | 97.8% | 98.2% |
| Question Type | MemBlock Hybrid | MemBlock Basic | MemPalace Raw |
|---|---|---|---|
| knowledge-update | 100% | 98.7% | 100% |
| multi-session | 98.5% | 97% | 96.6% |
| single-session-assistant | 96.4% | 96.4% | 96.4% |
| single-session-preference | 93.3% | 96.7% | 96.7% |
| single-session-user | 100% | 90% | 97.1% |
| temporal-reasoning | 94% | 90.2% | 96.6% |
[ LoCoMo LLM-as-Judge ]
End-to-end accuracy using Claude Sonnet 4 as judge. Measures whether MemBlock-retrieved context produces semantically equivalent answers to full-conversation baseline.
LLM-as-Judge accuracy (%) — how well MemBlock-retrieved context matches full-conversation answers.
LoCoMo tests memory systems on 50+ real-world conversational transcripts — each spanning thousands of turns — to evaluate how well systems retain and recall information over time.
Questions span single-hop factual recall, multi-hop inference across conversations, temporal ordering, open-domain generation, and adversarial unanswerable queries.
Published at ACL 2024, LoCoMo is the standard benchmark for evaluating long-term memory in conversational AI systems. We use LLM-as-Judge scoring for semantic evaluation.
MemBlock retrieves only what matters — fewer tokens, faster responses, same accuracy.
LongMemEval — 500 questions across ~53 conversation sessions each. LoCoMo — 10 long conversations with 1,986 QA pairs across 5 reasoning categories. Both are published academic benchmarks.
Retrieval benchmarks use Recall@k and NDCG@k on the same dataset, machine, and metrics. LLM-as-Judge evaluation uses Claude Sonnet 4 to score semantic equivalence on a 1-5 scale.
All benchmarks use session-level granularity with local embeddings (all-MiniLM-L6-v2). No API calls required for retrieval benchmarks. Scripts included in the repository under benchmarks/.