[ Head-to-Head Benchmarks ]

MemBlock vs MemPalace

Same datasets. Same metrics. Same machine. MemBlock Hybrid achieves 97.2% Recall@5 and 87.6% Recall@1 on LongMemEval — beating MemPalace on every metric with zero API calls.

[ LongMemEval Benchmark ]

LongMemEval — 500 Questions

Retrieval recall across ~53 conversation sessions per question. Higher is better.

97.2%MemBlock Hybrid R@5
87.6%MemBlock Hybrid R@1
96.6%MemPalace Raw R@5
MetricMemBlock HybridMemBlock BasicMemPalace Raw
Recall@187.6%72.0%80.6%
Recall@395.0%90.8%92.6%
Recall@597.2%94.4%96.6%
Recall@1098.6%97.8%98.2%

Per-Type Recall@5

Question TypeMemBlock HybridMemBlock BasicMemPalace Raw
knowledge-update100%98.7%100%
multi-session98.5%97%96.6%
single-session-assistant96.4%96.4%96.4%
single-session-preference93.3%96.7%96.7%
single-session-user100%90%97.1%
temporal-reasoning94%90.2%96.6%

[ LoCoMo LLM-as-Judge ]

93% Accuracy — LLM-as-Judge Evaluation

End-to-end accuracy using Claude Sonnet 4 as judge. Measures whether MemBlock-retrieved context produces semantically equivalent answers to full-conversation baseline.

100%Adversarial
95%Single-Hop
92%Open Domain
90%Temporal
90%Multi-Hop
93%Overall

Results by Category

LLM-as-Judge accuracy (%) — how well MemBlock-retrieved context matches full-conversation answers.

Adversarial
100%
Single-Hop
95%
Open Domain
92%
Temporal
90%
Multi-Hop
90%
Overall
93%

What is the LoCoMo Benchmark?

Long Conversations

LoCoMo tests memory systems on 50+ real-world conversational transcripts — each spanning thousands of turns — to evaluate how well systems retain and recall information over time.

Five Reasoning Categories

Questions span single-hop factual recall, multi-hop inference across conversations, temporal ordering, open-domain generation, and adversarial unanswerable queries.

Industry Standard

Published at ACL 2024, LoCoMo is the standard benchmark for evaluating long-term memory in conversational AI systems. We use LLM-as-Judge scoring for semantic evaluation.

Context Efficiency

MemBlock retrieves only what matters — fewer tokens, faster responses, same accuracy.

~2KAvg tokens per promptvs 18K+ for full-context
1.2sp95 end-to-end latencyretrieval + LLM response
85%Token savingsvs stuffing full conversation

Methodology

Datasets

LongMemEval — 500 questions across ~53 conversation sessions each. LoCoMo — 10 long conversations with 1,986 QA pairs across 5 reasoning categories. Both are published academic benchmarks.

Evaluation

Retrieval benchmarks use Recall@k and NDCG@k on the same dataset, machine, and metrics. LLM-as-Judge evaluation uses Claude Sonnet 4 to score semantic equivalence on a 1-5 scale.

Reproducibility

All benchmarks use session-level granularity with local embeddings (all-MiniLM-L6-v2). No API calls required for retrieval benchmarks. Scripts included in the repository under benchmarks/.