[ Head-to-Head Benchmarks ]

MemBlock vs MemPalace

Same datasets. Same metrics. Same machine. MemBlock Hybrid achieves 97.2% Recall@5 and 87.6% Recall@1 on LongMemEval — beating MemPalace on every metric with zero API calls.

[ LongMemEval Benchmark ]

LongMemEval — 500 Questions

Retrieval recall across ~53 conversation sessions per question. Higher is better.

97.2%MemBlock Hybrid R@5

87.6%MemBlock Hybrid R@1

96.6%MemPalace Raw R@5

Metric	MemBlock Hybrid	MemBlock Basic	MemPalace Raw
Recall@1	87.6%	72.0%	80.6%
Recall@3	95.0%	90.8%	92.6%
Recall@5	97.2%	94.4%	96.6%
Recall@10	98.6%	97.8%	98.2%

Per-Type Recall@5

Question Type	MemBlock Hybrid	MemBlock Basic	MemPalace Raw
knowledge-update	100%	98.7%	100%
multi-session	98.5%	97%	96.6%
single-session-assistant	96.4%	96.4%	96.4%
single-session-preference	93.3%	96.7%	96.7%
single-session-user	100%	90%	97.1%
temporal-reasoning	94%	90.2%	96.6%

[ LoCoMo LLM-as-Judge ]

93% Accuracy — LLM-as-Judge Evaluation

End-to-end accuracy using Claude Sonnet 4 as judge. Measures whether MemBlock-retrieved context produces semantically equivalent answers to full-conversation baseline.

100%Adversarial

95%Single-Hop

92%Open Domain

90%Temporal

90%Multi-Hop

93%Overall

Results by Category

LLM-as-Judge accuracy (%) — how well MemBlock-retrieved context matches full-conversation answers.

Adversarial

100%

Single-Hop

95%

Open Domain

92%

Temporal

90%

Multi-Hop

90%

Overall

93%

What is the LoCoMo Benchmark?

Long Conversations

LoCoMo tests memory systems on 50+ real-world conversational transcripts — each spanning thousands of turns — to evaluate how well systems retain and recall information over time.

Five Reasoning Categories

Questions span single-hop factual recall, multi-hop inference across conversations, temporal ordering, open-domain generation, and adversarial unanswerable queries.

Industry Standard

Published at ACL 2024, LoCoMo is the standard benchmark for evaluating long-term memory in conversational AI systems. We use LLM-as-Judge scoring for semantic evaluation.

Context Efficiency

MemBlock retrieves only what matters — fewer tokens, faster responses, same accuracy.

~2KAvg tokens per promptvs 18K+ for full-context

1.2sp95 end-to-end latencyretrieval + LLM response

85%Token savingsvs stuffing full conversation

Methodology

Datasets

LongMemEval — 500 questions across ~53 conversation sessions each. LoCoMo — 10 long conversations with 1,986 QA pairs across 5 reasoning categories. Both are published academic benchmarks.

Evaluation

Retrieval benchmarks use Recall@k and NDCG@k on the same dataset, machine, and metrics. LLM-as-Judge evaluation uses Claude Sonnet 4 to score semantic equivalence on a 1-5 scale.

Reproducibility

All benchmarks use session-level granularity with local embeddings (all-MiniLM-L6-v2). No API calls required for retrieval benchmarks. Scripts included in the repository under benchmarks/.