Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Results

mnem vs MemPalace published numbers. Dense retrieval (vector + top-k); hybrid-v4 row mirrors MemPalace’s harness helper. No LLM rerank.

ONNX MiniLM-L6-v2 (bundled, in-process). 4 cores per lane.

BenchmarkSplitMetricMPmnemΔ vs MPLatency (ms)
LongMemEval500 Q (full)R@5 session0.9660.966±0711 (retr)
LongMemEval500 Q (full)R@10 session0.9820.982±0711 (retr)
LoCoMo1986 Q (full)R@5 session0.508$\color{green}{\textbf{0.726}}$+0.218333 (retr)
LoCoMo1986 Q (full)R@10 session0.603$\color{green}{\textbf{0.855}}$+0.252333 (retr)
ConvoMem5 cat × 50 items (250)avg recall0.929$\color{green}{\textbf{0.976}}$+0.047398 (retr)
MemBenchsimple/roles, 100 itemsR@50.840$\color{green}{\textbf{0.960}}$+0.1201874 (e2e)
MemBenchhighlevel/movie, 100 itemsR@50.950$\color{green}{\textbf{1.000}}$+0.050491 (e2e)
LongMemEval500 Q, Hybrid v4R@5 session0.982$\color{red}{\textbf{0.976}}$-0.006729 (retr)

(retr) = retrieve-only mean (from summary timing). (e2e) = end-to-end mean (runtime / n) when adapter doesn’t expose phase timing.

Headlines

  • Matches MemPalace exactly on LongMemEval (0.966 / 0.982).
  • Beats by +0.218 / +0.252 on LoCoMo session-level retrieval.
  • Beats by +0.047 on ConvoMem.
  • Beats by +0.120 / +0.050 on MemBench tasks.
  • Within ±0.006 on Hybrid v4 (no LLM rerank).

Raw artifacts

Per-bench JSON + JSONL in benchmarks/results/v0.1.0/. Each artifact carries the question, the gold set, the retrieved top-K, and per-item recall.

Reproduce

See Reproduce. One command:

bash benchmarks/harness/run_bench.sh