Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Methodology

Every published number ships with the harness, the dataset hash, and the raw artifacts. If you cannot reproduce a number, that is a bug.

Dataset matrix

DatasetVersionn queriesSource
LongMemEvallongmemeval_s_cleaned.json500xiaowu0162/longmemeval-cleaned
LoCoMolocomo10.json1986 (session-level)snap-research/LoCoMo
ConvoMem5 cat × 50 items (250)250Salesforce/ConvoMem
MemBench simple/roles100 items100import-myself/Membench
MemBench highlevel/movie100 items100import-myself/Membench

Embedder

ONNX MiniLM-L6-v2 (sentence-transformers/all-MiniLM-L6-v2 via Xenova/all-MiniLM-L6-v2), bundled in-process via the onnx-bundled feature. No network calls, no API keys, no per-call model load.

Hardware

Pinned 4 cores per lane (cpuset 0-3 / 4-7 / 8-11 / 12-15), MNEM_ORT_INTRA_THREADS=4, mem cap 3 GiB per lane. Bench host is documented per run in benchmarks/results/.

Scoring

MetricDefinition
R@Khit if any gold item is in top-K retrieved
avg recallmean per-item recall (ConvoMem)
Hybrid v4dense + sparse score boost (mirrors MP harness helper)

Apple-to-apple pledge

  • Same dataset version, same query count.
  • Same scoring code (benchmarks/harness/).
  • No secret post-filters, no LLM rerank in the headline numbers.
  • Latency reported alongside recall, not separately.

Reproduce in 1 command

bash benchmarks/harness/run_bench.sh

See Reproduce for the full step-by-step.