Methodology

Every published number ships with the harness, the dataset hash, and the raw artifacts. If you cannot reproduce a number, that is a bug.

Dataset matrix

Dataset	Version	n queries	Source
LongMemEval	`longmemeval_s_cleaned.json`	500	xiaowu0162/longmemeval-cleaned
LoCoMo	`locomo10.json`	1986 (session-level)	snap-research/LoCoMo
ConvoMem	5 cat × 50 items (250)	250	Salesforce/ConvoMem
MemBench simple/roles	100 items	100	import-myself/Membench
MemBench highlevel/movie	100 items	100	import-myself/Membench

Embedder

ONNX MiniLM-L6-v2 (sentence-transformers/all-MiniLM-L6-v2 via Xenova/all-MiniLM-L6-v2), bundled in-process via the onnx-bundled feature. No network calls, no API keys, no per-call model load.

Hardware

Pinned 4 cores per lane (cpuset 0-3 / 4-7 / 8-11 / 12-15), MNEM_ORT_INTRA_THREADS=4, mem cap 3 GiB per lane. Bench host is documented per run in benchmarks/results/.

Scoring

Metric	Definition
R@K	hit if any gold item is in top-K retrieved
avg recall	mean per-item recall (ConvoMem)
Hybrid v4	dense + sparse score boost (mirrors MP harness helper)

Apple-to-apple pledge

Same dataset version, same query count.
Same scoring code (benchmarks/harness/).
No secret post-filters, no LLM rerank in the headline numbers.
Latency reported alongside recall, not separately.

Reproduce in 1 command

bash benchmarks/harness/run_bench.sh

See Reproduce for the full step-by-step.

Keyboard shortcuts