Reproduce
End-to-end recipe to regenerate the 0.1.0 benchmark numbers locally.
Prerequisites
- Docker 24+ (or
podmanwith compose plugin) - 16 cores recommended, 8 cores minimum
- 16 GiB RAM
- Datasets downloaded:
bash benchmarks/harness/download-datasets.sh
One-shot run
bash benchmarks/harness/run_bench.sh
Wall ETA: 30-50 min on a 16-core box. Output: benchmarks/results/<UTC-stamp>/.
What happens
- Build Docker image (release, FEATURES=onnx-bundled):
- Bring up 4 lanes with cpuset pinning + thread caps.
- Run 6 benches (LongMemEval, LoCoMo, ConvoMem, MemBench × 2, Hybrid v4) sequentially across the lanes via a token-bucket dispatcher.
- Render
RESULTS.mdfrom per-bench JSONs.
Per-bench manual run
docker compose -f benchmarks/harness/compose.yml up -d mnem-bench-1
python benchmarks/harness/adapters/longmemeval_session.py \
--dataset benchmarks/datasets/longmemeval/longmemeval_s_cleaned.json \
mnem http serve --bind 127.0.0.1:9876 \
--limit 500 --top-k 10 \
--out benchmarks/results/longmemeval-500q.json
docker compose -f benchmarks/harness/compose.yml down
Verify against shipped numbers
python benchmarks/harness/comparison_table.py \
--results benchmarks/results/<UTC-stamp> \
--out /tmp/RESULTS.md
diff /tmp/RESULTS.md benchmarks/results/RESULTS.md
If your numbers diverge by more than ±0.01 on recall, open an issue with the host spec and the bench logs.