Ingest pipeline

mnem ingest is the only path content takes into the graph. The pipeline:

parse -> chunk -> extract -> embed -> commit

Sources

file path (mnem ingest README.md)
glob (mnem ingest 'docs/**/*.md')
stdin (cat data.txt | mnem ingest -)
structured JSON (mnem ingest data.json --json)

Chunking

Default: ~1k-token chunks with sentence-boundary alignment. Override via config:

[ingest]
chunk_size_tokens = 512
chunk_overlap_tokens = 50

Document-aware chunkers exist for code (Tree-sitter) and for Markdown (heading-aware). Auto-detected by file extension.

Extractors

Optional ingest-time enrichment:

Extractor	What it does
`none` (default)	raw text only
`keybert`	KeyBERT keyphrase extraction; phrases stored in node metadata

Enable via flag:

mnem ingest README.md --extractor keybert

Labels

Pass --label <str> to scope the ingested nodes:

mnem ingest user-42-chat.json --label user-42 --json

Subsequent retrieve calls with --label user-42 will see only this scope.

Idempotency

Ingesting the same content twice produces the same CID; the second commit is a no-op (parent points at the same tree). Edit-and-reingest produces a new CID and a child commit.