Codecs and metadata

Atlas.create takes three independent knobs that control the on-disk encoding:

atlas.Atlas.create(
    path,
    codec="zstd",            # array codec — see below
    meta_format="json",      # metadata format
    meta_compression="none", # metadata compression
)

All three are auto-detected on Atlas.open from the on-disk filename, so you never pass them when reopening.

Array codec — `codec=`

Value	When to pick it
`"zstd"` (default)	Best ratio at moderate CPU. Pick this unless you have a reason not to.
`"lz4"`	~2× larger files but faster to decompress. Worth it for read-heavy scan loops where compressed-bytes-per-second beats raw size.
`"none"` / `"uncompressed"`	Fastest write path, no size reduction. Tiny stores, or when you'll compress the whole directory externally.

The codec is recorded per-array, so reading is automatic regardless of which codec the store was opened with. Existing blocks always decompress with whichever codec wrote them; the codec= kwarg only affects new blocks. This means you can switch codecs mid-life without rewriting.

See examples/05_codecs.py for a head-to-head on a smooth float32 field (smooth data exposes codec differences; pure noise compresses uniformly badly across all three).

Metadata format — `meta_format=`

The metadata file is read on every Atlas.open and written on every flush. It contains the dataset registry, every array schema, every attribute, and every persisted stat — for stores with thousands of datasets it can grow into tens of MB.

Value	Filename	Notes
`"json"` (default)	`atlas.json`	Human-readable; greppable; backwards compatible.
`"msgpack"` / `"mp"`	`atlas.msgpack`	~30–50% smaller, faster to parse, not human-readable.

JSON is the default because the wins from msgpack only become visible on stores big enough that the metadata file is a meaningful fraction of total size. For small stores msgpack saves milliseconds you won't notice.

Metadata compression — `meta_compression=`

Applied on top of the encoded metadata file:

Value	Filename suffix	Notes
`"none"` / `"uncompressed"` (default)	(no suffix)
`"zstd"`	`.zst` (e.g. `atlas.json.zst`)	Best ratio.
`"lz4"`	`.lz4` (e.g. `atlas.msgpack.lz4`)	Faster decode, larger file.

Mostly useful for stores with thousands of datasets on a high-latency object store: the metadata file is read in full on every open, so trimming 30–50% off the wire size pays for itself.

examples/04_meta_formats.py walks all six combinations on a 30-dataset × 4-array store and prints the size ratios.

Auto-detection on open

Atlas.open(path) looks for any of atlas.json, atlas.json.zst, atlas.json.lz4, atlas.msgpack, atlas.msgpack.zst, atlas.msgpack.lz4 in the directory and picks the first one it finds. You don't pass meta_format= / meta_compression= to open — the filename is the source of truth.

# Whatever combination the store was created with, this just works.
atlas = atlas.Atlas.open("/tmp/store")

Picking a combination

For most use cases the defaults (codec="zstd", meta_format="json", meta_compression="none") are right. Reach for the alternatives when:

Lots of small datasets (1000+ datasets, simple schemas) → meta_format="msgpack", optionally meta_compression="zstd". The metadata file shrinks dramatically; reads are unaffected because the array codec is independent.
Read-heavy scan loops over already-warm chunks → codec="lz4". The decompression rate is what matters.
Already compressing the whole directory (e.g. tarballed and shipped through an object store with HTTP gzip) → codec="none" for the arrays and let the outer compression do the work.

Codecs and metadata

Array codec — codec=

Metadata format — meta_format=

Metadata compression — meta_compression=

Auto-detection on open

Picking a combination

Array codec — `codec=`

Metadata format — `meta_format=`

Metadata compression — `meta_compression=`