Shared physical arrays
Atlas's defining layout trick: arrays are shared by name across all
datasets in a store. When N datasets all define_array("temperature",
...) with the same schema, they all write into the same physical
temperature/data.af file, keyed by dataset name inside the file. Adding
the 1000th dataset that re-uses an existing schema is one append + one
atlas.json rewrite — not 1000 new directories.
What the on-disk layout looks like
For a store with 50 sensor datasets, each declaring temperature and
pressure:
/tmp/store/
├─ atlas.json # registry of all 50 datasets + array schemas
├─ temperature/
│ └─ data.af # holds 50 entries, keyed by dataset name
└─ pressure/
└─ data.af # holds 50 entries, keyed by dataset name
Atlas.list_arrays() returns the distinct array names across the
whole store (here: ["temperature", "pressure"]) — usually a small
constant. Atlas.list_datasets() returns the logical dataset names that
may run into the thousands.
examples/07_shared_arrays.py
walks this with 50 stations × 2 variables and prints the on-disk
directory listing — exactly two array directories regardless of N.
Why this scales
- Fewer inodes. A fleet of 1000 datasets that share 5 array names uses 5 physical files, not 5000. Filesystem walk time stays constant as the dataset count grows.
- One file handle per array. Cross-dataset bulk reads
(
read_array_across_stacked) share oneRwLock::readguard on each array's file and dispatch N per-dataset reads on the tokio runtime. The kernel-level open/close cost is paid once per array, not once per dataset. - One delta file per array per flush. N consecutive
add_xr_datasetcalls write one delta segment per touched array name on the next flush, no matter how many datasets contributed. See Durability and flushing.
When schemas don't line up
There's no requirement that every dataset declare every array. If five out
of 1000 datasets skip humidity:
- The
humidity/data.affile holds 995 entries, not 1000. view.array_meta("humidity")returnsNoneon the five that don't declare it.Atlas.read_array_across("humidity", all_names, ...)returnsNonein those five positions;read_array_across_stackederrors (because there's no positional "missing" sentinel in the stacked output).
So shared physical files are an optimisation that activates when schemas do line up; mixed-schema stores work too, you just don't get the bulk-read density.
Dtype rules across a shared array
All datasets that share a physical array file must declare it with the
same dtype. The first define_array(name, dtype=...) in the store
fixes the on-disk dtype; subsequent definitions with a different dtype
raise.
ds1.define_array("temperature", dtype="float32", ...) # ok
ds2.define_array("temperature", dtype="float32", ...) # ok, joins the shared file
ds3.define_array("temperature", dtype="float64", ...) # raises — dtype mismatch
Shape and chunk shape can differ per dataset — only the dtype is global to the physical file.
When to not share
If two datasets happen to have an array named temperature but with
incompatible meanings (different units, different reference frames), give
them distinct names (indoor_temperature, surface_temperature). The
shared-file layout is a performance and storage win, not a semantic
contract — atlas won't enforce that two datasets' temperature mean the
same thing, so you should.