Dask streaming and lazy reads
Atlas integrates with dask at both ends of the pipeline:
- Writes — dask-backed
xr.Datasetvariables stream one block at a time into atlas; peak memory ≈ one chunk per variable. - Reads — variables stored with
chunk_shape != shapecome back as dask arrays, one task per on-disk chunk.
Streaming writes
If a variable's .data is a dask.array.Array (e.g. from
xr.open_dataset(path, chunks=...) or ds.chunk({...})),
atlas.add_xr_dataset / ds.atlas.write iterates the dask chunk grid and
calls view.write_array(start=..., data=chunk) once per chunk. The whole
array is never materialised.
import xarray as xr
import atlas
ds = xr.open_dataset("big.nc", chunks={"time": 100, "lat": -1, "lon": -1})
with atlas.Atlas.create("/tmp/store") as atlas:
atlas.add_xr_dataset(ds, "big") # streams chunk-by-chunk
The dask chunk shape is also used as the atlas on-disk chunk_shape, so
the layout maps 1:1 with no extra alignment cost. Override per-variable
with chunks={"var_name": [...]}:
This decouples write-time memory budget from read-side chunk layout.
Peak memory ≈ one dask chunk per variable, plus dask's own task graph overhead. For a 10 GB chunked NetCDF with 100 MB chunks, you pay 100 MB of RAM per variable, not 10 GB.
Lazy reads
atlas.to_xarray(name) returns each variable dask-backed when it was
stored with non-trivial chunking (chunk_shape != shape). The dask
chunks tuple mirrors the on-disk chunk grid one-to-one, and each
on-disk chunk becomes a single dask task. Full-shape arrays and 0-D
scalars come back eager as numpy.
ds = xr.open_dataset("big.nc", chunks={"time": 100, "lat": -1, "lon": -1})
with atlas.Atlas.create("/tmp/store") as atlas:
atlas.add_xr_dataset(ds, "big")
ds_back = atlas.Atlas.open("/tmp/store").to_xarray("big")
type(ds_back["temperature"].data) # -> dask.array.Array
ds_back["temperature"][0:100].compute() # reads exactly one chunk
.compute() materialises the requested slice. Use xarray's normal
operators (isel, sel, mean, map_blocks, …) to operate lazily.
Scheduler restrictions
The dask graph captures the DatasetView directly, so dask's default
threaded scheduler works out of the box. Distributed and multiprocessing
schedulers aren't supported in this release — the DatasetView isn't
picklable across process boundaries.
# Works
ds_back["temperature"][0:100].compute(scheduler="threads")
# Does NOT work — DatasetView is not picklable
ds_back["temperature"][0:100].compute(scheduler="processes")
ds_back["temperature"][0:100].compute(scheduler=Client(...)) # dask.distributed
If you need cross-process parallelism, either:
- Materialise first —
.compute()(on the threaded scheduler) and hand off numpy arrays to your workers. - Use the bulk-read APIs —
Atlas.read_array_across_stackedandDatasetView.read_arraysare PyO3 calls into Rust that return eager numpy arrays, with parallelism handled internally by the tokio runtime. See Bulk reads.
When chunking is worth it
| Storage shape | Result |
|---|---|
chunk_shape == shape (default) |
Whole array stored as one block. Reads are eager numpy. Slice reads decompress the full array. |
chunk_shape != shape |
One block per chunk. Reads come back as dask arrays. Slice reads decompress only the touching chunks. |
For small per-dataset shapes (e.g. the profile benchmark case — (50,
168) per variable) chunking adds more overhead than it saves; keep the
default single-chunk layout. For large per-dataset shapes (e.g. the
gridded case — (100, 100, 48)), chunking is what unlocks the
decompression-only-what-you-touch behaviour. See
Benchmarks for the actual numbers.