Skip to content

xarray integration

xarray and dask are required dependencies, and importing atlas registers an accessor at xr.Dataset.atlas, so the integration is on without any extra setup.

import atlas        # registers ds.atlas on import
import xarray as xr

Writing an xr.Dataset

Atlas must exist first; you then append xr.Datasets to it. There are two equivalent entry points:

import numpy as np, xarray as xr, atlas

ds = xr.Dataset(
    data_vars={
        "temperature": (["lat", "lon"],
                        np.arange(8 * 16, dtype=np.float32).reshape(8, 16),
                        {"units": "C", "long_name": "surface temperature"}),
    },
    coords={"lat": np.arange(8, dtype=np.float32),
            "lon": np.arange(16, dtype=np.float32)},
    attrs={"month": 1, "station": "KNMI"},
)

with atlas.Atlas.create("/tmp/store") as atlas:
    atlas.add_xr_dataset(ds, "jan_2024")     # method on Atlas
    ds.atlas.write(atlas, "feb_2024")        # accessor on xr.Dataset (same effect)

add_xr_dataset doesn't flush — N consecutive calls accumulate and one atlas.flush() (or the with block exit) persists everything. See Bulk ingestion below.

Reading back

atlas = atlas.Atlas.open("/tmp/store")
ds_back = atlas.to_xarray("jan_2024")
xr.testing.assert_identical(ds, ds_back)     # round-trip is bit-identical

Variables stored with chunk_shape != shape come back dask-backed (one dask task per on-disk chunk); full-shape and 0-D variables come back eager as numpy. See Dask streaming and lazy reads.

Storage conventions

xr.Dataset item How it's stored in atlas
Each data_var and each coord A separate atlas array; dims map 1:1 onto atlas dimension names.
Dataset.attrs Atlas dataset attributes, plain keys.
Per-variable attrs Flattened as {var}.{attr} at the dataset attribute level.
Per-variable _FillValue Consumed by define_array as a typed fill value (not flattened as a regular attr). The source Dataset.attrs is not mutated.
Coord vs data_var distinction JSON list in the internal _pyatlas_coords attribute.
Non-scalar attr values (list, ndarray, dict) JSON-encoded with a json: string prefix.

Reading back without the _pyatlas_coords marker falls back to a "1-D-array-named-after-its-dim-is-a-coord" heuristic, so atlas datasets written via the raw DatasetView API still load cleanly into xarray.

Supported variable dtypes

numpy dtype atlas dtype
int8/int16/int32/int64, uint*, float32/float64 matching numeric
datetime64[ns] timestamp_nanoseconds (round-trips to datetime64[ns])
object (Python str/bytes), \|S<n>, \|U<n> string (variable-length; reads return Python str)

0-D scalar variables (e.g. a NetCDF TRAJECTORY identifier) round-trip natively. See Supported dtypes for the full list and the restrictions.

Bulk ingestion

add_xr_dataset never flushes by itself — N consecutive calls accumulate in memory and one flush at the end persists everything. One delta file is written per array name touched across the whole batch, not one per input file:

import glob, os
import atlas, xarray as xr

with atlas.Atlas.create("/tmp/store") as atlas:
    for nc_path in sorted(glob.glob("*.nc")):
        name = os.path.splitext(os.path.basename(nc_path))[0]
        atlas.add_xr_dataset(xr.open_dataset(nc_path), name)
# Single flush on `with` exit.

If the NetCDF files are chunked (xr.open_dataset(path, chunks=...)), the dask chunks become the atlas on-disk chunks automatically — see Dask streaming and lazy reads.

Overriding the on-disk chunk shape

atlas.add_xr_dataset(ds, "jan_2024", chunks={"temperature": [4, 8]})
ds.atlas.write(atlas, "feb_2024", chunks={"temperature": [4, 8]})

This is independent of dask's chunking — the dask chunks are still what gets streamed at write time, but the on-disk chunk_shape is whatever you pass. Pick this when you want a different read-side chunk layout from your write-side memory budget.

Limitations

  • No append-into-existing. Each call to add_xr_dataset / ds.atlas.write creates a new atlas dataset. Append-style updates to an existing dataset go through the raw DatasetView API.
  • Threaded scheduler only for lazy reads. The DatasetView captured in the dask graph isn't picklable, so distributed / multiprocessing schedulers don't work. Call .compute() before crossing a process boundary, or use the bulk-read APIs which return eager numpy.
  • No bool / binary / list arrays yet (see Supported dtypes).