Atlas
Python bindings to atlas-rust — the
Rust core that powers ATLAS (Aggregated Tensor Large Array Store). Think of it as a "zip"
for N-dimensional data: store thousands of NetCDF / Zarr-style datasets in one
high-performance collection, then read any of them back as NumPy or
xarray.
import numpy as np
import atlas
with atlas.Atlas.create("/tmp/my_store", codec="zstd") as store:
ds = store.create_dataset("jan_2024")
ds.define_array(
"temperature",
dtype="float32",
dims=["lat", "lon"],
shape=[8, 16],
chunk_shape=[4, 8],
)
ds.write_array("temperature", start=[0, 0],
data=np.full((8, 16), 20.0, dtype=np.float32))
ds.set_attribute("month", 1)
# `with` exit == store.close() == single flush to disk
What's here
- Multi-dataset stores — one
Atlasdirectory holds many named datasets, each with their own arrays and attributes. - Shared physical arrays — N datasets that all declare
temperatureshare onetemperature/data.affile, so the 1000th dataset is just an append. - Compression — zstd / lz4 / uncompressed array codecs and json / msgpack metadata, optionally compressed.
- xarray + dask integration —
atlas.add_xr_dataset(ds, name)to ingest,atlas.to_xarray(name)to read back. Dask-backed variables stream chunk-by-chunk on write and come back lazily on read. - Bulk cross-dataset APIs —
to_xarray_manyandread_array_across_stackedcollapse N per-dataset reads into one PyO3 call, sharing a single file handle. - Sync API, GIL-released — a multi-threaded tokio runtime backs every blocking call; the GIL is released so other Python threads can run.
- Local fs or cloud storage — pass a path string for local, or an
obstore handle for S3 /
GCS / Azure / HTTP via
pip install "atlas-python[cloud]". The full read / write / flush / compact / bulk-read API works identically against either. See Cloud storage.
How does this compare to Zarr / netCDF?
netCDF and Zarr both put one logical "dataset" in one file (or one chunk directory); a fleet of N similar datasets becomes N stores. Atlas inverts that layout — one store, N datasets, with arrays of the same name sharing a physical file across all of them. This is the design choice that makes atlas dramatically faster on "many small datasets, same schema" workloads and on cross-dataset slice reads.
See vs Zarr / netCDF for the head-to-head, and Benchmarks for the numbers.
Next steps
- Installation — install the wheel, or build from source.
- Quickstart — first store in five minutes.
- Guides — the mental model, dtypes, attributes, xarray, dask, bulk reads, stats.
- API reference — auto-generated from the source-level docstrings.
Status
- Local filesystem and any
object_storebackend (S3 / GCS / Azure / HTTP / local) via the optional obstore dependency.pip install "atlas-python[cloud]"enables it; see Cloud storage (S3, GCS, Azure). - Lazy dask reads use the threaded scheduler only;
.compute()before crossing into distributed/multiprocessing schedulers. bool,binary,list[...],fixed_size_list[...,N]are reserved for a later release as array element types.