Getting started

This walkthrough mirrors the public API exposed by the SDK so you can go from zero to a working query in minutes.

1. Install and import

Beacon API supports Python 3.10+. Install it from PyPI and import the pieces you need:

pip install beacon-api

from beacon_api import Client

Editor support

The project ships typed stubs (py.typed) so VS Code, PyCharm, or notebooks provide signature help and inline documentation out of the box.

2. Create a client

Instantiate Client with the Beacon base URL and (optionally) headers for authentication. The constructor normalizes headers, sets JSON defaults, and validates connectivity by calling /api/health.

client = Client(
    "https://beacon.example.com",
    # jwt_token="<optional bearer token>",
    # proxy_headers={"X-Forwarded-For": "<ip>"},
    # basic_auth=("user", "pass") is also supported
)

Use client.check_status() to verify connectivity and print the Beacon version, or client.get_server_info() to inspect the metadata returned by /api/info.

3. Discover tables and datasets

list_tables() returns a mapping of table names to DataTable helpers that already know their description and type.

tables = client.list_tables()
stations = tables["stations-collection"]

print(stations.get_table_description())
schema = stations.get_table_schema()        # Arrow schema with pyarrow fields
schema_arrow = stations.get_table_schema_arrow()

If your Beacon Node is running v1.4.0 or later, use list_datasets() to enumerate file-backed resources and derive a query directly from a Dataset:

datasets = client.list_datasets(pattern="*.parquet", limit=5)
file = datasets["wod/2024-01.parquet"]

print(file.get_file_format(), file.get_file_name())
dataset_schema = file.get_schema()
dataset_query = file.query()

Deprecation notice

Client.query() and Client.subset() are still available for backwards compatibility but emit deprecation warnings. Prefer starting from a table (tables["default"].query()) or dataset (file.query()).

4. Build a JSON query

All table and dataset helpers return a JSONQuery, a fluent builder with chainable selects and filters:

df = (
    tables['argo'] # Select the 'default' table as our data source
    .query() # Create a new query on the selected table
    .add_select_column("LONGITUDE") # Select the LONGITUDE column
    .add_select_column("LATITUDE") # Select the LATITUDE column
    .add_select_column("JULD")
    .add_select_column("PRES")
    .add_select_column("TEMP")
    .add_select_column("PSAL")
    .add_select_column(".featureType") # Select the .featureType column
    .add_select_column("DATA_TYPE")
    .add_range_filter("JULD", "2020-01-01T00:00:00", "2021-01-01T00:00:00") # Filter for JULD between 2020 and 2021 for the column JULD
    .add_range_filter("PRES", 0, 10) # Filter for pressure between 0 and 10 dbar for the column PRES
    .to_pandas_dataframe() # Execute the query and return the results as a Pandas DataFrame
)
df

from datetime import datetime

query = (
    stations
    .query()
    .add_select_columns([
        ("LONGITUDE", None),
        ("LATITUDE", None),
        ("JULD", None),
        ("TEMP", "temperature_c"),
        ("PSAL", "salinity"),
    ])
    .add_select_coalesced(["SEA_NAME", "BASIN"], alias="water_body")
    .add_range_filter("JULD", datetime(2024, 1, 1), datetime(2024, 6, 1))
    .add_range_filter("PRES", 0, 10)
    .add_polygon_filter(
        longitude_column="LONGITUDE",
        latitude_column="LATITUDE",
        polygon=[(-5.2, 52.0), (-5.2, 52.5), (-4.2, 52.5), (-4.2, 52.0), (-5.2, 52.0)],
    )
)

Need a quick spatial/temporal subset without writing filters manually? DataTable.subset() wraps the same builder and automatically selects longitude/latitude/depth/time columns.

subset_query = stations.subset(
    longitude_column="LONGITUDE",
    latitude_column="LATITUDE",
    time_column="JULD",
    depth_column="PRES",
    columns=["TEMP", "PSAL"],
    bbox=(-20, 40, -10, 50),
    depth_range=(0, 50),
)

5. Execute the query

Every query inherits rich output helpers from BaseQuery:

query = tables['argo'].query()

... # build up the query as shown above

# Serialize results into various formats
df = query.to_pandas_dataframe()
gdf = query.to_geo_pandas_dataframe("LONGITUDE", "LATITUDE")

query.to_parquet("subset.parquet")
query.to_geoparquet("subset.geoparquet", "LONGITUDE", "LATITUDE")
query.to_netcdf("subset.nc")
query.to_nd_netcdf("subset_nd.nc", dimension_columns=["LONGITUDE", "LATITUDE", "JULD"])
query.to_zarr("subset.zarr")

Beacon compatibility

to_nd_netcdf requires Beacon Node v1.5.0 or newer.

Need lazy/out-of-core execution? Use to_dask_dataframe() or to_xarray_dataset() with chunking, or call to_dask_dataframe().head() for quick inspection.

Profiling and explain

Call query.explain() to retrieve the Beacon execution plan, or query.execute() to inspect the raw HTTP response.

6. Running SQL directly

When you already have SQL, skip the builder and call:

sql = client.sql_query("""
    SELECT lon, lat, juld, temperature
    FROM <some-collection-name>
    WHERE juld BETWEEN '2024-01-01T00:00:00' AND '2024-06-30T23:59:59'
""")

df = sql.to_pandas_dataframe()
print(df)

Next steps

Deep dive into exploring datasets and tables.
Learn the fluent APIs in Querying the Beacon Data Lake.
Browse the auto-generated API reference.