Introduction

Open Source (AGPL V3)

Beacon is open source under the AGPL V3 license. Source code and contributions: github.com/maris-development/beacon

Beacon is a data lakehouse query engine built for scientific datasets. Point it at your existing files — on disk or in S3 — and it exposes a SQL query API instantly, with no data migration or preprocessing required.

Clients query Beacon using SQL or JSON and receive results as a file (Parquet, NetCDF, Arrow IPC, …) or a streaming Arrow IPC response. Beacon handles filtering, aggregation, and joins across files entirely server-side.

Quick setup

Start Beacon with Docker, mounting a local datasets folder that holds your files:

bash

docker run -d \
  --name beacon \
  -p 5001:5001 \
  -v ./datasets:/beacon/data/datasets \
  ghcr.io/maris-development/beacon:latest

Drop your .nc (or Parquet, Zarr, CSV, …) files into ./datasets and they are queryable immediately at http://localhost:5001. For Compose, S3-backed storage, and more, see the getting started guide.

Your first query

The same query several ways — sent over the HTTP API against Parquet or NetCDF files, or from the Python SDK. No table registration required: the read_*() functions read the files in place. The paths are always relative to the datasets directory you mounted.

HTTP · ParquetHTTP · NetCDFPython

http

POST /api/query
Content-Type: application/json

{
  "sql": "SELECT time, latitude, longitude, temperature FROM read_parquet(['example.parquet']) WHERE temperature > 20 LIMIT 100",
  "output": { "format": "csv" }
}

http

POST /api/query
Content-Type: application/json

{
  "sql": "SELECT time, latitude, longitude, temperature FROM read_netcdf(['example.nc']) WHERE temperature > 20 LIMIT 100",
  "output": { "format": "csv" }
}

python

%pip install beacon-api --upgrade
from beacon_api import Client

client = Client("https://your-beacon-node")

df = client.sql_query(
    "SELECT time, latitude, longitude, temperature "
    "FROM read_netcdf(['example.nc']) "
    "WHERE temperature > 20 LIMIT 100"
).to_pandas_dataframe()

SQL is sent over the HTTP API (POST /api/query, with BEACON_ENABLE_SQL=true) or Arrow Flight SQL. Prefer querying by name? Register the files as an external table first.

Supported formats

Format	Notes
NetCDF	`.nc`, `.nc4`, `.cdf`
Zarr	v2 and v3
Atlas	Array store optimized for NetCDF/Zarr query performance
Parquet	Native columnar, Hive partitioning supported
GeoTIFF / COG	Cloud-Optimized GeoTIFF supported
ODV ASCII	Ocean Data View spreadsheet format
CSV	Header row required, delimiter configurable
Arrow IPC	`.arrow`, `.ipc` stream files

Key concepts

A few terms used throughout the docs:

Dataset — an individual file Beacon reads in place (NetCDF, Zarr, Parquet, …).
External table — a registered name over one or more files (a folder or glob pattern), with a merged schema across them.
View — a saved query exposed as a table.
Managed table — an Iceberg-backed table Beacon owns and can mutate (INSERT / UPDATE / DELETE).

Next steps

Get started — run Beacon with Docker, locally or against S3.
Connect a client — JetBrains DataGrip, Python ADBC/SDK, or the CLI.
Write queries — the SQL guide.
Register your data — external tables and views.

Supported Formats

External Tables

Remote Tables (Federation)

Performance Tuning

WHERE

JOIN

Reading Files

CREATE EXTERNAL TABLE

CREATE TABLE (Managed)

CREATE MATERIALIZED VIEW

Introspection

Function Reference

Beacon-specific

Geospatial

Domain Mapping

Exploring the Data Lake

Querying

JSON Query DSL

SQL

Introduction

Quick setup

Your first query

Supported formats

Key concepts

Next steps

External Tables

Remote Tables (Federation)

WHERE

JOIN

Reading Files

CREATE EXTERNAL TABLE

CREATE TABLE (Managed)

CREATE MATERIALIZED VIEW

Beacon-specific

Geospatial

Domain Mapping

JSON Query DSL

SQL

Introduction ​

Quick setup ​

Your first query ​

Supported formats ​

Key concepts ​

Next steps ​

Introduction

Quick setup

Your first query

Supported formats

Key concepts

Next steps