Datasets
All datasets intended for querying through Beacon must be placed in the following directory inside the Docker container: /beacon/data/datasets/
No other configuration is required to make the datasets available for querying. Beacon will be able to infer the schema (columns,variables,attributes) of the dataset and make it available for querying.
Supported File Formats
Beacon supports the following file formats for querying datasets:
NetCDF
Beacon supports NetCDF but limited to native data types. User defined types are not supported.
NetCDF4 (recommended)
NetCDF3
- char* arrays with a string like dimension (eg. STRLEN) will be inferred as a fixed size string
Parquet
Beacon supports parquet natively through datafusion.
Limitations:
Hive partioning is not supported.
We currently disabled cloud support (S3, GCS, etc.) but might re-enable it in the future.
CSV
Beacon supports CSV files with the following limitations:
The first row of the CSV file must contain the column names.
The CSV file must be well-formed and properly formatted.
The CSV file must be encoded in UTF-8.
Beacon will infer the schema based on the entire CSV file, this can be optionally changed in the configuration.
ODV ASCII
Fully supported. It is recommended to store the ODV ASCII files using zstd compression. This can be done using the zstd
command line tool:
zstd -9 < input.txt > output.txt.zst
Beacon will automatically detect the compression and decompress the file on the fly.
Arrow IPC
Fully supported.
Beacon Binary Format (Standard Edition only)
Fully supported. Learn more more about this format and its usage.
Mounting Datasets into the Docker Container
If datasets are stored on the host machine, they need to be mounted into the container. This can be done when starting the Docker container:
version: "3.8"
services:
beacon:
image: ghcr.io/maris-development/beacon:community-nightly-latest
container_name: beacon
restart: unless-stopped
ports:
- "8080:8080" # Adjust the port mapping as needed
environment:
- BEACON_ADMIN_USERNAME=admin # Replace with your admin username
- BEACON_ADMIN_PASSWORD=securepassword # Replace with your admin password
- BEACON_VM_MEMORY_SIZE=4096 # Adjust memory allocation as needed (in MB)
- BEACON_DEFAULT_TABLE=default # Set default table name
- BEACON_LOG_LEVEL=INFO # Adjust log level
- BEACON_HOST=0.0.0.0 # Set IP address to listen on
- BEACON_PORT=8080 # Set port number
volumes:
- ./data/datasets:/beacon/data/datasets # Adjust the volume mapping as required
Managing Datasets
Adding datasets to the container can be done by copying the datasets to the ./data/datasets
directory on the host machine. The datasets will then automatically be available for querying through Beacon. There is no need to restart the container after adding datasets. The same applies to removing datasets.
Exploring Datasets
Once the datasets are mounted into the container, Beacon will automatically discover them and make them available for querying. To list the available datasets, you can use the following API endpoint:
GET /api/datasets
Listing available columns/variables/attributes of a dataset
GET /api/dataset-schema?file=example.nc
However, it it also possible to list the merged schema of all datasets found using a glob path:
GET /api/dataset-schema?file=*.nc