Performance Tuning

Beacon Query Engine Settings

This chapter focuses on practical performance knobs that are safe to tune in production.

Beacon is built on DataFusion. Most performance tuning comes down to:

How much parallelism Beacon is allowed to use (threads)
How much memory is available to the query engine before spilling to disk
Whether Beacon can avoid unnecessary IO (projection pushdown, caches)

TIP

All settings below are environment variables. See the full list in configuration.md.

CPU and concurrency

`BEACON_WORKER_THREADS`

Beacon uses this value to size its Tokio runtime (the executor that runs API requests and query work).

If you run Beacon on a dedicated machine, start with BEACON_WORKER_THREADS ~= number of physical cores.
If the same host also runs other heavy services, cap this to leave headroom.

For IO-heavy workloads (remote object storage, NetCDF reads over HTTP), more threads can help. For CPU-heavy workloads (aggregations, joins), scaling is limited by CPU.

Memory and disk spilling

`BEACON_VM_MEMORY_SIZE`

This controls the DataFusion memory pool size (in MB). When queries exceed this pool, DataFusion can spill to disk.

Larger values generally improve performance by reducing spill frequency.
If you see high disk activity and slow queries under load, increase this first.

WARNING

Spilling uses the OS temp area (DataFusion disk manager). For best performance, ensure your temp directory is on fast storage and has enough free space.

Avoiding unnecessary reads

`BEACON_ENABLE_PUSHDOWN_PROJECTION`

When enabled, Beacon will attempt to project only the columns referenced by your JSON query select list when building the scan.

Set to true when you frequently query “wide” datasets but only select a few columns.
Leave as false if you suspect projection bugs or you want the simplest behavior.

Query language and parsing

`BEACON_ENABLE_SQL`

SQL parsing/execution is guarded by this flag.

Set to true to allow SQL queries.
Keep false if you only use the JSON query API and want to reduce exposed surface area.

Geospatial function cache

`BEACON_ST_WITHIN_POINT_CACHE_SIZE`

Controls the cache capacity used by the ST_WithinPoint implementation.

Increase this if you run many repeated point-in-polygon style queries over similar geometries.
Reduce it if memory pressure is an issue and you don’t benefit from reuse.

Filesystem and object-store listing

`BEACON_ENABLE_FS_EVENTS`

When using local filesystem datasets, enabling filesystem events allows Beacon’s object-store layer to maintain an in-memory view of changes and avoid expensive directory rescans.

Enable (true) when datasets change frequently and you care about fast “list/search datasets” operations.
Keep disabled if you are on a platform where file watching is noisy/unsupported.

`BEACON_S3_DATA_LAKE`, `BEACON_ENABLE_S3_EVENTS`, `BEACON_S3_BUCKET`, `BEACON_S3_ENABLE_VIRTUAL_HOSTING`

These control whether Beacon uses S3-compatible object storage for datasets and how it addresses buckets.

For performance, prefer placing Beacon close (network-wise) to the object store.
If you see high latency when listing, consider enabling event-driven updates (BEACON_ENABLE_S3_EVENTS) if your S3 backend supports it.

NetCDF Tuning

NetCDF performance in Beacon is mainly affected by:

How often Beacon needs to open the file and infer schema
Whether opened readers/schemas are cached
Whether NetCDF reads are offloaded to a multi-process worker pool (MPIO mode)

TIP

NetCDF scans currently read a single Arrow RecordBatch per file. If you have extremely large .nc files, performance may improve by splitting them into smaller files or converting to chunk-friendly formats (e.g. Zarr), depending on your access pattern.

Schema cache (fast repeated schema inference)

`BEACON_NETCDF_USE_SCHEMA_CACHE` and `BEACON_NETCDF_SCHEMA_CACHE_SIZE`

Beacon discovers an Arrow schema for NetCDF datasets by opening files and inspecting variables/attributes. With schema caching enabled, these discovered schemas are cached in-memory and keyed by:

the object path
the object last-modified timestamp

Recommendations:

Keep BEACON_NETCDF_USE_SCHEMA_CACHE=true (default) for most deployments.
Increase BEACON_NETCDF_SCHEMA_CACHE_SIZE when you query many distinct NetCDF files/tables and see repeated schema inference.
Reduce cache size if memory is constrained and your workload touches only a small working set.

Reader cache (avoid reopening files)

`BEACON_NETCDF_USE_READER_CACHE` and `BEACON_NETCDF_READER_CACHE_SIZE`

With reader caching enabled, Beacon reuses opened NetCDF readers (also keyed by path + last-modified time). This helps when the same files are accessed repeatedly across queries.

Recommendations:

Keep BEACON_NETCDF_USE_READER_CACHE=true (default) for repeated-access workloads.
Increase BEACON_NETCDF_READER_CACHE_SIZE if your “hot set” of NetCDF files is larger than the default (128).
Disable reader caching if you have extremely high file churn and want to minimize open file handles.

NetCDF multiplexer (MPIO): multi-process NetCDF reads

`BEACON_ENABLE_MULTIPLEXER_NETCDF`

When enabled, Beacon routes NetCDF schema reads and batch reads through a pool of external worker processes (beacon-arrow-netcdf-mpio). This can improve throughput by:

parallelizing NetCDF reads across processes
isolating blocking NetCDF operations from the main async runtime

Enable this when:

you have many concurrent queries reading NetCDF
NetCDF open/read calls are a bottleneck
you want better isolation from native library behavior

Keep it disabled when:

you run a minimal deployment and prefer fewer moving parts

`BEACON_NETCDF_MULTIPLEXER_PROCESSES`

Controls the number of worker processes in the MPIO pool.

Default is half of CPU cores.
Increase for high concurrency / IO-bound workloads.
Decrease if the workers compete too aggressively with query execution CPU.

`BEACON_NETCDF_MPIO_WORKER`

Optional explicit path to the beacon-arrow-netcdf-mpio executable.

Use this if:

the worker is not on PATH

`BEACON_NETCDF_MPIO_REQUEST_TIMEOUT_MS`

Per-request timeout for calls to the MPIO worker pool.

Default is 0 (disabled).
Set this in production to bound tail-latency if you’ve observed stuck or pathological reads.

Suggested starting points

For a “many users, many NetCDF files” deployment:

BEACON_NETCDF_USE_SCHEMA_CACHE=true
BEACON_NETCDF_SCHEMA_CACHE_SIZE=4096
BEACON_NETCDF_USE_READER_CACHE=true
BEACON_NETCDF_READER_CACHE_SIZE=512
BEACON_ENABLE_MULTIPLEXER_NETCDF=true
BEACON_NETCDF_MULTIPLEXER_PROCESSES ~= CPU/2 (then tune up/down)

For a small single-user deployment:

Keep defaults, and only enable the MPIO multiplexer if NetCDF reads dominate runtime.

Zarr statistics (predicate pruning)

Beacon’s Zarr reader can use lightweight statistics for predicate-aware IO reduction. When enabled, Beacon can:

Prune entire Zarr groups that cannot satisfy your filter (based on per-column min/max)
Push down 1D slicing for selected “coordinate-like” arrays (for example time/lat/lon), so only relevant ranges are read

Enable statistics for a Zarr collection

When you create a logical Zarr table/collection, include statistics.columns under the Zarr table definition.

http

POST /api/admin/create-table/

Content-Type: application/json
{
	"table_name": "my_zarr_table",
	"table_type": {
		"logical": {
			"paths": [
				"datasets/**/*.zarr/zarr.json"
			],
			"file_format": "zarr",
			"statistics": {
				"columns": ["valid_time", "latitude", "longitude"]
			}
		}
	}
}

Enable statistics for ad-hoc reads (SQL)

read_zarr supports an optional second argument: a list of statistics_columns.

http

POST /api/query
Content-Type: application/json

{
	"sql": "SELECT * FROM read_zarr(['datasets/**/*.zarr/zarr.json'], ['valid_time', 'latitude', 'longitude']) WHERE valid_time >= '2025-01-01' AND longitude < 30 LIMIT 100",
	"output": { "format": "csv" }
}

Enable statistics for ad-hoc reads (JSON)

For JSON queries, set from.zarr.statistics_columns:

http

POST /api/query
Content-Type: application/json

{
	"from": {
		"zarr": {
			"paths": ["datasets/**/*.zarr/zarr.json"],
			"statistics_columns": ["valid_time", "latitude", "longitude"]
		}
	},
	"select": ["valid_time", "latitude", "longitude"],
	"filters": [
        { "column": "valid_time", "min": "2025-01-01" },
        { "column": "longitude", "min": 15, "max": 30 }
    ],
	"limit": 100,
	"output": { "format": "csv" }
}

WARNING

Statistics are computed by reading the full values of the selected arrays (and slice pushdown only applies to 1D arrays). Only enable statistics for a small set of frequently-filtered, coordinate-like columns.

Datasets

Collections