The xcube generator¶
Introduction¶
The generator is an xcube feature which allows users to create, manipulate, and write xcube datasets according to a supplied configuration. The same configuration can be used to generate a dataset on the user’s local computer or remotely, using an online server.
The generator offers two main user interfaces: A Python API, configured using Python objects; and a command-line interface, configured using YAML or JSON files. The Python and file-based configurations have the same structure and are interconvertible.
The online generator service interfaces with the xcube client via a well-defined REST API; it is also possible for third-party clients to make use of this API directly, though it is expected that the Python and command-line interfaces will be more convenient in most cases.
Further documentation¶
This document aims to provide a brief overview of the generation process and the available configuration options. More details are available in other documents and in the code itself:
Probably the most thorough documentation is available in the Jupyter demo notebooks in the xcube repository. These can be run in any JupyterLab environment containing an xcube installation. They combine explanation with interactive worked examples to demonstrate practical usage of the generator in typical use cases.
For the Python API in particular, the xcube API documentation is generated from the docstrings included in the code itself and serves as a detailed low-level reference for individual Python classes and methods. The docstrings can also be read from a Python environment (e.g. using the
?
postfix in IPython or JupyterLab) or, of course, by browsing the source code itself.For the YAML/JSON configuration syntax used with the command-line interface, there are several examples available in the examples/gen2/configs subdirectory of the xcube repository.
For the REST API underlying the Python and command-line interfaces, there is a formal definition on SwaggerHub, and one of the example notebooks demonstrates its usage with the Python requests library.
The generation process¶
The usual cube generation process is as follows:
The generator opens the input data store using the store identifier and parameters in the input configuration.
The generator reads from the input store the data specified in the cube configuration and uses them to create a data cube, often with additional manipulation steps such as resampling the data.
If an optional code configuration has been given, the user-supplied code is run on the created data cube, potentially modifying it.
The generator writes the generated cube to the data store specified in the output configuration.
Invoking the generator from a Python environment¶
The configurations for the various parts of the generator are used to
initialize a GeneratorRequest
, which is then passed to
xcube.core.gen2.generator.CubeGenerator.generate_cube
. The
generate_cube
method returns a cube reference which can be used to
open the cube from the output data store.
The generator can also be directly invoked with a configuration file
from a Python environment, using the
xcube.core.gen2.generator.CubeGenerator.from_file
method.
Invoking the generator from the command line¶
The generator can be invoked from the command line using the
xcube gen2
subcommand. (Note: the subcommand xcube gen
invokes
an earlier, deprecated generator feature which is not compatible with
the generator framework described here.)
Configuration syntax¶
All Python configuration classes are defined in the xcube.core.gen2
package, except for CodeConfig
, which is in xcube.core.byoa
.
The types in the parameter tables are given in an ad-hoc, semi-formal
notation whose corresponding Python and JSON representations should be
obvious. For the formal Python type definitions, see the signatures of
the __init__
methods of the configuration classes; for the formal
JSON type definitions, see the JSON schemata (in JSON Schema
format) produced by the get_schema
methods of the configuration classes.
Remote generator service configuration¶
The command-line interface allows a service configuration for the
remote generator service to be provided as a YAML or JSON file. This
file defines the endpoint and access credentials for an online generator
service. If it is provided, the specified remote service will be used to
generate the cube. If it is omitted, the cube will be generated locally.
The configuration file defines three values: endpoint_url
,
client_id
, and client_secret
. A typical service configuration
YAML file might look as follows:
endpoint_url: "https://xcube-gen.brockmann-consult.de/api/v2/"
client_id: "93da366d7c39517865e4f141ddf1dd81"
client_secret: "d2l0aG91dCByZXN0cmljdGlvbiwgaW5jbHVkaW5nIHd"
Store configuration¶
In the command-line interface, an additional YAML or JSON file containing one or more store configurations may be supplied. A store configuration encapsulates a data store ID and an associated set of store parameters, which can then be referenced by an associated store configuration identifier. This identifier can be used in the input configuration, as described below. A typical YAML store configuration might look as follows:
sentinelhub_eu:
title: SENTINEL Hub (Central Europe)
description: Datasets from the SENTINEL Hub API deployment in Central Europe
store_id: sentinelhub
store_params:
api_url: https://services.sentinel-hub.com
client_id: myid123
client_secret: 0c5892208a0a82f1599df026b5e19017
cds:
title: C3S Climate Data Store (CDS)
description: Selected datasets from the Copernicus CDS API
store_id: cds
store_params:
normalize_names: true
num_retries: 3
my_data_bucket:
title: S3 output bucket
description: An S3 bucket for output data sets
store_id: s3
store_params:
root: cube-outputs
storage_options:
key: qwerty12345
secret: 7ff889c0aea254d5e00440858289b85c
client_kwargs:
endpoint_url: https://my-endpoint.some-domain.org/
Input configuration¶
The input configuration defines the data store from which data for the cube are to be read, and any additional parameters which this data store requires.
The Python configuration object is InputConfig
; the corresponding
YAML configuration section is input_configs
.
Parameter |
Required? |
Type |
Description |
---|---|---|---|
|
N |
str |
Identifier for the data store |
|
N |
str |
Identifier for the data opener |
|
Y |
str |
Identifier for the dataset |
|
N |
map(str→ |
Parameters for the data store |
|
N |
map(str→ |
Parameters for the data opener |
store_id
is a string identifier for a particular xcube data store,
defined by the data store itself. If a store configuration file has been
supplied (see above), a store configuration identifier can also be
supplied here in place of a ‘plain’ store identifier. Store
configuration identifiers must be prefixed by an @
symbol. If a
store configuration identifier is supplied in place of a store
identifier, store_params
values will be supplied from the predefined
store configuration and can be omitted from the input configuration.
data_id
is a string identifier for the dataset within a particular
store.
The format and content of the store_params
and open_params
dictionaries is defined by the individual store or opener.
The generator service does not yet provide a remote interface to list
available data stores, datasets, and store parameters (i.e. allowed
values for the parameters in the table above). In a local xcube Python
environment, you can list the currently available store identifiers with
the expression
list(map(lambda e: e.name, xcube.core.store.find_data_store_extensions()))
.
You can create a local store object for an identifier store_id
with
xcube.core.store.get_data_store_instance(store_id).store
. The store
object provides methods get_data_ids
,
get_data_store_params_schema
, and get_open_data_params_schema
to
describe the allowed values for the corresponding parameters. Note that
the available stores and datasets on a remote xcube generator server may
not be the same as those available in your local xcube environment.
Cube configuration¶
This configuration element defines the characteristics of the cube that
should be generated. The Python configuration class is called
CubeConfig
, and the YAML section cube_config
. All parameters are
optional and will be filled in with defaults if omitted; the default
values are dependent on the data store and dataset.
Parameter |
Type |
Units/Description |
---|---|---|
|
[str, …] |
Available variables are data store dependent. |
|
str |
PROJ string, JSON string with PROJ parameters, CRS WKT string, or authority string |
|
[float, float, float, float] |
Bounding-box
( |
|
float or [float, float] |
CRS-dependent, usually degrees |
|
int or [int, int] |
pixels |
|
str or [str, str] |
ISO 8601 subset |
|
str |
integer + unit |
|
map(str→null/int) |
maps variable names to chunk sizes |
The crs
parameter string is interpreted using `CRS.from_string
in the pyproj
package <https://pyproj4.github.io/pyproj/dev/api/crs/crs.html#pyproj.crs.CRS.from_string>`__
and therefore accepts the same specifiers.
time_range
specified the start and end of the requested time range.
can be specified either as a date in the format YYYY-MM-DD
or as a
date and time in the format YYYY-MM-DD HH:MM:SS
. If the time is
omitted, it is taken to be 00:00:00
(the start of the day) for the
start specifier and 24:00:00
(the end of the day) for the specifier.
The end specifier may be omitted; in this case the current time is used.
time_period
specified the duration of a single time step in the
requested cube, which determines the temporal resolution. It consists of
an integer denoting the number of time units, followed by single
upper-case letter denoting the time unit. Valid time unit specifiers are
D (day), W (week), M (month), and Y (year). Examples of time_period
values: 1Y
(one year), 2M
(two months), 10D
(ten days).
The value of the chunks
mapping determines how the generated data is
chunked for storage. The chunking has no effect on the data itself, but
can have a dramatic impact on data access speeds in different scenarios.
The value of chunks
is structured a map from variable names
(corresponding to those specified by the variable_names
parameter)
to chunk sizes.
Code configuration¶
The code configuration supports multiple ways to define a dataset processor – fundamentally, a Python function which takes a dataset and returns a processed version of the input dataset. Since the code configuration can work directly with instantiated Python objects (which can’t be stored in a YAML file), there are some differences in code configuration between the Python API and the YAML format.
Parameter |
Type |
Units/description |
---|---|---|
|
Callable |
Function to be called to process the datacube. Only available via Python API |
|
str (non-empty) |
A reference to a
Python class or
function, in the
format
|
|
map(str→ |
Parameters to be passed to the specified callable |
|
str (non-empty) |
An inline snippet of Python code |
|
FileSet (Python) / map (YAML) |
A bundle of Python modules or packages (see details below) |
|
boolean |
If set, indicates
that |
All parameters are optional (as is the entire code configuration itself). The three parameters marked † are mutually exclusive: at most one of them may be given.
_callable
provides the dataset processor directly and is only
available in the Python API. It must be either a function or a class.
If a function, it takes a
Dataset
and optional additional named parameters, and returns aDataset
. Any additional parameters are supplied in thecallable_params
parameter of the code configuration.If an object, it must implement a method
process_dataset
, which is treated like the function described above, and may optionally implement a class methodget_process_params_schema
, which returns aJsonObjectSchema
describing the additional parameters. For convenience and clarity, the object may extend the abstract base classDatasetProcessor
, which declares both these methods.
callable_ref
is a string with the structure
<module>:<function_or_class>
, and specifies the function or class to
call when inline_code
or file_set
is provided. The specified
function or class is handled like the _callable
parameter described
above.
callable_params
specifies a dictionary of named parameters which are
passed to the processor function or method.
inline_code
is a string containing Python source code. If supplied,
it should contain the definition of a function or object as described
for the _callable
parameter. The module and class identifiers for
the callable in the inline code snippet should be specified in
callable_ref
parameter.
file_set
specifies a set of files which should be read from an
fsspec file system and
which contain a definition of a dataset processor. As with
inline_code
, the parameter callable_ref
should also be supplied
to tell the generator which class or function in the file set is the
actual processor. The parameters of file_set
are identical with
those of the constructor of the corresponding Python FileSet
class,
and are as follows:
Parameter |
Type |
Description |
---|---|---|
|
str |
fsspec-compatible root path specifier |
|
str |
optional sub-path to append to main path |
|
[str] |
include files matching any of these patterns |
|
[str] |
exclude files matching any of these patterns |
|
map(str→ |
FS-specific parameters (passed to fsspec) |
Output configuration¶
This configuration element determines where the generated cube should be
written to. The Python configuration class is called OutputConfig
,
and the YAML section output_config
.
Parameter |
Type |
Units/description |
---|---|---|
|
str |
Identifier of output store |
|
str |
Identifier of data writer |
|
str |
Identifier under which to write the cube |
|
map(str→ |
Store-dependent parameters for output store |
|
map(str→ |
Writer-dependent parameters for output writer |
|
bool |
If true, replace any existing data with the same identifier. |