Common data store conventions

This document is a work in progress.

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

Useful references related to this document include:

The JSON Schema Specification and the book Understanding JSON Schema
xcube Issue #330 (‘Establish common data store conventions’)
The existing xcube store plugins xcube-sh, xcube-cci, and xcube-cds
The xcube.util.jsonschema source code

Naming Identifiers

This section explains various identifiers used by the xcube data store framework and defines their format.

In the data store framework, identifiers are used to denote data sources, data stores, and data accessors. Data store, data opener, and data writer identifiers are used to register the component as extension in a package’s plugin.py. Identifiers MUST be unambiguous in the scope of the data store. They SHOULD be unambiguous across the entirety of data stores.

There are no further restrictions for data source and data store identifiers.

A data accessor identifier MUST correspond to the following scheme:

<data_type>:<format>:<storage>[:<version>]

<data_type> identifies the in-memory data type to represent the data, e.g., dataset (or xarray.Dataset), geodataframe (or geopandas.GeoDataFrame). <format> identifies the data format that may be accessed, e.g., zarr, netcdf, geojson. <storage> identifies the kind of storage or data provision the accessor can access. Example values are file (the local file system), s3 (AWS S3-compatible object storage), or sentinelhub (the Sentinel Hub API), or cciodp (the ESA CCI Open Data Portal API). The <version> finally is an optional notifier about a data accessor’s version. The version MUST follow the Semantic Versioning.

Examples for valid data accessors identifiers are:

dataset:netcdf:file
dataset:zarr:sentinelhub
geodataframe:geojson:file
geodataframe:shapefile:cciodp:0.4.1

Open Parameters

This section aims to provide an overview of the interface defined by an xcube data store or opener in its open parameters schema, and how this schema may be used by a UI generator to automatically construct a user interface for a data opener.

Specification of open parameters

Every implementation of the xcube.core.store.DataOpener or xcube.core.store.DataStore abstract base classes MUST implement the get_open_data_params_schema method in order to provide a description of the allowed arguments to open_data for each dataset supported by the DataOpener or DataStore. The description is provided as a JsonObjectSchema object corresponding to a JSON Schema. The intention is that this description should be full and detailed enough to allow the automatic construction of a user interface for access to the available datasets. Note that, under this system:

Every dataset provided by an opener can support a different set of open parameters.
The schema does not allow the representation of interdependencies between values of open parameters within a dataset. For instance, the following interdependencies between two open parameters sensor_type and variables would not be representable in an open parameters schema:

sensor_type: A or B
variables: [temperature, humidity] for sensor type A; [temperature, pressure] for sensor type B

To work around some of the restrictions of point (2) above, a dataset MAY be presented by the opener as multiple “virtual” datasets with different parameter schemas. For instance, the hypothetical dataset described above MAY be offered not as a single dataset envdata but as two datasets envdata:sensor-a (with a fixed sensor_type of A) and envdata:sensor-b, (with a fixed sensor_type of B), offering different sets of permitted variables. Sometimes, the interdependencies between parameters are too complex to be fully represented by splitting datasets in this manner. In these cases:

The JSON Schema SHOULD describe the smallest possible superset of the allowed parameter combinations.
The additional restrictions on parameter combinations MUST be clearly documented.
If illegal parameter combinations are supplied, the opener MUST raise an exception with an informative error message, and the user interface SHOULD present this message clearly to the user.

Common parameters

While an opener is free to define any open parameters for any of its datasets, there are some common parameters which are likely to be used by the majority of datasets. Furthermore, there are some parameters which are fundamental for the description of a dataset and therefore MUST be included in a schema (these parameters are denoted explicitly in the list below). In case that an opener does not support varying values of one of these parameters, a constant value must defined. This may be achieved by the JSON schema’s const property or by an enum property value whose is a one-element array.

Any dataset requiring the specification of these parameters MUST use the standard parameter names, syntax, and semantics defined below, in order to keep the interface consistent. For instance, if a dataset allows a time aggregation period to be specified, it MUST use the time_period parameter with the format described below rather than some other alternative name and/or format. Below, the parameters are described with their Python type annotations.

variable_names: List[str]
A list of the identifiers of the requested variables. This parameter MUST be included in an opener parameters schema.
bbox: Union[str,Tuple[float, float, float, float]] The bounding box for the requested data, in the order xmin, ymin, xmax, ymax. Must be given in the units of the specified spatial coordinate reference system crs. This parameter MUST be included in an opener parameters schema.
crs: str
The identifier for the spatial coordinate reference system of geographic data.
spatial_res: float
The requested spatial resolution (x and y) of the returned data. Must be given in the units of the specified spatial coordinate reference system crs. This parameter MUST be included in an opener parameters schema.
time_range: Tuple[Optional[str], Optional[str]]
The requested time range for the data to be returned. The first member of the tuple is the start time; the second is the end time. See section ‘Date, time, and duration specifications’. This parameter MUST be included in an opener parameters schema. If a date without a time is given as the start time, it is interpeted as 00:00 on the specified date. If a date without a time is given as the end time, it is interpreted as 24:00 on the specified date (identical with 00:00 on the date following the specified date). If the end time is specified as None, it is interpreted as the current time.
time_period: str
The requested temporal aggregation period for the data. See section ‘Date, time, and duration specifications’. This parameter MUST be included in an opener parameters schema.
force_cube: bool
Whether to return results as a specification-compliant xcube. If a store supports this parameter and if a dataset is opened with this parameter set to True, the store MUST return a specification-compliant xcube dataset. If this parameter is not supported or if a dataset is opened with this parameter set to False, the caller MUST NOT assume that the returned data conform to the xcube specification.

Semantics of list-valued parameters

The variables parameter takes as its value a list, with no duplicated members and the values of its members drawn from a predefined set. The values of this parameter, and other parameters whose values also follow such a format, are interpreted by xcube as a restriction, much like a bounding box or time range. That is:

By default (if the parameter is omitted or if a None value is supplied for it), all the possible member values MUST be included in the list. In the case of variables, this will result in a dataset containing all the available variables.
If a list containing some of the possible members is given, a dataset corresponding to those members only MUST be returned. In the case of variables, this will result in a dataset containing only the requested variables.
A special case of the above: if an empty list is supplied, a dataset containing no data MUST be returned – but with the requested spatial and temporal dimensions.

Date, time, and duration specifications

In the common parameter time_range, times can be specified using the standard JSON Schema formats date-time or date. Any additional time or date parameters supported by an xcube opener dataset SHOULD also use these formats, unless there is some good reason to prefer a different format.

The formats are described in the JSON Schema Validation 2019 draft, which adopts definitions from RFC 3339 Section 5.6. The JSON Schema date-time format corresponds to RFC 3339’s date-time production, and JSON Schema’s date format to RFC 3339’s full-date production. These formats are subsets of the widely adopted ISO 8601 format.

The date format corresponds to the pattern YYYY-MM-DD (four-digit year – month – day), for example 1995-08-20. The date-time format consists of a date (in the date format), a time (in HH:MM:SS format), and timezone (Z for UTC, or +HH:MM or -HH:MM format). The date and time are separated by the letter T. Examples of date-time format include 1961-03-23T12:22:45Z and 2018-04-01T21:12:00+08:00. Fractions of a second MAY also be included, but are unlikely to be relevant for xcube openers.

The format for durations, as used for aggregation period, does not conform to the syntax defined for this purpose in the ISO 8601 standard (which is also quoted as Appendix A of RFC 3339). Instead, the required format is a small subset of the pandas time series frequency syntax, defined by the following regular expression:

^([1-9][0-9]*)?[HDWMY]$

That is: an optional positive integer followed by one of the letters H (hour), D (day), W (week), M (month), and Y (year). The letter specifies the time unit and the integer specifies the number of units. If the integer is omitted, 1 is assumed.

Time limits: an extension to the JSON Schema

JSON Schema itself does not offer a way to impose time limits on a string schema with the date or date-time format. This is a problem for xcube generator UI creation, since it might be reasonably expected that a UI will show and enforce such limits. The xcube opener API therefore defines an unofficial extension to the JSON string schema: a JsonStringSchema object (as returned as part of a JsonSchema by a call to get_open_data_params_schema) MAY, if it has a format property with a value of date or date-time, also have one or both of the properties min_datetime and max_datetime. These properties must also conform to the date or date-time format. xcube provides a dedicated JsonDatetimeSchema for this purpose. Internally, it extends JsonStringSchema by adding the required properties to the JSON string schema.

Generating a UI from a schema

With the addition of the time limits extension described above, the JSON Schema returned by get_open_data_params_schema is expected to be extensive and detailed enough to fully describe a UI for cube generation.

Order of properties in a schema

Sub-elements of a JsonObjectSchema are passed to the constructor using the properties parameter with type signature Mapping[str, JsonSchema]. Openers SHOULD provide an ordered mapping as the value of properties, with the elements placed in an order suitable for presentation in a UI, and UI generators SHOULD lay out the UI in the provided order, with the exception of the common parameters discussed below. Note that the CPython dict object preserves the insertion order of its elements as of Python 3.6, and that this behaviour is officially guaranteed as of Python 3.7, so additional classes like OrderedDict are no longer necessary to fulfil this requirement.

Special handling of common parameters

Any of the common parameters listed above SHOULD, if present, be recognized and handled specially. They SHOULD be presented in a consistent position (e.g. at the top of the page for a web GUI), in a consistent order, and with user-friendly labels and tooltips even if the title and description annotations (see below) are absent. The UI generator MAY provide special representations for these parameters, for instance an interactive map for the bbox parameter.

An opener MAY provide title, description, and/or examples annotations for any of the common parameters, and a UI generator MAY choose to use any of these to supplement or modify its standard presentation of the common parameters.

Schema annotations (title, description, examples, and default)

For JSON Schemas describing parameters other than the common parameters, an opener SHOULD provide the title and description annotations. A UI generator SHOULD make use of these annotations, for example by taking the label for a UI control from title and the tooltip from description. The opener and UI generator MAY additionally make use of the examples annotation to record and display example values for a parameter. If a sensible default value can be envisaged, the opener SHOULD record this default as the value of the default annotation and the UI generator SHOULD set the default value in the UI accordingly. If the title annotation is absent, the UI generator SHOULD use the key corresponding to the parameter’s schema in the parent schema as a fallback.

Generalized conversion of parameter schemas

For parameters other than the common parameters, the UI can be generated automatically from the schema structure. In the case of a GUI, a one-to-one conversion of values of JSON Schema properties into GUI elements will generally be fairly straightforward. For instance:

A schema of type boolean can be represented as a checkbox.
A schema of type string without restrictions on allowed items can be represented as an editable text field.
A schema of type string with an enum keyword giving a list of allowed values can be represented as a drop-down menu.
A schema of type string with the keyword setting "format": "date" can be represented as a specialized date selector.
A schema of type array with the keyword setting "uniqueItems": true and an items keyword giving a fixed list of allowed values can be represented as a list of checkboxes.