Concepts

Models

Data Cubes

Transforms

What

A transform is, very broadly speaking, a node in a computation plan that accepts as input something other than a “scalar” parameter, and produces something other than a “scalar” output. In practice right now, this generally means taking one or more datacubes as input and producing one or more datacubes as output. See here for more details on datacubes. One can think of a transform as an arbitrary computation lifted into the domain of datacubes.

Why

The gap between data producers and data consumers can be vast. Some of the guarantees SuperMaaS provides and the restrictions SuperMaaS requires help to bridge the gap between producers and a single, well-defined, reasonably-highly-structured method of data representation. Transform infrastructure, and transforms themselves, help to bridge the gap between that method of representation and disparate data consumption spaces.

For example, consider two models that produce rainfall data. Model A produces CSV files specifying inches of rainfall in named cities, and model B produces GeoTIFF images specifying centimeters of rainfall at latitude/longitude points. The registration process and the metadata it requires will tell SuperMaaS how to interpret these models’ outputs, store them internally, and provide API access to fetch their output data in a normalized format. Transforms interface with that API to fetch the data as SuperMaaS stores it, perform arbitrary computation over it, and return it to SuperMaaS for storage and potential further fetching, by another transform or by an end consumer. In this example, one can imagine the following workflow:

../_images/pipeline.svg
  1. Model A’s output is piped to transform C, which converts the city names model A provides into latitude/longitude coordinate pairs.

  2. Model B’s output is piped to transform D, which converts the numeric values B produces in centimeters to inches.

  3. Transform C’s output is piped to transform E as the first of two expected inputs. E determines the difference/error between two sources of data representing the same concept. This is an intentionally vague description as, again, the transform’s computation can be arbitrary.

  4. Transform D’s output is also piped to transform E, as the second of two expected inputs.

  5. Transform E’s output can be shipped off elsewhere, to an end consumer or to another transform.

How

Though nothing in SuperMaaS strictly requires it, transforms tend to be written in Python (>= 3.8). One reason for this is the Galois-authored-and-maintained supermaas_utils Python library, which provides some API abstractions to ease datacube pull-modify-push workflows, the bread and butter of many transforms. For a literate sample transform, see here.