.. _pipelines:

# Pipelines

A *pipeline* is a series of data-processing steps in RiskScape.
RiskScape uses pipelines to transform rows of data, or *tuples*, as they flow through the system.

All RiskScape models are implemented using an underlying pipeline.
However, advanced users can define their own models directly in pipeline code.
This lets you interact with RiskScape's data-processing at a much lower level.

.. tip::
    For a how-to guide to writing pipelines see :ref:`pipelines_tutorial`.

## Steps

A pipeline is made up of one or more steps. A step is a component that will
process *tuples*. For example, the `input` step will output *tuples* from a
data source, whilst the `filter` step will remove *tuples* that do not match
the filter expression.

Steps get *chained* together, so that the output from one step feeds into the input for another step.

Most steps have parameters that are used to alter how the step processes
*tuples*.

RiskScape has many built-in pipeline steps. The steps available, and their associated parameters, can be
inspected with the following commands:

```shell
  riskscape pipeline step list
  riskscape pipeline step info STEP_NAME
```

## Defining pipelines

Pipelines can be defined in the :ref:`project <projects>` file as a pipeline model.

The following sections use a simple pipeline example that:

- takes some assets
- filters them to remove assets that are not constructed from timber
- joins the assets to their region

We will explain how the definition works in the following sections.

### Pipeline file

The pipeline can be defined in a separate text file, e.g. `my_pipeline.txt`:

```text
  input('regions.shp', name: 'region') as regions
  input('assets.csv', name: 'asset') -> filter(filter: asset.construction = 'timber')
  join(on: region.name = asset.region) as with_region
  filter -> with_region.lhs
  regions -> with_region.rhs
```

Then a pipeline model entry is added to the project INI file with a `location` pointing
to that file.

```ini
[model my_pipeline]
framework = pipeline
description = demonstrates how to write a pipeline

location = my_pipeline.txt
```

### In project INI

Pipeline models can also be defined in the project INI file itself with a `pipeline`
entry. For example:

```ini
[model my_pipeline]
framework = pipeline
description = demonstrates how to write a pipeline

pipeline = \
  input('regions.shp', name: 'region') as regions \
  input('assets.csv', name: 'asset') -> filter(filter: asset.construction = 'timber') \
  join(on: region.name = asset.region) as with_region \
  filter -> with_region.lhs \
  regions -> with_region.rhs
```

.. note::

  Pipelines defined in an INI file need an ``\`` at the end of every line, *except*
  for the last line.
  This is because the INI file format expects entries to only span one line. The ``\``
  is a line continuation that makes the following line part of the current entry.

### Defining steps

Steps are defined in a similar syntax to functions:

```text
  step_id([optional parameter, ...])[as optional_name]
```

The `step_id` specifies the _type_ of step to use.
It must correspond to one of the IDs listed in `riskscape pipeline step list`.
The `optional_name` is used to uniquely identify the step in the pipeline.

Some examples of defining the _same_ pipeline step:

```text
  input('my_bookmark')
```

With a step name:

```text
  input('my_bookmark') as input_assets
```

With parameter keywords:

```text
  input(relation: 'my_bookmark')
```

If you do not specify a step name, RiskScape will assign a default identifier to each step.
The default name is simply the `step_id`, e.g. `input`.
To ensure the name is unique, RiskScape may also append a number that reflects the order the step
appeared in the pipeline, e.g. `input_2`, `input_3`, etc.

.. tip::
    You only need to assign a step name if you want to *reference* your step from elsewhere in the pipeline.
    For example, a ``join`` step requires *two* inputs, so at least one of them will be a reference to another step.


### Connecting steps

Steps are connected together so that the output of one step is passed to the input of the
next. 
When multiple steps are connected together, they are called a pipeline _chain_.

Connecting steps is done with the `->` operator. For example:

```text
source -> destination
```

The `source` and `destination` in this example must both be unique pipeline steps, but they can be either:

- a step definition e.g. `input() as my_input`
- the name of a previously defined step e.g. `my_input`

The exception to this is where the destination step has more than one input, such as a *join*. In that
case the destination must be `<step-name>.<input-name>`. Input names are listed in the
`riskscape pipeline step list` output for steps that require them.

There are examples of both of these in the [Defining Pipelines](#defining-pipelines) example above.

.. tip::
    When you *reference* a previously defined step, make sure you have explicitly named the step.
    RiskScape will implicitly assign a unique name to each unnamed step, e.g. ``select({*})``
    might be named ``select_2``. However, these implicit names can change as you edit your pipeline,
    so it is not recommended to reference an implicit step name.

## Advanced pipeline features

This has covered the basics of defining pipelines. 
You may be interested in more advanced topics, such as :ref:`parameterized_pipelines`.