.. _python-step:

# Python step

.. note::
  This page is for advanced users. If you are just looking for a simple way to
  process your data with Python, we recommend using :ref:`python-functions`

.. note::
  In order to use the python step, you need to have the :ref:`beta-plugin` enabled
  and have configured RiskScape to use :ref:`cpython-impl`.

As well as supporting python functions within RiskScape expressions, RiskScape also supports
processing the entire dataset within a CPython function.  This feature supports being able
to use libraries like Pandas and Numpy across the whole dataset, rather than row at a time.

Consider the case where you want to use numpy to compute some loss statistics from an event
loss table in RiskScape.

```python
# some kind of pandas/numpy concoction
def compute_aal(dataframe):
    # figure out aal somehow
    return aal
```

To integrate this code in to your pipeline you could add the following to your pipeline:

```
event_loss
->
python(
    script: 'compute-aal.py',
    result-type: 'struct(aal: floating, peril: text)'
)
```

Then add to your python script:

```python
def function(rows):
    # 1. construct a dataframe from all the rows
    df = pd.DataFrame(rows)

    # 2. call your aal function (from the first example)
    aal_eq = compute_aal(df['eq_loss'])

    # 3. return a result to riskscape
    yield {'aal': aal_eq, 'peril': 'earthquake'}
```

This example would send all the tuples from the `event_loss` step in your pipeline to
the `compute-aal.py` script.  First, the script then passes all these rows in to a Dataframe,
and second gives that Dataframe to your existing AAL function.  Last of all, the function
'yields' the result as a dictionary, so that RiskScape can convert it back in to a tuple.

This feature is not limited to returning a single result.  The example can be adapted to return
multiple rows back to RiskScape:

```python
    # 4. call your aal function (from the first example)
    aal_flood = compute_aal(df['eq_flood'])

    # 5. return a second result to riskscape
    yield {'aal': aal_flood, 'peril': 'fluvial_flooding'}
```

Only once the final yield is called will the script finish.


## Generator functions

RiskScape makes use of a feature of the Python language called
[generator functions](https://wiki.python.org/moin/Generators) to support whole-dataset processing. Tuples come in to
the function using a generator function, and rows are sent back to RiskScape in the same way. For the most part, you 
don't need to know much about how these work, as long as you remember to return rows back to RiskScape using the `yield`
keyword instead of `return`.

## Outputs

In addition to being able to process your data, CPython also has robust tools 
for displaying it. For example, `matplotlib` allows you to easily make 
plots and figures. RiskScape provides a special function to code running in the 
`python` step - `model_output(file_name)`. Calling the function with a file name
registers that file with RiskScape. RiskScape then knows that it's an output 
file, and will move it to the output directory along with your other outputs 
when the model completes.

You do not need to add anything to your pipeline file in order to register 
additional outputs. In fact, if your Python file only registers outputs and
does not yield any rows, you can omit the result type from the `python` step 
definition.

For a worked example of using the `python` step to produce a PDF report, see :ref:`python-outputs`.

### Sub-directories

RiskScape will move all registered outputs into a flat output directory. If you
register outputs that have the same name, but are in different directories 
(e.g. `flood/map.png` and `landslide/map.png`) only the first can be moved to the output
directory (e.g. `output/map.png`), and the other output will be discarded.

We recommend writing all your Python outputs to a single directory 
(e.g. `flood-map.png` and `landslide-map.png`)

## More examples

### Batch-processing

This example shows how computation can be batched up, which can be beneficial when using
advanced features like GPU offloading.

```python
import itertools

BATCH_SIZE = 100

def function(rows):

    # use python stdlib itertools to batch the rows coming in so we
    # can operate on them en masse
    for batch in itertools.batched(rows, BATCH_SIZE):
        df = pd.DataFrame(batch)

        # call the function that benefits from running across many rows at once
        df = df.reticulate_splines()

        # return each result from the dataframe back to RiskScape
        for new_row in df:
            yield new_row
        ```

### Row-at-a-time

This example shows how you can call a function more like a traditional CPython function in
RiskScape.  Assume you already have a script that has a compute_damage and compute_loss
function:

```python
def function(rows):
    for row in rows:
        dr = compute_damage(row)
        loss = compute_loss(row, dr)

        # Return a row back to RiskScape for each row we are given
        yield {dr: dr, loss: loss}
```

Note that unlike a standard RiskScape function that appears in a select step, only those
attributes that are returned from the function are returned.