# Geoprocessing your input data

## Before we start

This tutorial is aimed at users who want to use RiskScape to model damage to larger elements-at-risk, such as roads.
We expect that you:

- Have completed the :ref:`intermediate_models` tutorial.
- Have a basic understanding of geospatial data and risk analysis.
- Have a GIS application (e.g. QGIS) installed that you can use to view shapefiles.

The aim of this tutorial is to explore the ways in which RiskScape can modify the geometry in your input data on the fly,
so that you can build accurate loss models.

## Getting started

### Setup

Click [here](../geoprocessing-tutorial.zip) to download the example project we will use in this guide.
Unzip the file into a new directory, e.g. `C:\RiskScape_Projects\`.

Open a command prompt and `cd` to the directory where you unzipped the files, e.g.

```none
cd C:\RiskScape_Projects\geoprocessing-tutorial
```

You will use this command prompt to run the RiskScape commands in this tutorial.

The unzipped `geoprocessing-tutorial\data` sub-directory contains the input data files we will use in this tutorial.
This data is *similar* to the :ref:`Upolu tsunami data <Upolu_data>` that we used in the previous tutorials,
except that our exposure-layer is now a dataset of _roads_ rather than _buildings_.

.. note::
    This input data was provided by `NIWA <https://niwa.co.nz>`_ and is based on
    `Open Street Map <https://www.openstreetmap.org>`_ data for south-eastern Upolu.
    The regional map of Samoa was sourced from the `PCRAFI <http://pcrafi.spc.int/>`_
    (Pacific Risk Information System) website.
    The data files have been adapted slightly for this tutorial.

The `geoprocessing-tutorial` directory also contains a `project.ini` file that
contains some pre-existing models we will use to demonstrate the various geoprocessing techniques.

## Background

So far we have been using a building dataset for all our example models.
Instead of modelling damage to _buildings_, if we were to model damage to _roads_ it introduces some additional complexities:
- The element-at-risk is much larger. For example, a road may be many kilometres long
and so some parts of the road may be completely unexposed to the hazard while others are severely damaged.
It may be useful to **cut** the road into smaller segments and model the damage to each segment individually.
- Roading data is typically stored as line-string geometry, whereas in real-life the road may be several metres wide.
You may want to **enlarge** the roads so they more realistically cover the area that will be exposed to the hazard.
- Working with larger, more complicated geometry is more computationally expensive and therefore your model may be slower to run.
When you have a large dataset you can spatially **filter** the input data so that only elements-at-risk in the area of interest are run through the model.

RiskScape's geoprocessing features allow us to modify the input data _before_ it gets used in our model.
We can cut long roads into 10-metre segments, enlarge them by a designated width so they become polygons,
or filter our input data by the bounds of the hazard or a region of interest.

So far we have skipped over RiskScape's geoprocessing functionality, but this tutorial will now look at the
optional geoprocessing phase of the model workflow (shown below, click to enlarge the image) in more detail.

.. image:: ../diagrams/model_framework.svg
    :target: ../_images/model_framework.svg
    :alt: Flowchart showing the different processing phases of the RiskScape model workflow

.. note::
    This tutorial will focus on roads, but the same concepts apply to any large elements-at-risk, such as
    farmland, water pipes, or transmission lines.

.. tip::
    We will focus on geoprocessing the exposure-layer, but RiskScape supports geoprocessing any *vector* layer (i.e. any shapefile data).
    For example, if your hazard-layer contained point data, you could buffer the points to turn them into larger polygon circles.

## Cutting

How you handle large elements-at-risk, such as roads, depends somewhat on what you are trying to model.

Let's start off with a simple model that tries to work out how much roading is exposed to the tsunami.
Our input data already has a length (in metres) for each road segment, so we can try summing the lengths of all exposed road segments.

Run the following command in your command prompt:

```none
riskscape model run simple-exposure
```

.. note::
    You may notice that the models in this tutorial are slightly slower to run.
    This is because the road geometry is more complex than the simple building data we have been using so far.
    When the input data involves more complex geometry, or the hazard grid is smaller resolution,
    then the model's processing will be more computationally expensive.

Running this model should produce an `event-impact.csv` results file that looks like the following:

```none
Region,Exposed_segments,Exposed_km
Aleipata Itupa i Lalo,24,24.287
Aleipata Itupa i Luga,30,19.661
Falealili,67,42.263
Lepa,10,7.69
Lotofaga,41,16.58
```

The problem with this approach is that it overestimates how much road is exposed to the tsunami.

The road segments can be very long, but only a small area of road might be exposed to the tsunami.
So counting the _whole_ length of the road will over-represent the true clean-up or repair cost.

We will now look at a few different ways we can make our RiskScape model more accurate.

### Cutting by distance

The simplest approach to deal with large elements-at-risk is to cut them into smaller segments.
For example, instead of trying to model the damage to a 5km stretch of road, you could cut the road into smaller
10m segments and then model the damage to each segment individually.

RiskScape cuts geometry using the following approach:
- Line-string input geometry, such as roads, will be cut into segments of the specified length (or less, for the last segment).
- Polygon input geometry will be cut using a square grid. Each side of the grid will have the specified length.
So cutting a polygon by a distance of 100m will result in 100m x 100m segments (10000m²) or smaller.

.. tip::
    Line-strings can also be cut by a grid, rather than by length, but this is currently a *pipeline* feature.
    This lets you cut line-strings into pieces that exactly match the hazard grid.

For example, the following diagram shows how a block of farmland (grey outline) would be cut into segments
that better correspond to the underlying hazard layer.

.. image:: ../diagrams/segment_by_grid.svg
    :target: ../_images/segment_by_grid.svg
    :width: 75%
    :alt: Example of cutting farmland into smaller segments

In this case we are dealing with road data, so we will cut them by length rather than grid.
The project contains a `cut-by-distance` model, which will cut our roads into 50m segments,
which is the same size as the hazard grid resolution.

Run the following command:

```none
riskscape model run cut-by-distance
```

This should produce an `event-impact.csv` results file that looks like this:

```none
Region,Exposed_segments,Exposed_km
Aleipata Itupa i Lalo,200,9.558
Aleipata Itupa i Luga,237,11.301
Falealili,347,15.936
Lepa,116,5.588
Lotofaga,108,4.138
```

You can see that the `Exposed_segments` values are much larger, because they are now dealing with smaller 50m segments.
Whereas the `Exposed_km` values are much smaller because large parts of the original roads were not exposed to the hazard.

This model also saves the raw results after the geoprocessing has cut the exposure-layer, as a `exposure-geoprocessed.shp` output file.
This lets you see exactly how RiskScape has modified your exposure-layer.

Open the `exposure-geoprocessed.shp` in QGIS, or your preferred GIS application.
Use the 'Identify Features' tool (`Ctrl+Shift+I`) in QGIS and
click on the road segments. You should see that each line-string is now no more than 50m long (i.e. the `length` attribute).

If you open the attribute table in QGIS, you will see that RiskScape has assigned each element-at-risk a `SegmentID` attribute.
The `SegmentID` gets incremented for each segment cut from the same original line-string.

### Recomputing attributes

When you cut an element-at-risk into smaller segments, you may also want to update its attributes to reflect its change in size.
For example, your input data may contain attributes such as `area`, `length`, or `repair_cost` that will no longer be valid for the smaller segments.

The RiskScape wizard gives you two options after cutting your input data:
- _Scale attributes_: RiskScape can automatically adjust attributes based on the change in size.
This can be useful for attributes such as repair cost.
For example if a 1km road is cut into 10 segments, then RiskScape can adjust the `repair_cost` for each segment so that it is one-tenth the original value.
- _Recompute attributes_: For attributes such as `area` or `length`, it can be simpler just to recompute a new value for the attribute.
For example, the `cut-by-distance` model replaced the input road's `length` attribute by measuring the new length of the cut segment.

.. tip::
    You can use the *expression* ``measure(exposure)`` to recompute the length of line-strings in your exposure-layer, or the area for polygon data.
    We will learn more about RiskScape expressions soon.

### Alternatives to cutting

Instead of cutting the input data, you can use the _aggregate by consequence_ functionality
we covered in the :ref:`intermediate_models` tutorial. It depends somewhat on what you want to model.

The _all-intersections_ spatial sampling essentially already cuts the element-at-risk into segments that intersect with the hazard-layer.
It also calculates an `exposed_ratio` attribute, which is the amount that each element-at-risk was exposed to the hazard.

This `exposed_ratio` attribute is typically used to scale the loss.
However, we can also use the `exposed_ratio` here to calculate the total length of road exposed to the tsunami, i.e. `exposed_ratio * exposure.length`.

Try running the `exposed-length` model using the following command:

```none
riskscape model run exposed-length
```

This should produces a `raw-analysis.shp` output file, which contains the intermediate results.
This shapefile contains the exposure-layer geometry segmented by where it intersects the hazard-layer.

.. note::
    Although the hazard-layer effectively has a 50m resolution, the pixel size in the GeoTIFF is actually 5m.
    So the roads are cut using a 5m grid here, rather than a 50m grid.

The command should also produce an `event-impact.csv` results file that looks like this:

```none
Region,Exposed_segments,Exposed_km
Aleipata Itupa i Lalo,24,10.661
Aleipata Itupa i Luga,30,12.333
Falealili,67,14.088
Lepa,10,0.902
Lotofaga,41,5.489
```

Notice that the `Exposed_km` values are similar to the `cut-by-distance` model, _except_ for the Lepa region which is quite different (0.9km vs 5.6km).
This difference is because the Lepa road segments are large and are actually centred in the neighbouring regions, such as Lotofaga.
The `cut-by-distance` cuts these roads into smaller segments which get assigned to the Lepa region.
Whereas the `exposed-length` model matches the whole original road against the area-layer.

RiskScape has one more cutting feature that can help in this case.

### Cutting by layer

Instead of cutting by distance, you can also cut your input data against geometry in another layer.
For example, we can cut our roads by the regional boundaries in our area-layer.

The following diagram shows a simple example of how this works.

.. image:: ../diagrams/cut_layer.svg
    :target: ../_images/cut_layer.svg
    :width: 75%
    :alt: Example of cutting roads by regional boundary

The `cut-by-layer` model works the same as the `exposed-length` model, except it cuts the roads along the area-layer boundaries _before_ doing any other processing.
Try running the model using the following command:

```none
riskscape model run cut-by-layer
```

This should produce an `event-impact.csv` results file that looks like this:

```none
Region,Exposed_segments,Exposed_km
Aleipata Itupa i Lalo,25,9.216
Aleipata Itupa i Luga,37,10.614
Falealili,68,14.799
Lepa,12,5.225
Lotofaga,42,3.62
```

The `Exposed_km` value for Lepa now looks more reasonable because the original roads were cut along the regional boundaries.

A `exposure-geoprocessed.shp` output file should also have been produced.
Open this file in QGIS to see how RiskScape cut the input geometry.

.. tip::
    You can optionally copy attributes from the cut-layer into your exposure-layer, such as the ``Region`` name.
    This can avoid having to specify a separate area-layer input, as it will typically be the same as the cut-layer.

## Filtering

As you may have noticed, RiskScape models tend to run slower as they do more complex geometry processing.
We can speed models up by using _filtering_ to throw away input data that we are not interested in.

Filtering avoids unnecessary processing.
For example, cutting can be computationally expensive.
Therefore, cutting up input data that we will not use in our results can waste a lot of time.

There are three different ways you can filter your input data:
- by expression
- by a layer's bounds
- by a feature in another layer, e.g. a region

### Filter by expression

This approach checks whether an attribute in the input data satisfies a given condition before the data will be used in the model.

The following diagram is a simple illustration of filtering, where the squares represent input data attributes and the colours represent different values.

.. image:: ../diagrams/attr_filter.svg
    :target: ../_images/attr_filter.svg
    :width: 60%
    :alt: Rows of data are only included if a given attribute value is present.

For example, if you were only interested in what bridges were exposed to the tsunami, then you could use the filter: `exposure.bridge = 'T'`.

.. tip::
    The wizard will help you build these filter expressions interactively.
    This filtering is similar to the 'Marine area' :ref:`bookmark filter <bookmark_filter>` in the project tutorial.

### Filter by layer bounds

We will usually not be interested in elements-at-risk that fall _outside_ the bounds of the hazard-layer,
as we know they will not be impacted by the hazard at all.
We can spatially filter our exposure-layer by the bounds of the hazard-layer to avoid unnecessarily processing these elements-at-risk.

The following diagram is a simple illustration of filtering out roads based on the bounds of flood inundation.

.. image:: ../diagrams/filter_bounds.svg
    :target: ../_images/filter_bounds.svg
    :width: 75%
    :alt: Roads are only included in the model if they overlap the bounds of the hazard

This filtering approach is useful when the exposure-layer dataset is wide-spread, but hazard is localized.

The exposure-layer data in this project has _already_ been filtered by the bounds of the hazard to speed up running models.
We know the tsunami hazard only impacts south-eastern Upolu, so cutting roads located on other islands of Samoa,
such as Savai'i, would be unnecessary extra processing.

.. note::
    Filtering by layer bounds can currently only be used when both input layers are in the same CRS.

### Filter by region

Sometimes we might only be interested in the impact to a specific region.

The wizard reporting options let us filter the _output_ results by region.
However, this still means that a lot of unnecessary processing gets done for the other regions that we are ignoring.

RiskScape lets you filter your _input_ data by a region of interest.
For example, the following diagram is a simple illustration of filtering farmland based on a region of interest.

.. image:: ../diagrams/filter_regions.svg
    :target: ../_images/filter_regions.svg
    :width: 75%
    :alt: Only the farmland that falls within the region of interest gets included in the model

The `filter-by-region` model filters the input data so that only roads for the Falealili region are included in the model.
Try running this model using the following command:

```none
riskscape model run filter-by-region
```

This should produce a `exposure-geoprocessed.shp` output file, which shows the roads that will actually be used in the model.
Open `exposure-geoprocessed.shp` in QGIS and take a look at it.

The command should also produce an `event-impact.csv` results file that looks like this:

```none
Region,Exposed_segments,Exposed_km
Falealili,67,14.088
Lotofaga,1,1.328
```

.. note::
    This filtering includes any input data that *intersects* the region of interest.
    Whereas matching against the area-layer uses the *centroid* of the road.
    So here the results include one road that is centred in the Lotofaga region,
    but still intersects the Falealili region.

As well as the Lotofaga road, there will also be other roads that extend into neighbouring regions.
Let's look at another example where we include *only* the roads that fall within the bounds of a given region.

### Including only a region

Let's say we want to only include the roads within the Falealili region in our model,
and exclude any roading that falls outside that region.

To do this, we could combine several different approaches we have already learnt about:
- Filter the input data by the Falealili region.
- Cut by area-layer. This will cut any roads that 'straddle' regions into two.
- Filter results by the Falealili region again. This will remove any segments that fall into neighbouring regions.

However, we will now look at the `include-only-region` model which uses a slightly different approach.
The cut-by-layer functionality can behave like the [GIS 'clip' operation](http://wiki.gis.com/wiki/index.php/Clip),
in that RiskScape can throw away any geometry that falls outside the polygon(s) in the cut-layer.

For this to do work, we want the layer we are cutting against to *only* include the Falealili region polygon.
We can add a simple bookmark with a `filter` statement that does this:

```ini
[bookmark Only_Falealili]
location = data/Samoa_constituencies.shp
filter = Region = 'Falealili'
```

When we cut against this `Only_Falealili` layer, it will cut all the roads at the Falealili boundary and throw away
any parts of the original road that fall outside that region.

Try running this model using the following command:

```none
riskscape model run include-only-region
```

This should produce an `event-impact.csv` results file that looks like this:

```none
Region,Exposed_segments,Exposed_km
Falealili,68,14.799
```

This will also produce a `exposure-geoprocessed.shp` output file, which shows the roads after they have been 'clipped'.
Open `exposure-geoprocessed.shp` in QGIS and see how it differs compared to the `filter-by-region` shapefile.

## Enlarging

The final geoprocessing feature we will look at is _enlarging_, or [buffering](http://wiki.gis.com/wiki/index.php/Buffer_(GIS%29).
Buffering lets you enlarge polygons, or turn points or line-strings into polygons.
The following diagram shows road line-strings that are buffered and turned into polygons.

.. image:: ../diagrams/buffer.svg
    :target: ../_images/buffer.svg
    :width: 75%
    :alt: Shows road line-strings being turned into polygons that better represent the width of the road

Buffering can be useful when working with roading data.
The road geometry typically only follows the centre-line, whereas the real-life road might be several metres wide.
Enlarging the road line-string means that you detect hazards that intersect the road, but do not intersect the centre-line.

When you buffer shapes you have choices about what the enlarged shape will look like.
For line-strings and polygons you can choose whether the corners should be rounded, bevelled, or mitred.
The following diagram shows the difference between the approaches:

.. image:: ../diagrams/buffer_corners.png
    :target: ../_images/buffer_corners.png
    :width: 75%
    :alt: Shows rounded, bevelled, and mitred corners when a line-string is buffered

For line-strings, you can also choose whether the end of the line should be rounded, flat, or square.
The following diagram shows the difference between the approaches.

.. image:: ../diagrams/buffer_ends.png
    :target: ../_images/buffer_ends.png
    :width: 75%
    :alt: Shows rounded, flat, and square line-ends when a line-string is buffered

.. tip::
    When enlarging roads, we recommend using *mitred* corners and *flat* line-endings.
    This keeps the original shape of the line-string, but the road does not become any longer.

### Buffering roads

Try running the `buffer-roads` model using the following command:

```none
riskscape model run buffer-roads
```

This will produce a `exposure-geoprocessed.shp` output file, which shows what the roads look like after they have been buffered.

.. note::
    This example data deals with a mixture of main highways, single-lane roads, and footpaths.
    So we have enlarged the road by a variable amount based on the type of road (its ``fclass`` attribute).
    You may simply choose to enlarge all roads by a constant, fixed distance.

In this model we calculate a `width_metres` attribute in the bookmark, based on the number of lanes that the road has.
We then use half this value to buffer the road, i.e. `exposure.width_metres / 2`.

.. tip::
    The line-string will be enlarged by the given distance in *each* direction.
    So the buffer distance should be *half* the total width of the road.

The model should also produce an `event-impact.csv` output file that looks like this:

```none
Region,Exposed_segments,Exposed_km,Exposed_m2
Aleipata Itupa i Lalo,25,24.427,30515
Aleipata Itupa i Luga,30,19.661,32572
Falealili,67,42.263,36775
Lepa,10,7.69,2277
Lotofaga,41,16.58,11937
```

We have not cut the roads at all, and the buffered roads are now *polygons* rather than line-strings.
So the `Exposed_km` is based on the original road-length, similar to the first `simple-exposure` model.

The hazard-layer here is a coarse 50-metre grid, and so buffering the road by a few metres has only detected one extra road as being exposed.
However, the results now shows us how many square metres of road were exposed to the hazard.

### Removing overlaps

In QGIS, if you zoom in closely to the buffered `exposure-geoprocessed.shp` file,
you will notice that the buffered roads now overlap slightly at intersections.

These overlaps mean that the area impacted by the hazard can be counted multiple times at intersections.
In a city grid with a lot of intersections, this double-counting could potentially over-represent the true damage from the hazard.

RiskScape has an option to remove these overlaps by cutting the buffered polygon.
Overlaps are removed on a 'best effort' basis.

Try running the `buffer-no-overlap` model using the following command:

```none
riskscape model run buffer-no-overlap
```

Open the resulting `exposure-geoprocessed.shp` output file in QGIS.
This file looks similar to the previous `buffer-roads` model, except now if you zoom in to look at intersections
you will see that the polygons have been cut so they do not overlap.

If you open the `event-impact.csv` output file, it should look like this:

```none
Region,Exposed_segments,Exposed_km,Exposed_m2
Aleipata Itupa i Lalo,25,24.427,30452
Aleipata Itupa i Luga,30,19.661,32480
Falealili,67,42.263,36640
Lepa,10,7.69,2246
Lotofaga,41,16.58,11915
```

The `Exposed_m2` values are now slightly lower compared with the `buffer-roads` model.
However, as the tsunami impacts mostly coastal roads here, the total difference is only approximately 350 square metres.

## Geoprocessing performance

RiskScape's geoprocessing features let you alter your input geometry on the fly.
This avoids cumbersome and potentially error-prone steps where you modify the data by hand.

However, you may have noticed that geoprocessing can add a performance overhead.
In particular, cutting geometry into very small pieces, such as 1-2 metres, can be very slow
and can greatly increase the memory (RAM) that running your model consumes.

If you want to run the model repeatedly, it can be quicker to save a copy of the geoprocessed data
and use that file as the input to your model. This means RiskScape only needs to cut the data once.

.. tip::
    You can save and run a model that *only* does geoprocessing.
    In the wizard, you can access the ``Ctrl + c``  menu part way through building your model,
    and that lets you save it.

## Expressions

RiskScape uses its own expression language as the 'glue' that holds your model together.
RiskScape expressions are simple statements that are similar to what you would use in a spreadsheet formula.

For example, to produce the `Exposed_segments` attribute in the results, we use the *aggregate* expression `count(exposure) as Exposed_segments`.
The wizard will usually help you build these expressions, but there are limits to what you can do in the wizard.
In order to do more advanced things in RiskScape, it is useful to be able to modify and write expressions yourself.

For example, you can use the wizard to build an expression like `sum(exposure.length) as Exposed_metres`.
But converting this value to *kilometres* instead of metres is a slightly more complicated expression
that you would have to enter manually, e.g. `sum(exposure.length) / 1000 as Exposed_km`.

.. tip::
    In your ``project.ini`` file, you can edit the expressions in your saved model configuration manually.
    Writing your own expressions gives you more flexibility with your model than the wizard provides by default.

We will look at :ref:`expression_tutorial` in more detail in the next tutorial.

## Recap

Let's review some of the key points we have covered so far:

- Geoprocessing lets you modify your input geometry on-the-fly, in a deterministic way.
You can cut large elements-at-risk into smaller segments, buffer them to enlarge the geometry,
or spatially filter the input data to avoid unnecessary processing.
- RiskScape supports geoprocessing any *vector* layer in your model.
Typically geoprocessing is useful for things like roads, water pipes, transmission lines, and farmland.
- Geoprocessing essentially creates a new exposure-layer that your model will use.
Geoprocessing always takes effect *before* any other processing in your model, such as spatial sampling or consequence analysis.
- Cutting your input data lets you deal with large assets on a finer-grain level.
You can either cut by a fixed distance or cut using geometry in another vector layer.
However, cutting by a small distance can greatly increase your model's memory usage and execution time.
- After cutting your exposure-layer, you may need to adjust attributes to reflect the change in size, i.e. `area`, `length`, `repair_cost`, etc.
You can either automatically scale the attributes by the amount they have decreased in size, or recompute the values from scratch.
- Filtering lets you remove input data that you are not interested in from your model.
Excluding input data means your model will run quicker.
This can be useful when you have a large exposure-layer dataset, but a small area of interest.
- Buffering or enlarging will increase the area of each element-at-risk.
This can mean that road line-strings are more realistically spatially matched against the hazard-layer.
Buffering will turn line-strings into polygons.
- To do more complicated things in your models it is useful to be able to modify RiskScape expressions.

## Extra for experts

If you want to explore the concepts we have covered a little further,
try using the `riskscape wizard` command to build your own geoprocessing models.

- _Tip_: Focus on the geoprocessing side of things.
Select the option to 'Save the modified layer after the geoprocessing has taken effect' so you can see what the geoprocessing does.
For simplicity, just use the 'closest' spatial sampling, and filter the results based on `consequence = 1`
(i.e. only include the exposed road segments in the results).
- Try creating models that:
  - Cut the roads by a fixed distance, e.g. 10m.
  - Filter the road input data by the Lepa region.
  - Enlarge all the roads by 3 metres in either direction.
- Try running the `cut-by-distance` model with a different cut-distance, i.e. `-p "input-exposures.cut-distance=5"`.
Try it with 5 metres and then again with 100 metres. Pay attention to how long the model takes to run each time.