How to write advanced pipelines

Before we start

This tutorial is aimed at advanced RiskScape users who want a better understanding of the inner workings of their risk models. We expect that you:

  • Have been through the How to write basic pipelines tutorial already.

  • Are proficient using RiskScape and other shell commands.

  • Are comfortable working with geospatial data and risk analysis models.

The aim of this tutorial is to build a pipeline for a risk model from scratch. This will hopefully give you a better understanding of the RiskScape pipeline features that lie beneath your risk models, such as models built using the wizard.

Background

Types of layers

There are essentially two types of input data layers that RiskScape uses:

  • Relations, which hold rows or records of data. Relations include CSV and shapefile vector data. The previous pipeline tutorial only covered relation data.

  • Coverages, which are spatial lookup tables. Coverages include grid-based GeoTIFF and ASC raster files. We will look at coverages a little later in the tutorial.

The input pipeline step only works with relation datasets. Subsequent pipeline steps can transform the data, but each step ultimately still produces a relation, which then becomes the input for the next step.

Because pipeline steps primarily deal with relations, you may notice some similarities with SQL. If you are not familiar with SQL, it may help to think of a relation like a CSV file or spreadsheet. A relation holds rows of data, and each row has several columns (attributes).

Each pipeline step essentially takes a relation as input, transforms the data row by row, and passes the new relation on to the next step. Conceptually, it may help to visualize the relation data at any one point in the pipeline as being akin to CSV or spreadsheet data.

Combining layers

The previous pipeline tutorial only used a single dataset as input. But to do any real modelling, you will want to combine multiple datasets together. The most obvious example is finding geospatial matches between your exposure-layer and your hazard-layer.

There are two ways we can combine datasets:

  • Using the join pipeline step to combine two different pipeline chains (relations) into one. This is similar to a ‘join’ database operation.

  • Using a function call, such as sample() to retrieve any matching geospatial data from a coverage, one row of data at a time. This tutorial will mostly look at the sampling approach.

Setup

Click Here to download the input data we will use in these examples. Unzip the file into a new directory, e.g. C:\RiskScape_Projects\.

Open a command prompt and cd to the directory where you unzipped the files, e.g.

cd C:\RiskScape_Projects\pipeline-advanced

You will use this command prompt to run the RiskScape commands in this tutorial.

The unzipped directory contains an empty file called pipeline.txt. Open this file in a text editor, such as Notepad or gedit. This is where you will paste the pipeline examples.

The project.ini is pretty simple - it loads the pipeline.txt contents as a pipeline model called advanced.

The input data files that we will use in our model are exactly the same ones from the Getting started modelling in RiskScape tutorial.

Note

This input data was provided by NIWA, as well as the PCRAFI (Pacific Risk Information System) website. The data files have been adapted slightly for this tutorial.

Loading the input data

Let’s start off by simply loading the shapefile building data. Add the following to your pipeline.txt file and save it.

input('data/Buildings_SE_Upolu.shp', name: 'exposure')
  ->
select({
        exposure,
        # TODO: we need to work this out still...
        '???' as Region,
        '???' as hazard,
        '???' as consequence
       }) as event_impact_table

Notice that the input step in the pipeline uses the name parameter. This wraps up each row of input data in a Struct instance called exposure. This makes it easier to distinguish our exposure-layer data from our area-layer and hazard-layer data.

Note

When you name the input data Struct, it changes how you access the attributes. For example, in our previous pipeline tutorial we simply used Use_Cat to access the use-category attribute, whereas now we would have to use exposure.Use_Cat.

Try running the pipeline with the following command.

riskscape model run advanced

This should output a event_impact_table.shp file. Because our input data contains geometry, RiskScape defaults to saving it as a shapefile.

You can open the shapefile in QGIS, but the results are not very interesting yet. The data is exactly the same as the Buildings_SE_Upolu.shp file, except we have added some extra attributes that all have the value ???.

Spatial sampling

Our pipeline uses the input step to read an input data relation one row at a time. We can also use the bookmark() function to return a handle to the the entire relation or coverage, which can then be passed to other functions.

Note

If there is no configured bookmark with the given name, RiskScape will try to find a file with the given name. We have not setup any bookmarks in this tutorial, so we are only using filenames.

So we could use the following pipeline, which would load the MaxEnv_All_Scenarios_50m.tif hazard file. Because the file is a GeoTIFF, the bookmark() function returns it as a coverage.

input('data/Buildings_SE_Upolu.shp', name: 'exposure')
  ->
select({
        *,
        bookmark('data/MaxEnv_All_Scenarios_50m.tif') as hazard_coverage
       })

However, this pipeline is still not quite what we need yet. The last step adds the whole coverage to the results as the hazard attribute, whereas we want a single hazard intensity measure for each building. To get a hazard intensity measure, we still need to sample the coverage data.

Tip

Remember that * in a pipeline works as a wildcard to select all the attributes that are already present in the pipeline data.

Sampling a coverage

A coverage is a bit like a spatial lookup table, so for any given geometry (e.g. a building), a coverage can return the corresponding data (if any) at that point. We call this geometry-based lookup sampling.

You could think of sampling as being like putting a pin into a map and plucking out the data at the point where it lands.

RiskScape supports several different sample() functions that all take a coverage argument and a given geometry value to ‘lookup’ in the coverage. We will use sample_centroid() here, which takes the centre of each building’s geometry and uses it as an index into the coverage.

Copy the following into your pipeline.txt file and save it.

input('data/Buildings_SE_Upolu.shp', name: 'exposure')
  ->
select({
        *,
        bookmark('data/MaxEnv_All_Scenarios_50m.tif') as hazard_coverage
       })
  ->
select({
        *,
        sample_centroid(exposure, hazard_coverage) as hazard
       })
  ->
select({
        exposure,
        hazard,
        # TODO: we need to work this out still...
        '???' as Region,
        '???' as consequence
       }) as event_impact_table

Then re-run the model using the following command:

riskscape model run advanced

Open the event_impact_table.shp results file produced in QGIS. If you open the attribute table, each building should now have a valid hazard value associated with it.

Note

When there is no corresponding hazard value sampled for a building, -9999.0 will be written to the shapefile. RiskScape’s sample() functions will always handle reprojection if the CRS are different between the two layers.

Sampling a relation

In order to sample spatial data, we need to always use a coverage layer. If we have relation data from a CSV or shapefile, such as our area-layer data, then we can convert it into a coverage layer.

To turn a relation into a coverage, we use the to_coverage() function. To get the whole relation, as you may recall, we can use the bookmark() function.

Once we have turned our area-layer into a coverage, we can then use a sampling function, such as sample_centroid(), to match each building to a corresponding region in our area-layer.

Copy the following into your pipeline.txt file and save it.

input('data/Buildings_SE_Upolu.shp', name: 'exposure')
  ->
select({
        *,
        bookmark('data/MaxEnv_All_Scenarios_50m.tif') as hazard_coverage
       })
  ->
select({
        *,
        sample_centroid(exposure, hazard_coverage) as hazard
       })
  ->
select({
        *,
        to_coverage(bookmark('data/Samoa_constituencies.shp')) as area_coverage
       })
  ->
select({
        *,
        sample_centroid(exposure, area_coverage) as area
       })
  ->
select({
        exposure,
        hazard,
        area.Region,
        # TODO: we need to work this out still...
        '???' as consequence
       }) as event_impact_table

Then re-run the model using the following command:

riskscape model run advanced

Open the event_impact_table.shp results file produced in QGIS. If you open the attribute table, each building should now have a valid Region value associated with it.

Note

A few buildings fall outside the area-layer boundaries, so will have a NULL value for the Region attribute. As we have seen in previous examples, we could use a buffer-distance when sampling in order to better match buildings to the closest region. To do this, we would need to use sample_one() function instead of sample_centroid().

Consequence analysis

The most important part of any model is computing a consequence based on the hazard intensity measure. This simply involves calling your user-defined function from a select step. As we have seen, we can call any RiskScape function from a select step, including Python functions that you can write yourself.

Let’s use the simplest possible consequence function - the built-in is_exposed() function. This function accepts Anything as arguments and returns 1 if the hazard argument is not null, and 0 if the hazard argument is null.

Save the following into your pipeline.txt file.

input('data/Buildings_SE_Upolu.shp', name: 'exposure')
  ->
select({
        *,
        bookmark('data/MaxEnv_All_Scenarios_50m.tif') as hazard_coverage
       })
  ->
select({
        *,
        sample_centroid(exposure, hazard_coverage) as hazard
       })
  ->
select({
        *,
        to_coverage(bookmark('data/Samoa_constituencies.shp')) as area_coverage
       })
  ->
select({
        *,
        sample_centroid(exposure, area_coverage) as area
       })
  ->
select({
        *,
        is_exposed(exposure, hazard) as consequence
       })
  ->
select({
        exposure,
        hazard,
        area.Region,
        consequence
       }) as event_impact_table

Use riskscape model run advanced to re-run the model and check the results. We now have a simple exposure model that we have written completely from scratch.

Note

The is_exposed() function is simply checking whether the hazard is null. So another alternative to using the is_exposed() function here would be if(is_null(hazard), 0, 1).

Pipeline parameters

As you may recall, the riskscape model run command lets us change the model’s parameters, or assumptions, on the fly. RiskScape lets you define your own parameters for pipeline models.

For example, say that you feel the input-exposures.layer wizard parameter name is too verbose. You might prefer to simply call this parameter buildings in your pipeline model.

First, in the project.ini file you need to define the parameters your model will take, along with default values. The current project.ini already has some parameters defined for you: buildings, hazard, regions. These are the main input layers that the model takes.

[model advanced]
framework = pipeline
location = pipeline.txt
description = An empty pipeline file we will use for our tutorial
param.buildings = 'data/Buildings_SE_Upolu.shp'
param.hazard = 'data/MaxEnv_All_Scenarios_50m.tif'
param.regions = 'data/Samoa_constituencies.shp'

Note

The parameter values need to be in the exact format that the pipeline code expects. For example, the values for input files or bookmarks always need single quotes around the name, i.e. 'data/Buildings_SE_Upolu.shp', not data/Buildings_SE_Upolu.shp.

The next step is to replace the pipeline code that you want to parameterize. You use $ along with the parameter name, e.g. $buildings.

For example, try replacing pipeline.txt with the following content, and save the file.

input($buildings, name: 'exposure')
  ->
select({
        *,
        bookmark($hazard) as hazard_coverage
       })
  ->
select({
        *,
        sample_centroid(exposure, hazard_coverage) as hazard
       })
  ->
select({
        *,
        to_coverage(bookmark($regions)) as area_coverage
       })
  ->
select({
        *,
        sample_centroid(exposure, area_coverage) as area
       })
  ->
select({
        *,
        is_exposed(exposure, hazard) as consequence
       })
  ->
select({
        exposure,
        hazard,
        area.Region,
        consequence
       }) as event_impact_table

Try running the model with a different hazard-layer by using the following command:

riskscape model run advanced -p "hazard='data/MaxEnv_All_Scenarios_10m.tif'"

The results produced will use the 10m hazard grid this time, rather than the 50m grid.

Note

RiskScape always saves a pipeline.txt file in the output directory. This file contains the replaced values for any parameters, so you can always see exactly what was used to produce the model’s results.

With pipelines, you can pass your parameter values right through to your Python code, if you need to. See Customizing your own model parameters for an example of this.

Customizing wizard models

Remember that wizard models are simply a framework built around an underlying RiskScape pipeline. There are several ways you can access the pipeline code for a wizard model:

  • Whenever you save a wizard model, it also saves a pipeline version of the model.

  • Whenever you run a wizard model, it generates a pipeline.txt file in the output directory.

  • Whenever you build a wizard model, you can use the ctrl+c menu to view the corresponding pipeline code as you go along.

Tip

Instead of writing a pipeline from scratch, it’s simpler to build a wizard model and then adjust the pipeline to suit your needs.

Wizard models have many different parameters, but not all of the parameters will need to be changed on the fly. The parameter names are also fairly generic, as they need to fit many different problem domains.

Defining your own pipeline parameters can be useful when you want to share models with less experienced RiskScape users. It can reduce complexity, as you can limit the user’s options to one or two clearly-named parameters. It also means users cannot inadvertently change the parts of the model that they shouldn’t, such as the spatial sampling technique.

Pipeline code formatting

The pipeline examples on this page have been formatted in a way that tries to make it clear what the pipeline is doing. However, you have some flexibility in how you organize and format your pipeline code. There are several different ways of achieving the same thing with pipelines.

Pipelines generated by the wizard will typically look more like the following example. This produces exactly the same results as the previous pipeline - the main difference is in how the code is formatted.

input('data/Buildings_SE_Upolu.shp', name: 'exposure')
 -> select({*, bookmark('data/MaxEnv_All_Scenarios_50m.tif') as hazard})
 -> select({*, sample_centroid(exposure, hazard) as hazard})
 -> select({*, to_coverage(bookmark('data/Samoa_constituencies.shp')) as area})
 -> select({*, sample_centroid(exposure, area) as area})
 -> select({*, is_exposed(exposure, hazard) as consequence})
 -> select({exposure, hazard, area.Region, consequence}) as event_impact_table

Alternatively, we could reduce the total number of pipeline steps by reorganizing the pipeline code. For example, the following pipeline produces the exact same result as the previous example, but it uses nested function calls to reduce the total number of pipeline steps.

input('data/Buildings_SE_Upolu.shp', name: 'exposure')
  ->
select({
        *,
        sample_centroid(exposure, bookmark('data/MaxEnv_All_Scenarios_50m.tif')) as hazard,
        sample_centroid(exposure, to_coverage(bookmark('data/Samoa_constituencies.shp'))).Region as Region
       }) as spatial_sampling
  ->
select({
        *,
        is_exposed(exposure, hazard) as consequence
       }) as consequence_analysis

Note

Pipeline code is generated by the wizard so it is easy to construct rather than it is easy to read. RiskScape has to handle many different paths through the wizard, so sometimes pipeline steps will be included that essentially do nothing, e.g. select({*}) as NAMED_STEP.

Joins

With database implementations, such as SQL, you can join two different tables, or datasets, together. The tables are typically joined on a shared attribute, or key, that the two tables have in common.

RiskScape also supports a join pipeline step that lets you combine two different inputs, or pipeline chains, into one. In RiskScape, each pipeline chain produces a relation, so it is conceptually the same as an SQL operation - we are simply joining two relations together.

The join pipeline step goes through the rows of data in each layer looking for two rows that match somehow. The match criteria may be an attribute they have in common, or it could be some other true/false conditional expression.

The join step is a little different to other pipeline steps, as it takes two different pipeline chains as input. One pipeline chain will be the left-hand-side (lhs) of the join, and the other will be the right-hand-side (rhs).

When we use the join step we need to explicitly state what side of the join the pipeline chain is connecting to, e.g. step_foo -> join.lhs or step_bar -> join.rhs.

Note

If you omit the lhs or rhs input names, the step will be chained to the left-hand side of the join step by default.

The side of the join does matter as it can affect how the join behaves. As a rule of thumb you want the primary (or largest) dataset to be on the left-hand-side of the join.

A simple join

Let’s look at a simple example where we want to join additional information from another dataset to our main exposure-layer. We have the following Cons_Frame.csv data that we want to combine with our building data.

Material,DS5_median
Masonry,2.49
Reinforced Concrete,7.3
Reinforced_Concrete,7.3
Timber,1.62
Steel,2.77
,2.77

Warning

The Cons_Frame.csv data is used here to demonstrate RiskScape’s pipeline functionality, rather than as an example of scientific best practice. For example, you could encode these values into your Python function rather than joining datasets. The values here have been taken from Empirical building fragilities from observed damage in the 2009 South Pacific tsunami.

To join this data to our exposure-layer, we need to specify the matching condition we will join on. In this case, the construction framing attribute is common to both datasets (the Cons_Frame attribute in our exposure-layer and Material in Cons_Frame.csv). So we can simply join two rows if the construction material is equal, i.e. exposure.Cons_Frame = Material.

Copy and paste the following into your pipeline.txt file, save it, and then run the pipeline using riskscape model run advanced.

input('data/Cons_Frame.csv')
  ->
join_cons_frame.rhs

input('data/Buildings_SE_Upolu.shp', name: 'exposure')
  ->
join(on: exposure.Cons_Frame = Material) as join_cons_frame

This should produce a join_cons_frame.shp shapefile output. Open it and look at the attribute table. You should now see the Material and DS5_median included with the building data.

Notice that the pipeline code now has two input steps, but produces one output file. This is because the join step combines the two input relations into one.

Merging attributes into your exposure-layer

It is a little hard to see in the shapefile output, but there are a few undesirable things about this join still:

  1. The Material attribute is redundant, because it simply duplicates the Cons_Frame attribute.

  2. Because the data comes from CSV file, the DS5_median is actually text rather than a floating-point number. We could potentially fix this by using a bookmark.

  3. Although the DS5_median is now included in every row of data, it is still separate to the exposure struct. This makes it more awkward if we wanted to pass the DS5_median to a Python function, for example.

The following pipeline adds an extra step that addresses these issues. Try copying it into your pipeline.txt file, saving it, and running the advanced model again.

input('data/Cons_Frame.csv')
  ->
join_cons_frame.rhs

input('data/Buildings_SE_Upolu.shp', name: 'exposure')
  ->
join(on: exposure.Cons_Frame = Material) as join_cons_frame
  ->
select({
         {
          exposure.*,
          float(DS5_median) as DS5_median
         } as exposure
       })

You will not notice much difference in the select.shp output file it generates. However, the DS5_median attribute is now part of the exposure struct, and so could be accessed via exposure.DS5_median.

Wizard joins

If you look closely at the pipeline code that gets generated by the wizard, you will see that it uses join steps. For example, spatial sampling the area-layer would look something like this:

input('data/Buildings_SE_Upolu.shp', name: 'exposure')
 -> join(on: true) as exposures_join_areas
 -> select({*, sample_one(geometry: exposure, coverage: area, buffer-distance: 1000) as area})
input(relation: 'data/Samoa_constituencies.shp', name: 'area') as areas_input
 -> group({to_coverage(area) as area}) as areas
 -> exposures_join_areas.rhs

As you have seen in the previous examples on this page, we managed to spatially sample the area-layer without using any join steps. There are several different approaches to achieving the same result with RiskScape pipelines. The wizard uses join steps because it allows you to do optional geoprocessing on the area-layer data.

The group({to_coverage(area) as area}) turns the area-layer relation into a coverage. We end up with a single row of data, with a single area attribute, which is all the area-layer data in coverage form. The join(on: true) step then adds this area attribute to every row of building data, so that we can then sample the coverage.

Note

Only use join(on: true) when you have exactly one row of data on the right-hand side. Otherwise the rows of data on the left-hand side will get duplicated.

Join complexities

In these examples we have only covered the simple case where there is always a matching row between the left-hand-side and the right-hand-side datasets.

When some rows do not match between the two datasets, you have a choice of what type of join you want:

  • Inner join: drop any non-matching rows (the default).

  • Left-outer join: include any non-matching rows from the left-hand-side.

You could also use the join step to match based on geometry, however, we recommend using a spatial sampling function instead. For example, we could potentially join our area-layer to our exposure-layer using a step like: join(on: contains(area, centroid(exposure))). Currently geometry-based join conditions will only work if both input files are in the same CRS.

Recap

Hopefully you now have a better understanding of the underlying data processing that is going on in a risk model. Here are some key takeaway points we have covered:

  • There are two types of input data layers: relations (i.e. CSV, shapefile) and coverages (i.e. ASC, GeoTIFF).

  • There are two ways to combine geospatial layers in RiskScape: using the join step or sampling from a coverage.

  • The join step has some complexities and limitations, so we recommend using the sampling approach to combine layers, where possible.

  • A sampling function uses a given geometry (i.e. building) as an index into a spatial coverage file.

  • When you have a relation (CSV or shapefile data), you can turn it into a coverage and sample it still.

  • The consequence analysis is simply a function call from a select step.

  • You can define your own parameters for a pipeline model. This means you can tailor your parameters so your model is simpler for less experienced RiskScape users to use.

  • Wizard models are always backed by pipeline code. It can be simpler to start off customizing a wizard model, rather than writing your own pipeline from scratch.

  • The join step brings together two different pipeline chains or relation layers. You define the join on condition that determines how the rows of data are matched up.

Additional exercises

Congratulations! You have reached the end of the RiskScape tutorials. If you want to test your pipeline abilities further, here are some exercises to try out on your own. Start off with the pipeline we used in the Consequence analysis section.

  • Make sure that every building gets matched to a region. Replace the sample_centroid(exposure, area_coverage) function call with sample_one(exposure, area_coverage). Try adding the buffer-distance function argument and setting it to 5000 metres.

  • Change the consequence to either ‘Exposed’ or ‘Not Exposed’ in the results. Tip: replace the is_exposed() function call with if(). You can use is_null(hazard) to check if a building was not exposed to the hazard. Remember you can use riskscape function info FUNCTION_NAME to find out more about a function.

  • Add a group step so that the results are grouped by consequence. Use count(*) to count the number of buildings in each group (i.e. ‘Exposed’ and ‘Not exposed’).

  • Update the group step so that the results are also broken down by construction type, i.e. group by consequence and Cons_Frame. Tip: to group by multiple things, you use a Struct expression. There was an example of Aggregating by multiple attributes in the previous tutorial.

  • Include the median hazard depth for each group, i.e. median(hazard). For bonus points, report the depth in centimetres rather than metres, and round it to the nearest whole number.

  • Add a sort step so that the ‘Exposed’ results are all grouped together. Tip: you can use the riskscape pipeline step info STEP_NAME command to find out more about a new step.

  • Change the filename that the results are saved in to be regional-exposure.csv.

If you get stuck, there is an answers.txt file in your directory with the solutions.