How to write advanced pipelines
Before we start
This tutorial is aimed at advanced RiskScape users who want a better understanding of the inner workings of their risk models. We expect that you:
Have been through the How to write basic pipelines tutorial already.
Are proficient using RiskScape and other shell commands.
Are comfortable working with geospatial data and risk analysis models.
The aim of this tutorial is to build a pipeline for a risk model from scratch. This will hopefully give you a better understanding of the RiskScape pipeline features that lie beneath your risk models, such as models built using the wizard.
Background
Types of layers
There are essentially two types of input data layers that RiskScape uses:
Relations, which hold rows or records of data. Relations include CSV and shapefile vector data. The previous pipeline tutorial only covered relation data.
Coverages, which are spatial lookup tables. Coverages include grid-based GeoTIFF and ASC raster files. We will look at coverages a little later in the tutorial.
The input
pipeline step only works with relation datasets.
Subsequent pipeline steps can transform the data, but each step ultimately still produces a relation,
which then becomes the input for the next step.
Because pipeline steps primarily deal with relations, you may notice some similarities with SQL. If you are not familiar with SQL, it may help to think of a relation like a CSV file or spreadsheet. A relation holds rows of data, and each row has several columns (attributes).
Each pipeline step essentially takes a relation as input, transforms the data row by row, and passes the new relation on to the next step. Conceptually, it may help to visualize the relation data at any one point in the pipeline as being akin to CSV or spreadsheet data.
Combining layers
The previous pipeline tutorial only used a single dataset as input. But to do any real modelling, you will want to combine multiple datasets together. The most obvious example is finding geospatial matches between your exposure-layer and your hazard-layer.
There are two ways we can combine datasets:
Using the
join
pipeline step to combine two different pipeline chains (relations) into one. This is similar to a ‘join’ database operation.Using a function call, such as
sample()
to retrieve any matching geospatial data from a coverage, one row of data at a time. This tutorial will mostly look at the sampling approach.
Setup
Click Here to download the input data we will use in these examples.
Unzip the file into a new directory, e.g. C:\RiskScape_Projects\
.
Open a command prompt and cd
to the directory where you unzipped the files, e.g.
cd C:\RiskScape_Projects\pipeline-advanced
You will use this command prompt to run the RiskScape commands in this tutorial.
The unzipped directory contains an empty file called pipeline.txt
.
Open this file in a text editor, such as Notepad or gedit
.
This is where you will paste the pipeline examples.
The project.ini
is pretty simple - it loads the pipeline.txt
contents as a pipeline model called advanced
.
The input data files that we will use in our model are exactly the same ones from the Getting started modelling in RiskScape tutorial.
Loading the input data
Let’s start off by simply loading the shapefile building data.
Add the following to your pipeline.txt
file and save it.
input('data/Buildings_SE_Upolu.shp', name: 'exposure')
->
select({
exposure,
# TODO: we need to work this out still...
'???' as Region,
'???' as hazard,
'???' as consequence
}) as event_impact_table
Notice that the input
step in the pipeline uses the name
parameter.
This wraps up each row of input data in a Struct
instance called exposure
.
This makes it easier to distinguish our exposure-layer data from our area-layer and hazard-layer data.
Note
When you name the input data Struct
, it changes how you access the attributes.
For example, in our previous pipeline tutorial we simply used Use_Cat
to access the use-category attribute,
whereas now we would have to use exposure.Use_Cat
.
Try running the pipeline with the following command.
riskscape model run advanced
This should output a event_impact_table.shp
file.
Because our input data contains geometry, RiskScape defaults to saving it as a shapefile.
You can open the shapefile in QGIS, but the results are not very interesting yet.
The data is exactly the same as the Buildings_SE_Upolu.shp
file, except we have added some extra attributes
that all have the value ???
.
Spatial sampling
Our pipeline uses the input
step to read an input data relation one row at a time.
We can also use the bookmark()
function to return a handle
to the the entire relation or coverage, which can then be passed to other functions.
Note
If there is no configured bookmark with the given name, RiskScape will try to find a file with the given name. We have not setup any bookmarks in this tutorial, so we are only using filenames.
So we could use the following pipeline, which would load the MaxEnv_All_Scenarios_50m.tif
hazard file.
Because the file is a GeoTIFF, the bookmark()
function returns it as a coverage.
input('data/Buildings_SE_Upolu.shp', name: 'exposure')
->
select({
*,
bookmark('data/MaxEnv_All_Scenarios_50m.tif') as hazard_coverage
})
However, this pipeline is still not quite what we need yet.
The last step adds the whole coverage to the results as the hazard
attribute, whereas we want a single hazard intensity measure for each building.
To get a hazard intensity measure, we still need to sample the coverage data.
Tip
Remember that *
in a pipeline works as a wildcard to select all the attributes that are already present in the pipeline data.
Sampling a coverage
A coverage is a bit like a spatial lookup table, so for any given geometry (e.g. a building), a coverage can return the corresponding data (if any) at that point. We call this geometry-based lookup sampling.
You could think of sampling as being like putting a pin into a map and plucking out the data at the point where it lands.
RiskScape supports several different sample()
functions that all take a coverage argument and a given geometry value to ‘lookup’ in the coverage.
We will use sample_centroid()
here, which takes the centre of each building’s geometry and uses it as an index into the coverage.
Copy the following into your pipeline.txt
file and save it.
input('data/Buildings_SE_Upolu.shp', name: 'exposure')
->
select({
*,
bookmark('data/MaxEnv_All_Scenarios_50m.tif') as hazard_coverage
})
->
select({
*,
sample_centroid(exposure, hazard_coverage) as hazard
})
->
select({
exposure,
hazard,
# TODO: we need to work this out still...
'???' as Region,
'???' as consequence
}) as event_impact_table
Then re-run the model using the following command:
riskscape model run advanced
Open the event_impact_table.shp
results file produced in QGIS.
If you open the attribute table, each building should now have a valid hazard
value associated with it.
Note
When there is no corresponding hazard value sampled for a building, -9999.0
will be written to the shapefile.
RiskScape’s sample()
functions will always handle reprojection if the CRS are different between the two layers.
Sampling a relation
In order to sample spatial data, we need to always use a coverage layer. If we have relation data from a CSV or shapefile, such as our area-layer data, then we can convert it into a coverage layer.
To turn a relation into a coverage, we use the to_coverage()
function.
To get the whole relation, as you may recall, we can use the bookmark()
function.
Once we have turned our area-layer into a coverage, we can then use a sampling function,
such as sample_centroid()
, to match each building to a corresponding region in our area-layer.
Copy the following into your pipeline.txt
file and save it.
input('data/Buildings_SE_Upolu.shp', name: 'exposure')
->
select({
*,
bookmark('data/MaxEnv_All_Scenarios_50m.tif') as hazard_coverage
})
->
select({
*,
sample_centroid(exposure, hazard_coverage) as hazard
})
->
select({
*,
to_coverage(bookmark('data/Samoa_constituencies.shp')) as area_coverage
})
->
select({
*,
sample_centroid(exposure, area_coverage) as area
})
->
select({
exposure,
hazard,
area.Region,
# TODO: we need to work this out still...
'???' as consequence
}) as event_impact_table
Then re-run the model using the following command:
riskscape model run advanced
Open the event_impact_table.shp
results file produced in QGIS.
If you open the attribute table, each building should now have a valid Region
value associated with it.
Note
A few buildings fall outside the area-layer boundaries, so will have a NULL
value for the Region
attribute.
As we have seen in previous examples, we could use a buffer-distance
when sampling in order to better match
buildings to the closest region.
To do this, we would need to use sample_one()
function instead of sample_centroid()
.
Consequence analysis
The most important part of any model is computing a consequence based on the hazard intensity measure.
This simply involves calling your user-defined function from a select
step.
As we have seen, we can call any RiskScape function from a select
step, including Python functions that you can write yourself.
Let’s use the simplest possible consequence function - the built-in is_exposed()
function.
This function accepts Anything
as arguments and returns 1
if the hazard argument is not null, and 0
if the hazard argument is null.
Save the following into your pipeline.txt
file.
input('data/Buildings_SE_Upolu.shp', name: 'exposure')
->
select({
*,
bookmark('data/MaxEnv_All_Scenarios_50m.tif') as hazard_coverage
})
->
select({
*,
sample_centroid(exposure, hazard_coverage) as hazard
})
->
select({
*,
to_coverage(bookmark('data/Samoa_constituencies.shp')) as area_coverage
})
->
select({
*,
sample_centroid(exposure, area_coverage) as area
})
->
select({
*,
is_exposed(exposure, hazard) as consequence
})
->
select({
exposure,
hazard,
area.Region,
consequence
}) as event_impact_table
Use riskscape model run advanced
to re-run the model and check the results.
We now have a simple exposure model that we have written completely from scratch.
Note
The is_exposed()
function is simply checking whether the hazard is null.
So another alternative to using the is_exposed()
function here would be if(is_null(hazard), 0, 1)
.
Pipeline parameters
As you may recall, the riskscape model run
command lets us change the model’s parameters, or assumptions, on the fly.
RiskScape lets you define your own parameters for pipeline models.
For example, say that you feel the input-exposures.layer
wizard parameter name is too verbose.
You might prefer to simply call this parameter buildings
in your pipeline model.
First, in the project.ini
file you need to define the parameters your model will take, along with default values.
The current project.ini
already has some parameters defined for you: buildings
, hazard
, regions
.
These are the main input layers that the model takes.
[model advanced]
framework = pipeline
location = pipeline.txt
description = An empty pipeline file we will use for our tutorial
param.buildings = 'data/Buildings_SE_Upolu.shp'
param.hazard = 'data/MaxEnv_All_Scenarios_50m.tif'
param.regions = 'data/Samoa_constituencies.shp'
Note
The parameter values need to be in the exact format that the pipeline code expects.
For example, the values for input files or bookmarks always need single quotes around the name,
i.e. 'data/Buildings_SE_Upolu.shp'
, not data/Buildings_SE_Upolu.shp
.
The next step is to replace the pipeline code that you want to parameterize.
You use $
along with the parameter name, e.g. $buildings
.
For example, try replacing pipeline.txt
with the following content, and save the file.
input($buildings, name: 'exposure')
->
select({
*,
bookmark($hazard) as hazard_coverage
})
->
select({
*,
sample_centroid(exposure, hazard_coverage) as hazard
})
->
select({
*,
to_coverage(bookmark($regions)) as area_coverage
})
->
select({
*,
sample_centroid(exposure, area_coverage) as area
})
->
select({
*,
is_exposed(exposure, hazard) as consequence
})
->
select({
exposure,
hazard,
area.Region,
consequence
}) as event_impact_table
Try running the model with a different hazard-layer by using the following command:
riskscape model run advanced -p "hazard='data/MaxEnv_All_Scenarios_10m.tif'"
The results produced will use the 10m hazard grid this time, rather than the 50m grid.
Note
RiskScape always saves a pipeline.txt
file in the output directory.
This file contains the replaced values for any parameters, so you can always see exactly what was used to produce the model’s results.
With pipelines, you can pass your parameter values right through to your Python code, if you need to. See Customizing your own model parameters for an example of this.
Customizing wizard models
Remember that wizard models are simply a framework built around an underlying RiskScape pipeline. There are several ways you can access the pipeline code for a wizard model:
Whenever you save a wizard model, it also saves a pipeline version of the model.
Whenever you run a wizard model, it generates a
pipeline.txt
file in the output directory.Whenever you build a wizard model, you can use the
ctrl+c
menu to view the corresponding pipeline code as you go along.
Tip
Instead of writing a pipeline from scratch, it’s simpler to build a wizard model and then adjust the pipeline to suit your needs.
Wizard models have many different parameters, but not all of the parameters will need to be changed on the fly. The parameter names are also fairly generic, as they need to fit many different problem domains.
Defining your own pipeline parameters can be useful when you want to share models with less experienced RiskScape users. It can reduce complexity, as you can limit the user’s options to one or two clearly-named parameters. It also means users cannot inadvertently change the parts of the model that they shouldn’t, such as the spatial sampling technique.
Pipeline code formatting
The pipeline examples on this page have been formatted in a way that tries to make it clear what the pipeline is doing. However, you have some flexibility in how you organize and format your pipeline code. There are several different ways of achieving the same thing with pipelines.
Pipelines generated by the wizard will typically look more like the following example. This produces exactly the same results as the previous pipeline - the main difference is in how the code is formatted.
input('data/Buildings_SE_Upolu.shp', name: 'exposure')
-> select({*, bookmark('data/MaxEnv_All_Scenarios_50m.tif') as hazard})
-> select({*, sample_centroid(exposure, hazard) as hazard})
-> select({*, to_coverage(bookmark('data/Samoa_constituencies.shp')) as area})
-> select({*, sample_centroid(exposure, area) as area})
-> select({*, is_exposed(exposure, hazard) as consequence})
-> select({exposure, hazard, area.Region, consequence}) as event_impact_table
Alternatively, we could reduce the total number of pipeline steps by reorganizing the pipeline code. For example, the following pipeline produces the exact same result as the previous example, but it uses nested function calls to reduce the total number of pipeline steps.
input('data/Buildings_SE_Upolu.shp', name: 'exposure')
->
select({
*,
sample_centroid(exposure, bookmark('data/MaxEnv_All_Scenarios_50m.tif')) as hazard,
sample_centroid(exposure, to_coverage(bookmark('data/Samoa_constituencies.shp'))).Region as Region
}) as spatial_sampling
->
select({
*,
is_exposed(exposure, hazard) as consequence
}) as consequence_analysis
Note
Pipeline code is generated by the wizard so it is easy to construct rather than it is easy to read.
RiskScape has to handle many different paths through the wizard, so sometimes pipeline steps will be included that
essentially do nothing, e.g. select({*}) as NAMED_STEP
.
Joins
With database implementations, such as SQL, you can join two different tables, or datasets, together. The tables are typically joined on a shared attribute, or key, that the two tables have in common.
RiskScape also supports a join
pipeline step that lets you combine two different inputs, or pipeline chains, into one.
In RiskScape, each pipeline chain produces a relation, so it is conceptually the same as an SQL operation -
we are simply joining two relations together.
The join
pipeline step goes through the rows of data in each layer looking for two rows that match somehow.
The match criteria may be an attribute they have in common, or it could be some other true/false conditional expression.
The join
step is a little different to other pipeline steps, as it takes two different pipeline chains as input.
One pipeline chain will be the left-hand-side (lhs
) of the join, and the other will be the right-hand-side (rhs
).
When we use the join
step we need to explicitly state what side of the join the pipeline chain is connecting to,
e.g. step_foo -> join.lhs
or step_bar -> join.rhs
.
Note
If you omit the lhs
or rhs
input names, the step will be chained to the left-hand side of
the join
step by default.
The side of the join does matter as it can affect how the join behaves. As a rule of thumb you want the primary (or largest) dataset to be on the left-hand-side of the join.
A simple join
Let’s look at a simple example where we want to join additional information from another dataset to our main exposure-layer.
We have the following Cons_Frame.csv
data that we want to combine with our building data.
Material,DS5_median
Masonry,2.49
Reinforced Concrete,7.3
Reinforced_Concrete,7.3
Timber,1.62
Steel,2.77
,2.77
Warning
The Cons_Frame.csv
data is used here to demonstrate RiskScape’s pipeline functionality, rather than as an example of scientific best practice.
For example, you could encode these values into your Python function rather than joining datasets.
The values here have been taken from
Empirical building fragilities from observed damage in the 2009 South Pacific tsunami.
To join this data to our exposure-layer, we need to specify the matching condition we will join on.
In this case, the construction framing attribute is common to both datasets (the Cons_Frame
attribute in our exposure-layer and Material
in Cons_Frame.csv
).
So we can simply join two rows if the construction material is equal, i.e. exposure.Cons_Frame = Material
.
Copy and paste the following into your pipeline.txt
file, save it, and then run the pipeline using riskscape model run advanced
.
input('data/Cons_Frame.csv')
->
join_cons_frame.rhs
input('data/Buildings_SE_Upolu.shp', name: 'exposure')
->
join(on: exposure.Cons_Frame = Material) as join_cons_frame
This should produce a join_cons_frame.shp
shapefile output.
Open it and look at the attribute table. You should now see the Material
and DS5_median
included with the building data.
Notice that the pipeline code now has two input steps, but produces one output file.
This is because the join
step combines the two input relations into one.
Merging attributes into your exposure-layer
It is a little hard to see in the shapefile output, but there are a few undesirable things about this join still:
The
Material
attribute is redundant, because it simply duplicates theCons_Frame
attribute.Because the data comes from CSV file, the
DS5_median
is actually text rather than a floating-point number. We could potentially fix this by using a bookmark.Although the
DS5_median
is now included in every row of data, it is still separate to theexposure
struct. This makes it more awkward if we wanted to pass theDS5_median
to a Python function, for example.
The following pipeline adds an extra step that addresses these issues.
Try copying it into your pipeline.txt
file, saving it, and running the advanced
model again.
input('data/Cons_Frame.csv')
->
join_cons_frame.rhs
input('data/Buildings_SE_Upolu.shp', name: 'exposure')
->
join(on: exposure.Cons_Frame = Material) as join_cons_frame
->
select({
{
exposure.*,
float(DS5_median) as DS5_median
} as exposure
})
You will not notice much difference in the select.shp
output file it generates.
However, the DS5_median
attribute is now part of the exposure
struct, and so could be accessed via exposure.DS5_median
.
Wizard joins
If you look closely at the pipeline code that gets generated by the wizard, you will see that it uses join
steps.
For example, spatial sampling the area-layer would look something like this:
input('data/Buildings_SE_Upolu.shp', name: 'exposure')
-> join(on: true) as exposures_join_areas
-> select({*, sample_one(geometry: exposure, coverage: area, buffer-distance: 1000) as area})
input(relation: 'data/Samoa_constituencies.shp', name: 'area') as areas_input
-> group({to_coverage(area) as area}) as areas
-> exposures_join_areas.rhs
As you have seen in the previous examples on this page, we managed to spatially sample the area-layer without using any join
steps.
There are several different approaches to achieving the same result with RiskScape pipelines.
The wizard uses join
steps because it allows you to do optional geoprocessing on the area-layer data.
The group({to_coverage(area) as area})
turns the area-layer relation into a coverage.
We end up with a single row of data, with a single area
attribute, which is all the area-layer data in coverage form.
The join(on: true)
step then adds this area
attribute to every row of building data, so that we can then sample the coverage.
Note
Only use join(on: true)
when you have exactly one row of data on the right-hand side.
Otherwise the rows of data on the left-hand side will get duplicated.
Join complexities
In these examples we have only covered the simple case where there is always a matching row between the left-hand-side and the right-hand-side datasets.
When some rows do not match between the two datasets, you have a choice of what type of join you want:
Inner join: drop any non-matching rows (the default).
Left-outer join: include any non-matching rows from the left-hand-side.
You could also use the join
step to match based on geometry, however, we recommend using a spatial sampling function instead.
For example, we could potentially join our area-layer to our exposure-layer using a step like: join(on: contains(area, centroid(exposure)))
.
Currently geometry-based join
conditions will only work if both input files are in the same CRS.
Recap
Hopefully you now have a better understanding of the underlying data processing that is going on in a risk model. Here are some key takeaway points we have covered:
There are two types of input data layers: relations (i.e. CSV, shapefile) and coverages (i.e. ASC, GeoTIFF).
There are two ways to combine geospatial layers in RiskScape: using the
join
step or sampling from a coverage.The
join
step has some complexities and limitations, so we recommend using the sampling approach to combine layers, where possible.A sampling function uses a given geometry (i.e. building) as an index into a spatial coverage file.
When you have a relation (CSV or shapefile data), you can turn it into a coverage and sample it still.
The consequence analysis is simply a function call from a
select
step.You can define your own parameters for a pipeline model. This means you can tailor your parameters so your model is simpler for less experienced RiskScape users to use.
Wizard models are always backed by pipeline code. It can be simpler to start off customizing a wizard model, rather than writing your own pipeline from scratch.
The
join
step brings together two different pipeline chains or relation layers. You define the joinon
condition that determines how the rows of data are matched up.
Additional exercises
Congratulations! You have reached the end of the RiskScape tutorials. If you want to test your pipeline abilities further, here are some exercises to try out on your own. Start off with the pipeline we used in the Consequence analysis section.
Make sure that every building gets matched to a region. Replace the
sample_centroid(exposure, area_coverage)
function call withsample_one(exposure, area_coverage)
. Try adding thebuffer-distance
function argument and setting it to 5000 metres.Change the
consequence
to either ‘Exposed’ or ‘Not Exposed’ in the results. Tip: replace theis_exposed()
function call withif()
. You can useis_null(hazard)
to check if a building was not exposed to the hazard. Remember you can useriskscape function info FUNCTION_NAME
to find out more about a function.Add a
group
step so that the results are grouped byconsequence
. Usecount(*)
to count the number of buildings in each group (i.e. ‘Exposed’ and ‘Not exposed’).Update the
group
step so that the results are also broken down by construction type, i.e. group byconsequence
andCons_Frame
. Tip: to group by multiple things, you use aStruct
expression. There was an example of Aggregating by multiple attributes in the previous tutorial.Include the median hazard depth for each group, i.e.
median(hazard)
. For bonus points, report the depth in centimetres rather than metres, and round it to the nearest whole number.Add a
sort
step so that the ‘Exposed’ results are all grouped together. Tip: you can use theriskscape pipeline step info STEP_NAME
command to find out more about a new step.Change the filename that the results are saved in to be
regional-exposure.csv
.
If you get stuck, there is an answers.txt
file in your directory with the solutions.