.. _advanced_pipelines: # How to write advanced pipelines ## Before we start This tutorial is aimed at _advanced_ RiskScape users who want a better understanding of the inner workings of their risk models. We expect that you: - Have been through the :ref:`pipelines_tutorial` tutorial already. - Are proficient using RiskScape and other shell commands. - Are comfortable working with geospatial data and risk analysis models. The aim of this tutorial is to build a pipeline for a risk model from scratch. This will hopefully give you a better understanding of the RiskScape pipeline features that lie beneath your risk models, such as models built using the wizard. ## Background .. _relation_vs_coverage: ### Types of layers There are essentially two types of input data layers that RiskScape uses: - [Relations](https://en.wikipedia.org/wiki/Relation_(database%29), which hold rows or records of data. Relations include CSV and shapefile _vector_ data. The previous :ref:`pipeline tutorial ` only covered relation data. - [Coverages](https://en.wikipedia.org/wiki/Coverage_data), which are spatial lookup tables. Coverages include grid-based GeoTIFF and ASC _raster_ files. We will look at coverages a little later in the tutorial. The `input` pipeline step only works with _relation_ datasets. Subsequent pipeline steps can transform the data, but each step ultimately still produces a _relation_, which then becomes the input for the next step. Because pipeline steps primarily deal with relations, you may notice some similarities with SQL. If you are not familiar with SQL, it may help to think of a relation like a CSV file or spreadsheet. A relation holds rows of data, and each row has several columns (attributes). Each pipeline step essentially takes a relation as input, transforms the data row by row, and passes the new relation on to the next step. Conceptually, it may help to visualize the relation data at any one point in the pipeline as being akin to CSV or spreadsheet data. ### Combining layers The previous pipeline tutorial only used a single dataset as input. But to do any real modelling, you will want to combine multiple datasets together. The most obvious example is finding geospatial matches between your exposure-layer and your hazard-layer. There are two ways we can combine datasets: - Using the `join` pipeline step to combine two different pipeline chains (relations) into one. This is similar to a 'join' database operation. - Using a function call, such as `sample()` to retrieve any matching geospatial data from a _coverage_, one row of data at a time. This tutorial will mostly look at the sampling approach. ## Setup Click [Here](../pipeline-advanced.zip) to download the input data we will use in these examples. Unzip the file into a new directory, e.g. `C:\RiskScape_Projects\`. Open a command prompt and `cd` to the directory where you unzipped the files, e.g. ```none cd C:\RiskScape_Projects\pipeline-advanced ``` You will use this command prompt to run the RiskScape commands in this tutorial. The unzipped directory contains an empty file called `pipeline.txt`. Open this file in a text editor, such as Notepad or `gedit`. This is where you will paste the pipeline examples. The `project.ini` is pretty simple - it loads the `pipeline.txt` contents as a pipeline model called `advanced`. The input data files that we will use in our model are exactly the same ones from the :ref:`getting-started` tutorial. .. note:: This input data was provided by `NIWA `_, as well as the `PCRAFI `_ (Pacific Risk Information System) website. The data files have been adapted slightly for this tutorial. ## Loading the input data Let's start off by simply loading the shapefile building data. Add the following to your `pipeline.txt` file and save it. ```none input('data/Buildings_SE_Upolu.shp', name: 'exposure') -> select({ exposure, # TODO: we need to work this out still... '???' as Region, '???' as hazard, '???' as consequence }) as event_impact_table ``` Notice that the `input` step in the pipeline uses the `name` parameter. This wraps up each row of input data in a `Struct` instance called `exposure`. This makes it easier to distinguish our exposure-layer data from our area-layer and hazard-layer data. .. note:: When you name the input data ``Struct``, it changes how you access the attributes. For example, in our previous pipeline tutorial we simply used ``Use_Cat`` to access the use-category attribute, whereas now we would have to use ``exposure.Use_Cat``. Try running the pipeline with the following command. ```none riskscape model run advanced ``` This should output a `event_impact_table.shp` file. Because our input data contains geometry, RiskScape defaults to saving it as a shapefile. You can open the shapefile in QGIS, but the results are not very interesting yet. The data is exactly the same as the `Buildings_SE_Upolu.shp` file, except we have added some extra attributes that all have the value `???`. ## Spatial sampling Our pipeline uses the `input` step to read an input data _relation_ one row at a time. We can also use the `bookmark()` function to return a [handle](https://en.wikipedia.org/wiki/Handle_(computing%29) to the the _entire_ relation or coverage, which can then be passed to other functions. .. note:: If there is no configured bookmark with the given name, RiskScape will try to find a file with the given name. We have not setup any bookmarks in this tutorial, so we are only using filenames. So we _could_ use the following pipeline, which would load the `MaxEnv_All_Scenarios_50m.tif` hazard file. Because the file is a GeoTIFF, the `bookmark()` function returns it as a coverage. ```none input('data/Buildings_SE_Upolu.shp', name: 'exposure') -> select({ *, bookmark('data/MaxEnv_All_Scenarios_50m.tif') as hazard_coverage }) ``` However, this pipeline is still not quite what we need yet. The last step adds the *whole* coverage to the results as the `hazard` attribute, whereas we want a single hazard intensity measure for each building. To get a hazard intensity measure, we still need to *sample* the coverage data. .. tip:: Remember that ``*`` in a pipeline works as a wildcard to select *all* the attributes that are already present in the pipeline data. ### Sampling a coverage A coverage is a bit like a spatial lookup table, so for any given geometry (e.g. a building), a coverage can return the corresponding data (if any) at that point. We call this geometry-based lookup _sampling_. You could think of sampling as being like putting a pin into a map and plucking out the data at the point where it lands. RiskScape supports several different `sample()` functions that all take a coverage argument and a given geometry value to 'lookup' in the coverage. We will use `sample_centroid()` here, which takes the centre of each building's geometry and uses it as an index into the coverage. Copy the following into your `pipeline.txt` file and save it. ```none input('data/Buildings_SE_Upolu.shp', name: 'exposure') -> select({ *, bookmark('data/MaxEnv_All_Scenarios_50m.tif') as hazard_coverage }) -> select({ *, sample_centroid(exposure, hazard_coverage) as hazard }) -> select({ exposure, hazard, # TODO: we need to work this out still... '???' as Region, '???' as consequence }) as event_impact_table ``` Then re-run the model using the following command: ```none riskscape model run advanced ``` Open the `event_impact_table.shp` results file produced in QGIS. If you open the attribute table, each building should now have a valid `hazard` value associated with it. .. note:: When there is no corresponding hazard value sampled for a building, ``-9999.0`` will be written to the shapefile. RiskScape's ``sample()`` functions will always handle reprojection if the CRS are different between the two layers. .. _sampling_relations: ### Sampling a relation In order to sample spatial data, we need to always use a _coverage_ layer. If we have _relation_ data from a CSV or shapefile, such as our area-layer data, then we can _convert_ it into a coverage layer. To turn a relation into a coverage, we use the `to_coverage()` function. To get the _whole_ relation, as you may recall, we can use the `bookmark()` function. Once we have turned our area-layer into a coverage, we can then use a sampling function, such as `sample_centroid()`, to match each building to a corresponding region in our area-layer. Copy the following into your `pipeline.txt` file and save it. ```none input('data/Buildings_SE_Upolu.shp', name: 'exposure') -> select({ *, bookmark('data/MaxEnv_All_Scenarios_50m.tif') as hazard_coverage }) -> select({ *, sample_centroid(exposure, hazard_coverage) as hazard }) -> select({ *, to_coverage(bookmark('data/Samoa_constituencies.shp')) as area_coverage }) -> select({ *, sample_centroid(exposure, area_coverage) as area }) -> select({ exposure, hazard, area.Region, # TODO: we need to work this out still... '???' as consequence }) as event_impact_table ``` Then re-run the model using the following command: ```none riskscape model run advanced ``` Open the `event_impact_table.shp` results file produced in QGIS. If you open the attribute table, each building should now have a valid `Region` value associated with it. .. note:: A few buildings fall outside the area-layer boundaries, so will have a ``NULL`` value for the ``Region`` attribute. As we have seen in previous examples, we could use a ``buffer-distance`` when sampling in order to better match buildings to the closest region. To do this, we would need to use ``sample_one()`` function instead of ``sample_centroid()``. .. _simple_risk_pipeline: ## Consequence analysis The most important part of any model is computing a _consequence_ based on the hazard intensity measure. This simply involves calling your user-defined function from a `select` step. As we have seen, we can call *any* RiskScape function from a `select` step, including Python functions that you can write yourself. Let's use the simplest possible consequence function - the built-in `is_exposed()` function. This function accepts `Anything` as arguments and returns `1` if the hazard argument is not null, and `0` if the hazard argument is null. Save the following into your `pipeline.txt` file. ```none input('data/Buildings_SE_Upolu.shp', name: 'exposure') -> select({ *, bookmark('data/MaxEnv_All_Scenarios_50m.tif') as hazard_coverage }) -> select({ *, sample_centroid(exposure, hazard_coverage) as hazard }) -> select({ *, to_coverage(bookmark('data/Samoa_constituencies.shp')) as area_coverage }) -> select({ *, sample_centroid(exposure, area_coverage) as area }) -> select({ *, is_exposed(exposure, hazard) as consequence }) -> select({ exposure, hazard, area.Region, consequence }) as event_impact_table ``` Use `riskscape model run advanced` to re-run the model and check the results. We now have a simple exposure model that we have written completely from scratch. .. note:: The ``is_exposed()`` function is simply checking whether the hazard is *null*. So another alternative to using the ``is_exposed()`` function here would be ``if_then_else(is_null(hazard), 0, 1)``. ## Pipeline parameters As you may recall, the `riskscape model run` command lets us change the model's parameters, or assumptions, on the fly. RiskScape lets you define your own parameters for pipeline models. For example, say that you feel the `input-exposures.layer` wizard parameter name is too verbose. You might prefer to simply call this parameter `buildings` in your pipeline model. First, in the `project.ini` file you need to define the parameters your model will take, along with default values. The current `project.ini` already has some parameters defined for you: `buildings`, `hazard`, `regions`. These are the main input layers that the model takes. ```ini [model advanced] framework = pipeline location = pipeline.txt description = An empty pipeline file we will use for our tutorial param.buildings = 'data/Buildings_SE_Upolu.shp' param.hazard = 'data/MaxEnv_All_Scenarios_50m.tif' param.regions = 'data/Samoa_constituencies.shp' ``` .. note:: The parameter values need to be in the exact format that the pipeline code expects. For example, the values for input files or bookmarks always need single quotes around the name, i.e. ``'data/Buildings_SE_Upolu.shp'``, *not* ``data/Buildings_SE_Upolu.shp``. The next step is to replace the pipeline code that you want to parameterize. You use `$` along with the parameter name, e.g. `$buildings`. For example, try replacing `pipeline.txt` with the following content, and save the file. ```none input($buildings, name: 'exposure') -> select({ *, bookmark($hazard) as hazard_coverage }) -> select({ *, sample_centroid(exposure, hazard_coverage) as hazard }) -> select({ *, to_coverage(bookmark($regions)) as area_coverage }) -> select({ *, sample_centroid(exposure, area_coverage) as area }) -> select({ *, is_exposed(exposure, hazard) as consequence }) -> select({ exposure, hazard, area.Region, consequence }) as event_impact_table ``` Try running the model with a different hazard-layer by using the following command: ```none riskscape model run advanced -p "hazard='data/MaxEnv_All_Scenarios_10m.tif'" ``` The results produced will use the 10m hazard grid this time, rather than the 50m grid. .. note:: RiskScape always saves a ``pipeline.txt`` file in the output directory. This file contains the *replaced* values for any parameters, so you can always see exactly what was used to produce the model's results. With pipelines, you can pass your parameter values right through to your Python code, if you need to. See :ref:`parameterized_pipelines` for an example of this. ### Customizing wizard models Remember that wizard models are simply a framework built around an underlying RiskScape pipeline. There are several ways you can access the pipeline code for a wizard model: - Whenever you save a wizard model, it also saves a pipeline version of the model. - Whenever you run a wizard model, it generates a `pipeline.txt` file in the output directory. - Whenever you build a wizard model, you can use the `ctrl+c` menu to view the corresponding pipeline code as you go along. .. tip:: Instead of writing a pipeline from scratch, it's simpler to build a wizard model and then adjust the pipeline to suit your needs. Wizard models have many different parameters, but not all of the parameters will need to be changed on the fly. The parameter names are also fairly generic, as they need to fit many different problem domains. Defining your own pipeline parameters can be useful when you want to share models with less experienced RiskScape users. It can reduce complexity, as you can limit the user's options to one or two clearly-named parameters. It also means users cannot inadvertently change the parts of the model that they shouldn't, such as the spatial sampling technique. ### Pipeline code formatting The pipeline examples on this page have been formatted in a way that tries to make it clear what the pipeline is doing. However, you have some flexibility in how you organize and format your pipeline code. There are several different ways of achieving the same thing with pipelines. Pipelines generated by the wizard will typically look more like the following example. This produces exactly the same results as the previous pipeline - the main difference is in how the code is formatted. ```none input('data/Buildings_SE_Upolu.shp', name: 'exposure') -> select({*, bookmark('data/MaxEnv_All_Scenarios_50m.tif') as hazard}) -> select({*, sample_centroid(exposure, hazard) as hazard}) -> select({*, to_coverage(bookmark('data/Samoa_constituencies.shp')) as area}) -> select({*, sample_centroid(exposure, area) as area}) -> select({*, is_exposed(exposure, hazard) as consequence}) -> select({exposure, hazard, area.Region, consequence}) as event_impact_table ``` Alternatively, we could reduce the total number of pipeline steps by reorganizing the pipeline code. For example, the following pipeline produces the exact same result as the previous example, but it uses *nested* function calls to reduce the total number of pipeline steps. ```none input('data/Buildings_SE_Upolu.shp', name: 'exposure') -> select({ *, sample_centroid(exposure, bookmark('data/MaxEnv_All_Scenarios_50m.tif')) as hazard, sample_centroid(exposure, to_coverage(bookmark('data/Samoa_constituencies.shp'))).Region as Region }) as spatial_sampling -> select({ *, is_exposed(exposure, hazard) as consequence }) as consequence_analysis ``` .. note:: Pipeline code is generated by the wizard so it is easy to *construct* rather than it is easy to *read*. RiskScape has to handle many different paths through the wizard, so sometimes pipeline steps will be included that essentially do nothing, e.g. ``select({*}) as NAMED_STEP``. .. _advanced_pipeline_joins: ## Joins With database implementations, such as SQL, you can *join* two different tables, or datasets, together. The tables are typically joined on a shared attribute, or *key*, that the two tables have in common. RiskScape also supports a `join` pipeline step that lets you combine two different inputs, or pipeline *chains*, into one. In RiskScape, each pipeline chain produces a relation, so it is conceptually the same as an SQL operation - we are simply joining two relations together. The `join` pipeline step goes through the rows of data in each layer looking for two rows that _match_ somehow. The match criteria may be an attribute they have in common, or it could be some other true/false conditional expression. The `join` step is a little different to other pipeline steps, as it takes _two_ different pipeline chains as input. One pipeline chain will be the _left-hand-side_ (`lhs`) of the join, and the other will be the _right-hand-side_ (`rhs`). When we use the `join` step we need to explicitly state what side of the join the pipeline chain is connecting to, e.g. `step_foo -> join.lhs` or `step_bar -> join.rhs`. .. note:: If you omit the ``lhs`` or ``rhs`` input names, the step will be chained to the left-hand side of the ``join`` step by default. The side of the join does matter as it can affect how the join behaves. As a rule of thumb you want the primary (or largest) dataset to be on the left-hand-side of the join. ### A simple join Let's look at a simple example where we want to join additional information from another dataset to our main exposure-layer. We have the following `Cons_Frame.csv` data that we want to combine with our building data. ```none Material,DS5_median Masonry,2.49 Reinforced Concrete,7.3 Reinforced_Concrete,7.3 Timber,1.62 Steel,2.77 ,2.77 ``` .. warning:: The ``Cons_Frame.csv`` data is used here to demonstrate RiskScape's pipeline functionality, rather than as an example of scientific best practice. For example, you could encode these values into your Python function rather than joining datasets. The values here have been taken from `Empirical building fragilities from observed damage in the 2009 South Paciļ¬c tsunami `_. To join this data to our exposure-layer, we need to specify the matching condition we will join _on_. In this case, the construction framing attribute is common to both datasets (the `Cons_Frame` attribute in our exposure-layer and `Material` in `Cons_Frame.csv`). So we can simply join two rows if the construction material is equal, i.e. `exposure.Cons_Frame = Material`. Copy and paste the following into your `pipeline.txt` file, save it, and then run the pipeline using `riskscape model run advanced`. ```none input('data/Cons_Frame.csv') -> join_cons_frame.rhs input('data/Buildings_SE_Upolu.shp', name: 'exposure') -> join(on: exposure.Cons_Frame = Material) as join_cons_frame ``` This should produce a `join_cons_frame.shp` shapefile output. Open it and look at the attribute table. You should now see the `Material` and `DS5_median` included with the building data. Notice that the pipeline code now has *two* input steps, but produces *one* output file. This is because the `join` step combines the two input relations into one. ### Merging attributes into your exposure-layer It is a little hard to see in the shapefile output, but there are a few undesirable things about this join still: 1. The `Material` attribute is redundant, because it simply duplicates the `Cons_Frame` attribute. 2. Because the data comes from CSV file, the `DS5_median` is actually text rather than a floating-point number. We could potentially fix this by using a bookmark. 3. Although the `DS5_median` is now included in every row of data, it is still separate to the `exposure` struct. This makes it more awkward if we wanted to pass the `DS5_median` to a Python function, for example. The following pipeline adds an extra step that addresses these issues. Try copying it into your `pipeline.txt` file, saving it, and running the `advanced` model again. ```none input('data/Cons_Frame.csv') -> join_cons_frame.rhs input('data/Buildings_SE_Upolu.shp', name: 'exposure') -> join(on: exposure.Cons_Frame = Material) as join_cons_frame -> select({ { exposure.*, float(DS5_median) as DS5_median } as exposure }) ``` You will not notice much difference in the `select.shp` output file it generates. However, the `DS5_median` attribute is now part of the `exposure` struct, and so could be accessed via `exposure.DS5_median`. ### Wizard joins If you look closely at the pipeline code that gets generated by the wizard, you will see that it uses `join` steps. For example, spatial sampling the area-layer would look something like this: ```none input('data/Buildings_SE_Upolu.shp', name: 'exposure') -> join(on: true) as exposures_join_areas -> select({*, sample_one(geometry: exposure, coverage: area, buffer-distance: 1000) as area}) input(relation: 'data/Samoa_constituencies.shp', name: 'area') as areas_input -> group({to_coverage(area) as area}) as areas -> exposures_join_areas.rhs ``` As you have seen in the previous examples on this page, we managed to spatially sample the area-layer *without* using any `join` steps. There are several different approaches to achieving the same result with RiskScape pipelines. The wizard uses `join` steps because it allows you to do optional geoprocessing on the area-layer data. The `group({to_coverage(area) as area})` turns the area-layer relation into a coverage. We end up with a single row of data, with a single `area` attribute, which is all the area-layer data in coverage form. The `join(on: true)` step then adds this `area` attribute to every row of building data, so that we can then sample the coverage. .. note:: Only use ``join(on: true)`` when you have *exactly* one row of data on the right-hand side. Otherwise the rows of data on the left-hand side will get duplicated. ### Join complexities In these examples we have only covered the simple case where there is _always_ a matching row between the left-hand-side and the right-hand-side datasets. When some rows do _not_ match between the two datasets, you have a choice of what _type_ of join you want: - [Inner join](https://en.wikipedia.org/wiki/Join_(SQL%29#Inner_join): drop any non-matching rows (the default). - [Left-outer join](https://en.wikipedia.org/wiki/Join_(SQL%29#Left_outer_join): include any non-matching rows from the _left_-hand-side. You *could* also use the `join` step to match based on geometry, however, we recommend using a spatial sampling function instead. For example, we could potentially join our area-layer to our exposure-layer using a step like: `join(on: contains(area, centroid(exposure)))`. Currently geometry-based ``join`` conditions will only work if both input files are in the *same* CRS. ## Recap Hopefully you now have a better understanding of the underlying data processing that is going on in a risk model. Here are some key takeaway points we have covered: - There are two types of input data layers: _relations_ (i.e. CSV, shapefile) and _coverages_ (i.e. ASC, GeoTIFF). - There are two ways to combine geospatial layers in RiskScape: using the `join` step or _sampling_ from a coverage. - The `join` step has some complexities and limitations, so we recommend using the _sampling_ approach to combine layers, where possible. - A sampling function uses a given geometry (i.e. building) as an index into a spatial coverage file. - When you have a relation (CSV or shapefile data), you can turn it into a coverage and sample it still. - The consequence analysis is simply a function call from a `select` step. - You can define your own parameters for a pipeline model. This means you can tailor your parameters so your model is simpler for less experienced RiskScape users to use. - Wizard models are always backed by pipeline code. It can be simpler to start off customizing a wizard model, rather than writing your own pipeline from scratch. - The `join` step brings together two different pipeline chains or _relation_ layers. You define the join `on` condition that determines how the rows of data are matched up. ## Additional exercises Congratulations! You have reached the end of the RiskScape tutorials. If you want to test your pipeline abilities further, here are some exercises to try out on your own. Start off with the pipeline we used in the :ref:`simple_risk_pipeline` section. - Make sure that every building gets matched to a region. Replace the `sample_centroid(exposure, area_coverage)` function call with `sample_one(exposure, area_coverage)`. Try adding the `buffer-distance` function argument and setting it to 5000 metres. - Change the `consequence` to either 'Exposed' or 'Not Exposed' in the results. _Tip_: replace the `is_exposed()` function call with `if_then_else()`. You can use `is_null(hazard)` to check if a building was *not* exposed to the hazard. Remember you can use `riskscape function info FUNCTION_NAME` to find out more about a function. - Add a `group` step so that the results are grouped by `consequence`. Use `count(*)` to count the number of buildings in each group (i.e. 'Exposed' and 'Not exposed'). - Update the `group` step so that the results are also broken down by construction type, i.e. group by `consequence` _and_ `Cons_Frame`. _Tip_: to group by multiple things, you use a `Struct` expression. There was an example of :ref:`group_multiple_attributes` in the previous tutorial. - Include the median hazard depth for each group, i.e. `median(hazard)`. For bonus points, report the depth in centimetres rather than metres, and round it to the nearest whole number. - Add a `sort` step so that the 'Exposed' results are all grouped together. _Tip_: you can use the `riskscape pipeline step info STEP_NAME` command to find out more about a new step. - Change the filename that the results are saved in to be `regional-exposure.csv`. If you get stuck, there is an `answers.txt` file in your directory with the solutions.