OpenQuake data

The following page describes how to read shaking data from an HDF5 file generated by OpenQuake v3.11 or later. An OpenQuake HDF5 is split up into several different datasets. Of particular interest are:

  • sitecol: The OpenQuake shaking data is organized by a mesh of sites. sitecol is a group of datasets that specify the lat, long coordinates of each site that is present in the file. Each site has a unique ID (sids).

  • gmf_data: This is a group of datasets that contains the actual shaking data (gmv_0). Each shaking data-point corresponds to a specific site (sid) and event (eid).

  • events: Maps each event (id) to the rupture (rup_id) that generated it.

  • ruptures: Contains information about the source fault that generated the event, such as the number of occurrences (n_occ) or the annual probability of occurring (occurence_rate).

Note

In order to read an OpenQuake file, you need to have the HDF5 plugin enabled. Refer to Plugins for instructions on how to enable RiskScape plugins.

Overview

Because the OpenQuake data is split across several datasets, we need to piece the hazard data back together in order to match it to our exposure-layer:

  1. First, we need to geospatially match our exposure-layer data to the hazard-layer. To do this, we turn the sitecol data into a special type of coverage that lets us find the site closest to each element-at-risk.

  2. We can then go through all the shaking data (gmf_data). This gives us shaking intensities at each site (and now also, for each element-at-risk), for every event in the HDF5 file.

  3. Optionally, we can use the event ID to lookup the rupture that generated it. For probabilistic modelling, this lets you access the metadata (e.g. annual probability of occurring) associated with each event.

Site information

To read the sitecol information, you need a bookmark like this:

[bookmark sitecol]
location = PATH_TO_YOUR_FILE.hdf5
format = hdf5
dataset = /sitecol/lat
dataset = /sitecol/lon
dataset = /sitecol/sids
set-attribute.location = create_point(lat, lon)
crs-name = EPSG:4326

This will return relational data, which you can then turn into a coverage that can be more easily spatially sampled. Because the sites are a mesh of points, you will need to use a Nearest-neighbour coverage to map your exposure-layer data to the closest site.

For example, to turn the sitecol bookmark into a nearest neighbour coverage (with an 11km cut-off distance), you can add the following step to your pipeline:

select({*,
        to_coverage(bookmark('sitecol'),
                    options: { index: 'nearest_neighbour',
                               nearest_neighbour_max_distance: 11000 }) as site_coverage
       })

You can then sample this coverage to find the closest site to your element-at-risk, e.g.

select({*,
        sample_centroid(exposure, site_coverage) as site
       }) as sample_sitecol

Shaking data

To read the shaking data, you need a bookmark like this:

[bookmark gmf_data]
location = PATH_TO_YOUR_FILE.hdf5
format = hdf5
dataset = /gmf_data/eid
dataset = /gmf_data/sid
dataset = /gmf_data/gmv_0

Tip

As you may have a lot of shaking data, it can be useful to add filter = gmv_0 > THRESHOLD to your bookmark, so that you exclude any shaking data below a certain minimum threshold from your model. Just replace THRESHOLD with a number you are comfortable with, e.g. 0.01.

In the previous section, we saw how to match our exposure-layer geometry to a site in the OpenQuake data. We can then join our shaking data to our exposure-layer data based on site ID, e.g.

input('gmf_data', name: 'hazard') as hazard_input
 ->
join(on: hazard.sid = site.sids) as join_hazard_to_exposures

Optimizing the join

If your exposure-layer contains a lot of data (i.e. several hundred thousand records), then you can potentially optimize the join step if you group by site-ID first. For example, after the sample_sitecol step you could add:

group(by: site,
      select: {
        to_list(exposure) as exposure,
        site
      })

This combines your exposure-layer data so that you have a single list for each site. Each list contains all the elements-at-risk affected by shaking at that site.

Then after the join_hazard_to_exposures step, you simply ‘unnest’ the list again, e.g.

unnest(exposure)

This turns each item in the list back into a separate row of data.

This can improve the join performance because it looks for fewer matches, i.e. it only needs to match one list rather than dozens of elements-at-risk.

Events and ruptures

To read the events or ruptures datasets, you can use bookmarks like the following:

[bookmark events]
location = PATH_TO_YOUR_FILE.hdf5
format = hdf5
dataset = /events

[bookmark rupture]
location = PATH_TO_YOUR_FILE.hdf5
format = hdf5
dataset = /ruptures

Currently you would need to join this input data to your pipeline by matching on the event ID (eid) or rupture ID (rup_id).

Example pipeline

Here is a full example of a simple, working pipeline. It does not include any optimizations for performance.

# read in data from your exposure-layer bookmark
input('exposure_layer', name: 'exposure')
 ->
select({*,
        to_coverage(bookmark('sitecol'),
                    options: { index: 'nearest_neighbour',
                               nearest_neighbour_max_distance: 11000 }) as site_coverage
       })
 ->
# match each element-at-risk to a site in the OpenQuake data
select({
        exposure,
        sample_centroid(exposure, site_coverage) as site
       }) as sample_sitecol
 ->
join_hazard_to_exposures.rhs

# read in the shaking data
input('gmf_data', name: 'hazard') as hazard_input
 ->
# match it up to the exposure-layer, based on site
join(on: hazard.sid = site.sids) as join_hazard_to_exposures

# TODO: call your Python function with exposure, hazard.gmv_0

# TODO: aggregate results, e.g. by event or by exposure