Site-based hazard data

Some probabilistic datasets will organize the hazard intensity values around geographic locations (or sites), where the same sites will be used across all events.

One example is the OpenQuake HDF5 files for modelling earthquake ruptures. The shaking measurements (i.e. PGA values) are recorded across a mesh of sites, which are fixed geospatial points, each with a unique site ID (or sid). The OpenQuake events dataset is indexed by sid so that the hazard intensities across all events can be efficiently looked up for a single site.

This site-based approach to organizing the hazard data is useful because the geospatial matching is an expensive part of the model processing. With site-based hazard data, you can spatially match elements-at-risk to their nearest site once, and then look-up all the event data for that site in a single operation.

This leads to more efficient data processing, particularly when you are dealing with large probabilistic datasets that have thousands, or even millions, of events. In some cases, such as with OpenQuake, using a site-based approach is the only efficient way to look-up the hazard intensity values.

Tip

If you have a NetCDF or HDF5 data file where the hazard intensities can be indexed by lat and long, or X and Y coordinates, then you are probably dealing with site-based hazard data.

OpenQuake example

RiskScape has a specific bookmark format for managing the site-based nature of OpenQuake probabilistic data. The bookmark supports different modes, depending on what data we want to extract from the HDF5 file:

coverage_site_ids produces a coverage that can geospatially match an element-at-risk to the closest site-ID.
lookup_gmv_by_site efficiently looks up the ground motion velocity values (gmv) readings for a specific site.

Note

To use OpenQuake functionality with RiskScape, the OpenQuake plugin must be configured and enabled.

This example assumes you have an OpenQuake generated hdf5 file containing a probabilistic dataset with many different events. Make sure you have a bookmark configured for this file in your project.ini file. We have called the OpenQuake bookmark in this example oq-hdf5, e.g.

[bookmark oq-hdf5]
location = my-project/calc_420.hdf5
format = openquake
# only match sites that are within 10km of the element-at-risk
max-site-distance-metres = 10000

You will also need some sort of loss function present in your project.ini file. This example will assume the function is called loss_function, e.g.

[function loss_function]
location = my-project/loss_function.py
argument-types = [building: anything, gmv: floating]
return-type = floating

Then, assuming we also have a buildings bookmark configured for our exposure-layer data, we can use the following pipeline to generate the event loss table.

# start by reading the exposure-layer dataset
input('buildings', name: 'element_at_risk')
->
# next, we find the site closest to each element_at_risk
select({
  *,
  # here the 'mode' returns the openquake bookmark's sites as a coverage.
  # We then query the coverage for the site id closest to our element_at_risk
  sample_centroid(
    geometry: element_at_risk,
    coverage: bookmark('oq-hdf5', {mode: 'coverage_site_ids'})
  ).sid as site_id
})
->
# now we group all the elements-at-risk by site into a list. Each element-at-risk
# in the list shares the same site id, and so will be exposed to the same hazard intensites
group(
  by: site_id,
  select: {
    site_id,
    to_list(element_at_risk) as elements_at_risk
  }
)
->
# fetch all the hazard intensity and event data for each site
# note that because we have grouped the elements at risk by site, we only have to
# do this lookup once per site, rather than once per element-at-risk.
# For most models, this will be far fewer lookups and thus will complete faster
select({
  *,
  lookup(
    bookmark('oq-hdf5', {mode: 'lookup_gmv_by_site'}),
    site_id
  ) as events
})
->
# unnest the list of events so we get one event and one site per row
unnest(events)
->
select({
  events as event,
  site_id,
  sum(
       # apply our loss function to every element-at-risk at the site
       map(elements_at_risk, building -> loss_function(building, events.gmv))
     ) as total_loss_at_site_per_event
})
->
# now we group by event, so we produce a total_loss per event
group(
  by: event,
  select: {
    sum(total_loss_at_site_per_event) as total_loss
  }
) as event_loss_table
->
# voila, we have our event loss table
save('event-loss', format: 'csv')

NetCDF files

You may find that NetCDF files lend themselves well to storing probabilistic hazard data, as NetCDF supports multi-dimensional data.

Typically NetCDF data has a time dimension, but for storing probabilistic data this could be an event dimension instead. For example, the NetCDF file might contain a series of lat, long coordinates, where each coordinate holds the hazard intensity measurements across all events for that particular geospatial point. In other words, such a NetCDF file would contain site-based hazard data.

Read through the How to use NetCDF files in a model tutorial to understand how RiskScape processes NetCDF data. For your probabilistic pipeline, you will likely need to replace the time dimension with an event dimension. In particular, the Scaling to large datasets is the best example to follow for probabilistic data.