.. _probabilistic_site_based_hazard:

# Site-based hazard data

Some probabilistic datasets will organize the hazard intensity values around geographic
locations (or sites), where the same sites will be used across all events.

One example is the OpenQuake HDF5 files for modelling earthquake ruptures.
The shaking measurements (i.e. PGA values) are recorded across a mesh of _sites_,
which are fixed geospatial points, each with a unique _site ID_ (or `sid`).
The OpenQuake `events` dataset is indexed by `sid` so that the hazard
intensities across _all_ events can be efficiently looked up for a single site.

This site-based approach to organizing the hazard data is useful because the geospatial matching
is an expensive part of the model processing.
With site-based hazard data, you can spatially match elements-at-risk to their nearest site _once_,
and then look-up _all_ the event data for that site in a single operation.

This leads to more efficient data processing, particularly when you are dealing
with large probabilistic datasets that have thousands, or even millions, of events.
In some cases, such as with OpenQuake, using a site-based approach is the _only_ efficient
way to look-up the hazard intensity values.

.. tip::
    If you have a NetCDF or HDF5 data file where the hazard intensities can be indexed
    by ``lat`` and ``long``, or ``X`` and ``Y`` coordinates, then you are probably dealing with
    site-based hazard data.

## OpenQuake example

RiskScape has a specific bookmark format for managing the site-based nature of OpenQuake probabilistic data.
The bookmark supports different modes, depending on what data we want to extract from the HDF5 file:
- `coverage_site_ids` produces a coverage that can geospatially match an element-at-risk to the closest site-ID.
- `lookup_gmv_by_site` efficiently looks up the ground motion velocity values (`gmv`) readings for a specific site.

.. note::
    To use OpenQuake functionality with RiskScape, the OpenQuake :ref:`plugin <plugins>` must be
    configured and enabled.

This example assumes you have an OpenQuake generated `hdf5` file containing a probabilistic dataset
with many different events. Make sure you have a bookmark configured for this file in your `project.ini` file.
We have called the OpenQuake bookmark in this example `oq-hdf5`, e.g.

```ini
[bookmark oq-hdf5]
location = my-project/calc_420.hdf5
format = openquake
# only match sites that are within 10km of the element-at-risk
max-site-distance-metres = 10000
```

You will also need some sort of loss function present in your `project.ini` file.
This example will assume the function is called `loss_function`, e.g.

```ini
[function loss_function]
location = my-project/loss_function.py
argument-types = [building: anything, gmv: floating]
return-type = floating
```

Then, assuming we also have a `buildings` bookmark configured for our exposure-layer data, we can use the following pipeline
to generate the event loss table.

```
# start by reading the exposure-layer dataset
input('buildings', name: 'element_at_risk')
->
# next, we find the site closest to each element_at_risk
select({
  *,
  # here the 'mode' returns the openquake bookmark's sites as a coverage.
  # We then query the coverage for the site id closest to our element_at_risk
  sample_centroid(
    geometry: element_at_risk,
	coverage: bookmark('oq-hdf5', {mode: 'coverage_site_ids'})
  ).sid as site_id
})
->
# now we group all the elements-at-risk by site into a list. Each element-at-risk
# in the list shares the same site id, and so will be exposed to the same hazard intensites
group(
  by: site_id,
  select: {
    site_id,
    to_list(element_at_risk) as elements_at_risk
  }
)
->
# fetch all the hazard intensity and event data for each site
# note that because we have grouped the elements at risk by site, we only have to
# do this lookup once per site, rather than once per element-at-risk.
# For most models, this will be far fewer lookups and thus will complete faster
select({
  *,
  lookup(
    bookmark('oq-hdf5', {mode: 'lookup_gmv_by_site'}),
	site_id
  ) as events
})
->
# unnest the list of events so we get one event and one site per row
unnest(events)
->
select({
  events as event,
  site_id,
  sum(
       # apply our loss function to every element-at-risk at the site
       map(elements_at_risk, building -> loss_function(building, events.gmv))
     ) as total_loss_at_site_per_event
})
->
# now we group by event, so we produce a total_loss per event
group(
  by: event,
  select: {
    sum(total_loss_at_site_per_event) as total_loss
  }
) as event_loss_table
->
# voila, we have our event loss table
save('event-loss', format: 'csv')
```

## NetCDF files

You may find that NetCDF files lend themselves well to storing probabilistic hazard data,
as NetCDF supports multi-dimensional data.

Typically NetCDF data has a _time_ dimension, but for storing probabilistic data this could be
an _event_ dimension instead. For example, the NetCDF file might contain a series of `lat`, `long`
coordinates, where each coordinate holds the hazard intensity measurements across _all_ events for that
particular geospatial point. In other words, such a NetCDF file would contain site-based hazard data.

Read through the :ref:`netcdf` tutorial to understand how RiskScape processes NetCDF data.
For your probabilistic pipeline, you will likely need to replace the `time` dimension with an `event` dimension.
In particular, the :ref:`scalable_netcdf_pipeline` is the best example to follow for probabilistic data.