.. _openquake:

# OpenQuake data

The following page describes how to read shaking data from an HDF5 file generated by OpenQuake v3.11 or later.
An OpenQuake HDF5 is split up into several different datasets. Of particular interest are:

- `sitecol`: The OpenQuake shaking data is organized by a mesh of sites.
  `sitecol` is a group of datasets that specify the lat, long coordinates of each site that is present in the file.
  Each site has a unique ID (`sids`).
- `gmf_data`: This is a group of datasets that contains the actual shaking data (`gmv_0`).
  Each shaking data-point corresponds to a specific site (`sid`) and event (`eid`).
- `events`: Maps each event (`id`) to the rupture (`rup_id`) that generated it.
- `ruptures`: Contains information about the source fault that generated the event, such as the number of occurrences (`n_occ`)
  or the annual probability of occurring (`occurence_rate`).

.. note::
    In order to read an OpenQuake file, you need to have the HDF5 plugin enabled.
    Refer to :ref:`plugins` for instructions on how to enable RiskScape plugins.

## Overview

Because the OpenQuake data is split across several datasets, we need to piece the hazard data back together
in order to match it to our exposure-layer:

1. First, we need to geospatially match our exposure-layer data to the hazard-layer.
   To do this, we turn the `sitecol` data into a special type of coverage that lets us find the site closest to each element-at-risk.
2. We can then go through all the shaking data (`gmf_data`).
   This gives us shaking intensities at each site (and now also, for each element-at-risk), for every event in the HDF5 file.
3. Optionally, we can use the event ID to lookup the rupture that generated it.
   For probabilistic modelling, this lets you access the metadata (e.g. annual probability of occurring) associated with each event.

## Site information

To read the `sitecol` information, you need a bookmark like this:

```ini
[bookmark sitecol]
location = PATH_TO_YOUR_FILE.hdf5
format = hdf5
dataset = /sitecol/lat
dataset = /sitecol/lon
dataset = /sitecol/sids
set-attribute.location = create_point(lat, lon)
crs-name = EPSG:4326
```

This will return _relational_ data, which you can then turn into a :ref:`coverage <relation_vs_coverage>`
that can be more easily spatially sampled. Because the sites are a mesh of points, you will need to use
a :ref:`nn_coverage` to map your exposure-layer data to the closest site.

For example, to turn the `sitecol` bookmark into a nearest neighbour coverage (with an 11km cut-off distance),
you can add the following step to your pipeline:

```none
select({*,
        to_coverage(bookmark('sitecol'),
                    options: { index: 'nearest_neighbour',
                               nearest_neighbour_max_distance: 11000 }) as site_coverage
       })
```

You can then sample this coverage to find the closest site to your element-at-risk, e.g.

```none
select({*,
        sample_centroid(exposure, site_coverage) as site
       }) as sample_sitecol
```

### Shaking data

To read the shaking data, you need a bookmark like this:

```ini
[bookmark gmf_data]
location = PATH_TO_YOUR_FILE.hdf5
format = hdf5
dataset = /gmf_data/eid
dataset = /gmf_data/sid
dataset = /gmf_data/gmv_0
```

.. tip::
    As you may have *a lot* of shaking data, it can be useful to add ``filter = gmv_0 > THRESHOLD``
    to your bookmark, so that you exclude any shaking data below a certain minimum threshold from your model.
    Just replace ``THRESHOLD`` with a number you are comfortable with, e.g. ``0.01``.

In the previous section, we saw how to match our exposure-layer geometry to a site in the OpenQuake data.
We can then join our shaking data to our exposure-layer data based on site ID, e.g.

```none
input('gmf_data', name: 'hazard') as hazard_input
 ->
join(on: hazard.sid = site.sids) as join_hazard_to_exposures
```

### Optimizing the join

If your exposure-layer contains a lot of data (i.e. several hundred thousand records), then you can
potentially optimize the `join` step if you group by site-ID first.
For example, after the `sample_sitecol` step you could add:

```none
group(by: site,
      select: {
        to_list(exposure) as exposure,
        site
      }) 
```

This combines your exposure-layer data so that you have a single list for each site.
Each list contains all the elements-at-risk affected by shaking at that site.

Then after the `join_hazard_to_exposures` step, you simply 'unnest' the list again, e.g.

```none
unnest(exposure)
```

This turns each item in the list back into a separate row of data.

This can improve the join performance because it looks for fewer matches, i.e. it only needs
to match _one_ list rather than _dozens_ of elements-at-risk.

### Events and ruptures

To read the `events` or `ruptures` datasets, you can use bookmarks like the following:

```ini
[bookmark events]
location = PATH_TO_YOUR_FILE.hdf5
format = hdf5
dataset = /events

[bookmark rupture]
location = PATH_TO_YOUR_FILE.hdf5
format = hdf5
dataset = /ruptures
```

Currently you would need to join this input data to your pipeline by matching on the event ID (`eid`)
or rupture ID (`rup_id`).

### Example pipeline

Here is a full example of a simple, working pipeline.
It does not include any optimizations for performance.

```none
# read in data from your exposure-layer bookmark
input('exposure_layer', name: 'exposure')
 ->
select({*,
        to_coverage(bookmark('sitecol'),
                    options: { index: 'nearest_neighbour',
                               nearest_neighbour_max_distance: 11000 }) as site_coverage
       })
 ->
# match each element-at-risk to a site in the OpenQuake data
select({
        exposure,
        sample_centroid(exposure, site_coverage) as site
       }) as sample_sitecol
 ->
join_hazard_to_exposures.rhs

# read in the shaking data
input('gmf_data', name: 'hazard') as hazard_input
 ->
# match it up to the exposure-layer, based on site
join(on: hazard.sid = site.sids) as join_hazard_to_exposures

# TODO: call your Python function with exposure, hazard.gmv_0

# TODO: aggregate results, e.g. by event or by exposure
```