Progress statistics
When RiskScape runs a model, it displays real-time progress statistics on the CLI. These numbers give you an indication as to how long the model will take to run.
Example output
The progress statistics output focuses on the number of tuples (rows of data) that RiskScape processes. Example output from a wizard model might look something like this.
Progress:
995917 / 2025135 49.18% exposures_input.complete
995727 total, 20122.089/s avg: exposures.in
995727 total, 20122.091/s avg: exposures.out
995921 total, 20140.616/s avg: exposures_input.out
994823 total, 19947.583/s avg: exposures_join_areas.in
994823 total, 19947.583/s avg: exposures_join_areas.out
995727 total, 20121.692/s avg: exposures_join_hazards.in
995727 total, 20121.692/s avg: exposures_join_hazards.out
995242 total, 20058.763/s avg: sample_hazard_layer.in
995239 total, 20058.474/s avg: sample_hazard_layer.out
993367 total, 19928.925/s avg: steps-select_3~>report_event-impact-sink.in
The first line shows you how far through the input data RiskScape is (i.e. 49.18% complete for the exposure-layer). Often you can use this as a rough guide to how far through the overall model RiskScape is.
Note
The percentage complete does not include any invalid tuples or tuples that get removed by
a bookmark filter. The reported progress could be inaccurate if your bookmark has a filter
that removes a large number of features. You could potentially move the filter
from your
bookmark into your model for more accurate progress.
The remaining lines show you a breakdown of the pipeline steps that are currently in progress. Each line shows you:
The
total
tuples processed so far.The
avg
tuples processed per second.The name of the pipeline step(s) that are doing the work.
The name of the pipeline steps are further broken into in
and out
.
This is because some pipeline steps can emit more tuples than they consume (e.g. unnest
and join
steps),
and others can emit fewer tuples (e.g. filter
steps).
Note
By default, these statistics are also saved to a stats.txt
file in the output directory.
Although viewing the statistics in real-time makes it easier to see what is happening.
Pipeline processing
In general, RiskScape tries to ‘stream’ your input data so that it is spread out through the entire pipeline.
Model processing can involve a large number of data-points. Potentially this is more data than it is practical to hold in memory all at once (especially true of probabilistic models). So RiskScape tries to get rid of the data as quickly as it can, by getting the data from one end of the pipeline to the other.
In a ‘waterfall’ approach, all the input data would be read before moving on to the next step (geospatial matching), and so on. RiskScape does not do this. Instead, RiskScape will read just enough input data to keep the rest of the pipeline steps busy. When the ‘geospatial matching’ step starts to run out of data, then RiskScape will read some more input data.
The goal of this approach is to make maximal use of your CPU cores, by parallelizing work, while using your available RAM efficiently, by holding minimal data in memory at once.
What this generally means is that the percentage complete metric for the input step (i.e. exposures_input.complete
)
is often a good indication for the progress through the model itself.
FAQ
Why do I not see a percentage complete?
Not all input formats support a percentage complete metric, e.g. WFS input data does not.
Some formats, like Shapefile, may not report a percentage complete when a filter
is configured
for the bookmark.
Furthermore, some model pipelines are organized such that it is hard to report their progress. For example, you might have an exposure-layer that contains a few hundred roads, but your model then cuts these roads into small (e.g. 1-metre) pieces. In this case, RiskScape may read all the input data before it gets to the slow part of the model (the cutting).
Unfortunately, it is not practical to report a percentage complete for every step in the pipeline. Due to the nature of pipelines, most steps can easily change the total number of tuples in the model, which makes it hard to know exactly how many tuples the next step can expect to process.
Why is my model slow?
Typically, the slowest processing in a model pipeline involves geometry operations. If your model is running particularly slow, here are some things to check.
Are you loading remote input data? E.g. your model uses WFS bookmarks or HTTP links. If so, try saving a local copy of the data to your file system and using that instead.
Are you geoprocessing your input data? E.g. cutting or buffering the input geometry. If so, you could try:
Filtering the input data before your cut or buffer. This means you do the expensive geometry processing for less of the input data (i.e. only the data you care about).
If cutting by distance, try using a larger distance.
Avoid doing the geoprocessing every time you run the model. Instead, you could cut or buffer the geometry once, save the result, and then use that as the input layer to your models.
Do you have a large (e.g. national-level) dataset and a localized hazard? If so, you could try filtering your input data by the bounds of your hazard. Refer to the geoprocessing options in the wizard.
Do you have large polygons in your exposure-layer and a small grid distance for your hazard-layer? And are you using all-intersections or closest sampling? E.g. if you had farmland polygons and a 1-metre resolution hazard grid, then the sampling operation may still end up cutting each farm polygon into 1-metre segments. You could use centroid sampling instead. For better sampling accuracy, you could cut the polygons into more reasonable size first (e.g. 100-metre segments), and then use centroid sampling.
Are you using shapefiles for the hazard-layer or area-layer? These can hold complex geometry, which can slow down spatial sampling operations. Here are a few things to check:
Does your area-layer contain a marine or oceanic feature that encloses the other features? If so, we recommend using bookmark Filtering to remove this feature, as it will slow down spatial matching.
Try running the
riskscape bookmark info
command for your shapefile bookmarks. If this command seems to hang for a minute or more, then it may indicate something is wrong with your geometry input data. You can useCtrl c
to stop the command. Try settingvalidate-geometry = off
in your bookmark and repeat. Refer to Invalid geometry on how to fix invalid geometry.If you have large hazard-layer or area-layer shapes, try cutting the geometry into smaller pieces (e.g. 1km by 1km polygon segments) using the geoprocessing features in the wizard. This should not make any difference to the results, but it can mean that the geospatial matching is quicker because it is matching against smaller geometry.
Are you filtering your output results when you could be filtering your input? It’s more efficient to filter out any data as early in the model as you can. So if you were filtering by region say, you would want to filter in the geoprocessing phase of your model, rather than in the reporting phase (which is after the bulk of the processing work has been done).
If you have lots of rows of data and are aggregating it, some aggregation operations are more efficient than others. For example,
count()
,sum()
, andmean()
should be fairly efficient, whereasstddev()
,percentile()
, and others can consume a lot more memory. You could try temporarily replacing aggregation functions withsum()
orcount()
as a sanity-check, and see if performance improves. You could also try filtering out unnecessary data before you aggregate it.By default, the memory that RiskScape can consume is capped by Java. If you have plenty of free RAM available on your system, you could try increasing this Java limit and see if RiskScape runs more efficiently. For more details on how to do this, see Java memory utilization.
For advanced users who have written their own pipeline code, you could also check:
Are you running a probabilistic model with a large number of data-points? The total data-points will be your number of elements-at-risk multiplied by the number of events. If this seems like a big number, try thinking of ways to process the data more efficiently. For example, you could filter (i.e. remove) data you are not interested in, or use interpolation (refer to the
create_continuous()
function) so that your Python code gets called less.Are you manually joining two different datasets or layers together, e.g. based on a common attribute? If so, make sure that the smaller dataset is on the
rhs
of thejoin
step.Are you doing any unnecessary spatial operations? E.g. if you are filtering your data, then do any geospatial matching to region after the filter, not before.
Do you have large polygons (i.e. land area) that you are sampling repeatedly against different GeoTIFFs? The repeated sampling operations will repeatedly cut your polygons, so it can be quicker to cut your polygons once up front, to match the GeoTIFF grid resolution (use the
segment_by_grid()
function).