Quantify the percentage of data nearby — compute

Calculates the proportion of sample (reference) points lying in the vicinity of prediction (target) points in multivariate environmental space. See the 'Details' section and King & Zeng (2007) for a technical explanation.

compute_nearby(
  samples,
  covariate.names,
  prediction.grid,
  coordinate.system,
  nearby,
  max.size = 1e+07,
  no.partitions = 10,
  resolution = NULL,
  verbose = TRUE
)

Arguments

samples	Sample (reference) dataset used for model building and calibration. This corresponds to the `segment.data` used when building density surface models in `dsm`. It must contain one column for each of the covariates in `covariate.names`.
covariate.names	Character string. Names of the covariates of interest.
prediction.grid	Prediction data.frame. This contains both geographic coordinates (`x`, `y`) and covariate values associated with the target locations for which predictions are desired. Typically, these locations are taken as the centroids of the grid cells in a spatial prediction grid/raster. See `predict.dsm`.
coordinate.system	Projected coordinate system relevant to the study location. Can be either a character string or an object of class `CRS`.
nearby	Scalar indicating which reference data points are considered to be 'nearby' (i.e. within ‘nearby’ mean geometric Gower's distances of) prediction points. Defaults to 1, as per Mannocci et al. (2018) and Virgili et al. (2017).
max.size	Minimum size threshold for partitioning computations. Calculated as `prod(nrow(samples),nrow(prediction.grid))`. Has a default value of `1e7`. See the 'Details' section.
no.partitions	Integer. Number of desired partitions of the data (default of 10). See the 'Details' section.
resolution	Resolution of the output raster (in units relevant to `coordinate.system`). Only required if `prediction.grid` is irregular, and thus needs to be rasterised. Defaults to NULL.
verbose	Logical. Show or hide possible warnings and messages.

Value

A raster object mapping the proportion of reference data nearby each point in prediction.grid.

Details

While extrapolation is often seen as a binary concept (i.e. it either does or does not take place), it is reasonable to expect that predictions made at target points situated just outside the sampled environmental space may be more reliable than those made at points far outside it. The ExDet tool available through compute_extrapolation inherently quantifies this notion of distance from the envelope of the reference data.

However, the multivariate distribution of reference data points is often far from homogeneous. It is possible, therefore, for target points representing analogue conditions to fall within sparsely sampled regions of the reference space; or conversely, for two target points reflecting an equal degree of extrapolation to have very different amounts of reference data within their vicinity.

The notion of neighbourhood (or percentage/proportion of data nearby, %N) captures this idea, and provides an additional measure of the reliability of extrapolations in multivariate environmental space (Virgili et al. 2017; Mannocci et al. 2018). In practice, %N for any target point can be defined as the proportion of reference data within a radius of one geometric mean Gower’s distance (G^2), calculated between all pairs of reference points (King and Zeng 2007). The Gower’s distance between two points i and j defined along the axes of K covariates is calculated as the average absolute distance between the values of these two points in each dimension, divided by the range of the data, such that:

\(G_{ij}^2=\frac{1}{K}\sum_{k=1}^{K}\frac{\left|x_{ik}-x_{jk}\right|}{\textrm{max}(X_k)-\textrm{min}(X_k)}\)

The compute_nearby function is adapted from the code given in Mannocci et al. (2018) and allows the calculation of Gower’s distances as a basis for defining the neighbourhood.

In addition, the whatif function from the Whatif package (Gandrud et al. 2017), which compute_nearby calls internally, may not run on very large datasets. Running calculations on partitions of the data may circumvent this problem and lead to speed gains. Two arguments can be used to do this:

`max.size`	Threshold above which partitioning will be triggered
`no.partitions`	Number of required partitions

In practice, a run of compute_nearby begins with a quick assessment of the dimensions of the input data, i.e. the reference and target data.frames. If the product of their dimensions (i.e. number of samples multiplied by number of prediction grid cells) exceeds the value set for max.size, then no.partitions subsets of the data will be created and the computations run on each using map functions from the purrr package (Henry and Wickham 2019). This means that a smaller max.size will trigger partitioning on correspondingly smaller datasets. By default, max.size is set to 1e7. This value was chosen arbitrarily, and should be sufficiently large as to obviate the need for partitioning on most datasets.

References

Bouchet PJ, Miller DL, Roberts JJ, Mannocci L, Harris CM and Thomas L (2019). From here and now to there and then: Practical recommendations for extrapolating cetacean density surface models to novel conditions. CREEM Technical Report 2019-01, 59 p. https://research-repository.st-andrews.ac.uk/handle/10023/18509

Gandrud C, King G, Stoll H, Zeng L (2017). WhatIf: Evaluate Counterfactuals. R package version 1.5-9. https://CRAN.R-project.org/package=WhatIf.

Henry L, Wickham H (2019). purrr: Functional Programming Tools. R package version 0.3.2. https://CRAN.R-project.org/package=purrr.

King G, Zeng L (2007). When can history be our guide? The pitfalls of counterfactual inference. International Studies Quarterly 51, 183–210. DOI: 10.1111/j.1468-2478.2007.00445.x

Mannocci L, Roberts JJ, Halpin PN, Authier M, Boisseau O, Bradai MN, Canãdas A, Chicote C, David L, Di-Méglio N, Fortuna CM, Frantzis A, Gazo M, Genov T, Hammond PS, Holcer D, Kaschner K, Kerem D, Lauriano G, Lewis T, Notarbartolo Di Sciara G, Panigada S, Raga JA, Scheinin A, Ridoux V, Vella A, Vella J (2018). Assessing cetacean surveys throughout the mediterranean sea: A gap analysis in environmental space. Scientific Reports 8, art3126. DOI: 10.5061/dryad.4pd33.

Virgili A, Racine M, Authier M, Monestiez P, Ridoux V (2017). Comparison of habitat models for scarcely detected species. Ecological Modelling 346, 88–98. DOI: 10.1016/j.ecolmodel.2016.12.013.

Examples

library(dsmextra)

# Load the Mid-Atlantic sperm whale data (see ?spermwhales)
data(spermwhales)

# Extract the data
segs <- spermwhales$segs
predgrid <- spermwhales$predgrid

# Define relevant coordinate system
my_crs <- sp::CRS("+proj=aea +lat_1=38 +lat_2=30 +lat_0=34 +lon_0=-73 +x_0=0
 +y_0=0 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0")

# Assess the percentage of data nearby
spermw.nearby <- compute_nearby(samples = segs,
                               prediction.grid = predgrid,
                               coordinate.system = my_crs,
                               covariate.names = c("Depth", "DistToCAS", "SST", "EKE", "NPP"),
                               nearby = 1)
#> Preprocessing data ...
#> Calculating distances ....
#> Calculating the geometric variance...
#> Calculating cumulative frequencies ...
#> Finishing up ...
#> Done!