compute_nearby.Rd
Calculates the proportion of sample (reference) points lying in the vicinity of prediction (target) points in multivariate environmental space. See the 'Details' section and King & Zeng (2007) for a technical explanation.
compute_nearby( samples, covariate.names, prediction.grid, coordinate.system, nearby, max.size = 1e+07, no.partitions = 10, resolution = NULL, verbose = TRUE )
samples | Sample (reference) dataset used for model building and calibration. This corresponds to the |
---|---|
covariate.names | Character string. Names of the covariates of interest. |
prediction.grid | Prediction data.frame. This contains both geographic coordinates ( |
coordinate.system | Projected coordinate system relevant to the study location. Can be either a character string or an object of class |
nearby | Scalar indicating which reference data points are considered to be 'nearby' (i.e. within ‘nearby’ mean geometric Gower's distances of) prediction points. Defaults to 1, as per Mannocci et al. (2018) and Virgili et al. (2017). |
max.size | Minimum size threshold for partitioning computations. Calculated as |
no.partitions | Integer. Number of desired partitions of the data (default of 10). See the 'Details' section. |
resolution | Resolution of the output raster (in units relevant to |
verbose | Logical. Show or hide possible warnings and messages. |
A raster object mapping the proportion of reference data nearby each point in prediction.grid
.
While extrapolation is often seen as a binary concept (i.e. it either does or does not take place), it is reasonable to expect that predictions made at target points situated just outside the sampled environmental space may be more reliable than those made at points far outside it. The ExDet tool available through compute_extrapolation
inherently quantifies this notion of distance from the envelope of the reference data.
However, the multivariate distribution of reference data points is often far from homogeneous. It is possible, therefore, for target points representing analogue conditions to fall within sparsely sampled regions of the reference space; or conversely, for two target points reflecting an equal degree of extrapolation to have very different amounts of reference data within their vicinity.
The notion of neighbourhood (or percentage/proportion of data nearby, %N) captures this idea, and provides an additional measure of the reliability of extrapolations in multivariate environmental space (Virgili et al. 2017; Mannocci et al. 2018). In practice, %N for any target point can be defined as the proportion of reference data within a radius of one geometric mean Gower’s distance (G^2), calculated between all pairs of reference points (King and Zeng 2007). The Gower’s distance between two points i and j defined along the axes of K covariates is calculated as the average absolute distance between the values of these two points in each dimension, divided by the range of the data, such that:
\(G_{ij}^2=\frac{1}{K}\sum_{k=1}^{K}\frac{\left|x_{ik}-x_{jk}\right|}{\textrm{max}(X_k)-\textrm{min}(X_k)}\)
The compute_nearby
function is adapted from the code given in Mannocci et al. (2018) and allows the calculation of Gower’s distances as a basis for defining the neighbourhood.
In addition, the whatif
function from the Whatif package (Gandrud et al. 2017), which compute_nearby
calls internally, may not run on very large datasets. Running calculations on partitions of the data may circumvent this problem and lead to speed gains. Two arguments can be used to do this:
max.size | Threshold above which partitioning will be triggered |
no.partitions | Number of required partitions |
In practice, a run of compute_nearby
begins with a quick assessment of the dimensions of the input data, i.e. the reference and target data.frames. If the product of their dimensions (i.e. number of samples multiplied by number of prediction grid cells) exceeds the value set for max.size
, then no.partitions
subsets of the data will be created and the computations run on each using map
functions from the purrr package (Henry and Wickham 2019). This means that a smaller max.size
will trigger partitioning on correspondingly smaller datasets. By default, max.size
is set to 1e7
. This value was chosen arbitrarily, and should be sufficiently large as to obviate the need for partitioning on most datasets.
Bouchet PJ, Miller DL, Roberts JJ, Mannocci L, Harris CM and Thomas L (2019). From here and now to there and then: Practical recommendations for extrapolating cetacean density surface models to novel conditions. CREEM Technical Report 2019-01, 59 p. https://research-repository.st-andrews.ac.uk/handle/10023/18509
Gandrud C, King G, Stoll H, Zeng L (2017). WhatIf: Evaluate Counterfactuals. R package version 1.5-9. https://CRAN.R-project.org/package=WhatIf.
Henry L, Wickham H (2019). purrr: Functional Programming Tools. R package version 0.3.2. https://CRAN.R-project.org/package=purrr.
King G, Zeng L (2007). When can history be our guide? The pitfalls of counterfactual inference. International Studies Quarterly 51, 183–210. DOI: 10.1111/j.1468-2478.2007.00445.x
Mannocci L, Roberts JJ, Halpin PN, Authier M, Boisseau O, Bradai MN, Canãdas A, Chicote C, David L, Di-Méglio N, Fortuna CM, Frantzis A, Gazo M, Genov T, Hammond PS, Holcer D, Kaschner K, Kerem D, Lauriano G, Lewis T, Notarbartolo Di Sciara G, Panigada S, Raga JA, Scheinin A, Ridoux V, Vella A, Vella J (2018). Assessing cetacean surveys throughout the mediterranean sea: A gap analysis in environmental space. Scientific Reports 8, art3126. DOI: 10.5061/dryad.4pd33.
Virgili A, Racine M, Authier M, Monestiez P, Ridoux V (2017). Comparison of habitat models for scarcely detected species. Ecological Modelling 346, 88–98. DOI: 10.1016/j.ecolmodel.2016.12.013.
library(dsmextra) # Load the Mid-Atlantic sperm whale data (see ?spermwhales) data(spermwhales) # Extract the data segs <- spermwhales$segs predgrid <- spermwhales$predgrid # Define relevant coordinate system my_crs <- sp::CRS("+proj=aea +lat_1=38 +lat_2=30 +lat_0=34 +lon_0=-73 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0") # Assess the percentage of data nearby spermw.nearby <- compute_nearby(samples = segs, prediction.grid = predgrid, coordinate.system = my_crs, covariate.names = c("Depth", "DistToCAS", "SST", "EKE", "NPP"), nearby = 1)#>#>#>#>#>#>