Introduction

This vignette highlights the application of bespoke R functions designed to quantify and visualise model extrapolation in environmental space, with particular relevance to density surface model (DSMs) as implemented in package dsm (Miller et al. 2015). The underlying theory behind extrapolation detection is covered at length in Mesgaran et al. (2014) and King and Zeng (2007). Illustrative case studies rely on a combined dataset of shipborne and aerial line transect surveys for cetaceans conducted in the Northwestern Atlantic and the Gulf of Mexico, and fully described in Roberts et al. (2016) and Mannocci et al. (2017) (Figure 1). A basic level of familiarity with density surface modelling is assumed - for technical details, see Miller et al. (2013).

Note that this vignette accompanies a technical report on extrapolation in cetacean density surface models. See Bouchet et al. (2019) for more information.

<b>Fig. 1:</b> Visual line transect surveys undertaken in the North Atlantic and Gulf of Mexico. <u>Source</u>: @Roberts2016.

Fig. 1: Visual line transect surveys undertaken in the North Atlantic and Gulf of Mexico. Source: Roberts et al. (2016).

Preamble

A number of R packages are required in order to run this vignette. The code below both installs any missing packages (if not already present on your system), and loads them (and their dependencies, as required), in addition to setting some general options. Note that the pacman package must first be installed

#'--------------------------------------------------------------------
# Load (+ install, if needed) all packages
#'--------------------------------------------------------------------

pacman::p_load(tidyverse,          # Tidyverse
               tools,              # Tools in base R
               plyr,               # The split-apply-combine paradigm for R
               raster,             # Raster and GIS operations
               smoothr,            # Smooth and Tidy Spatial Features
               utils,              # R utility functions
               htmltools,          # HTML things
               leaflet,            # Interactive maps
               maps,               # Basic maps
               cowplot,            # Streamlined theme + annotations for 'ggplot2'
               knitr,              # Dynamic report generation
               dichromat,          # Colour palettes 
               # http://stat545.com/block018_colors.html
               pals,               # Colour palettes
               WhatIf)             # Counterfactuals - Gower's distances

#'--------------------------------------------------------------------
# Set tibble options
#'--------------------------------------------------------------------

options(tibble.width = Inf) # All tibble columns shown
options(pillar.neg = FALSE) # No colouring negative numbers
options(pillar.subtle = TRUE)
options(pillar.sigfig = 4)

#'--------------------------------------------------------------------
# Set knitr options
#'--------------------------------------------------------------------

knitr::opts_chunk$set(echo = TRUE)

All the code needed for the analysis has been pre-packaged in the ExDet-functions.R R file. It builds upon the functions available in the ecospat (Broennimann et al. 2016) and WhatIf (Gandrud et al. 2017) packages to provide user-friendly tools relevant to the analysis of distance sampling data. Specifically, the functions enable an a priori evaluation of environmental extrapolation, when using a DSM model built in a reference system to make predictions in a separate target system (eg. a different geographical area and/or a past/future time period) (Sequeira et al. 2018). A description of each individual function is available in the Appendix.

#'---------------------------------------------
# Import required functions 
#'---------------------------------------------

source(file.path("../ExDet_functions.R"))

Example 1: Sperm whales

Overview

This section covers all the steps required to undertake an extrapolation assessment, and explains how to:

  • Quantify extrapolation using compute_extrapolation.
  • Obtain a summary of extrapolation using summarise_extrapolation.
  • Identify covariates responsible for extrapolation using compare_covariates.
  • Calculate a neighbourhood metric of extrapolation using compute_nearby.
  • Generate maps of extrapolation using map_extrapolation.
  • Run a full assessment using analyse_extrapolation.

Data

The data for this example come from two shipboard surveys run by the National Oceanic and Atmospheric Administration (NOAA)’s Northeast and Southeast Fisheries Science Centers. They represent a subset of the survey effort shown in Roberts et al. (2016) and Mannocci et al. (2017), and are stored in an .RData file called sperm_whale.RData.

#'---------------------------------------------
# Import the sperm whale data
#'---------------------------------------------

load("/Users/philippebouchet/Dropbox/bouchet-extrapolation/data/Models/AFTT_Sperm_whales/sperm_whale.RData")

We are primarily interested in the following two data.frames:

segs holds the segment data, ie. the line transects that have been ‘chopped’ into segments. These are referred to as the reference (or calibration) data/system/conditions (Sequeira et al. 2018).

knitr::kable(head(segs), format = "pandoc")
Sample.Label CenterTime SegmentID Depth DistToCAS SST EKE NPP x y Effort Survey
1 2004/06/24 07:27:04 1 118.5027 14468.1533 15.54390 0.0014443 1908.129 214544.0 689074.3 10288.91 en04395
2 2004/06/24 08:08:04 2 119.4853 10262.9648 15.88358 0.0014198 1889.540 222654.3 682781.0 10288.91 en04395
3 2004/06/24 09:03:18 3 177.2779 6900.9829 16.21920 0.0011705 1842.057 230279.9 675473.3 10288.91 en04395
4 2004/06/24 09:51:27 4 527.9562 1055.4124 16.45468 0.0004102 1823.942 239328.9 666646.3 10288.91 en04395
5 2004/06/24 10:25:39 5 602.6378 1112.6293 16.62554 0.0002553 1721.949 246686.5 659459.2 10288.91 en04395
6 2004/06/24 11:00:22 6 1094.4402 707.5795 16.83725 0.0006556 1400.281 254307.0 652547.2 10288.91 en04395

predgrid holds the coordinates and covariate values for grid cells over which density surface model predictions are ultimately sought (although model fitting is outside the scope of this vignette). These are referred to as the target data/system/conditions (Sequeira et al. 2018).

knitr::kable(head(predgrid), format = "pandoc")
x y Depth SST NPP DistToCAS EKE off.set LinkID
126 547984.6 788254 153.5983 12.0461 1462.521 11788.974 0.0008 1e+08 1
127 557984.6 788254 552.3107 12.8138 1465.410 5697.248 0.0010 1e+08 2
258 527984.6 778254 96.8199 12.9025 1429.432 13722.626 0.0012 1e+08 3
259 537984.6 778254 138.2376 13.2139 1424.862 9720.671 0.0013 1e+08 4
260 547984.6 778254 505.1439 13.7566 1379.351 8018.690 0.0027 1e+08 5
261 557984.6 778254 1317.5952 14.4252 1348.544 3775.462 0.0046 1e+08 6

Further, we also have a record of individual sperm whale sightings, as well as a spatial representation of survey track lines (Figure 2). These are stored in the data.frame object obsand the SpatialLinesDataFrame object transects, respectively, and will be useful for plotting.

knitr::kable(head(obs), format = "pandoc")
Survey SeaState SightingTime coords.x1 coords.x2 distance object Sample.Label size detected observer
en04395 3.0 2004/06/28 10:22:21 -65.636 39.576 246.0173 1 48 2 1 1
en04395 2.5 2004/06/28 13:18:14 -65.648 39.746 1632.3934 2 50 2 1 1
en04395 3.0 2004/06/28 14:13:34 -65.692 39.843 2368.9941 3 51 1 1 1
en04395 3.5 2004/06/28 15:06:01 -65.717 39.967 244.6977 4 52 1 1 1
en04395 4.0 2004/06/29 10:48:31 -65.820 40.279 2081.3468 5 56 1 1 1
en04395 2.4 2004/06/29 14:35:34 -65.938 40.612 1149.2632 6 59 1 1 1

A quick map in base R is useful to get a feel for the distribution of sightings throughout the area.

#'---------------------------------------------
# Create an outline of the study area boundaries
#'---------------------------------------------

study_area <- predgrid[, c("x", "y")]
study_area$value <- 1
study_area <- raster::rasterFromXYZ(study_area, crs = sp::CRS("+proj=aea +lat_1=38 +lat_2=30 +lat_0=34 +lon_0=-73 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0"))
study_area <- raster::rasterToPolygons(study_area, dissolve = TRUE)
study_area <- sp::spTransform(study_area, CRSobj = sp::proj4string(transects))
study_area <- smoothr::smooth(study_area, method = "ksmooth", smoothness = 5)

#'---------------------------------------------
# Produce a simple plot
#'---------------------------------------------

plot(study_area, border = "darkorange") # Region boundary
plot(transects, add = TRUE) # Survey tracks
maps::map("world", fill = TRUE, col = "grey", 
          xlim = range(obs$coords.x1), 
          ylim = range(obs$coords.x2), add = TRUE)
pts <- obs # Sightings
coordinates(pts) <- ~ coords.x1 + coords.x2
axis(1)
axis(2)
box()
points(pts, pch = 21, col = "#2082C4")
<b>Fig. 2:</b> Distribution of sperm whale (<em>Physeter macrocephalus</em>) sightings (in blue) within the study area (orange outline) off the US East coast. Surveyed transects are shown as solid lines.

Fig. 2: Distribution of sperm whale (Physeter macrocephalus) sightings (in blue) within the study area (orange outline) off the US East coast. Surveyed transects are shown as solid lines.

Analysis

In this particular example, we already know what projected coordinate system is in use (ie. Albers Equal Area, +proj=aea). The explanatory covariates of interest are: seabed depth (Depth), sea surface temperature (SST), primary productivity (NPP), distance to the nearest canyons and seamounts (DistToCAS), and eddy kinetic energy (EKE) – see Roberts et al. (2016) for details.

#'---------------------------------------------
# Define projected coordinate system
#'---------------------------------------------

aftt_crs <- sp::CRS("+proj=aea +lat_1=38 +lat_2=30 +lat_0=34 +lon_0=-73 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0")

#'---------------------------------------------
# Define environmental covariates of interest
#'---------------------------------------------

covariates.spermw <- c("Depth", "SST", "NPP", "DistToCAS", "EKE")

Quantifying extrapolation

To assess extrapolation in environmental space, we can run the extrapolation detection (ExDet) tool proposed by Mesgaran et al. (2014) using compute_extrapolation.

Note that this tool was originally implemented in the ecospat package, although with more limited capabilities.

The function takes the following arguments:

  • segments: reference data
  • covariate.names: names of covariates of interest
  • prediction.grid: target data
  • coordinate.system: projected coordinate system relevant to study location
  • print.summary: if TRUE, prints a summary of the results in the R console
  • print.precision: number of significant figures printed in the summary
  • save.summary: if TRUE, saves summary to the function’s output object

Here, we store the results in an object called spermw.extrapolation. Values are expressed as both the number (n) of grid cells subject to extrapolation, and the proportion (%) of the target system that this represents.

spermw.extrapolation <- compute_extrapolation(segments = segs, 
                                   covariate.names = covariates.spermw, 
                                   prediction.grid = predgrid,
                                   coordinate.system = aftt_crs,
                                   print.summary = TRUE,
                                   save.summary = FALSE,
                                   print.precision = 2)
## 
## 
## Table: Extrapolation
## 
## --------------  ------------  ------------
## Analogue        n = 5,175     97.92 %     
## -----------     -----------   ----------- 
## Univariate      n = 65        1.23 %      
## Combinatorial   n = 45        0.85 %      
## -----------     -----------   ----------- 
## Sub-total       n = 110       2.08 %      
## -----------     -----------   ----------- 
## Total           n = 5,285     100 %       
## --------------  ------------  ------------
## 
## 
## Table: Most influential covariates (MIC)
## 
## --------------  ------------  ------------  ------------
## Univariate      EKE           n = 24        0.45 %      
## Univariate      SST           n = 17        0.32 %      
## Univariate      Depth         n = 16        0.3 %       
## Univariate      NPP           n =  6        0.11 %      
## Univariate      DistToCAS     n =  2        0.04 %      
## -----------     -----------   -----------   ----------- 
## Sub-total                     n = 65        1.23 %      
## -----------     -----------   -----------   ----------- 
## Combinatorial   NPP           n = 39        0.74 %      
## Combinatorial   Depth         n =  6        0.11 %      
## -----------     -----------   -----------   ----------- 
## Sub-total                     n = 45        0.85 %      
## -----------     -----------   -----------   ----------- 
## Total                         n = 110       2.08 %      
## --------------  ------------  ------------  ------------

In this example, the vast majority of prediction grid cells (n = 5,175 – ie. 97.92%) are characterised by environmental conditions (defined along the axes of the covariates specified above) similar to those captured in the original sample (segs).

A quick check reveals that those results align with our expectations from the data in predgrid.

# Number of cells subject to univariate extrapolation (see below for definition)

predgrid %>% 
  dplyr::filter(!dplyr::between(Depth, min(segs$Depth), max(segs$Depth)) | 
                  !dplyr::between(SST, min(segs$SST), max(segs$SST)) |
                  !dplyr::between(NPP, min(segs$NPP), max(segs$NPP)) |
                  !dplyr::between(DistToCAS, min(segs$DistToCAS), max(segs$DistToCAS)) |
                  !dplyr::between(EKE, min(segs$EKE), max(segs$EKE))) %>% 
  nrow()
## [1] 65

In practice, compute_extrapolation returns a list containing both data.frames (extrapolation values) and rasters (spatial representation of these values), with the following structure:

tibble::glimpse(spermw.extrapolation)
## List of 2
##  $ data   :List of 4
##   ..$ all          :'data.frame':    5285 obs. of  6 variables:
##   .. ..$ x                : num [1:5285] 547985 557985 527985 537985 547985 ...
##   .. ..$ y                : num [1:5285] 788254 788254 778254 778254 778254 ...
##   .. ..$ ExDet            : num [1:5285] -0.1067 -0.0587 -0.0532 -0.0337 0.3445 ...
##   .. ..$ mic_univariate   : int [1:5285] 2 2 2 2 NA NA NA 2 2 NA ...
##   .. ..$ mic_combinatorial: int [1:5285] NA NA NA NA NA NA NA NA NA NA ...
##   .. ..$ mic              : num [1:5285] 2 2 2 2 0 0 0 2 2 0 ...
##   ..$ univariate   :'data.frame':    65 obs. of  6 variables:
##   .. ..$ x                : num [1:65] 547985 557985 527985 537985 507985 ...
##   .. ..$ y                : num [1:65] 788254 788254 778254 778254 768254 ...
##   .. ..$ ExDet            : num [1:65] -0.1067 -0.0587 -0.0532 -0.0337 -0.0394 ...
##   .. ..$ mic_univariate   : int [1:65] 2 2 2 2 2 2 2 2 2 2 ...
##   .. ..$ mic_combinatorial: int [1:65] NA NA NA NA NA NA NA NA NA NA ...
##   .. ..$ mic              : num [1:65] 2 2 2 2 2 2 2 2 2 2 ...
##   ..$ combinatorial:'data.frame':    45 obs. of  6 variables:
##   .. ..$ x                : num [1:45] 557985 567985 577985 587985 607985 ...
##   .. ..$ y                : num [1:45] 728254 728254 728254 728254 728254 ...
##   .. ..$ ExDet            : num [1:45] 1.11 1.09 1.45 1.44 1.19 ...
##   .. ..$ mic_univariate   : int [1:45] NA NA NA NA NA NA NA NA NA NA ...
##   .. ..$ mic_combinatorial: int [1:45] 3 3 3 3 3 3 3 3 3 3 ...
##   .. ..$ mic              : num [1:45] 3 3 3 3 3 3 3 3 3 3 ...
##   ..$ analogue     :'data.frame':    5175 obs. of  6 variables:
##   .. ..$ x                : num [1:5175] 547985 557985 567985 527985 537985 ...
##   .. ..$ y                : num [1:5175] 778254 778254 778254 768254 768254 ...
##   .. ..$ ExDet            : num [1:5175] 0.344 0.263 0.212 0.346 0.279 ...
##   .. ..$ mic_univariate   : int [1:5175] NA NA NA NA NA NA NA NA NA NA ...
##   .. ..$ mic_combinatorial: int [1:5175] NA NA NA NA NA NA NA NA NA NA ...
##   .. ..$ mic              : num [1:5175] 0 0 0 0 0 0 0 0 0 0 ...
##  $ rasters:List of 2
##   ..$ ExDet:List of 4
##   .. ..$ all          :Formal class 'RasterLayer' [package "raster"] with 12 slots
##   .. ..$ univariate   :Formal class 'RasterLayer' [package "raster"] with 12 slots
##   .. ..$ combinatorial:Formal class 'RasterLayer' [package "raster"] with 12 slots
##   .. ..$ analogue     :Formal class 'RasterLayer' [package "raster"] with 12 slots
##   ..$ mic  :List of 4
##   .. ..$ all          :Formal class 'RasterLayer' [package "raster"] with 12 slots
##   .. ..$ univariate   :Formal class 'RasterLayer' [package "raster"] with 12 slots
##   .. ..$ combinatorial:Formal class 'RasterLayer' [package "raster"] with 12 slots
##   .. ..$ analogue     :Formal class 'RasterLayer' [package "raster"] with 12 slots

Three types of extrapolation can be identified (Figure 3):

  • Univariate extrapolation occurs when ExDet values < 0. It is also known as mathematical, strict, or Type 1 extrapolation, and represents conditions outside the range of individual covariates in the reference sample.
  • Combinatorial extrapolation occurs when ExDet values > 1. It is also known as multivariate or Type 2 extrapolation, and describes novel combinations of values encountered within the univariate range of reference covariates. Such combinations are identified based on the Mahalanobis distance metric (D2), a well-known and scale-invariant measure of multivariate outliers (Rousseeuw and Zomeren 1990).
  • Lastly, values between 0 and 1 denote predictions made in analogue conditions. These correspond to what is commonly referred to as interpolation, although if found in a different region in space or past/future period in time, then the terms geographical/temporal extrapolation are sometimes also used.
<b>Fig. 3:</b> Schematic representation of extrapolation in multivariate environmental space, based on two hypothetical covariates. Reference data points are represented as grey circles. Shaded areas correspond to different types of extrapolation outside the envelope of the reference data. Univariate extrapolation occurs beyond the range of individual covariates. Combinatorial extrapolation occurs within this range, but outside the reference hyperspace/hypervolume. <u>Source</u>: @Bouchet2019, adapted from @Mesgaran2014.

Fig. 3: Schematic representation of extrapolation in multivariate environmental space, based on two hypothetical covariates. Reference data points are represented as grey circles. Shaded areas correspond to different types of extrapolation outside the envelope of the reference data. Univariate extrapolation occurs beyond the range of individual covariates. Combinatorial extrapolation occurs within this range, but outside the reference hyperspace/hypervolume. Source: Bouchet et al. (2019), adapted from Mesgaran et al. (2014).

compute_extrapolation also determines which covariate makes the largest contribution to extrapolation for any given grid cell, ie. the most influential covariate (mic). In univariate extrapolation, this is the covariate that leads to the highest negative univariate distance from the initial covariate range. In combinatorial extrapolation, this corresponds to the covariate whose omission (while retaining all others) makes the largest reduction in the Mahalanobis distance to the centroid of the reference data. See Mesgaran et al. (2014) for formulae and technical explanations.

ExDet values can be retrieved from the returned list, as below:

# Example from combinatorial extrapolation

head(spermw.extrapolation$data$combinatorial)
##          x      y    ExDet mic_univariate mic_combinatorial mic
## 1 557984.6 728254 1.106042             NA                 3   3
## 2 567984.6 728254 1.090527             NA                 3   3
## 3 577984.6 728254 1.447405             NA                 3   3
## 4 587984.6 728254 1.437475             NA                 3   3
## 5 607984.6 728254 1.186687             NA                 3   3
## 6 527984.6 718254 1.240072             NA                 3   3

Summarising extrapolation

The summary table generated by compute_extrapolation is the result of an internal call to the function summarise_extrapolation. This function can also be run separately on the output of compute_extrapolation, if needed.

The function takes the following arguments:

  • extrapolation.object: output from compute_extrapolation
  • covariate.names: names of covariates of interest
  • extrapolation: if TRUE, returns a summary of univariate/combinatorial extrapolation
  • mic: if TRUE, returns a summary of the most influential covariates
  • print.precision: number of significant figures printed in the summary

Note that values of the summary statistics can be saved to an object in list form (here the summary.values object).

summary.values <- summarise_extrapolation(extrapolation.object = spermw.extrapolation,
                        covariate.names = covariates.spermw,
                        extrapolation = TRUE,
                        mic = TRUE,
                        print.precision = 2)
## 
## 
## Table: Extrapolation
## 
## --------------  ------------  ------------
## Analogue        n = 5,175     97.92 %     
## -----------     -----------   ----------- 
## Univariate      n = 65        1.23 %      
## Combinatorial   n = 45        0.85 %      
## -----------     -----------   ----------- 
## Sub-total       n = 110       2.08 %      
## -----------     -----------   ----------- 
## Total           n = 5,285     100 %       
## --------------  ------------  ------------
## 
## 
## Table: Most influential covariates (MIC)
## 
## --------------  ------------  ------------  ------------
## Univariate      EKE           n = 24        0.45 %      
## Univariate      SST           n = 17        0.32 %      
## Univariate      Depth         n = 16        0.3 %       
## Univariate      NPP           n =  6        0.11 %      
## Univariate      DistToCAS     n =  2        0.04 %      
## -----------     -----------   -----------   ----------- 
## Sub-total                     n = 65        1.23 %      
## -----------     -----------   -----------   ----------- 
## Combinatorial   NPP           n = 39        0.74 %      
## Combinatorial   Depth         n =  6        0.11 %      
## -----------     -----------   -----------   ----------- 
## Sub-total                     n = 45        0.85 %      
## -----------     -----------   -----------   ----------- 
## Total                         n = 110       2.08 %      
## --------------  ------------  ------------  ------------
summary.values
## $extrapolation
## $extrapolation$univariate.n
## [1] 65
## 
## $extrapolation$univariate.p
## [1] 1.229896
## 
## $extrapolation$combinatorial.n
## [1] 45
## 
## $extrapolation$combinatorial.p
## [1] 0.8514664
## 
## $extrapolation$analogue.n
## [1] 5175
## 
## $extrapolation$analogue.p
## [1] 97.91864
## 
## 
## $mic
## $mic$Depth
## $mic$Depth$.n
## [1] 16
## 
## $mic$Depth$.p
## [1] 0.3027436
## 
## 
## $mic$DistToCAS
## $mic$DistToCAS$.n
## [1] 2
## 
## $mic$DistToCAS$.p
## [1] 0.03784295
## 
## 
## $mic$EKE
## $mic$EKE$.n
## [1] 24
## 
## $mic$EKE$.p
## [1] 0.4541154
## 
## 
## $mic$NPP
## $mic$NPP$.n
## [1] 6
## 
## $mic$NPP$.p
## [1] 0.1135289
## 
## 
## $mic$SST
## $mic$SST$.n
## [1] 17
## 
## $mic$SST$.p
## [1] 0.3216651
## 
## 
## $mic$Depth
## $mic$Depth$.n
## [1] 6
## 
## $mic$Depth$.p
## [1] 0.1135289
## 
## 
## $mic$NPP
## $mic$NPP$.n
## [1] 39
## 
## $mic$NPP$.p
## [1] 0.7379376

Comparing covariates

The extent and magnitude of extrapolation naturally varies with the type and number of covariates considered. It may be useful, therefore, to test different combinations of covariates to inform their selection a priori, ie. before model fitting, thereby supporting model parsimony.

The compare_covariates function is available for this purpose. It is designed to assess extrapolation iteratively for all combinations of n.covariates, and return an overview of those that minimise/maximise extrapolation (as measured by the number of grid cells subject to extrapolation in the prediction area).

n.covariates can be supplied as any vector of integers from 1 to p, where p is the total number of covariates available. Its default value of NULL is equivalent to 1:p, meaning that all possible combinations are tested, unless otherwise specified.

Additional arguments include:

  • extrapolation.type: one of 'univariate', 'combinatorial' or 'both'
  • segments: reference data
  • covariate.names: names of covariates of interest
  • prediction.grid: target data
  • coordinate.system: projected coordinate system relevant to study location
  • create.plots: if TRUE, generates boxplots of results
  • display.percent: if TRUE, the y-axis of the boxplots is expressed as a percentage of the total number of grid cells in prediction.grid

Below is an example for both types of extrapolation, and the full set of five covariates.

compare_covariates(extrapolation.type = "both",
                   covariate.names = covariates.spermw,
                   n.covariates = NULL,
                   segments = segs, 
                   prediction.grid = predgrid,
                   coordinate.system = aftt_crs,
                   create.plots = TRUE,
                   display.percent = TRUE)
## Computing ...
## 
## Creating summaries ...
## Done!

## 
## 
## Extrapolation   Minimum             n_min  Maximum                           n_max 
## --------------  -----------------  ------  --------------------------------  ------
## Univariate      DistToCAS               2  Depth, SST, NPP, DistToCAS, EKE   65    
## Combinatorial   Depth                   0  Depth, SST, NPP                   213   
##                 SST                     0  -                                 -     
##                 NPP                     0  -                                 -     
##                 DistToCAS               0  -                                 -     
##                 EKE                     0  -                                 -     
##                 Depth, DistToCAS        0  -                                 -     
##                 Depth, EKE              0  -                                 -     
##                 NPP, DistToCAS          0  -                                 -     
##                 NPP, EKE                0  -                                 -     
##                 Depth, SST, EKE         0  -                                 -     
##                 Depth, NPP, EKE         0  -                                 -     
## Both            DistToCAS               2  Depth, SST, NPP                   252

The top graph summarises the extent of univariate (yellow) and combinatorial (blue) extrapolation for all combinations of \(1, 2, ..., n\) covariates. The total number of combinations is given above each boxplot as \(N_c\). As expected, univariate extrapolation increases with the number of covariates considered.

The bottom graph displays the same results, summarised by covariate of interest. This may help identify any covariates that consistently lead to extrapolation, and should be avoided. Here, none of the covariates stand out as driving extrapolation more than any others, although combinatorial extrapolation is marginally more prevalent with NPP and SST.

Finding nearby data

While extrapolation is often seen as a binary concept (ie. it either does or does not take place), it is reasonable to expect that predictions made at target points situated just outside the sampled environmental space may be more reliable than those made at points far outside it. The ExDet tool available through compute_extrapolation inherently quantifies this notion of ‘distance’ from the envelope (solid line) of the reference data (grey circles) (Figure 4A).

However, the multivariate distribution of reference data points is often far from homegeneous. It is possible, therefore, for target points representing analogue conditions to fall within sparsely sampled regions of the reference space; or conversely, for two target points reflecting an equal degree of extrapolation to have very different amounts of reference data within their vicinity.

An example of this is shown in Figure 4B, where three target points \(x_1\), \(x_2\) and \(x_3\) are located equally close to the envelope of the reference data. In essence, these reflect identical degrees of extrapolation, as defined under the ExDet framework. However, given the shape of the data cloud in multivariate space, it is apparent that predictions made at target point \(x_1\) will be much better ‘informed’ than those made at \(x_2\) or \(x_3\), as a bigger cluster of reference points lies nearby (green circle around \(x_1\)).

<b>Fig. 4:</b> Conceptual representation of two key extrapolation metrics. (A) Distance from the sampled environmental space. A target point far from the envelope of the reference data (eg. falling in the yellow area) is arguably 'more of an extrapolation' than one close to it (eg. falling in the purple area).  (B) Neighbourhood (syn. percentage of data nearby). Due to the shape of the reference data cloud, the amount of sample information available to 'inform' predictions made at target points can vary considerably. <u>Source</u>: @Bouchet2019.

Fig. 4: Conceptual representation of two key extrapolation metrics. (A) Distance from the sampled environmental space. A target point far from the envelope of the reference data (eg. falling in the yellow area) is arguably ‘more of an extrapolation’ than one close to it (eg. falling in the purple area). (B) Neighbourhood (syn. percentage of data nearby). Due to the shape of the reference data cloud, the amount of sample information available to ‘inform’ predictions made at target points can vary considerably. Source: Bouchet et al. (2019).

The notion of neighbourhood (or percentage/proportion of data nearby, hereafter %N) captures this idea, and provides an additional measure of the reliability of extrapolations in multivariate environmental space (Virgili et al. 2017; Mannocci et al. 2018). In practice, %N for any target point can be defined as the proportion of reference data within a radius of one geometric mean Gower’s distance (G2, calculated between all pairs of reference points) (King and Zeng 2007). The Gower’s distance between two points \(i\) and \(j\) defined along the axes of \(K\) covariates is calculated as the average absolute distance between the values of these two points in each dimension, divided by the range of the data, such that:

\[G_{ij}^2=\frac{1}{K}\sum_{k=1}^{K}\frac{\left|x_{ik}-x_{jk}\right|}{\textrm{max}(X_k)-\textrm{min}(X_k)}\]

The compute_nearby function is adapted from the code given in Mannocci et al. (2018) and allows the calculation of Gower’s distances (G2) as a basis for defining the neighbourhood.

The neighbourhood argument to this function corresponds to the radius within which neighbouhood calculations will occur. It has a default value of 1, as per Mannocci et al. (2018) and Virgili et al. (2017). Increasing its value (eg. to say, 2) would mean doubling the size of the green circles in Figure 4B, and would lead to larger percentages of data nearby per target point.

Important: The sperm whale dataset used here is sufficiently small that compute_nearby should run in a matter of seconds on most machines. However, applying the function to larger datasets may burn out memory allocation and lead to an R session failure. Additional arguments can be passed to compute_nearby to accommodate such ‘big’ data. Their use is unnecessary here, but is illustrated in the subsequent case study on beaked whales (see below).

#'---------------------------------------------
# Calculate Gower's distances and %N
#'---------------------------------------------

spermw.nearby <- compute_nearby(segments = segs, 
                   prediction.grid = predgrid, 
                   coordinate.system = aftt_crs,
                   covariate.names = covariates.spermw,
                   neighbourhood = 1)
## Preprocessing data ...
## Calculating distances ....
## Calculating the geometric variance...
## Calculating cumulative frequencies ...
## Finishing up ...
## Done!

Visualising extrapolation

Lastly, we can use the map_extrapolation function to create a series of interactive html maps of extrapolation in the prediction area.

Arguments to this function are identical to those of compute_extrapolation, with the exception of:

  • map.type: output to be shown on the map. One of 'extrapolation' for a map of extrapolation values, 'mic' for a map of the most influential covariates, or 'nearby' for a neighbourhood map (proportion of reference data near each target grid cell, a.k.a %N).
  • extrapolation.values: output from compute_extrapolation (only valid for map.type = 'extrapolation' and map.type = 'mic')
  • gower.values: output from compute_nearby (only valid for map.type = 'nearby')

sightings and tracks are optional arguments, and only plotted if specified. If either is provided as a matrix, then coordinates must be labelled x and y, lest a warning be returned. Tracks can also be given as a SpatialLinesDataFrame object.

#'---------------------------------------------
# Rename coordinates and convert to SpatialPointsdf
#'---------------------------------------------

obs.sp <- obs %>% 
  dplyr::rename(., x = coords.x1, y = coords.x2) %>% 
  sp::SpatialPointsDataFrame(coords = cbind(.$x, .$y), data = ., proj4string = sp::CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0")) %>% 
  sp::spTransform(., CRSobj = aftt_crs)

The resulting maps can be panned, zoomed, and all layers toggled on and off independently. However, note that maps rely on the leaflet package (Cheng et al. 2018), and thus require an internet connection - they will not work offline. Facilities to save map tiles locally for offline viewing are currently not implemented but may be rolled into a future version of the code. Finally, note that prediction.grid is only used here to set the map extent when plotting.

Here’s a map of extrapolation (ExDet) values:

map_extrapolation(map.type = "extrapolation",
                  extrapolation.values = spermw.extrapolation,
                  covariate.names = covariates.spermw,
                  prediction.grid = predgrid, 
                  coordinate.system = aftt_crs,
                  sightings = NULL, 
                  tracks = transects)

Most influential covariates (MIC):

map_extrapolation(map.type = "mic",
                  extrapolation.values = spermw.extrapolation,
                  covariate.names = covariates.spermw,
                  prediction.grid = predgrid, 
                  coordinate.system = aftt_crs,
                  sightings = obs.sp, 
                  tracks = transects)

Proportion of data nearby (%N):

map_extrapolation(map.type = "nearby",
                  gower.values = spermw.nearby,
                  covariate.names = covariates.spermw,
                  prediction.grid = predgrid, 
                  coordinate.system = aftt_crs,
                  sightings = obs.sp, 
                  tracks = transects)

Full analysis

For simplicity, all the above analyses can be performed in a single step by calling the wrapper function analyse_extrapolation. Most arguments are derived from the individual functions presented above, and should thus be self-explanatory. Note, however, that specific parts of the analysis can be run/bypassed using the following arguments:

  • summarise.extrapolation: if TRUE, run summarise_extrapolation
  • compare.covariates = if TRUE, run compare_covariates
  • compute.nearby = if TRUE, run compute_nearby
  • generate.maps = if TRUE, run map_extrapolation
#'---------------------------------------------
# One function to rule them all, one funtion to find them, 
# One function to bring them all, and in the darkness bind them.
#'---------------------------------------------

spermw.analysis <- analyse_extrapolation(segments = segs,
                                          covariate.names = covariates.spermw,
                                          prediction.grid = predgrid,
                                          coordinate.system = aftt_crs,
                                          summarise.extrapolation = TRUE,
                                          summary.mic = TRUE,
                                          summary.print.precision = 2,
                                          compare.covariates = TRUE,
                                          compare.extrapolation.type = "both",
                                          compare.n.covariates = NULL,
                                          compare.create.plots = TRUE,
                                          compare.display.percent = TRUE,
                                          compute.nearby = TRUE,
                                          nearby.neighbourhood = 1,
                                          generate.maps = TRUE,
                                          sightings = obs.sp,
                                          tracks = transects)

Results are packaged in a list object with three levels, ie. extrapolation, nearby, and maps, representing the outputs of each step in the analysis.

purrr::map(spermw.analysis, ~class(.))
## $extrapolation
## [1] "list"
## 
## $nearby
## [1] "RasterLayer"
## attr(,"package")
## [1] "raster"
## 
## $maps
## [1] "list"

Example 2: Beaked whales

Overview

This section demonstrates an application of the code to a much larger dataset comprising 100 times as many segments (ie. reference/calibration data points) as in the preceding sperm whale example, and a target system spanning half the of the North Atlantic Ocean.

Specifically, it shows:

  • How to run the compute_nearby function on ‘big’ data
  • How covariate transformations, which are common in ecological research, may affect extrapolation assessments.

Data

The data for this example are fully described in Roberts et al. (2016) and Mannocci et al. (2017), and are very similar in nature to the sperm whale data from the previous example.

#'---------------------------------------------
# Import the beaked whale data
#'---------------------------------------------

load("/Users/philippebouchet/Dropbox/bouchet-extrapolation/data/Models/AFTT_Beaked_whales/segments_ziphiids.RData")

As before, segs.ziphiids is a data.frame containing the segment data. Note, however, that coordinates for the segment mid-points are not available.

knitr::kable(head(segs.ziphiids), format = "pandoc")
Depth DistToCanyonOrSeamount Chl1 CurrentSpeed
3.359937 2.449904 -1.113293 -0.9454697
3.349395 3.655052 -1.075477 -0.9704471
3.263835 4.467572 -1.034892 -0.9796294
3.355105 4.710572 -1.029381 -0.9759339
3.335688 4.556134 -1.049879 -0.9735497
3.246040 4.381402 -1.092257 -0.9678591

The environmental covariates likely to affect beaked whale density and distribution also differ. Two of those covariates are static (ie. Depth and DistToCanyonOrSeamount), and two are dynamic (Chl1 and CurrentSpeed).

#'---------------------------------------------
# Define environmental covariates of interest
#'---------------------------------------------

covariates.ziphiids <- c("Depth", "DistToCanyonOrSeamount", "Chl1", "CurrentSpeed")

Unlike for sperm whales, we do not have a ready-made prediction data.frame, so we need to build one from individual covariate rasters. For simplicity, we calculate the yearly mean of the twelve monthly rasters available for each dynamic covariate.

#'---------------------------------------------
# Import environmental covariates
#'---------------------------------------------

# Bathymetry

depth <- raster::raster(file.path("/Users/philippebouchet/Dropbox/bouchet-extrapolation", "data", "covariates", "Depth", "NA_Depth_10km_mean_10km.img"))

# Distance to the nearest canyon or seamount

distcanyonseamount <- raster::raster(file.path("/Users/philippebouchet/Dropbox/bouchet-extrapolation", "data", "covariates", "DistToCanyonOrSeamount", "Dist_to_canyons_or_seamounts_10km.img"))

# Chlorophyll-a concentration

chla <- purrr::map(list.files(file.path("/Users/philippebouchet/Dropbox/bouchet-extrapolation", "data", "covariates", "Chl1"),
          pattern = "\\.img$"),
        ~raster::raster(file.path("/Users/philippebouchet/Dropbox/bouchet-extrapolation", "data", "covariates", "Chl1",.))) %>%
  raster::stack(.) %>% 
  mean(.)

# Current speed

currentspeed <- purrr::map(list.files(file.path("/Users/philippebouchet/Dropbox/bouchet-extrapolation", "data", "covariates", "CurrentSpeed"), pattern = "\\.img$"),
       ~raster::raster(file.path("/Users/philippebouchet/Dropbox/bouchet-extrapolation", "data", "covariates", "CurrentSpeed",.))) %>%
  raster::stack(.) %>% 
  mean(.)

The next step is to clip these rasters to the extent of the study region, namely the US Navy’s Atlantic Fleet Training and Testing area (AFTT).

#'---------------------------------------------
# Load shapefile of study area
#'---------------------------------------------

aftt <- raster::shapefile(file.path("/Users/philippebouchet/Dropbox/bouchet-extrapolation", "data", "models", "aftt.shp"))
aftt_crs <- sp::CRS(sp::proj4string(aftt))

#'---------------------------------------------
# Clip covariate rasters
#'---------------------------------------------

rasters.ziphiids <- purrr::map(.x = list(depth, distcanyonseamount, chla, currentspeed),
                               .f = ~raster::crop(x = ., y = aftt) %>% 
                                 raster::mask(x = ., mask = aftt)) %>% 
  purrr::set_names(., covariates.ziphiids)

Analysis

Transforming covariates

Transforming explanatory covariates (eg. to reduce the influence of outliers, or make skewed distributions more symmetric) is common practice in ecological modelling. We can explore the effects of data transformations on extrapolation metrics by constructing two separate datasets, ie. one transformed, the other not.

Here, the data in segs.ziphiids have already been transformed, whereas those in rasters.ziphiids are not. Note that in this case, \(log_{10}\) and square root transformations are applied to Depth/CurrentSpeed and to DistToCanyonOrSeamount, respectively. Remote-sensed chlorophyll-a values are frequently only used, and already provided, in \(log_{10}\) form - these will not be touched.

We use this to create the complementary datasets below:

segments.ziphiids <- predgrid.ziphiids <- list(transformed = NULL, untransformed = NULL)

#'---------------------------------------------
# Back-transform the segment-level covariate values
#'---------------------------------------------

segments.ziphiids$transformed <- segs.ziphiids
segments.ziphiids$untransformed <- segs.ziphiids %>% 
  dplyr::mutate(Depth = 10^Depth) %>% 
  dplyr::mutate(DistToCanyonOrSeamount = (DistToCanyonOrSeamount^2)*1000) %>% 
  dplyr::mutate(CurrentSpeed = 10^CurrentSpeed)

#'---------------------------------------------
# Transform covariate rasters
#'---------------------------------------------

predgrid.ziphiids$untransformed <- rasters.ziphiids
predgrid.ziphiids$transformed <- purrr::map_at(.x = rasters.ziphiids, 
                                      .at = c("Depth", "CurrentSpeed"), 
                                      .f = ~log10(.x)) %>% 
                        purrr::map_at(.x = ., 
                                      .at = "DistToCanyonOrSeamount", 
                                      .f = ~sqrt(.x/1000))

The final prediction grids (ie. target data) can now be put together.

#'---------------------------------------------
# Untransformed data
#'---------------------------------------------

predgrid.ziphiids$untransformed <- purrr::map_dfc(.x = predgrid.ziphiids$untransformed,
                                                  .f = ~raster::as.data.frame(.)) %>% 
                                   cbind(., coordinates(predgrid.ziphiids$untransformed[[1]]))
names(predgrid.ziphiids$untransformed) <- c(covariates.ziphiids, "x", "y")
predgrid.ziphiids$untransformed <- tidyr::drop_na(predgrid.ziphiids$untransformed) # Remove NAs

#'---------------------------------------------
# Transformed data
#'---------------------------------------------

predgrid.ziphiids$transformed <- purrr::map_dfc(.x = predgrid.ziphiids$transformed,
                                                .f = ~raster::as.data.frame(.)) %>% 
                                 cbind(., coordinates(predgrid.ziphiids$transformed[[1]]))
names(predgrid.ziphiids$transformed) <- c(covariates.ziphiids, "x", "y")
predgrid.ziphiids$transformed <- tidyr::drop_na(predgrid.ziphiids$transformed) # Removes NAs

Quantifying extrapolation

ExDet values can be calculated on both the transformed and untransformed data.

ziphiid.extrapolation <- list(transformed = NULL, untransformed = NULL)

ziphiid.extrapolation$untransformed <- compute_extrapolation(segments = segments.ziphiids$untransformed, 
    covariate.names = covariates.ziphiids, prediction.grid = predgrid.ziphiids$untransformed, 
    coordinate.system = aftt_crs, print.summary = TRUE, save.summary = FALSE, 
    print.precision = 2)
## 
## 
## Table: Extrapolation
## 
## --------------  ------------  ------------
## Analogue        n = 88,185    79.98 %     
## -----------     -----------   ----------- 
## Univariate      n = 22,067    20.01 %     
## Combinatorial   n = 2         0 %         
## -----------     -----------   ----------- 
## Sub-total       n = 22,069    20.02 %     
## -----------     -----------   ----------- 
## Total           n = 110,254   100 %       
## --------------  ------------  ------------
## 
## 
## Table: Most influential covariates (MIC)
## 
## --------------  -----------------------  ------------  ------------
## Univariate      Chl1                     n = 21,818    19.79 %     
## Univariate      CurrentSpeed             n =    239    0.22 %      
## Univariate      DistToCanyonOrSeamount   n =      7    0.01 %      
## Univariate      Depth                    n =      3    0 %         
## -----------     -----------              -----------   ----------- 
## Sub-total                                n = 22,067    20.01 %     
## -----------     -----------              -----------   ----------- 
## Combinatorial   CurrentSpeed             n = 2         0 %         
## -----------     -----------              -----------   ----------- 
## Sub-total                                n = 2         0 %         
## -----------     -----------              -----------   ----------- 
## Total                                    n = 22,069    20.02 %     
## --------------  -----------------------  ------------  ------------
ziphiid.extrapolation$transformed <- compute_extrapolation(segments = segments.ziphiids$transformed, 
    covariate.names = covariates.ziphiids, prediction.grid = predgrid.ziphiids$transformed, 
    coordinate.system = aftt_crs, print.summary = TRUE, save.summary = FALSE, 
    print.precision = 2)
## 
## 
## Table: Extrapolation
## 
## ------------  ------------  ------------
## Analogue      n = 88,187    79.99 %     
## -----------   -----------   ----------- 
## Univariate    n = 22,067    20.01 %     
## -----------   -----------   ----------- 
## Total         n = 110,254   100 %       
## ------------  ------------  ------------
## 
## 
## Table: Most influential covariates (MIC)
## 
## ------------  -----------------------  ------------  ------------
## Univariate    Chl1                     n = 21,818    19.79 %     
## Univariate    CurrentSpeed             n =    239    0.22 %      
## Univariate    DistToCanyonOrSeamount   n =      7    0.01 %      
## Univariate    Depth                    n =      3    0 %         
## -----------   -----------              -----------   ----------- 
## Total                                  n = 22,067    20.01 %     
## ------------  -----------------------  ------------  ------------

Transformations appear to have a minimal effect on extrapolation results, as the extent of univariate extrapolation is identical amongst datasets. Current speed is the only covariate responsible for combinatorial extrapolation in the untransformed case, albeit to a minute degree.

Comparing covariates

Running the code on the untransformed data:

compare_covariates(extrapolation.type = "both",
                   covariate.names = covariates.ziphiids,
                   n.covariates = NULL,
                   segments = segments.ziphiids$untransformed, 
                   prediction.grid = predgrid.ziphiids$untransformed,
                   coordinate.system = aftt_crs,
                   create.plots = TRUE,
                   display.percent = TRUE)
## Computing ...
## 
## Creating summaries ...
## Done!

## 
## 
## Extrapolation   Minimum                                       n_min  Maximum                                             n_max 
## --------------  -------------------------------------------  ------  --------------------------------------------------  ------
## Univariate      Depth                                             3  Depth, DistToCanyonOrSeamount, Chl1, CurrentSpeed   22067 
## Combinatorial   Depth                                             0  Depth, DistToCanyonOrSeamount                       2     
##                 DistToCanyonOrSeamount                            0  Depth, DistToCanyonOrSeamount, Chl1                 2     
##                 Chl1                                              0  Depth, DistToCanyonOrSeamount, CurrentSpeed         2     
##                 CurrentSpeed                                      0  Depth, DistToCanyonOrSeamount, Chl1, CurrentSpeed   2     
##                 Depth, CurrentSpeed                               0  -                                                   -     
##                 DistToCanyonOrSeamount, Chl1                      0  -                                                   -     
##                 DistToCanyonOrSeamount, CurrentSpeed              0  -                                                   -     
##                 Chl1, CurrentSpeed                                0  -                                                   -     
##                 DistToCanyonOrSeamount, Chl1, CurrentSpeed        0  -                                                   -     
## Both            Depth                                             3  Depth, DistToCanyonOrSeamount, Chl1, CurrentSpeed   22069

Running the code on the transformed data:

compare_covariates(extrapolation.type = "both",
                   covariate.names = covariates.ziphiids,
                   n.covariates = NULL,
                   segments = segments.ziphiids$transformed, 
                   prediction.grid = predgrid.ziphiids$transformed,
                   coordinate.system = aftt_crs,
                   create.plots = TRUE,
                   display.percent = TRUE)
## Computing ...
## 
## Creating summaries ...
## Done!

## 
## 
## Extrapolation   Minimum                                              n_min  Maximum                                             n_max 
## --------------  --------------------------------------------------  ------  --------------------------------------------------  ------
## Univariate      Depth                                                    3  Depth, DistToCanyonOrSeamount, Chl1, CurrentSpeed   22067 
## Combinatorial   Depth                                                    0  Depth, DistToCanyonOrSeamount                       905   
##                 DistToCanyonOrSeamount                                   0  -                                                   -     
##                 Chl1                                                     0  -                                                   -     
##                 CurrentSpeed                                             0  -                                                   -     
##                 DistToCanyonOrSeamount, Chl1                             0  -                                                   -     
##                 DistToCanyonOrSeamount, CurrentSpeed                     0  -                                                   -     
##                 Chl1, CurrentSpeed                                       0  -                                                   -     
##                 Depth, Chl1, CurrentSpeed                                0  -                                                   -     
##                 DistToCanyonOrSeamount, Chl1, CurrentSpeed               0  -                                                   -     
##                 Depth, DistToCanyonOrSeamount, Chl1, CurrentSpeed        0  -                                                   -     
## Both            Depth                                                    3  Depth, DistToCanyonOrSeamount, Chl1, CurrentSpeed   22067

Finding nearby data

As previously stated, the whatif function from the Whatif package (Gandrud et al. 2017), which is called internally by compute_nearby, may not run on very large datasets. While this was not an issue with the sperm whale data in the previous example, the beaked whale dataset is several orders of magnitude larger.

size.df <- tibble(species = c("Sperm whales", "Beaked whales"), size = c(format(prod(nrow(segs), 
    nrow(predgrid)), nsmall = 0, big.mark = ","), format(prod(nrow(segs.ziphiids), 
    nrow(predgrid.ziphiids$transformed)), nsmall = 0, big.mark = ",")))
knitr::kable(size.df, format = "pandoc")
species size
Sperm whales 5,015,465
Beaked whales 13,781,198,730

To circumvent this problem, the whatif.opt sets the calculations performed by whatif to run on partitions of the data instead, for greater efficiency. whatif.opt can be called internally within compute_nearby by using two additional arguments, namely:

  • max.size: Threshold above which partitioning will be triggered.
  • no.partitions: Number of required partitions.

In practice, a run of compute_nearby begins with a quick assessment of the dimensions of the input data, ie. the reference and target data.frames. If the product of their dimensions (ie. number of segments multiplied by number of prediction grid cells) exceeds the value set for max.size, then no.partitions subsets of the data will be created and the computations run on each using map functions from the purrr package (Henry and Wickham 2019). This means that a smaller max.size will trigger partitioning on correspondingly smaller datasets. By default, max.size is set to 1e7. This value was chosen arbitrarily, and is sufficiently large as to obviate the need for partitioning on the sperm whale dataset. That said, note that max.size makes little difference to computation time in the analysis of the sperm whale surveys.

It is quite crucial, however, to use partitioning on the beaked whale data. Given the large size of the dataset, the default value of max.size is just fine in this instance, but we could lower it if we wanted. No sensitivity analysis has been conducted on no.partitions, but manual testing suggests that 10 provides a reasonable all-rounder value.

The below code ran in a couple of hours on a recent machine.

Visualising extrapolation

Despite trivial differences in the geographical extent of extrapolation across datasets, the maps below illustrate that the magnitude of (univariate) extrapolation is marginally greater after covariate transformation.

map_extrapolation(map.type = "extrapolation", 
                  extrapolation.values = ziphiid.extrapolation$untransformed,
                  covariate.names = covariates.ziphiids,
                  prediction.grid = predgrid.ziphiids$untransformed, 
                  coordinate.system = aftt_crs,
                  base.layer = "gray")

map_extrapolation(map.type = "extrapolation",
                  extrapolation.values = ziphiid.extrapolation$transformed,
                  covariate.names = covariates.ziphiids,
                  prediction.grid = predgrid.ziphiids$transformed, 
                  coordinate.system = aftt_crs,
                   base.layer = "gray")

Most influential covariates:

map_extrapolation(map.type = "mic",
                  extrapolation.values = ziphiid.extrapolation$untransformed,
                  covariate.names = covariates.ziphiids,
                  prediction.grid = predgrid.ziphiids$untransformed,
                  coordinate.system = aftt_crs)

map_extrapolation(map.type = "mic",
                  extrapolation.values = ziphiid.extrapolation$transformed,
                  covariate.names = covariates.ziphiids,
                  prediction.grid = predgrid.ziphiids$transformed,
                  coordinate.system = aftt_crs)

Proportion of data nearby (%N):

map_extrapolation(map.type = "nearby",
                  gower.values = ziphiids.nearby$untransformed,
                  covariate.names = covariates.ziphiids,
                  prediction.grid = predgrid.ziphiids$untransformed, 
                  coordinate.system = aftt_crs)

map_extrapolation(map.type = "nearby",
                  gower.values = ziphiids.nearby$transformed,
                  covariate.names = covariates.ziphiids,
                  prediction.grid = predgrid.ziphiids$transformed, 
                  coordinate.system = aftt_crs)

Difference between the two rasters:

map_extrapolation(map.type = "nearby",
                  gower.values = ziphiids.nearby$difference,
                  covariate.names = covariates.ziphiids,
                  prediction.grid = predgrid.ziphiids$transformed, 
                  coordinate.system = aftt_crs)

breakpoints <- c(-35,-0.5,0.5,10)
plot(ziphiids.nearby$difference, breaks = breakpoints, col = pals::viridis(3))
plot(aftt, add=T)

Patterns in the %N metric are broadly similar across datasets within much of the Davis Strait (to the northeast) and the Sargasso Sea (to the southeast). However, a dichotomy between extrapolation levels on and the off the continental shelf is apparent, with covariate transformations tempering extrapolation in deeper waters, at the cost of reduced confidence in extrapolations made closer to shore.

Conclusion

This document has outlined an analysis of extrapolation in environmental space using bespoke code adapted from the ecospat (Broennimann et al. 2016) and WhatIf (Gandrud et al. 2017) packages. Note that many possible models (and model types) can be fitted to distance sampling or related data, with each potentially behaving differently outside reference conditions for any given level of extrapolation (Yates et al. 2018). Because of this, the tools presented herein are model-agnostic and only designed to provide an a priori assessment. This may, in turn, help inform modelling decisions and model interpretations, but caution should always be exercised when performing model selection, discrimination and criticism.

Software

The code is freely available here.

Appendix

The ExDet-functions.R R file contains a number of useful functions for assessing extrapolation in environmental space, namely:

  • ExDet: Adaptation of the ecospat.climan function from ecospat (formerly ecospat.exdet in previous releases of the package). Assesses the degree of environmental similarity between a reference and a target system, as per Mesgaran et al. (2014). Adds functionality to identify the most influential covariate(s) - MIC - ie. contributing most to departures from reference conditions.
  • whatif.opt: Adaptation of the whatif function from the WhatIf package, modified to run on large datasets via matrix partitioning. Relies on the purrr package (Henry and Wickham 2019). See Example 2 (beaked whales) for more details.
  • compute_extrapolation: Calls ExDet and return results in data.frame and raster formats.
  • summarise_extrapolation: Summarises extrapolation results in tabular form. Is called internally by compute_extrapolation by default.
  • compare_covariates: Wrapper around compute_extrapolation used to assess extrapolation for different combinations of covariates. Useful for informing covariate selection during model development.
  • compute_nearby: Wrapper around whatif and whatif.opt. Quantifies the proportion of reference points located in the vicinity of each target point in multivariate space, as an additional metric of extrapolation.
  • map_extrapolation: Creates interactive html maps of the outputs from compute_extrapolation. Aids in the visual assessment of extrapolation. Relies on the leaflet package (Cheng et al. 2018).
  • extrapolation_analysis: Wrapper around compute_extrapolation, summarise_extrapolation, and map_extrapolation. Allows a full assessment of extrapolation (ie. calculations, summary, and visualisation) to be conducted in one function call.

References

Bouchet PJ, Miller DL, Harris CM, Thomas L (2019). From here and now to there and then: Practical recommendations for the extrapolation of cetacean density surface models to novel conditions. CREEM Technical report 2019-01. University of St Andrews, St Andrews. Available at: https://risweb.st-andrews.ac.uk/portal/en/researchoutput/from-here-and-now-to-there-and-then-practical-recommendations-for-extrapolating-cetacean-density-surface-models-to-novel-conditions(d93b3865-dfc9-4b1c-bdf1-193183897df0).html

Broennimann O, Di Cola V, Guisan A (2016). ‘Ecospat: Spatial ecology miscellaneous methods. R package version 2.1.1’. Available at: https://CRAN.R-project.org/package=ecospat

Cheng J, Karambelkar B, Xie Y (2018). ‘Leaflet: Create interactive web maps with the javascript ’leaflet’ library. R package version 2.0.2.’ Available at: https://CRAN.R-project.org/package=leaflet

Gandrud C, King G, Stoll H, Zeng L (2017). ‘WhatIf: Evaluate counterfactuals. R package version 1.5-9.’ Available at: https://CRAN.R-project.org/package=WhatIf

Henry L, Wickham H (2019). ‘Purrr: Functional programming tools. R package version 0.3.0.’ Available at: https://CRAN.R-project.org/package=purrr

King G, Zeng L (2007). When can history be our guide? The pitfalls of counterfactual inference. International Studies Quarterly 51, 183–210. DOI: 10.1111/j.1468-2478.2007.00445.x

Mannocci L, Roberts JJ, Halpin PN, Authier M, Boisseau O, Bradai MN, Canãdas A, Chicote C, David L, Di-Méglio N, Fortuna CM, Frantzis A, Gazo M, Genov T, Hammond PS, Holcer D, Kaschner K, Kerem D, Lauriano G, Lewis T, Notarbartolo Di Sciara G, Panigada S, Raga JA, Scheinin A, Ridoux V, Vella A, Vella J (2018). Assessing cetacean surveys throughout the mediterranean sea: A gap analysis in environmental space. Scientific Reports 8, art3126. DOI: 10.5061/dryad.4pd33

Mannocci L, Roberts JJ, Miller DL, Halpin PN (2017). Extrapolating cetacean densities to quantitatively assess human impacts on populations in the high seas. Conservation Biology 31, 601–614. DOI: 10.1111/cobi.12856

Mesgaran MB, Cousens RD, Webber BL (2014). Here be dragons: A tool for quantifying novelty due to covariate range and correlation change when projecting species distribution models. Diversity and Distributions 20, 1147–1159. DOI: 10.1111/ddi.12209

Miller DL, Burt ML, Rexstad EA, Thomas L (2013). Spatial models for distance sampling data: Recent developments and future directions. Methods in Ecology and Evolution 4, 1001–1010. DOI: 10.1111/2041-210X.12105

Miller DL, Rexstad E, Burt L, Bravington MV, Hedley. S (2015). ‘Dsm: Density surface modelling of distance sampling data. R package version 2.2.9.’ Available at: https://CRAN.R-project.org/package=dsm

Roberts JJ, Best BD, Mannocci L, Fujioka E, Halpin PN, Palka DL, Garrison LP, Mullin KD, Cole TVN, Khan CB, McLellan WA, Pabst DA, Lockhart GG (2016). Habitat-based cetacean density models for the u.s. Atlantic and gulf of mexico. Scientific Reports 6, art22615. DOI: 10.1038/srep22615

Rousseeuw PJ, Zomeren BC van (1990). Unmasking multivariate outliers and leverage points. Journal of the American Statistical Association 85, 633–639. DOI: 10.1080/01621459.1990.10474920

Sequeira A, Bouchet PJ, Yates K, Mengersen K, Caley J (2018). Transferring biodiversity models for conservation: Opportunities and challenges. Methods in Ecology and Evolution 9, 1250–1264. DOI: 10.1111/2041-210X.12998

Virgili A, Racine M, Authier M, Monestiez P, Ridoux V (2017). Comparison of habitat models for scarcely detected species. Ecological Modelling 346, 88–98. DOI: 10.1016/j.ecolmodel.2016.12.013

Yates KL, Bouchet PJ, Caley MJ, Mengersen K, Randin CF, Parnell S, Fielding AH, Bamford AJ, Ban S, Barbosa AM, Dormann CF, Elith J, Embling CB, Ervin GN, Fisher R, Gould S, Graf RF, Gregr EJ, Halpin PN, Heikkinen RK, Heinänen S, Jones AR, Krishnakumar PK, Lauria V, Lozano-Montes H, Mannocci L, Mellin C, Mesgaran MB, Moreno-Amat E, Mormede S, Novaczek E, Oppel S, Ortuño Crespo G, Peterson AT, Rapacciuolo G, Roberts JJ, Ross RE, Scales KL, Schoeman D, Snelgrove P, Sundblad G, Thuiller W, Torres LG, Verbruggen H, Wang L, Wenger S, Whittingham MJ, Zharikov Y, Zurell D, Sequeira AMM (2018). Outstanding challenges in the transferability of ecological models. Trends in Ecology and Evolution 33, 790–802. DOI: 10.1016/j.tree.2018.08.001