Lesson 10 Encounter Rate
In this lesson, we’ll estimate the encounter rate of Wood Thrush on eBird checklists in June in BCR 27, where encounter rate is defined as the probability of an eBirder encountering a species on a standard eBird checklist. We’ll be using random forests in this lesson, a machine learning technique that uses an ensemble of many decision trees, each of which is fit using a bootstrap sampled of the data. For the purposes of this tutorial, we’ll treat the random forest as a black box method.
Let’s start by loading all the packages and data we’ll need for this lesson.
library(sf) library(raster) library(dggridR) library(lubridate) library(ranger) library(scam) library(PresenceAbsence) library(verification) library(edarf) library(ebirdst) library(fields) library(gridExtra) library(tidyverse) # resolve namespace conflicts select <- dplyr::select projection <- raster::projection map <- purrr::map set.seed(1) # ebird data ebird <- read_csv("data/ebd_woothr_june_bcr27_zf.csv") %>% # year required to join to habitat data mutate(year = year(observation_date)) # modis habitat covariates habitat <- read_csv("data/pland-elev_location-year.csv") %>% mutate(year = as.integer(year)) # combine ebird and habitat data ebird_habitat <- inner_join(ebird, habitat, by = c("locality_id", "year")) # prediction surface pred_surface <- read_csv("data/pland-elev_prediction-surface.csv") r <- raster("data/prediction-surface.tif") # load gis data for making maps map_proj <- st_crs(102003) ne_land <- read_sf("data/gis-data.gpkg", "ne_land") %>% st_transform(crs = map_proj) %>% st_geometry() bcr <- read_sf("data/gis-data.gpkg", "bcr") %>% st_transform(crs = map_proj) %>% st_geometry() ne_country_lines <- read_sf("data/gis-data.gpkg", "ne_country_lines") %>% st_transform(crs = map_proj) %>% st_geometry() ne_state_lines <- read_sf("data/gis-data.gpkg", "ne_state_lines") %>% st_transform(crs = map_proj) %>% st_geometry()
10.1 Data preparation
As we learned in Part I of this workshop, spatiotemporal subsampling can reduce spatial and temporal bias, and class imbalance, provided we sample detections and non-detections separately. So, we’ll apply subsampling prior to fitting the random forest model.
Sampling detections and non-detections separately will change the prevalence rate of the detections in the data. As a result, the estimated probability of occurrence based on these subsampled data will be larger than the true occurrence rate. When examining the outputs from the models it will be important to recall that we altered the prevalence rate at this stage.
# generate hexagonal grid with ~ 5 km betweeen cells dggs <- dgconstruct(spacing = 5) #> Resolution: 13, Area (km^2): 31.9926151554038, Spacing (km): 5.58632116604266, CLS (km): 6.38233997895802 # get hexagonal cell id and week number for each checklist checklist_cell <- ebird_habitat %>% mutate(cell = dgGEO_to_SEQNUM(dggs, longitude, latitude)$seqnum, year = year(observation_date), week = week(observation_date))
For very rare species, a more drastic approach to dealing with class imbalance is needed: only subsampling the non-detections and keeping all the detections. Here’s one way of accomplishing this.
This approach leads to many more detections being kept in the data.
However, some of the extra detections we have with this approach are in the same 5km cell and the same week, they may not be independent. There are trade-offs to many of these decisions about post-hoc sampling.
In preparation for modeling, we’ll select only the the columns that will be used as predictors in the model. We include both habitat predictors, which we expect to influence whether a species is present at a site, and also effort predictors to help control for variation in detectability.
Finally, we’ll hold 20% of the data aside so we have an independent test set, which we can later use to assess the performance of our model.
10.2 Random forests
Random forests are an excellent, general purpose machine learning method suitable for modeling encounter rate in a wide variety of scenarios. To address the issue of class imbalance, we’ll use a balanced random forest approach, a modification of the traditional random forest algorithm specifically designed to handle scenarios in which one class (in our case: species detections) is much more common than the other (non-detections). To implement a balanced random forest, we’ll first need to calculate the frequency of detections (the smaller class).
Now we can use the
ranger package to fit a random forest model to the eBird data.
# ranger requires a factor response to do classification ebird_split$train$species_observed <- factor(ebird_split$train$species_observed) # grow random forest rf <- ranger(formula = species_observed ~ ., data = ebird_split$train, importance = "impurity", probability = TRUE, replace = TRUE, sample.fraction = c(detection_freq, detection_freq))
10.2.1 Model assessment
To assess model quality, we’ll validate the model’s ability to predict the observed patterns of occurrence using independent validation data (i.e. the 20% test data set that we removed earlier). We’ll use a range of predictive performance metrics to compare the predictions to the actual observations. Different performance metrics reveal different aspects of the data and we can choose to emphasise some over others, depending on the specific goals of our analysis. For example, if we want to minimise the number of false negatives (i.e. we don’t want to miss any places where the species actually occurs), we would focus on sensitivity.
# predict on test data preds <- predict(rf, data = ebird_split$test, type = "response") # extract probability of detection preds <- preds$predictions[, 2] rf_pred_test <- data.frame(id = seq_along(preds), # actual detection/non-detection obs = ebird_split$test$species_observed, # predictions pred = preds) %>% drop_na() # mean squared error (mse) mse <- mean((rf_pred_test$obs - rf_pred_test$pred)^2, na.rm = TRUE) # pick threshold to maximize kappa opt_thresh <- optimal.thresholds(rf_pred_test, opt.methods = "MaxKappa") # calculate accuracy metrics: auc, kappa, sensitivity, specificity, brier pa_metrics <- presence.absence.accuracy(rf_pred_test, threshold = opt_thresh$pred, na.rm = TRUE, st.dev = FALSE) # combine various performance metrics together rf_assessment <- tibble( metric = c("mse", "sensitivity", "specificity", "auc", "kappa"), value = c(mse, pa_metrics$sensitivity, pa_metrics$specificity, pa_metrics$AUC, pa_metrics$Kappa) ) knitr::kable(rf_assessment, digits = 3)
10.3 Habitat associations
From the random forest model, we can glean two important sources of information about the association between Wood Thrush detection and features of their local environment. First, predictor importance is a measure of the predictive power of each covariate, and is calculated as a byproduct of fitting a random forest model. Second, partial dependence plots estimate the marginal effect of one predictor holding all other predictors constant.
10.3.1 Predictor importance
During the process of fitting a random forest model, some variables are removed at each node of the trees that make up the random forest. Predictor importance is based on the mean decrease in accuracy of the model when a given covariate is not used.
pi <- enframe(rf$variable.importance, "predictor", "importance") # plots ggplot(pi) + aes(x = fct_reorder(predictor, importance), y = importance) + geom_col() + geom_hline(yintercept = 0, size = 2, colour = "#555555") + scale_y_continuous(expand = c(0, 0)) + coord_flip() + labs(x = NULL, y = "Predictor Importance (Gini Index)") + theme_minimal() + theme(panel.grid = element_blank(), panel.grid.major.x = element_line(colour = "#cccccc", size = 0.5))
Consult the file
data/mcd12q1_classes.csv for a key to the different
—— ———————————— 0 Water bodies
1 Evergreen Needleleaf Forests
2 Evergreen Broadleaf Forests
3 Deciduous Needleleaf Forests
4 Deciduous Broadleaf Forests
5 Mixed Forests
6 Closed Shrublands
7 Open Shrublands
8 Woody Savannas
11 Permanent Wetlands
13 Urban and Built-up Lands
14 Cropland/Natural Vegetation Mosaics 15 Non-Vegetated Lands
10.3.2 Partial dependence
Partial dependence plots show the marginal effect of a given predictor on encounter rate averaged across the other predictors. We’ll use the R package
edarf to construct partial dependence plots for the most important predictors.
# top 9 predictors other than date top_pred <- pi %>% filter(!predictor %in% c("year", "day_of_year")) %>% top_n(n = 9, wt = importance) %>% arrange(desc(importance)) # calculate partial dependence for each predictor pd <- top_pred %>% mutate(pd = map(predictor, partial_dependence, fit = rf, data = ebird_split$train), pd = map(pd, ~ .[, c(1, 3)]), pd = map(pd, set_names, nm = c("value", "encounter_rate"))) %>% unnest(cols = pd) # plot ggplot(pd) + aes(x = value, y = encounter_rate) + geom_line() + geom_point() + scale_y_continuous(labels = scales::percent) + facet_wrap(~ as_factor(predictor), nrow = 3, scales = "free") + labs(x = NULL, y = "Encounter Rate") + theme_minimal() + theme_minimal() + theme(panel.grid = element_blank(), axis.line = element_line(color = "grey60"), axis.ticks = element_line(color = "grey60"))
Finally, we can use the random forest model to make a map of Wood Thrush encounter rate in BCR 27! The data package contains a prediction surface consisting of the PLAND habitat covariates summarized on a regular grid of points across BCR 27. We’ll make predictions of encounter rate at these points. However, first we need to bring effort variables into this prediction surface. We’ll make predictions for a standard eBird checklist: a 1 km, 1 hour traveling count at the peak time of day for detecting this species.
To find the time of day with the highest detection probability, we can look for the peak of the partial dependence plot, constraining the search to times of day for which there are enough data to make reasonable predictions (hours with at least 1% of checklists).
# find peak time of day from partial dependence pd_time <- partial_dependence(rf, vars = "time_observations_started", # make estimates at 30 minute intervals # use the entire training dataset for estimation n = c(24 * 2, nrow(ebird_split$train)), data = ebird_split$train) %>% select(time_observations_started, encounter_rate = "TRUE") # hours with at least 1% of checklists search_hours <- ebird_split$train %>% mutate(hour = floor(time_observations_started)) %>% count(hour) %>% mutate(pct = n / sum(n)) %>% filter(pct >= 0.01) # constrained peak time t_peak <- pd_time %>% filter(floor(time_observations_started) %in% search_hours$hour) %>% top_n(1, wt = desc(time_observations_started)) %>% pull(time_observations_started) t_peak #>  5.08
Based on this analysis, the best time for detecting Wood Thrush is at 5:05 AM. Now we can use this time to make predictions.
# add effort covariates to prediction pred_surface_eff <- pred_surface %>% mutate(observation_date = ymd("2018-06-15"), year = year(observation_date), day_of_year = yday(observation_date), time_observations_started = t_peak, duration_minutes = 60, effort_distance_km = 1, number_observers = 1) # predict pred_rf <- predict(rf, data = pred_surface_eff, type = "response") pred_rf <- pred_rf$predictions[, 2] # add to prediction surface pred_er <- bind_cols(pred_surface_eff, encounter_rate = pred_rf) %>% select(latitude, longitude, encounter_rate) %>% mutate(encounter_rate = pmin(pmax(encounter_rate, 0), 1))
Next, we’ll convert this data frame to spatial features using
sf, then rasterize the points using the prediction surface raster template.
Finally, we can map these predictions!
# project predictions r_pred_proj <- projectRaster(r_pred, crs = map_proj$proj4string, method = "ngb") par(mar = c(3.5, 0.25, 0.25, 0.25)) # set up plot area plot(bcr, col = NA, border = NA) plot(ne_land, col = "#dddddd", border = "#888888", lwd = 0.5, add = TRUE) # encounter rate r_max <- ceiling(10 * cellStats(r_pred_proj, max)) / 10 brks <- seq(0, r_max, by = 0.025) lbl_brks <- seq(0, r_max, by = 0.1) # ebird status and trends color palette pal <- abundance_palette(length(brks) - 1) plot(r_pred_proj, col = pal, breaks = brks, maxpixels = ncell(r_pred_proj), legend = FALSE, add = TRUE) # borders plot(bcr, border = "#000000", col = NA, lwd = 1, add = TRUE) plot(ne_state_lines, col = "#ffffff", lwd = 0.75, add = TRUE) plot(ne_country_lines, col = "#ffffff", lwd = 1.5, add = TRUE) box() # legend par(new = TRUE, mar = c(0, 0, 0, 0)) title <- "Wood Thrush Encounter Rate" image.plot(zlim = range(brks), legend.only = TRUE, col = pal, breaks = brks, smallplot = c(0.25, 0.75, 0.06, 0.09), horizontal = TRUE, axis.args = list(at = lbl_brks, labels = lbl_brks, fg = "black", col.axis = "black", cex.axis = 0.75, lwd.ticks = 0.5, padj = -1.5), legend.args = list(text = title, side = 3, col = "black", cex = 1, line = 0))
Now that you’ve completed this lesson, try modifying your script to complete at least one of the following exercises:
How does changing the subsampling grid cell size affect the model performance?
What happens to the predictions if you make them for an eBirder traveling further than 1 km, or birding for longer than 1 hour?
Filter the data to only shorter duration checklists or shorter distances traveled. How does this affect model performance?
An alternative approach to dealing with class imbalance, is to grid sample only the non-detections, while keeping all the detections. Try this subsampling approach and see what the affect is on the predictive performance metrics.