Lesson 12 Occupancy

In this lesson, we’ll use occupancy models to estimate the occupancy of occupancy of Australian Ibis on eBird checklists in September in the Temperate and Subtropical Forest CMZ, while explicitly accounting for imperfect detection. First, we’ll give a short presentation introducing occupancy modeling. The presentation can be downloaded in PowerPoint or PDF format, or viewed on SpeakerDeck.

Let’s start by loading the packages and data required for this lesson.

12.1 Data preparation

Since we’ll be fitting a single-season occupancy models, we’ll need to start by focusing on observations from September of a single year, in this case the most recent year for which we have data. At this point, we also suggest subsetting the data to observations with 5 or fewer observers since there are few checklists with more than 5 observers.

As we learned in Part I of this workshop, we can use the auk function filter_repeat_visits() to extract just the eBird records that are suitable for occupancy modeling. Specifically, we want repeat visits to the same location by the same observer, so we’ll use latitude, longitude, and observer ID to define ‘sites’.

Exercise

Suppose you define the temporal period of closure as week long blocks, rather than the whole month of September. Use filter_repeat_visits() to extract eBird data accordingly.

Exercise

Subsetting eBird data to just the observations suitable for occupancy modeling will inevitably reduce the amount of data. What proportion of observations remain after calling filter_repeat_visits()? How many unique sites do we have?

Next, we’ll use the auk function format_unmarked_occu() to put data in the specific format required by unmarked, as discussed in Part I. Prior knowledge of Australian Ibis, as well as the predictor importance results from the encounter rate lesson, inform what land cover variables we choose as occupancy covariates. The five effort variables should always be included as detection covariates, but we’ll also examine whether different habitat types affect the detection probability.

As described in lesson 7, we’ll use spatial subsampling to reduce spatial bias. However, here we’ll subsample at the level of ‘sites’ rather than observations.

Finally, we’ll convert this data frame of observations into an unmarked object in order to fit occupancy models.

12.2 Occupancy modeling

Now that the data are prepared, we can fit a single-season occupancy model to using the occu() function, specifying the detection and occupancy covariates, respectively, via a double right-hand sided formula of the form ~ detection covariates ~ occupancy covariates.

12.2.1 Assessment

The MacKenzie and Bailey [-@mackenzieAssessingFitSiteoccupancy2004] goodness-of-fit test can be used to assess the occupancy model fit. Note that to produce accurate results, this process requires simulating about 1,000 bootstrap samples, which can take a long time to run. For the sake of speed, if you want to run the below code, we suggest using nsim = 5.

#> 
#> MacKenzie and Bailey goodness-of-fit for single-season occupancy model
#> 
#> Chi-square statistic = 612 
#> Number of bootstrap samples = 1000
#> P-value = 0.989
#> 
#> Quantiles of bootstrapped statistics:
#>    0%   25%   50%   75%  100% 
#>   451  1248  1609  2072 26914 
#> 
#> Estimate of c-hat = 0.31

12.3 Prediction

Now we can estimate the distribution of Australian Ibis in the CMZ and produce a map. Recall that when we predicted encouter rate, we had to include effort variables in our prediction surface. We don’t need to do that here because the estimated occupancy doesn’t depend on the effort covariates, these only occur in the detection submodel. In addition, predict() can produce both predictions as well as estimates of the standard error.

Checkpoint

Predicting on the full prediction surface will typically take several minutes. As the above code runs, let’s take a short break.

Next, we want to plot these predictions. We’ll convert this data frame to spatial features using sf, then rasterize the points using the prediction surface raster template.

Finally, we can map these predictions!

12.4 Model selection

So far, we’ve been using a global model that includes all of the covariates we believe will influence the occupancy and detection probabilities; however, a more thorough approach is to use model selection to compare and rank a set of candidate models, each containing different subsets of the covariates. We’ll use the dredge() function, which evaluates a set of candidate models generated by using different combinations of the terms in the global model. Since we know from prior experience that the effort covariates are almost always important, we’ll lock these variables in, and consider a candidate set consisting of all possible combinations of the ecological covariates in the occupancy submodel.

psi(pland_02) psi(pland_08) psi(pland_09) psi(pland_13) df AICc delta weight
-3.56 11.06 11 468 0.000 0.329
-3.18 12.63 0.733 12 468 0.365 0.274
-3.60 11.31 -0.225 12 470 2.147 0.112
-3.01 12.56 0.420 0.911 13 470 2.383 0.100
-2.90 10 471 3.014 0.073
-2.67 0.382 11 472 4.680 0.032
-2.89 0.106 11 473 5.218 0.024
11.71 1.075 11 474 6.323 0.014
-2.43 0.568 0.635 12 474 6.415 0.013
11.15 0.891 1.457 12 475 7.456 0.008
8.19 10 476 8.348 0.005
9 476 8.358 0.005
0.692 10 476 8.550 0.005
0.941 1.107 11 477 9.383 0.003
8.59 -0.191 11 478 10.499 0.002
0.082 10 478 10.549 0.002

Tip

To determine which candidate model each row corresponds to, note that he columns beginning with psi( give the model coefficients for each of the habitat covariates in the occupancy submodel; missing values indicate that that covariate was not included in the given model.

The corrected Akaike Information Criterion (AICc) measures the performance of each model, relative to the other models in the candidate set, adjusting for the number of parameters. Lower values indicate models with a better fit to the data, penalizing for the number of parameters. Delta is the difference between the AICc values for the given model and that for the top model (i.e. the one with the lowest AICc). Finally, the AIC weight is a transformation of delta that can be interpreted as the likelihood that the given model is the most likely one of the candidate models to have generated the data.

There doesn’t appear to be a clear single model, or even a small set of models, that are most likely to have generated our data. Given this, we’ll average across the top models (those comprising 95% of the weights), weighted by AICc, to produce a model-averaged prediction.

Making model-averaged predictions works in the same way as for making predictions from a single global model. However, it takes significantly more time to run, so we won’t run the below code or produce maps based on model-averaged predictions during the workshop, instead leaving it as an exercise.

12.4.1 Detection model

A unique feature of occupancy models, is that we can investigate whether certain covariates influence detection, which wasn’t possible using the machine learning approach in lesson 11. We included evergreen broadleaf forest (pland_02) and woody savanna (pland_08) as detection covariates in the global model, and we can compare a set of models with and without these covariates to assess their importance.

Now we can perform model selection on this set to compare the different candidate models.

From these results, it’s clear that including these habitat covariates (especially woody savanna) leads to an improvement in model performance, as shown by the AIC and AIC weights. Let’s look at the coefficients from the global model for these covariates to see how they’re impacting detection and occupancy.

The psi() coefficients are from the occupancy submodel and the p() coefficients are from the detection submodel.

12.5 Exercises

Now that you’ve completed this lesson, try modifying your script to complete at least one of the following exercises:

  1. Try sampling more than a single checklist per grid cell in the spatiotemporal sampling. How does that affect model fit and predictions?

  2. What happens to the size of dataset if you only use stationary counts, or reduce the distance traveled to 1 km? How does it impact the results? How does the different input data affect your interpretation of the results?

  3. What happens to the size of the dataset if you allow repeat visits to be by multiple observers? How does this impact the results.

  4. Produce a map based on model averaged predictions. Note that making these predictions may take up to an hour.

  5. Try estimating occupancy for one of the other species in the example data.