Lesson 7 Spatiotemporal Subsampling

Despite the strengths of eBird data, species observations collected through citizen science projects present a number of challenges that are not found in conventional scientific data. In this chapter, we’ll discuss three of these challenges: spatial bias, temporal bias, and class imbalance. Spatial and temporal bias refers to the tendency of eBird checklists to be distributed non-randomly in space and time, while class imbalance refers to fact that there will be many more non-detections than detections for most species. All three can impact our ability to make reliable inferences from eBird data.

Exercise

Think of some examples of birder behavior that can lead to spatial and temporal bias.

  • Spatial bias: most eBirders sample near their homes, in easily accessible areas such as roadsides, or in areas and habitats of known high biodiversity.
  • Temporal bias: eBirders preferentially sample when they are available, such as weekends, and at times of year when they expect to observe more birds, such as spring migration in North America.

Fortunately, all three of these challenges can largely be addressed by spatiotemporal subsampling of the eBird data prior to modeling. In particular, this consists of dividing space and time up into a regular grid (e.g. 5 km x 5 km x 1 week), and only selecting a subset of checklists from each grid cell. To deal with class imbalance, we can subsample detections and non-detections separately to ensure we don’t lose too many detections.

For the spatial part of the subsampling we’ll use the package dggridR to generate a regular hexagonal grid and assign points to the cells of this grid. Hexagonal grids are preferable to square grids because they exhibit significantly less spatial distortion.

7.2 Subsampling eBird data

Now let’s apply this same approach to the zero-filled American Flamingo data we produced in the previous lesson; however, now we’ll temporally sample as well, at a resolution of one week, and sample presences and absences separately. We start by reading in the eBird data and assigning each checklist to a hexagonal grid cell and a week.

Now we sample a single checklist from each grid cell for each week, using group_by() to sample presences and absences separetely.

Exercise

How did the spatiotemporal subsampling affect the overall sample size as well as the prevalence of detections?