Lesson 6 Presence-absence Data

Up to this point we’ve been working with presence-only data. The EBD, and eBird checklists in general, only explicitly record positive observations of species. However, if we limit ourselves to complete checklists, we can fill in the implied zero counts for any checklists on which a given species isn’t explicitly reported to generate presence-absence data. We refer to this process as zero-filling the eBird data.

Zero-filling relies on the Sampling Event Data, which is a tab-seperated text file containing checklist-level information. This file contains the full population of checklists in the eBird database. If we apply exactly the same set of filters to both the EBD and the Sampling Event Data we can assume that any checklist with no observations for a given species in the EBD should get a zero-count record added to the dataset. So, producing presence-absence eBird data is a two-step process:

  1. Simultaneously filter the EBD and Sampling Event Data, making sure to only use complete checklists.
  2. Read both files into R and zero-fill the EBD using the full population of checklists from the Sampling Event Data.

Tip

When we say “presence-absence” what we really mean by “absence” is that the species was not detected, it’s entirely possible that the species was present, but the observer didn’t detect it.

Checkpoint

Are there any conceptual questions about the process of zero-filling?

6.1 Filtering

Simultaneously filtering the EBD and Sampling Event Data is done in almost the exact same way as filtering the EBD alone. The only difference is that we provide both files to auk_ebd() and two corresponding output files to auk_filter(). For example, we can extract all American Flamingo observations from January in the Mexican state of Yucatán in preparation for zero-filling.

We now have two output files that have been extracted using the same set of filters, apart from the species filter, which only applies to the EBD. We can read these files into R individually:

So, we have 291 checklists in the Sampling Event Data and, of those, 47 have Flamingo observations on them.

Checkpoint

Were you able to filter and import the EBD and Sampling Event Data? Did you get the correct number of rows in both files?

6.2 Zero-filling

Now that we have these two datasets–containing checklist and species information, respectively–we can use the function auk_zerofill() to combine them to produce presence-absence data. This function also imports the data, and handles group checklists and taxonomic rollup automatically, we just have to pass it the paths to the two files. Let’s do this with the American Flamingo data.

By default, auk_zerofill() returns the data as a list of two dataframes: sampling_events contains all the checklist and observations contains just the counts and presence-absence data for each species on each checklist. This compact format reduces the size of the data because checklist information isn’t replicated for every species observation.

glimpse(ebd_zf$observations)
#> Observations: 291
#> Variables: 4
#> $ checklist_id      <chr> "G1089999", "G1092350", "G1095290", "G1095467"…
#> $ scientific_name   <chr> "Phoenicopterus ruber", "Phoenicopterus ruber"…
#> $ observation_count <chr> "0", "0", "0", "0", "3", "0", "0", "0", "0", "…
#> $ species_observed  <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE…
glimpse(ebd_zf$sampling_events)
#> Observations: 291
#> Variables: 31
#> $ checklist_id              <chr> "S16201726", "S21515362", "S21431825",…
#> $ last_edited_date          <chr> "2014-01-03 11:28:47", "2015-01-24 11:…
#> $ country                   <chr> "Mexico", "Mexico", "Mexico", "Mexico"…
#> $ country_code              <chr> "MX", "MX", "MX", "MX", "MX", "MX", "M…
#> $ state                     <chr> "Yucatán", "Yucatán", "Yucatán", "Yuca…
#> $ state_code                <chr> "MX-YUC", "MX-YUC", "MX-YUC", "MX-YUC"…
#> $ county                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ county_code               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ iba_code                  <chr> NA, "MX_183", "MX_183", NA, NA, NA, NA…
#> $ bcr_code                  <int> 56, 55, 55, 55, 55, 55, 55, 55, 56, 56…
#> $ usfws_code                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ atlas_block               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ locality                  <chr> "VALLADOLID", "Celestun Casa Palmera",…
#> $ locality_id               <chr> "L2502912", "L3305787", "L3305787", "L…
#> $ locality_type             <chr> "P", "P", "P", "H", "H", "H", "P", "P"…
#> $ latitude                  <dbl> 20.7, 20.9, 20.9, 21.0, 20.7, 21.0, 21…
#> $ longitude                 <dbl> -88.2, -90.4, -90.4, -89.6, -89.7, -89…
#> $ observation_date          <date> 2014-01-01, 2015-01-24, 2015-01-20, 2…
#> $ time_observations_started <chr> "10:15:00", "09:00:00", "06:45:00", "0…
#> $ observer_id               <chr> "obs439605", "obs170749", "obs170749",…
#> $ sampling_event_identifier <chr> "S16201726", "S21515362", "S21431825",…
#> $ protocol_type             <chr> "Traveling", "Stationary", "Traveling"…
#> $ protocol_code             <chr> "P22", "P21", "P22", "P22", "P21", "P2…
#> $ project_code              <chr> "EBIRD_MEX", "EBIRD", "EBIRD", "EBIRD"…
#> $ duration_minutes          <int> 90, 150, 120, 45, 30, 120, 5, 450, 105…
#> $ effort_distance_km        <dbl> 1.609, NA, 0.322, 0.322, NA, 2.000, NA…
#> $ effort_area_ha            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ number_observers          <int> 4, 1, 1, 2, 2, 1, 1, 3, 13, 1, 1, 5, 5…
#> $ all_species_reported      <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
#> $ group_identifier          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ trip_comments             <chr> "RECORRIDO POR UNA HACIENDA.", "from p…

However, in this case object size isn’t an issue, and it’s easier to work with a single dataframe, so we can collapse the data with collapse_zerofill().

ebd_zf_df <- collapse_zerofill(ebd_zf)
glimpse(ebd_zf_df)
#> Observations: 291
#> Variables: 34
#> $ checklist_id              <chr> "S16201726", "S21515362", "S21431825",…
#> $ last_edited_date          <chr> "2014-01-03 11:28:47", "2015-01-24 11:…
#> $ country                   <chr> "Mexico", "Mexico", "Mexico", "Mexico"…
#> $ country_code              <chr> "MX", "MX", "MX", "MX", "MX", "MX", "M…
#> $ state                     <chr> "Yucatán", "Yucatán", "Yucatán", "Yuca…
#> $ state_code                <chr> "MX-YUC", "MX-YUC", "MX-YUC", "MX-YUC"…
#> $ county                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ county_code               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ iba_code                  <chr> NA, "MX_183", "MX_183", NA, NA, NA, NA…
#> $ bcr_code                  <int> 56, 55, 55, 55, 55, 55, 55, 55, 56, 56…
#> $ usfws_code                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ atlas_block               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ locality                  <chr> "VALLADOLID", "Celestun Casa Palmera",…
#> $ locality_id               <chr> "L2502912", "L3305787", "L3305787", "L…
#> $ locality_type             <chr> "P", "P", "P", "H", "H", "H", "P", "P"…
#> $ latitude                  <dbl> 20.7, 20.9, 20.9, 21.0, 20.7, 21.0, 21…
#> $ longitude                 <dbl> -88.2, -90.4, -90.4, -89.6, -89.7, -89…
#> $ observation_date          <date> 2014-01-01, 2015-01-24, 2015-01-20, 2…
#> $ time_observations_started <chr> "10:15:00", "09:00:00", "06:45:00", "0…
#> $ observer_id               <chr> "obs439605", "obs170749", "obs170749",…
#> $ sampling_event_identifier <chr> "S16201726", "S21515362", "S21431825",…
#> $ protocol_type             <chr> "Traveling", "Stationary", "Traveling"…
#> $ protocol_code             <chr> "P22", "P21", "P22", "P22", "P21", "P2…
#> $ project_code              <chr> "EBIRD_MEX", "EBIRD", "EBIRD", "EBIRD"…
#> $ duration_minutes          <int> 90, 150, 120, 45, 30, 120, 5, 450, 105…
#> $ effort_distance_km        <dbl> 1.609, NA, 0.322, 0.322, NA, 2.000, NA…
#> $ effort_area_ha            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ number_observers          <int> 4, 1, 1, 2, 2, 1, 1, 3, 13, 1, 1, 5, 5…
#> $ all_species_reported      <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
#> $ group_identifier          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ trip_comments             <chr> "RECORRIDO POR UNA HACIENDA.", "from p…
#> $ scientific_name           <chr> "Phoenicopterus ruber", "Phoenicopteru…
#> $ observation_count         <chr> "0", "0", "0", "0", "0", "0", "0", "15…
#> $ species_observed          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FAL…

Notice that in addition to the observation_count column, we now have a binary species_observered column specifying whether or not the species was observered on this checklist. You can also automatically collapse the data by using the collapse = TRUE argument to auk_zerofill().

Exercise

Zero-fill and collapse the Hooded Warbler data you extracted in the previous exercise. What proportion of checklists detected this species?

Tip

Whenever you’re zero-filling data it’s critical that you think about region and season (i.e. where and when) in addition to just the species. If you don’t do that, you’ll zero-fill the entire global EBD and your computer will explode! For example, consider a highly localized species like the Cozumel Vireo, endemic to the small island of Cozumel off the coast of Mexico. Let’s try just filtering on species.

What we have here is the entire EBD (22 thousand checklists in the example dataset, and 40 million in the full EBD!) for a species that only occurs on one small island. Do we really care that a checklist in Anchorage, Alaska doesn’t have Cozumel Vireo? In this situation, you would be better to identify the boundaries of the island and use auk_bbox() to spatially subset the data.

We have the same number of positive observations, but have now drastically reduced the number of checklists that didn’t detect Cozumel Vireo observations.

6.3 Tidying up

We now have a zero-filled presence-absence dataset with duplicate group checklists removed and all observations at the species level. There are couple remaining steps that we typically run to clean up the data. First, you may have noticed some cases where observation_count is "X" in the data. This is what eBirders enter for the count to indicate that they didn’t count the number of individuals for a given species.

It’s more appropriate to have the count as NA rather than "X" in this scenario. This will also allow us to convert the count column to integer rather than character. At this point, we’ll also assign an explicit distance of 0 to stationary checklists.

Finally, depending on your application, you’ll likely want to do some further filtering of the data. For many uses, it’s a good idea to reduce the variation in detectability between checklists by imposing some constraints on the effort variables. You can think of this as partially standardizing the observation process in a post hoc fashion. For example, in part II of this workshop, we’ll restrict observations to those from checklists less than 5 hours long and 5 km in length, and with 10 or fewer observers.

We’ve reduced the amount of data, but also decreased the variability in effort, which will lead to better model performance if we use these data to model species distributions. At this point, we can save the resulting processed eBird data, so that we can use it later in our analysis workflow.