Lesson 5 Importing Data

In the previous lesson, we extracted a subset of the EBD containing Yellow-rumped Warbler observations from Guatemala. The output file created by auk_filter() is a tab-separated text file and could be read into R using read.delim() or readr::read_tsv(); however, auk has a function specifically for reading the EBD. read_ebd() does the following:

  1. Reads the data using data.table::fread(), which is much faster than read.delim().
  2. Sets the correct data types for the columns.
  3. Cleans up the column names so they are all snake_case.
  4. Automatically performs some post processing steps, which will be covered later in this lesson.

Let’s read in the data!

library(auk)
library(tidyverse)

ebd <- read_ebd("data/ebd_yerwar.txt", unique = FALSE, rollup = FALSE)
glimpse(ebd)
#> Observations: 160
#> Variables: 46
#> $ global_unique_identifier     <chr> "URN:CornellLabOfOrnithology:EBIRD:…
#> $ last_edited_date             <chr> "2018-09-09 12:59:27", "2017-08-29 …
#> $ taxonomic_order              <dbl> 32863, 32859, 32858, 32858, 32858, …
#> $ category                     <chr> "issf", "issf", "species", "species…
#> $ common_name                  <chr> "Yellow-rumped Warbler", "Yellow-ru…
#> $ scientific_name              <chr> "Setophaga coronata", "Setophaga co…
#> $ subspecies_common_name       <chr> "Yellow-rumped Warbler (Goldman's)"…
#> $ subspecies_scientific_name   <chr> "Setophaga coronata goldmani", "Set…
#> $ observation_count            <chr> "12", "8", "11", "3", "1", "15", "4…
#> $ breeding_bird_atlas_code     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ breeding_bird_atlas_category <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ age_sex                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ country                      <chr> "Guatemala", "Guatemala", "Guatemal…
#> $ country_code                 <chr> "GT", "GT", "GT", "GT", "GT", "GT",…
#> $ state                        <chr> "Huehuetenango", "Petén", "Huehuete…
#> $ state_code                   <chr> "GT-HU", "GT-PE", "GT-HU", "GT-JA",…
#> $ county                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ county_code                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ iba_code                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ bcr_code                     <int> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ usfws_code                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ atlas_block                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ locality                     <chr> "Cerro de los Cuervos", "Tikal Area…
#> $ locality_id                  <chr> "L2713729", "L4754006", "L2713730",…
#> $ locality_type                <chr> "P", "P", "P", "P", "H", "P", "P", …
#> $ latitude                     <dbl> 15.5, 17.2, 15.5, 14.7, 14.7, 15.5,…
#> $ longitude                    <dbl> -91.5, -89.6, -91.5, -90.0, -91.5, …
#> $ observation_date             <date> 2014-03-07, 2014-01-19, 2014-03-05…
#> $ time_observations_started    <chr> "07:10:00", "14:00:00", "10:55:00",…
#> $ observer_id                  <chr> "obsr200421", "obsr837809", "obsr20…
#> $ sampling_event_identifier    <chr> "S17445204", "S36593691", "S1744520…
#> $ protocol_type                <chr> "Traveling", "Traveling", "Travelin…
#> $ protocol_code                <chr> "P22", "P22", "P22", "P22", "P22", …
#> $ project_code                 <chr> "EBIRD", "EBIRD", "EBIRD", "EBIRD",…
#> $ duration_minutes             <int> 170, 210, 55, 300, 120, 210, 180, 2…
#> $ effort_distance_km           <dbl> 2.01, 1.61, 4.83, 5.00, 2.41, 2.01,…
#> $ effort_area_ha               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ number_observers             <int> 2, 3, 3, 6, 1, 3, 3, 14, 4, 12, 14,…
#> $ all_species_reported         <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
#> $ group_identifier             <chr> "G828557", "G2390002", "G828560", N…
#> $ has_media                    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, …
#> $ approved                     <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
#> $ reviewed                     <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, TR…
#> $ reason                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ trip_comments                <chr> NA, NA, "Driving w/ stops", NA, "Wa…
#> $ species_comments             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…

We’ll cover the use of unique = FALSE and rollup = FALSE next. For now, let’s just look at the data.

Exercise

Take a minute to explore these data using glimpse() and View(). Familiarize yourself with the columns. Be sure you can find the effort columns and the observation_count column.

Checkpoint

Do you have the data in a data frame? Does anyone have any questions about the data so far?

5.1 Group checklists

eBird allows users to share checklists with other eBird users that they’re birding with. This results it multiple copies of some checklists in the database. Group checklists can be identified in the data because they have the group_identifier column populated. Let’s take a look at some these checklists.

We see that there are multiple checklists with the same group_identifier, implying that these checklists have been shared and are duplicates. Let’s look at one of these on the eBird website: https://ebird.org/view/checklist/S20741847

As it turns out, group checklists aren’t exact duplicates; once a checklist has been shared the individual checklists can diverge in terms of the species seen, the counts for each species, and even the protocol and effort. For an example, look at this checklist with six observers each of whom saw a different set of species.

In most cases, you’ll only want to retain one of these checklists, but it’s not trivial to do so because the checklists are only partial duplicates. The function auk_unique() manages this for you. Specifically, for each species, it retains only the first observation of that species, which is typically the one submitted by the primary observer (i.e. the person who submit the checklist to eBird). Note that the resulting “checklist” will be a combination of all the species seen across all copies of the group checklist.

When auk_unique() is run, a new field is created (checklist_id), which is populated with group_identifier for group checklists and sampling_event_identifier otherwise; this is now a unique identifier for checklists. In addition, the full set of observer and sampling event identifiers has been retained in a comma separated format.

By default, whenever you import data with read_ebd() it calls auk_unique() automatically; however, this behavior can be controlled with the unique argument. So, for example, the following will import data and remove duplicates.

Tip

auk_unique() takes a long time to run on large datasets. Consider using read_ebd(unique = FALSE) when importing large text files to speed up the process.

5.2 Taxonomy

eBird users can enter data for a wide range of taxa in addition to species. Observations can be reported at a level more granular than species (e.g. subspecies or recognizable forms) or at a higher level than species (e.g. spuhs, slashes, and hybrids). All the different taxa that can be reported are contained in the eBird taxonomy, which is updated every year in August. The eBird Science page has a subsection with details on the eBird taxonomy, and the taxonomy itself is available as a data frame in the auk package.

For taxa below the species level, the report_as field specifies the species that this taxa falls under. For example, Myrtle warbler rolls up to Yellow-rumped Warbler.

The EBD contains a subspecies column, which is populated when an observer has identified a bird below species level. In the EBD extract we’re working with, we have three different subspecies of Yellow-rumped Warbler:

It’s even possible to have multiple subspecies of the same species on a single checklist.

For most uses, you’ll want eBird data at the species level, which means dropping higher level taxa and rolling lower level taxa up to species level, making sure to sum the counts if multiple subspecies were present. The function auk_rollup() handles these taxonomic matters for you.

By default, when you import data with read_ebd() it calls auk_rollup() automatically; however, this behavior can be controlled with the rollup argument. So, for example, the following will import data and remove duplicates and report all records at species level.

Checkpoint

Any questions on data import, taxonomy, or group checklists?