Lesson 4 Filtering

The EBD is huge, much too large to be read into R. So, if we want to work with these data, we first need to extract a small enough subset that it can be processed in R. This is the main purpose of the auk package: it uses the unix command line utility AWK to extract data from the EBD. There are three steps to this filtering process:

  1. Set up a reference to the EBD text file with auk_ebd().
  2. Define a set of filters specifying the subset of data you want to extract.
  3. Compile those filters into an AWK script and run it to produce a text file with the desired subset of the data.

Tip

Filtering with auk can be fairly coarse, we just need to make the data small enough to read into R. Once the data are in R, they can further filtering can be used to refine the dataset.

4.1 Defining filters

The types of filters that can be applied to the EBD fall into four categories:

  • Species
  • Region
  • Season
  • Protocol and effort

Each specific filter is implemented by a different function in auk. Visit the documentation on filters on the auk website for a complete list. Each of these functions defines a filter on a column within the EBD. For example, auk_country() will define a filter allowing us to extract data from a subset of countries from the EBD.

Tip

Every filtering function in auk begins with auk_ for easy tab completion!

To define a filter, start by creating an auk_ebd object, then pipe this into one of the filtering functions.

Notice that when the auk_ebd object is printed, it tells us what filters have been defined. At this point, nothing has been done to the EBD, we’ve just defined the filter, we haven’t executed it yet.

Tip

Consult the Function Reference section of the auk website for a full list of available filters.

In general, you should think about filtering on region, season, and species, so let’s build upon what we already have and add some more filters. For example, if we wanted all Resplendent Quetzal records from Guatemala in June 2015 we would use the following filters:

Tip

The filtering functions in auk check the arguments you provide and will throw an error if there’s something wrong. Filtering the EBD takes a long time, so it’s better to get an error now rather than realizing you made a mistake after waiting several hours for the extraction process to complete.

Tip

In general, when using the effort filters like auk_time() or auk_distance(), it’s best to be a bit coarse. You can always refine the filters later once the data are in R, and starting with a coarse filter gives you some wiggle room if you later realize you want to make adjustments. Remember: the initial filtering with auk takes a long time, it’s best to limit the number of times you do this.

When filtering by date, you may need to extract records from a given date range regardless of year. For this situation, the auk_date() function can accept wildcards for the year. For example, we can rewrite the above Resplendent Quetzal example to get observations from June of any year.

4.1.1 Complete checklists

One of the most important filters is auk_complete(), which limits observations to those from complete checklists. As we’ve already seen, with complete checklists we can infer non-detections from the data. For most scientific applications, it’s critical that we have complete checklists, so we can generate presence-absence data.

Exercise

Define filters to extract Horned Guan and Highland Guan records from complete checklists in Chiapas, Mexico. Hint: look at the help for the auk_state() filter.

Checkpoint

Are there any questions about defining filters on the EBD?

4.2 Execute filters

Once you have an auk_ebd object with a set of filters defined, you can execute those filters with auk_filter(). This function compiles the filters into an AWK script, then runs that script to produce a text file with the defined subset of the EBD. The processing with AWK is done outside of R, line by line, only selecting rows that meet the criteria specified in the various auk filters. We’ll store the output file within the data/ subdirectory of the project directory. Note that filtering on the full EBD will take at least a couple hours, so be prepared to wait awhile.

Let’s define filters to extract Yellow-rumped Warbler observations in Guatemala that appear on complete traveling or stationary checklists, then execute those filters.

Take a look at this file and notice that we’ve drastically reduced the size. It can now be imported into R without any issues.

Checkpoint

Were you able to correctly extract the Yellow-rumped Warbler data? Any questions on filtering?