A big year is an informal competition among birders to see the most bird species in a given geographical area between January 1st and December 31st of the same calendar year. Big years can be as big or small as the participant desires: entire countries, single provinces or states, or even your own backyard.

Starting in Antarctica on January 1, 2015, American birder Noah Strycker embarked on the biggest of all big years, a global Big Year. His goal was to bird and travel non-stop for 365 days, visiting dozens of countries and every continent, in a quest to beat the previous record of 4,341 species. In particular, he set himself a target of 5,000 species, which is just under half of every bird species on earth!

Noah kept a daily blog entitled Birding Without Borders on the Audobon website. Along with this, he kept a list of every new species he saw. When he crossed the finish line on December 31st in India, Noah had shattered the previous big year record and his own target, seeing 6,042 species in 2015!

In what follows I use R to get the species list off the Audubon website, clean it up, and take a look at how Noah’s year progressed. If you don’t care about R, skip to the analysis and have a look at the plots.

Big Year Species List

Required packages

suppressPackageStartupMessages(library(dplyr))
library(rvest)
library(lubridate)
library(ggplot2)
library(ggalt)
library(scales)
library(knitr)
library(geonames)
# you'll need to run the following line with your Geonames username
# or put it in your .Rprofile
# options(geonamesUsername = 'YOURUSERNAME')
library(countrycode)
library(sp)
library(raster)
library(rgeos)

Web scraping

Noah’s list of species is available as a big HTML table on the Audubon website. I download the html and parse it into a dataframe using rvest:

bigyear <- read_html("http://www.audubon.org/news/the-species-list") %>% 
  html_node("table") %>% 
  html_table

Turns out this is only the final quarter of the year, January to June and July to September are on a different pages.

# third quarter of year
bigyear <- read_html("http://www.audubon.org/news/the-species-list-part-2") %>% 
  html_node("table") %>% 
  html_table %>% 
  bind_rows(bigyear, .)

# first half of year
bigyear <- read_html("http://www.audubon.org/news/the-species-list-part-1") %>% 
  html_node("table") %>% 
  html_table %>% 
  bind_rows(bigyear, .)

Tidy

First, wow, rvest makes web scraping almost too easy! Unfortunately, the table didn’t import cleanly because the html table has no header with column names, and for each day there is a row spanning all the columns that acts as a header. I can easily fix this by removing the invalid rows.

bigyear <- bigyear %>% 
  setNames(c("species_num", "species", "date", "country", "site")) %>% 
  mutate(species_num = suppressWarnings(as.integer(species_num))) %>% 
  filter(!is.na(species_num))
head(bigyear) %>% 
  kable(row.names = F)
species_num species date country site
6042 Silver-breasted Broadbill 12/31 India Soraipong
6041 White-winged Duck 12/31 India Digboi Oilfields
6040 White-cheeked Partridge 12/31 India Digboi Oilfields
6039 Large Woodshrike 12/31 India Digboi Oilfields
6038 Buff-chested Babbler 12/31 India Digboi Oilfields
6037 White-hooded Babbler 12/31 India Digboi Oilfields

Typos in species names

Having looked at the data, I know there are a handful of problems in the species names. Some are typos and some are inconsistencies with the species names on eBird. Although, it shouldn’t affect the analysis, I’d prefer a clean dataset, so I’ll manually fix them here.

typos <- c("Northern Island Brown Kiwi",
           "Black-baced Swamphen",
           "Fulvous-chested Jungle Flycatcher",
           "Arctic/Kamchatka Leaf/Japanese Leaf Warbler",
           "Ruby-cheeked Sunbir",
           "Pied Shrike-Babbler")
fixes <- c("North Island Brown Kiwi",
           "Black-backed Swamphen",
           "Fulvous-chested Jungle-Flycatcher",
           "Arctic Warbler",
           "Ruby-cheeked Sunbird",
           "Blyth's Shrike-Babbler")
bigyear <- data_frame(typos, fixes) %>% 
  left_join(bigyear, ., by = c("species" = "typos")) %>% 
  mutate(species = ifelse(is.na(fixes), species, fixes))

Fix dates

Next, I notice that the date field is of type character with just the month and day; a year isn’t needed since all these dates fall within 2015 by definition. I parse this into a proper date.

bigyear <- bigyear %>% 
  mutate(date = paste("2015", date, sep = "/")) %>% 
  mutate(date = as.Date(ymd(date)))
glimpse(bigyear)
#> Observations: 6,042
#> Variables: 6
#> $ species_num <int> 6042, 6041, 6040, 6039, 6038, 6037, 6036, 6035, 60...
#> $ species     <chr> "Silver-breasted Broadbill", "White-winged Duck", ...
#> $ date        <date> 2015-12-31, 2015-12-31, 2015-12-31, 2015-12-31, 2...
#> $ country     <chr> "India", "India", "India", "India", "India", "Indi...
#> $ site        <chr> "Soraipong", "Digboi Oilfields", "Digboi Oilfields...
#> $ fixes       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

I also think it would be valuable to have a day number variable, i.e. number of days since January 1, 2015.

bigyear <- bigyear %>% 
  mutate(day = yday(date))
tail(bigyear) %>% 
  kable(row.names = F)
species_num species date country site fixes day
6 Antarctic Shag 2015-01-01 Antarctica Mikkelsen Harbor NA 1
5 Brown Skua 2015-01-01 Antarctica Mikkelsen Harbor NA 1
4 Kelp Gull 2015-01-01 Antarctica Mikkelsen Harbor NA 1
3 Antarctic Tern 2015-01-01 Antarctica Mikkelsen Harbor NA 1
2 Southern Fulmar 2015-01-01 Antarctica Off Trinity Island NA 1
1 Cape Petrel 2015-01-01 Antarctica Off Trinity Island NA 1

Tidying up countries

Having a look at the countries column I see that it’s not strictly country:

unique(bigyear$country) %>% 
  sort
#>  [1] "Antarctica"            "Argentina"            
#>  [3] "Australia"             "Bali"                 
#>  [5] "Borneo"                "Brazil"               
#>  [7] "Cameroon"              "Chile"                
#>  [9] "China"                 "Colombia"             
#> [11] "Costa Rica"            "Ecuador"              
#> [13] "Falklands"             "France"               
#> [15] "Germany"               "Ghana"                
#> [17] "Guatemala"             "High Seas"            
#> [19] "Iceland"               "India"                
#> [21] "Jamaica"               "Kenya"                
#> [23] "Lesotho"               "Madagascar"           
#> [25] "Malaysia"              "Mexico"               
#> [27] "Myanmar"               "New Britain Island"   
#> [29] "New Zealand"           "Norway"               
#> [31] "Panama"                "Papua New Guinea"     
#> [33] "Peru"                  "Philippines"          
#> [35] "Queensland, Australia" "Singapore"            
#> [37] "South Africa"          "Spain"                
#> [39] "Sri Lanka"             "Sulawesi"             
#> [41] "Taiwan"                "Tanzania"             
#> [43] "Tasmania"              "Thailand"             
#> [45] "Turkey"                "UAE"                  
#> [47] "Uganda"                "US-Arizona"           
#> [49] "US-California"         "US-NY"                
#> [51] "US-Oregon"             "US-Texas"             
#> [53] "Victoria, Australia"   "Western Australia"

Borneo, Bali, and Sulawesi are all islands in Southeast Asia. Borneo consists of three countries (Malaysia, Brunei, and Indonesia), but I happen to know Noah only visited the Malaysian province of Sabah. Bali and Sulawesi are both Indonesian islands.

The US and Australia are broken into states. New Britain Island is part of Papua New Guinea.

The Falkland islands are technically a British overseas territory, but are off the coast of Argentina and there is an ongoing sovereignty dispute. I’ll leave them as is. Antarctica is not technically a country, but I’ll also leave it as is because it doesn’t nicely fit anywhere else. Finally, the records for High Seas are from the Drake Passage, the area between South America and Antarctica. I’ll group these with Antarctica.

bigyear <- bigyear %>% 
  mutate(location = country,
         country = ifelse(country == "Borneo", "Malaysia", country),
         country = ifelse(country == "Bali", "Indonesia", country),
         country = ifelse(country == "Sulawesi", "Indonesia", country),
         country = ifelse(grepl("^US", country), "United States", country),
         country = ifelse(grepl("Australia", country), "Australia", country),
         country = ifelse(country == "Tasmania", "Australia", country),
         country = ifelse(country == "New Britain Island", "Papua New Guinea", country),
         country = ifelse(country == "High Seas", "Antarctica", country)) %>% 
  dplyr::select(species_num, species, date, day, country, location, site) %>% 
  arrange(desc(species_num))
unique(bigyear$country) %>% 
  sort
#>  [1] "Antarctica"       "Argentina"        "Australia"       
#>  [4] "Brazil"           "Cameroon"         "Chile"           
#>  [7] "China"            "Colombia"         "Costa Rica"      
#> [10] "Ecuador"          "Falklands"        "France"          
#> [13] "Germany"          "Ghana"            "Guatemala"       
#> [16] "Iceland"          "India"            "Indonesia"       
#> [19] "Jamaica"          "Kenya"            "Lesotho"         
#> [22] "Madagascar"       "Malaysia"         "Mexico"          
#> [25] "Myanmar"          "New Zealand"      "Norway"          
#> [28] "Panama"           "Papua New Guinea" "Peru"            
#> [31] "Philippines"      "Singapore"        "South Africa"    
#> [34] "Spain"            "Sri Lanka"        "Taiwan"          
#> [37] "Tanzania"         "Thailand"         "Turkey"          
#> [40] "UAE"              "Uganda"           "United States"

Analysis

Noah visited 42 countries (counting the Falklands as a distinct country) over 365 days and saw an average of 143.86 new species per country and 16.55 new species per day. Let’s dive into this a little more with some plots.

Species accumulation

I start by looking at the species accumulation curve, i.e. trend in cumulative species seen over the course of the big year.

bigyear %>% 
  group_by(date) %>% 
  summarize(n_species = max(species_num)) %>% 
  {ggplot(., aes(x = date, y = n_species)) +
    geom_line(color = "#FA6900", size = 1) +
    geom_smooth(method = "lm", se = F, color = "grey30", 
                linetype = "dashed", size = 0.5) +
    scale_x_date(labels = date_format("%b"),
                 breaks = date_breaks("month")) +
    scale_y_continuous(breaks = 1000 * (0:6),
                       labels = comma) +
    labs(x = NULL, y = "Cumulative number of species") +
    annotate("text", x = as.Date("2015-04-15"), y = 4100,
             size = 4, colour = "grey30", family = "Helvetica Neue Light",
             label = paste0("Average = ", 
                            round(coef(lm(n_species ~ date, data = .))[[2]], 1),
                            " birds / day")) +
    geom_hline(yintercept = 0, size = 1,colour = "grey30") +
    theme_bw() +
    theme(text = element_text(family = "Helvetica Neue Light",
                              size = 12, colour = "grey30"),
          panel.background = element_rect(fill = "grey90"),
          plot.background = element_rect(fill = "grey90"),
          panel.border = element_blank(),
          panel.grid.major = element_line(colour = "grey80", size = 0.5),
          panel.grid.minor = element_blank(),
          axis.ticks = element_blank(),
          axis.title = element_text(family = "Helvetica Neue"),
          plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm")
    )}

plot of chunk trend

This is extremely impressive, Noah has managed to maintain a nearly linear increase in species sightings over the course of the year. A typical species accumulation curve shows a pattern of saturation; there is an effect of diminishing returns as more time is spent searching. Of course, here he’s managed to mostly avoid this by constantly changing locations to encounter new species pools.

We can also look at number of new species by month.

bigyear %>% 
  mutate(month = month(date, label = T)) %>% 
  ggplot(aes(x = month)) + 
    geom_bar(width = 0.7, color = "grey30") +
    labs(x = NULL, y = "# new species") +
    scale_y_continuous(limits = c(0, 810), expand = c(0, 0)) +
    geom_hline(yintercept = 0, size = 1, colour = "grey30") +
    theme_bw() +
    theme(text = element_text(family = "Helvetica Neue Light",
                              size = 12, colour = "grey30"),
          panel.background = element_rect(fill = "grey90"),
          plot.background = element_rect(fill = "grey90"),
          panel.border = element_blank(),
          panel.grid.major = element_line(colour = "grey80", size = 0.5),
          panel.grid.major.x = element_blank(),
          panel.grid.minor = element_blank(),
          axis.ticks = element_blank(),
          axis.title = element_text(family = "Helvetica Neue"),
          plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm"))

plot of chunk by-month

Here the effect of saturation is more clear. There seems to be three distinct groups and, within each group, there is a decline in species over time, then a jump when the next group is reached. I suspect these jumps are related to Noah changing continents.

Daily new species counts

How much variability was there in the number of new species Noah saw each day? A simple histogram should provide some insight here. Days with no birds seen will not appear in the dataset, so I bring these days in as well.

all_dates <- seq(ymd("2015-01-01"), ymd("2015-12-31"), by = "day") %>% 
  as.Date %>% 
  data.frame(date = .)
bigyear_by_day <- bigyear %>% 
  count(date) %>%
  left_join(all_dates, ., by = "date") %>% 
  mutate(n = ifelse(is.na(n), 0, n))
ggplot(bigyear_by_day, aes(x = n)) +
  geom_histogram(binwidth = 5, fill = "grey30") +
  geom_vline(xintercept = mean(bigyear_by_day$n), linetype="dashed", 
             color = "#FA6900") +
  annotate("text", x = mean(bigyear_by_day$n) + 27, y = 75,
           size = 4, colour = "grey30", family = "Helvetica Neue Light",
           label = paste0("Average = ", 
                          round(mean(bigyear_by_day$n), 1), " birds / day")) +
  scale_x_continuous(breaks = seq(0, 110, by = 10)) +
  scale_y_continuous(breaks = seq(0, 100, by = 10), expand = c(0, 0), 
                     limits = c(0, 100)) +
  labs(x = "# new species seen per day", y = "# days",
       title = "Number of New Species Added Each Day",
       subtitle = "Distribution of daily new species counts") +
    geom_hline(yintercept = 0, size = 1, colour = "grey30") +
    theme_bw() +
    theme(text = element_text(family = "Helvetica Neue Light",
                              size = 12, colour = "grey30"),
          plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm"),
          panel.background = element_rect(fill = "grey90"),
          plot.background = element_rect(fill = "grey90"),
          panel.border = element_blank(),
          panel.grid.major = element_line(colour = "grey80", size = 0.5),
          panel.grid.major.x = element_blank(),
          panel.grid.minor = element_blank(),
          axis.ticks = element_blank(),
          axis.title = element_text(family = "Helvetica Neue"),
          plot.title = element_text(hjust = 0, size=16, 
                                    family = "Helvetica Neue",
                                    margin = margin(b=10)),
          plot.subtitle = element_text(hjust = 0, size = 12,
                                       margin = margin(b = 15)))

plot of chunk by-day

Best days

From this plot it’s clear that Noah had a handful of very productive days. Where was Noah birding on his three most birdy days?

top_n(bigyear_by_day, 3, n) %>% 
  inner_join(bigyear, by = "date") %>% 
  dplyr::select(day, date, country, n) %>% 
  distinct %>% 
  arrange(desc(n)) %>% 
  kable
day date country n
14 2015-01-14 Argentina 108
335 2015-12-01 Papua New Guinea 81
177 2015-06-26 Ghana 75

Zero bird days

At the other end of the spectrum, are days where Noah didn’t see any new birds.

filter(bigyear_by_day, n == 0) %>% 
  .$date
#> [1] "2015-01-03" "2015-06-08" "2015-10-01"

Delving into these requires consulting Noah’s blog. On January 3 (day 3), he was in Antarctica where there is a very small pool of species to see. On June 8 (day 159), Noah was stuck in various New York airports due to flight delays. Finally, on October 1 (day 274), Noah was gain stuck in transit on his way from India to Myanmar.

Of course, as with this entire analysis, I only have data on new species seen, not total species seen. So, Noah likely did see some birds on these zero days, just no new ones. More on this later.

Geographical Analysis

Next I connect to the GeoNames API to join in some country level data. GeoNames is a database of geographical information with an API interface. rOpenSci has built the geonames package to access this API.

Download countryInfo from API

The countryInfo table contains the basic country-level information I’m after:

country_info <- GNcountryInfo()
country_info <- country_info %>% 
  dplyr::select(continent_code = continent, continent_name = continentName, 
         country_code = countryCode, country_name = countryName,
         area = areaInSqKm, north, south, east, west)

Of course, the big year dataset uses non-standard country names. Here’s where the countrycode package comes in handy, it aids conversion between various country coding schemes and country names. In particular, GNcountryInfo()$countryCode is the ISO alpha-2 country code, which I’ll use in joining. Thanks to Andrew MacDonald for this tip.

bigyear <- bigyear %>% 
  mutate(country_code = countrycode(country, "country.name", "iso2c")) %>% 
  right_join(country_info, ., by = "country_code")

Geographical visualization

Now it’s possible to delve deeper into the spatial pattern of the sightings. First, I look at new species by continent.

cont_summ <- bigyear %>% 
  group_by(continent_name) %>% 
  summarize(n_species = n(), 
            birding_days = n_distinct(day),
            species_per_day = n_species / birding_days,
            first_day = min(date)) %>% 
  ungroup %>% 
  mutate(continent_name = factor(continent_name),
         continent_name = reorder(continent_name, first_day, min)) %>% 
  dplyr::select(-first_day)

ggplot(cont_summ, aes(x = continent_name, y = n_species)) + 
  geom_bar(stat = "identity", colour = "grey30", width = 0.7) +
  scale_x_discrete(labels = c("North America" = "N America",
                              "South America" = "S America")) +
  scale_y_continuous(labels = comma, expand = c(0, 0), limits = c(0, 2000)) +
  labs(x = NULL, y = "# new species",
       title = "New Big Year Species by Continent",
       subtitle = "Contribution of each continent to Noah's big year") +
  geom_hline(yintercept = 0, size = 1, colour = "grey30") +
  theme_bw() +
  theme(text = element_text(family = "Helvetica Neue Light",
                            size = 12, colour = "grey30"),
        plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm"),
        panel.background = element_rect(fill = "grey90"),
        plot.background = element_rect(fill = "grey90"),
        panel.border = element_blank(),
        panel.grid.major = element_line(colour = "grey80", size = 0.5),
        panel.grid.major.x = element_blank(),
        panel.grid.minor = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_text(family = "Helvetica Neue"),
        axis.text.x = element_text(family = "Helvetica Neue", size = 10),
        plot.title = element_text(hjust = 0, size = 16, 
                                  family = "Helvetica Neue",
                                  margin = margin(b = 10)),
        plot.subtitle = element_text(hjust = 0, size = 12,
                                     margin = margin(b = 15)))

plot of chunk by-continent

No surprise here, tons of species seen in the tropics, especially South America. Of course, this is confounded by the amount of time Noah spent on each continent. So, now I look at number of species seen per day within each continent.

bigyear %>% 
  count(continent_name, date) %>% 
  ungroup %>% 
  mutate(continent_name = factor(continent_name),
         continent_name = reorder(continent_name, date, min)) %>% 
  ggplot(aes(x = continent_name, y = n)) +
    geom_boxplot(color = "grey30", outlier.colour = "#FA6900", 
                 outlier.size = 1.25) +
    geom_text(data = cont_summ, 
              aes(x = continent_name, y = 78, label = birding_days),
              size = 4.5, colour = "grey30", 
              family = "Helvetica Neue Light") +
    annotate("text", x = 1, y = 82, label = "# days",
             size = 4.5, colour = "grey30", 
             family = "Helvetica Neue Light") +
    scale_x_discrete(labels = c("North America" = "N America",
                                "South America" = "S America")) +
    scale_y_continuous(breaks = seq(0, 80, 10), 
                       limits = c(0, 83), expand = c(0, 0.1)) +
    labs(x = NULL, y = "# new species",
         title = "Contribution of Each Continent to Noah's Big Year",
         subtitle = "Average number of new species seen in each continent") +
  geom_hline(yintercept = 0, size = 1, colour = "grey30") +
  theme_bw() +
  theme(text = element_text(family = "Helvetica Neue Light",
                            size = 12, colour = "grey30"),
        plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm"),
        panel.background = element_rect(fill = "grey90"),
        plot.background = element_rect(fill = "grey90"),
        panel.border = element_blank(),
        panel.grid.major = element_line(colour = "grey80", size = 0.5),
        panel.grid.major.x = element_blank(),
        panel.grid.minor = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_text(family = "Helvetica Neue"),
        axis.text.x = element_text(family = "Helvetica Neue", size = 10),
        plot.title = element_text(hjust = 0, size=16, 
                                  family = "Helvetica Neue",
                                  margin = margin(b=10)),
        plot.subtitle = element_text(hjust = 0, size = 12,
                                     margin = margin(b = 15)))

plot of chunk per-day


arrange(cont_summ, species_per_day) %>% 
  kable(digits = 2, col.names = c("Continent", "# species", "Days", 
                                  "Species per day"), 
        row.names = F)
Continent # species Days Species per day
Antarctica 25 4 6.25
North America 770 55 14.00
Africa 1140 79 14.43
Europe 190 12 15.83
Asia 1395 87 16.03
South America 1955 98 19.95
Oceania 567 28 20.25

This is pretty interesting, Noah has seen around 15 new species per day regardless of continent. This suggests some impressive skill in planning on his part, he’s clearly optimized his itinerary to fight that tendency for the species accumulation curve to saturate, by moving around constantly and spending less time in less diverse areas.

Antarctica is an outlier with only 6.25 species per day, but these are all exciting species that aren’t found elsewhere. Noah saw the most species per day in South America and Oceania at about 20 per day. Oceania has a very distinctive avifauna and Noah was there for only 28 days, yielding some very concentrated birding. South America has extremely high alpha and beta diversity of birds. Have a look at these maps of global bird diversity from the Biodiversity Mapping project.

In addition, South America has amazing infrastructure for birding. A huge number of birders visit this region and many ornithologists conduct research here. These facts were demonstrated by a team from Louisiana State University that saw 354 bird species in a single 24 period in Peru in 2014, the record for most species seen in a day (Edit: turns out this big day record was destroyed in October 2015 by a team who saw 425 species in a single day in Ecuador; thanks to Tim Boucher for this correction).

And, no surprise about the country where Noah saw the most birds… Peru!

country_summ <- bigyear %>% 
  group_by(continent_name, country, country_code) %>% 
  summarize(n_species = n(), 
            birding_days = n_distinct(day),
            species_per_day = n_species / birding_days) %>% 
  ungroup %>% 
  mutate(country = factor(country),
         country = reorder(country, n_species, max),
         continent_name = ifelse(continent_name == "North America",
                                 "N America", continent_name),
         continent_name = ifelse(continent_name == "South America",
                                 "S America", continent_name),
         continent_name = factor(continent_name),
         continent_name = reorder(continent_name, -n_species, min))

ggplot(country_summ, aes(x = n_species, y = country, color = continent_name)) + 
  geom_lollipop(point.size = 2, horizontal = TRUE) +
  scale_color_brewer(name = NULL, palette = "Set1") +
  labs(x = "# new species", y = NULL,
         title = "Country Contributions to Noah's Big Year",
         subtitle = "Total number of new species added in each country") +
  scale_x_continuous(limits = c(0, 525), expand = c(0, 0)) +
  guides(color = guide_legend(override.aes = list(size = 4))) +
  geom_vline(xintercept = 0, size = 1, colour = "grey30") +
  theme_bw() +
  theme(text = element_text(family = "Helvetica Neue Light",
                            size = 12, colour = "grey30"),
        plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm"),
        panel.background = element_rect(fill = "grey90"),
        plot.background = element_rect(fill = "grey90"),
        panel.border = element_blank(),
        legend.background =  element_rect(fill = "white", color = "grey30"),
        legend.margin = margin(5, 10, 5, 10),
        legend.key = element_blank(),
        legend.justification = c(0, 1), legend.position = c(0.5, 0.5),
        panel.grid.major = element_line(colour = "grey80", size = 0.5),
        panel.grid.major.y = element_blank(),
        panel.grid.minor = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_text(family = "Helvetica Neue"),
        axis.text.y = element_text(family = "Helvetica Neue"),
        plot.title = element_text(hjust = 0, size=16, 
                                  family = "Helvetica Neue",
                                  margin = margin(b=10)),
        plot.subtitle = element_text(hjust = 0, size = 12,
                                     margin = margin(b = 15)))

plot of chunk by-country

Finally, I plot the trend in species seen again, but colour each segment by continent. Since Noah visited Asia three times, with visits to Africa and Oceania in between, I need to group the dataset by continuous stretches of time in each continent. My solution comes from this StackOverflow response.

trend <- bigyear %>% 
  group_by(date, continent_name) %>% 
  summarize(n_species = max(species_num)) %>% 
  ungroup %>% 
  mutate(idx = (continent_name != lag(continent_name)),
         continent_name = reorder(factor(continent_name), date, min))
trend$grp <- c(1, which(trend$idx), nrow(trend) + 1) %>% 
  diff %>% 
  rep(1:length(.), .)
explanation_caption <- wrap_format(100)(
  paste(
    "Noah's rate of species accumulation was nearly costant at 16.5 species per day",
    "throughout the course of his Big Year. He was able to",
    "maintain this impressive rate by frequently moving within",
    "and between countries as he began encountering fewer new birds.",
    sep = " "))
ggplot(trend, aes(x = date, y = n_species, color = continent_name)) +
  geom_line(aes(group = grp), size = 1) +
  scale_x_date(labels = date_format("%b"),
               breaks = date_breaks("month")) +
  scale_y_continuous(breaks = 1000 * (0:6),
                     labels = comma) +
  scale_color_brewer(name = NULL, palette = "Set1") +
  labs(x = NULL, y = "# species",
       title = "Big Year Species Accumulation",
       subtitle = explanation_caption) +
  geom_hline(yintercept = 0, size = 1, colour = "grey30") +
  theme_bw() +
  theme(text = element_text(family = "Helvetica Neue Light",
                            size = 12, colour = "grey30"),
        plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm"),
        panel.background = element_rect(fill = "grey90"),
        plot.background = element_rect(fill = "grey90"),
        panel.border = element_blank(),
        legend.background =  element_rect(fill = "white", color = "grey30"),
        legend.margin = margin(5, 10, 5, 10),
        legend.key = element_blank(),
        legend.text = element_text(size = 12, family = "Helvetica Neue"),
        legend.justification = c(0, 1), legend.position = c(0.65, 0.61),
        panel.grid.major = element_line(colour = "grey80", size = 0.5),
        panel.grid.minor = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_text(family = "Helvetica Neue"),
        axis.text.x = element_text(family = "Helvetica Neue Bold"),
        plot.title = element_text(hjust = 0, size=16, 
                                  family = "Helvetica Neue",
                                  margin = margin(b=10)),
        plot.subtitle = element_text(hjust = 0, size = 8,
                                     family = "Helvetica Neue",
                                     margin = margin(b = 10)))

plot of chunk trend-cont

Map of sightings

To finish things off, I’ll look at a map of Noah’s sightings. Since I’ve already been using ggplot in this post, I’ll use it to produce the map, although I typically prefer other tools for mapping in R. The following is inspired by this nice blog post on making world maps with ggplot.

First I download, unzip, and load shapefiles for country boundaries, graticules, and a bounding box.

base_url <- "http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/110m/"
tf <- tempfile()
download.file(paste0(base_url, "cultural/ne_110m_admin_0_countries.zip"), tf)
unzip(tf, exdir = "data/big-year/", overwrite = TRUE)
unlink(tf)
download.file(paste0(base_url, "physical/ne_110m_graticules_all.zip"), tf)
unzip(tf, exdir = "data/big-year/", overwrite = TRUE)
unlink(tf)

world_wgs <- shapefile("data/big-year/ne_110m_admin_0_countries.shp")
bbox_wgs <- shapefile("data/big-year/ne_110m_wgs84_bounding_box.shp")
grat_wgs <- shapefile("data/big-year/ne_110m_graticules_20.shp")

This shapfile is more granular than I require, so I aggregate it so that each ISO alpha-2 code corresponds to a single polygon.

world_wgs <- gUnaryUnion(world_wgs, id = world_wgs$iso_a2)

These shapefiles are currently in unprojected coordinates (i.e. lat/long), so I project them to the Winkel tripel projection, a nice compromise projection for global maps, which is used by National Geographic. In addition, I convert the spatial objects to data frames to appease ggplot.

world_wk_df <- spTransform(world_wgs, "+proj=wintri") %>% 
  fortify
bbox_wk_df <- spTransform(bbox_wgs, "+proj=wintri") %>% 
  fortify
grat_wk_df <- spTransform(grat_wgs, "+proj=wintri") %>% 
  fortify

Now I bring in the country-level sightings data.

world_wk_df <- left_join(world_wk_df, country_summ, 
                         by = c("id" = "country_code"))

Finally, I create the map:

ggplot(bbox_wk_df, aes(long, lat, group = group)) +
  geom_polygon(fill = "light blue") +
  geom_path(data = grat_wk_df, aes(long, lat, group = group, fill = NULL), 
            linetype = "dashed", color = "grey70", size = 0.25) +
  geom_polygon(data = world_wk_df, 
               aes(long, lat, group = group, fill = n_species), 
               color = "white", size = 0.15) +
  scale_fill_gradient(name = "# new species", limits = c(0, 500), 
                       low = "yellow", high = "red") +
  labs(title = "New Species Added to Noah's Big Year",
       subtitle = "Number of new species added in each country") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  coord_equal() +
  theme(axis.line = element_blank(),
        axis.text.x = element_blank(), axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        axis.title.x = element_blank(), axis.title.y = element_blank(),
        panel.background = element_blank(),
        panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        plot.background = element_blank(),
        legend.position = "bottom",
        legend.key.width = unit(1, "cm"),
        plot.title = element_text(hjust = 0, size = 18, margin = margin(b = 10)),
        plot.subtitle = element_text(hjust = 0, size = 13, margin = margin(b = 15)))

plot of chunk map

Far from perfect in my opinion, but not a terrible map by any means. Here’s one more, with species per day.

ggplot(bbox_wk_df, aes(long, lat, group = group)) +
  geom_polygon(fill = "light blue") +
  geom_path(data = grat_wk_df, aes(long, lat, group = group, fill = NULL), 
            linetype = "dashed", color = "grey70", size = 0.25) +
  geom_polygon(data = world_wk_df, 
               aes(long, lat, group = group, fill = species_per_day), 
               color = "white", size = 0.15) +
  scale_fill_gradient(name = "# new species / day", limits = c(0, 40),
                       low = "yellow", high = "red") +
  labs(title = "New Species Added to Noah's Big Year",
       subtitle = "Average number of new species added per day") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  coord_equal() +
  theme(axis.line = element_blank(),
        axis.text.x = element_blank(), axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        axis.title.x = element_blank(), axis.title.y = element_blank(),
        panel.background = element_blank(),
        panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        plot.background = element_blank(),
        legend.position = "bottom",
        legend.key.width = unit(1, "cm"),
        plot.title = element_text(hjust = 0, size = 18, margin = margin(b = 10)),
        plot.subtitle = element_text(hjust = 0, size = 13, margin = margin(b = 15)))

plot of chunk map-by-day

Conclusions

Since I only have data on number of new species seen each day, all these continent and country level patterns are confounded by the order in which countries are visited. Obviously the distribution of many birds spans country and continental lines. So, for example, since Noah visited Ecuador after neighbouring Peru, many of the birds he saw in Ecuador would already have been seen in Peru. Also, many many tropical birds migrate poleward in spring, so Noah would have already seen many North American migrants on their wintering grounds in Central and South America.

It would be really interesting to get ahold of the full set of Noah’s eBird checklists for the year. With these data, this analysis could be expanded and refined. Regardless, I think some interesting insights were gained from the existing dataset.