On more than one occasion I’ve taken the brutally long-haul flight from Toronto to Hong Kong with Air Canada. Given that I’m totally unable to sleep on planes, almost 16 hours crammed into a tiny economy class seat is pretty rough! This got me thinking: what is the longest regularly scheduled, commercial long-haul flight?

Wikipedia has the answer (no surprise) in the form of a table listing the top 30 longest flights by distance. Turns out the longest flight is from Dallas to Sydney, clocking in at almost 17hours and 13,804 km. This is 1.5 hours longer than my Hong Kong-Toronto flight, which comes in at number 24 on the list.

Of course, I couldn’t resist scraping these data from Wikipedia and mapping the flights. I’ll use this opportunity to practice plotting with ggplot, since I’ve recently been trying to gain more experience using this package for mapping spatial data.

Note: while in the process of making this map, I found a post on the Vizual Statistix blog with a very similar idea. However, he doesn’t discuss how the map is map or provide source code.

If you don’t care about R, skip to the bottom for the final map.

Required packages

library(knitr)
library(sp)
library(raster)
library(rgdal)
library(rgeos)
library(cleangeo)
library(geosphere)
library(plyr)
library(dplyr)
library(rvest)
library(stringr)
library(tidyr)
library(lubridate)
library(ggplot2)
library(ggmap)
library(ggrepel)
library(ggalt) #devtools::install_github("hrbrmstr/ggalt")
library(viridis)

Scraping and cleaning

The rvest package makes web scraping a breeze. I just read the html, extract out any tables, pick the first table on the page (this is the only one I’m interested in), and parse it into a dataframe with html_table().

flights <- read_html('https://en.wikipedia.org/wiki/Non-stop_flight') %>% 
  html_nodes('.wikitable') %>% 
  .[[1]] %>% 
  html_table(fill = TRUE)

As usual there are some issues with the imported data. First, the Wikipedia table has cells spanning multiple rows corresponding to flights on the same route with different airlines. The rvest help explicitly states that it can’t handle rows spanning multiple columns. In addition, the column headers are not nice variable names.

I fix these issues below.

# variable names
names(flights) <- c("rank", "from", "to", "airline", "flight_no", "distance",
                    "duration", "aircraft", "first_flight")
# cells spanning multiple rows
row_no <- which(is.na(flights$first_flight))
problem_rows <- flights[row_no, ]
fixed_rows <- flights[row_no - 1, ]
fixed_rows$rank <- problem_rows[, 1]
fixed_rows$airline <- problem_rows[, 2]
fixed_rows$flight_no <- problem_rows[, 3]
fixed_rows$duration <- problem_rows[, 4]
fixed_rows$aircraft <- problem_rows[, 5]
flights <- flights[-row_no, ]
flights <- rbind(flights, fixed_rows) %>% 
  arrange(rank)

The next step is cleaning the data, and there are a variety of issues here: 1. Footnotes need to be cleaned out of some cells 2. Destinations sometimes have city and airport 3. Some routes have multiple flight numbers for the same airline 4. Distances are given in three units all within the same cell 5. Durations aren’t in a nice format to work with 5. Some routes have different durations for winter and summer

Nothing stringr and some regular expressions can’t handle!

flights <- flights %>% 
  mutate(rank = as.integer(str_extract(rank, "^[:digit:]+")),
         from = str_extract(from, "^[[:alpha:] ]+"),
         to = str_extract(to, "^[[:alpha:] ]+"))
# make multiple flight numbers comma separated
flights$flight_no <- str_replace_all(flights$flight_no, "[:space:]", "") %>% 
  str_extract_all("[:alpha:]+[:digit:]+") %>% 
  laply(paste, collapse = ",")
# only consider distances in km, convert to integer
flights$distance <- str_extract(flights$distance, "^[0-9,]+") %>% 
  str_replace(",", "") %>% 
  as.integer
# convert duration to minutes and separate into summer/winter schedules
flights <- str_match_all(flights$duration, "([:digit:]{2}) hr ([:digit:]{2}) min") %>% 
  llply(function(x) {60 * as.integer(x[, 2]) + as.integer(x[, 3])}) %>% 
  llply(function(x) c(x, x)[1:2]) %>% 
  do.call(rbind, .) %>% 
  data.frame %>% 
  setNames(c("duration_summer", "duration_winter")) %>% 
  mutate(duration_max = pmax(duration_summer, duration_winter)) %>% 
  cbind(flights, .)
# first_flight to proper date
flights$first_flight <- str_extract(flights$first_flight, "^[0-9-]+") %>% 
  ymd %>% 
  as.Date
flights <- flights %>% 
  mutate(route = paste(from, to, sep = "-")) %>% 
  dplyr::select(rank, route, from, to, airline, flight_no, distance, 
         duration = duration_max, duration_summer, duration_winter,
         first_flight)

Now the table is in a nice clean format and ready for display.

dplyr::select(flights, rank, route, airline, distance, duration) %>% 
  kable(format.args =  list(big.mark = ','),
        col.names = c("rank", "route", "airline", "distance (km)", "duration (min)"))
rank route airline distance (km) duration (min)
1 Auckland-Dubai Emirates 14,203 1,035
2 Dallas-Sydney Qantas 13,804 1,015
3 Johannesburg-Atlanta Delta Air Lines 13,582 1,000
4 Abu Dhabi-Los Angeles Etihad Airways 13,502 990
5 Dubai-Los Angeles Emirates 13,420 995
6 Jeddah-Los Angeles Saudia 13,409 1,015
7 Doha-Los Angeles Qatar Airways 13,367 985
8 Dubai-Houston Emirates 13,144 980
9 Abu Dhabi-San Francisco Etihad Airways 13,128 975
10 Dallas-Hong Kong American Airlines 13,072 1,025
11 Dubai-San Francisco Emirates 13,041 950
12 New York-Hong Kong Cathay Pacific 12,983 975
13 Newark-Hong Kong Cathay Pacific 12,980 960
14 Newark-Hong Kong United Airlines 12,980 NA
15 Abu Dhabi-Dallas Etihad Airways 12,962 980
16 Doha-Houston Qatar Airways 12,951 980
17 Dubai-Dallas Emirates 12,940 980
18 New York-Guangzhou China Southern Airlines 12,878 965
19 Boston-Hong Kong Cathay Pacific 12,827 950
20 Johannesburg-New York South African Airways 12,825 965
21 Houston-Taipei EVA Air 12,776 955
22 Doha-Dallas Qatar Airways 12,764 980
23 Los Angeles-Melbourne Qantas 12,748 950
24 Los Angeles-Melbourne United Airlines 12,748 950
25 Toronto-Hong Kong Cathay Pacific 12,569 930
26 Toronto-Hong Kong Air Canada 12,569 935
27 New York-Taipei EVA Air 12,566 970
28 New York-Taipei China Airlines 12,566 920
29 Mumbai-Newark United Airlines 12,565 965
30 Mumbai-Newark Air India 12,565 955

Geocoding

If I’m going to map these flights, I’ll need coordinates for each city in the dataset. Fortunately, the ggmaps package has a function for geocoding locations based on their name using Google Maps.

cities <- c(flights$from, flights$to) %>% 
  unique
cities[cities == "Melbourne"] <- "Melbourne, Australia"
cities <- cities %>% 
  cbind(city = ., geocode(., output = "latlon", source = "google"))
cities <- cities %>% 
  mutate(city = as.character(city),
         city = ifelse(city == "Melbourne, Australia", "Melbourne", city))

Now I bring these coordinates into the flights dataframe.

flights <- flights %>% 
  left_join(cities, by = c("from" = "city")) %>% 
  left_join(cities, by = c("to" = "city")) %>% 
  rename(lng_from = lon.x, lat_from = lat.x, lng_to = lon.y, lat_to = lat.y)

Flight paths

A great circle is the path on a spherical surface (such as the Earth) that gives the shortest distance between two points. Although I have no way of knowing what the actual flight path is for these routes, it’s likely to be reasonably approximated by a great circle. First I subset the flights dataset to only include unique routes.

flights_unique <- flights %>% 
  group_by(route) %>% 
  filter(row_number(desc(duration)) == 1)

Then I use the geosphere package to get great circle routes for each of the above flights. Since flights over the Pacific cross the International Date Line, I use breakAtDateLine = TRUE to ensure the great circle lines are broken as they cross.

gc_routes <- gcIntermediate(flights_unique[c("lng_from", "lat_from")],
                            flights_unique[c("lng_to", "lat_to")],
                            n = 360, addStartEnd = TRUE, sp = TRUE, 
                            breakAtDateLine = TRUE)
gc_routes <- SpatialLinesDataFrame(gc_routes, 
                                   data.frame(rank = flights_unique$rank,
                                              route = flights_unique$route,
                                              stringsAsFactors = FALSE))
row.names(gc_routes) <- as.character(gc_routes$rank)

Global map

As a background on which to map the flight paths, I’ll use the global map provided by Natural Earth.

base_url <- 'http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/'
tf <- tempfile()
download.file(paste0(base_url, '110m/cultural/ne_110m_admin_0_countries_lakes.zip'), tf)
unzip(tf, exdir = 'data/long-flights/', overwrite = TRUE)
unlink(tf)
world <- shapefile('data/long-flights/ne_110m_admin_0_countries_lakes.shp')

ggplot can’t handle spatial objects directly, it only works with data frames. So, I use the fortify() function to convert each spatial object to a data frame ready for plotting.

world_df <- fortify(world)
gc_routes_df <- fortify(gc_routes)

Mapping

Now that all the data are prepared, I’ll create the map. Rather than just showing the final product, I’ll build it up in steps in the hope that it’ll be instructive.

First attempt

All coordinates are currently in unprojected (i.e. lat/long) coordinates, I project them to the Kavrayskiy VII projection, a nice compromise projection for global maps. Typically, I’d project all my spatial data with sp::spTransform() before plotting, but here I’ll make use of the new coord_proj() function from the ggalt package package, which projects coordinates on the fly.

ggplot() +
  geom_polygon(data = world_df, aes(long, lat, group = group), 
               fill = "grey80", color = "grey60", size = 0.1) +
  geom_point(data = cities, aes(lon, lat), color = "grey20", size = 0.5) +
  geom_path(data = gc_routes_df, 
            aes(long, lat, group = group), alpha = 0.5, color = "#fa6900") +
  geom_text(data = cities, aes(lon, lat, label = city),
            size = 3, color = "grey20", alpha = 0.9, nudge_y = 2, 
            check_overlap = TRUE) +
  coord_proj("+proj=kav7") +
  scale_x_continuous(breaks = seq(-180, 180, 30)) +
  scale_y_continuous(breaks = seq(-90, 90, 15)) +
  theme(panel.grid.major = element_line(size = 0.5, linetype = 2),
        axis.title = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank())

plot of chunk first-map

Looks OK, but there’s tons of room for improvement.

Changing central meridian

The default Kavrayskiy VII projection takes the Greenwich Prime Meridian as its central meridian. This is a poor choice in this case since most routes have a North American city at one end, which results in many of the routes going off the edge of the map. Centering the map on the US (around 90°W) seems the best bet, but this puts the edges of the map at 90°E, right in the middle of Asia. This makes a mess of the polygons spanning the edge.

central_meridian <- -90
proj <- sprintf("+proj=kav7 +lon_0=%i", central_meridian)
ggplot() +
  geom_polygon(data = world_df, aes(long, lat, group = group), 
               fill = "grey80", color = "grey60", size = 0.1) +
  geom_path(data = gc_routes_df, 
            aes(long, lat, group = group), alpha = 0.5, color = "#fa6900") +
  coord_proj(proj, ylim = c(-60, 90)) +
  theme(panel.grid.major = element_line(size = 0.5, linetype = 2),
        axis.title = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank())

plot of chunk messy-boundary

Sorting this issue out turned out to be a much bigger challenge than I expected; it seems there’s no elegant solution in R. The nowrapRecenter() function from the maptools package is meant to address this issue, but it only appears to work when the date line is the new center and it still leads to artifacts when the data are projected. Simply removing a small sliver of the polygons at what will become the edge in the new projection works, but this means removing data, it results in a world map that looks chopped off at the edges, and gives a seam at the date line.

trim_edge <- function(x, lng_center, edge_tol = 1e-8) {
  if (lng_center < -180 || lng_center > 180) {
    stop("invalid longitude")
  }
  
  if (is.projected(x) || is.na(is.projected(x))) {
    stop("trim_edge only works with unprojected coordinates")
  }
  
  edge <- (lng_center + 360) %% 360 - 180
  clip <- as(extent(edge, edge + edge_tol, -90, 90), "SpatialPolygons")
  projection(clip) <- projection(x)
  row.names(clip) <- "edge"
  gd <- gDifference(x, clip, byid = TRUE)
  # return features ids to original values
  row.names(gd) <- gsub(" edge$", "", row.names(gd))
  # bring back attribute data
  if (inherits(x, "SpatialPolygonsDataFrame")) {
    gd <- SpatialPolygonsDataFrame(gd, x@data, match.ID = TRUE)
  } else if (inherits(x, "SpatialLinesDataFrame")) {
    gd <- SpatialLinesDataFrame(gd, x@data, match.ID = TRUE)
  }
  gd
}
world_trimmed <- trim_edge(world, central_meridian) %>% 
  fortify
ggplot() +
  geom_polygon(data = world_trimmed, aes(long, lat, group = group), 
               fill = "grey80", color = "grey60", size = 0.1) +
  coord_proj(proj)

plot of chunk trim_edge

Pretty close, but I’m quite picky about aesthetics so this isn’t going to cut it! After much frustration, I went with a fairly hacky solution, which requires two functions that split the map into two at what will become the new edge, project it, then manually flip polygon vertices on the left edge that should actually be on the right edge.

split <- function(x, edge, spacing = 0.1) {
  if (edge < -180 || edge > 180) {
    stop("invalid longitude")
  }
  if (is.projected(x) || is.na(is.projected(x))) {
    stop("split only works with unprojected coordinates")
  }
  
  if (is.na(spacing)) {
    lat_seq <- c(-90, 90)
  } else {
    lat_seq <- seq(-90, 90, length.out = ceiling(180 / spacing))
  }
  left <- rbind(
    data.frame(long = c(edge, -180, -180, edge), lat = c(90, 90, -90, -90)),
    data.frame(long = edge, lat = lat_seq)) %>% 
    {SpatialPolygons(list(Polygons(list(Polygon(.)), "left")))}
  right <- rbind(
    data.frame(long = c(edge, 180, 180, edge), lat = c(90, 90, -90, -90)),
    data.frame(long = edge, lat = lat_seq)) %>% 
    {SpatialPolygons(list(Polygons(list(Polygon(.)), "right")))}
  sides <- rbind(left, right)
  projection(sides) <- projection(x)
  
  # add original id as attribute
  gi <- gIntersection(x, sides, byid = TRUE, drop_lower_td = TRUE)
  ids <- data.frame(.id = gsub(" (left|right)$", "", row.names(gi)),
                       stringsAsFactors = FALSE)
  row.names(ids) <- row.names(gi)
  if (inherits(gi, "SpatialPolygons")) {
    gi <- SpatialPolygonsDataFrame(gi, ids, match.ID = TRUE)
  } else if (inherits(gi, "SpatialLines")) {
    gi <- SpatialLinesDataFrame(gi, ids, match.ID = TRUE)
  }
  
  # bring back attribute data
  if ("data" %in% slotNames(x)) {
    df <- x@data
    df$.id <- row.names(x)
    gi <- merge(gi, df, by = ".id")
  }
  return(gi)
}

project_recenter <- function(x, proj, union_field, union_scale = getScale()) {
  if (!(missing(union_field) || union_field == ".id" ||
        ("data" %in% slotNames(x) && union_field %in% names(x)))) {
    stop("invalid union_field; must be an attribute")
  }
  # find center from proj4 string
  central_meridian <- proj %>% 
    {regmatches(., regexec("\\+lon_0=([-0-9]+)", .))[[1]][2]} %>% 
    as.integer
  if(is.na(central_meridian)) {
    stop("Must specify lon_0 in proj4 string.")
  }
  
  # split at edge and project
  edge <- (central_meridian + 360) %% 360 - 180
  edge_sl <- sprintf("LINESTRING(%s -90,%s 90)", edge, edge) %>% 
    readWKT(p4s = projection(x))
  if (all(!gIntersects(x, edge_sl, byid = TRUE)) || central_meridian == 0) {
    return(x)
  }
  x_proj <- spTransform(split(x, edge), proj)
  
  # fix points on boundary that have been projected onto opposite boundary
  l <- gIntersects(x_proj, spTransform(edge_sl, proj), byid = TRUE)[1,]
  l <- which(l & grepl(ifelse(edge > 0, "right$", "left$"), names(l)))
  if (inherits(x_proj, "SpatialPolygons")) {
    for (i in l) {
      for (j in 1:length(x_proj@polygons[[i]]@Polygons)) {
        if (edge > 0) {
          prob <- which(x_proj@polygons[[i]]@Polygons[[j]]@coords[,1] > 0)
        } else {
          prob <- which(x_proj@polygons[[i]]@Polygons[[j]]@coords[,1] < 0)
        }
        x_proj@polygons[[i]]@Polygons[[j]]@coords[prob, 1] <-
          -x_proj@polygons[[i]]@Polygons[[j]]@coords[prob, 1]
      }
    }
    x_proj <- clgeo_Clean(x_proj)
    s <- getScale()
    setScale(union_scale)
    if (!missing(union_field)) {
      x_proj <- gUnaryUnion(x_proj, id = x_proj@data[, union_field])
    }
    setScale(s)
  } else if (inherits(x_proj, "SpatialLines")) {
    for (i in l) {
      for (j in 1:length(x_proj@lines[[i]]@Lines)) {
        if (edge > 0) {
          prob <- which(x_proj@lines[[i]]@Lines[[j]]@coords[,1] > 0)
        } else {
          prob <- which(x_proj@lines[[i]]@Lines[[j]]@coords[,1] < 0)
        }
        x_proj@lines[[i]]@Lines[[j]]@coords[prob, 1] <-
          -x_proj@lines[[i]]@Lines[[j]]@coords[prob, 1]
      }
    }
    s <- getScale()
    setScale(union_scale)
    if (!missing(union_field)) {
      x_proj <- gLineMerge(x_proj, id = x_proj@data[, union_field])
    }
    setScale(s)
  }
  return(x_proj)
}

A lot of work for such a seemingly simple task! Unfortunately, this approach means moving away from the lovely coord_proj() function from the new ggalt package. On the up side, I believe this approach is fairly general and produces the nicest results. Before I proceed, I need to reproject the routes and points. Furthermore, I no longer want the routes to be split at the Date Line, so I regenerate them.

# routes
routes_nodl <- gcIntermediate(flights_unique[c("lng_from", "lat_from")],
                              flights_unique[c("lng_to", "lat_to")],
                              n = 360, addStartEnd = TRUE, sp = TRUE, 
                              breakAtDateLine = FALSE)
routes_nodl <- SpatialLinesDataFrame(routes_nodl, 
                                     data.frame(rank = flights_unique$rank,
                                                route = flights_unique$route,
                                                stringsAsFactors = FALSE))
row.names(routes_nodl) <- as.character(routes_nodl$rank)
# Auckland-Dubai crosses edge
crosses_edge <- (routes_nodl$route == "Auckland-Dubai")
crosses_edge_sl <- project_recenter(routes_nodl[crosses_edge, ], proj,
                                    union_field = "rank",
                                    union_scale = 1e6)
crosses_edge_sl$rank <- routes_nodl[crosses_edge, ]$rank
crosses_edge_sl$route <- routes_nodl[crosses_edge, ]$route
row.names(crosses_edge_sl) <- row.names(routes_nodl[crosses_edge, ])
routes_kav_df <- spTransform(routes_nodl[!crosses_edge, ], proj) %>% 
  rbind(crosses_edge_sl, .) %>% 
  fortify
# cities
cities_wgs <- cities
coordinates(cities_wgs) <- ~ lon + lat
projection(cities_wgs) <- projection(world)
cities_kav_df <- spTransform(cities_wgs, proj) %>% 
  as.data.frame

Plotting the newly centered data.

world_kav_df <- project_recenter(world, proj, union_field = "sov_a3", 
                                 union_scale = 1e6) %>% 
  fortify
ggplot() +
  geom_polygon(data = world_kav_df, aes(long, lat, group = group), 
               fill = "grey80", color = "grey60", size = 0.1) +
  geom_point(data = cities_kav_df, aes(lon, lat), color = "grey20", size = 0.5) +
  geom_path(data = routes_kav_df, aes(long, lat, group = group), 
            alpha = 0.5, color = "#fa6900") +
  geom_text(data = cities_kav_df, aes(lon, lat, label = city),
            size = 3, color = "grey20", alpha = 0.9, nudge_y = 2, 
            check_overlap = TRUE) +
  theme(panel.grid.major = element_line(size = 0.5, linetype = 2),
        axis.title = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank())

plot of chunk meridian

Bounding box and graticules

ggplot colours the whole background the same colour, including the corners which aren’t actually part of the globe. Also, now that I’m no longer using coord_proj, I’ll have to define my own graticules. To fix this I’ll define functions to create a bounding box:

make_bbox <- function(lng, lat, spacing, proj) {
  if (is.na(spacing[1])) {
    lng_seq <- lng
  } else {
    lng_seq <- seq(lng[1], lng[2], length.out = ceiling(diff(lng) / spacing[1]))
  }
  if (is.na(spacing[2])) {
    lat_seq <- lat
  } else {
    lat_seq <- seq(lat[1], lat[2], length.out = ceiling(diff(lat) / spacing[2]))
  }
  bb <- rbind(
    data.frame(lng = lng_seq, lat = lat[1]),
    data.frame(lng = lng[2], lat = lat_seq),
    data.frame(lng = rev(lng_seq), lat = lat[2]),
    data.frame(lng = lng[1], lat = rev(lat_seq))
  ) %>% 
  {SpatialPolygons(list(Polygons(list(Polygon(.)), "bb")))}
  if (!missing(proj)) {
    projection(bb) <- proj
  }
  return(bb)
}
bb <- make_bbox(c(-180, 180), c(-90, 90), spacing = c(NA, 0.1), proj = projection(world))

And graticules:

lng_label <- function(x) {
  ifelse(x < 0, paste0("E", abs(round(x))), paste0("W", round(x)))
}
lat_label <- function(x) {
  ifelse(x < 0, paste0("S", abs(round(x))), paste0("N", round(x)))
}
make_graticules <- function(lng_breaks, lat_breaks, spacing, proj) {
  if (is.na(spacing[1])) {
    lng_seq <- c(-180, 180)
  } else {
    lng_seq <- seq(-180, 180, length.out = ceiling(360 / spacing[1]))
  }
  if (is.na(spacing[2])) {
    lat_seq <- c(-90, 90)
  } else {
    lat_seq <- seq(-90, 90, length.out = ceiling(180 / spacing[2]))
  }
  meridians <- lapply(lng_breaks, 
                      function(x, lat_seq) {
                        Lines(list(Line(cbind(x, lat_seq))), ID = lng_label(x))
                      }, lat_seq)
  parallels <- lapply(lat_breaks, 
                      function(x, lng_seq) {
                        Lines(list(Line(cbind(lng_seq, x))), ID = lat_label(x))
                      }, lng_seq)
  grat <- SpatialLines(c(meridians, parallels))
  if (!missing(proj)) {
    projection(grat) <- proj
  }
  return(grat)
}
grat <- make_graticules(seq(-150, 180, 30), seq(-90, 90, 30), 
                        spacing = c(10, 1), proj = projection(world))

These will also need to be projected, re-centered, and converted to data frames for ggplot.

bb_df <- project_recenter(bb, proj, union_field = ".id", union_scale = 1e6) %>% 
  fortify
grat_df <- project_recenter(grat, proj) %>% 
  fortify

Including these in the plot.

ggplot() +
  geom_polygon(data = bb_df, aes(long, lat, group = group),
               fill = "light blue", color = NA) +
  geom_path(data = grat_df, aes(long, lat, group = group),
            color = "grey60", size = 0.1) +
  geom_polygon(data = world_kav_df, aes(long, lat, group = group), 
               fill = "grey80", color = "grey60", size = 0.1) +
  geom_point(data = cities_kav_df, aes(lon, lat), color = "grey20", size = 0.5) +
  geom_path(data = routes_kav_df, aes(long, lat, group = group), 
            alpha = 0.5, color = "#fa6900") +
  geom_polygon(data = bb_df, aes(long, lat, group = group),
               fill = NA, color = "grey20") +
  geom_text(data = cities_kav_df, aes(lon, lat, label = city),
            size = 3, color = "grey20", alpha = 0.9, nudge_y = 2, 
            check_overlap = TRUE) +
  theme_nothing()

plot of chunk add-bbox

Finally, after a huge amount of work, this North America centered projection is looking good.

Finishing touches

Now for the finishing touches: adding nicer labels with ggrepel, fine tuning the overall formatting, and colouring the routes according to duration. This last task requires joining the flight attribute data to the data frame of spatial data.

routes_att_df <- mutate(flights_unique, id = as.character(rank)) %>% 
  left_join(routes_kav_df, ., by = "id")

Then applying a color gradient to the routes. I use the excellent viridis package here, which provides perceptually uniform and colour blind friendly colour gradients.

set.seed(1)
ggplot() +
  geom_polygon(data = bb_df, aes(long, lat, group = group),
               fill = "light blue", color = NA) +
  geom_path(data = grat_df, aes(long, lat, group = group),
            color = "grey60", size = 0.1) +
  geom_polygon(data = world_kav_df, aes(long, lat, group = group), 
               fill = "grey80", color = "grey60", size = 0.1) +
  geom_path(data = routes_att_df, aes(long, lat, group = group, color = duration / 60),
            size = 0.6) +
  geom_point(data = cities_kav_df, aes(lon, lat), color = "grey20", size = 0.5) +
  geom_polygon(data = bb_df, aes(long, lat, group = group),
               fill = NA, color = "grey20") +
  annotate("text", x = 0.25 * max(bb_df$long), y = 0.97 * min(bb_df$lat), 
           label = "strimas.com - data source: wikipedia", 
           color = "grey20", size = 3, family = "Times") +
  geom_text_repel(data = cities_kav_df, aes(lon, lat, label = city),
                  size = 3, color = "white", fontface = "bold",
                  segment.color = "black", segment.size = 0.25,
                  box.padding = unit(0.1, 'lines'), force = 0.1) +
  # colour gradient applied to routes, and corresponding legend
  scale_color_viridis(name = "Top 25 Longest Non-stop Flights\n", option = "D",
                      breaks = c(16, 16.5, 17),
                      labels = c("16h", "16h30m", "17h")) +
  guides(color = guide_colorbar(
    nbin = 256, title.position = "top", title.hjust = 0.5, 
    barwidth = unit(18, "lines"), barheight = unit(1, "lines"))) +
  # reduce axis padding
  scale_x_continuous(expand = c(0.01, 0)) +
  scale_y_continuous(expand = c(0.01, 0)) +
  # 1:1 ratio between x and y scales
  coord_equal() +
  #blank_theme +
  theme(text = element_text(family = "Helvetica"),
        plot.margin = unit(c(0, 0, 0, 0), "lines"),
        # position legend within plot
        legend.position = c(0.5, 0.13),
        legend.direction = "horizontal",
        legend.background = element_rect(color = "grey20"),
        legend.title = element_text(size = 16, lineheight = 0.1),
        # remove axes
        axis.line = element_blank(),
        axis.text.x = element_blank(), axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        axis.title.x = element_blank(), axis.title.y = element_blank(),
        # remove grid
        panel.background = element_blank(),
        panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        plot.background = element_blank())

plot of chunk final

rank route airline distance (km) duration (hours)
1 Auckland-Dubai Emirates 14,203 17.25
2 Dallas-Hong Kong American Airlines 13,072 17.08
3 Dallas-Sydney Qantas 13,804 16.92
4 Jeddah-Los Angeles Saudia 13,409 16.92
5 Johannesburg-Atlanta Delta Air Lines 13,582 16.67
6 Dubai-Los Angeles Emirates 13,420 16.58
7 Abu Dhabi-Los Angeles Etihad Airways 13,502 16.50
8 Doha-Los Angeles Qatar Airways 13,367 16.42
9 Dubai-Houston Emirates 13,144 16.33
10 Abu Dhabi-Dallas Etihad Airways 12,962 16.33
11 Doha-Houston Qatar Airways 12,951 16.33
12 Dubai-Dallas Emirates 12,940 16.33
13 Doha-Dallas Qatar Airways 12,764 16.33
14 Abu Dhabi-San Francisco Etihad Airways 13,128 16.25
15 New York-Hong Kong Cathay Pacific 12,983 16.25
16 New York-Taipei EVA Air 12,566 16.17
17 New York-Guangzhou China Southern Airlines 12,878 16.08
18 Johannesburg-New York South African Airways 12,825 16.08
19 Mumbai-Newark United Airlines 12,565 16.08
20 Newark-Hong Kong Cathay Pacific 12,980 16.00
21 Houston-Taipei EVA Air 12,776 15.92
22 Dubai-San Francisco Emirates 13,041 15.83
23 Boston-Hong Kong Cathay Pacific 12,827 15.83
24 Los Angeles-Melbourne Qantas 12,748 15.83
25 Toronto-Hong Kong Air Canada 12,569 15.58

Overall, I’m increasingly impressed with ggplot as a tool for mapping and I think the new packages ggalt and ggrepel are great additions. Unfortunately, making a map with a non-standard central meridian is a huge pain in R, but in the end I’m quite happy with the results!