# Chapter 3 Habitat Covariates

## 3.1 Introduction

Species distribution models work by finding associations between species occurrence or abundance and environmental variables. Using these relationships, it’s possible to predict the distribution in areas that aren’t sampled, provided we know the value of the environmental variables in these areas. Therefore, to proceed with the modeling in the next several chapters, we’ll need to prepare a suite of environmental variables to be used as covariates in our models. The particular set of covariates that’s most suitable for a given study will depend on the focal species, region, and time period, as well as the availability of data. When species distributions are well defined by the environmental covariates, extrapolations to unsurveyed areas will be more accurate. So, itss worth considering which environmental covariates are important for your species. Scientists can use many variables to characterise a species distribution - for example, climate, weather, and soil type. Here we use only landcover and elevation as example environmental covariates.

Fortunately, there are an abundance of freely available, satellite-based land cover products derived from satellites such as Landsat, SPOT, and MODIS that are suitable for distribution modeling. This land cover data will act as a proxy for habitat and throughout this book we’ll often use habitat and land cover interchangeably. In addition, we’ll include elevation as an additional covariate, which can be important for many species.

For the examples in this book, we’ll use land cover covariates derived from the MODIS MCD12Q1 v006 land cover product (Friedl and Sulla-Menashe 2015). This product has global coverage at 500 m spatial resolution and annual temporal resolution from 2001-2018. These data are available for several different classification schemes. We’ll use the University of Maryland (UMD) land cover classification, which provides a globally accurate classification of land cover in our experience. This system classifies pixels into one of 16 different land cover classes:

Class Name Description
0 Water bodies At least 60% of area is covered by permanent water bodies.
1 Evergreen Needleleaf Forests Dominated by evergreen conifer trees (canopy >2m). Tree cover >60%.
2 Evergreen Broadleaf Forests Dominated by evergreen broadleaf and palmate trees (canopy >2m). Tree cover >60%.
3 Deciduous Needleleaf Forests Dominated by deciduous needleleaf (e.g. larch) trees (canopy >2m). Tree cover >60%.
4 Deciduous Broadleaf Forests Dominated by deciduous broadleaf trees (canopy >2m). Tree cover >60%.
5 Mixed Forests Dominated by neither deciduous nor evergreen (40-60% of each) tree type (canopy >2m). Tree cover >60%.
6 Closed Shrublands Dominated by woody perennials (1-2m height) >60% cover.
7 Open Shrublands Dominated by woody perennials (1-2m height) 10-60% cover.
8 Woody Savannas Tree cover 30-60% (canopy >2m).
9 Savannas Tree cover 10-30% (canopy >2m).
10 Grasslands Dominated by herbaceous annuals (<2m).
11 Permanent Wetlands Permanently inundated lands with 30-60% water cover and >10% vegetated cover.
12 Croplands At least 60% of area is cultivated cropland.
13 Urban and Built-up Lands At least 30% impervious surface area including building materials, asphalt, and vehicles.
14 Cropland/Natural Vegetation Mosaics Mosaics of small-scale cultivation 40-60% with natural tree, shrub, or herbaceous vegetation.
15 Non-Vegetated Lands At least 60% of area is non-vegetated barren (sand, rock, soil) or permanent snow and ice with less than 10% vegetation.
255 Unclassified Has not received a map label because of missing inputs.

For a wide range of studies, this MODIS land cover dataset will be suitable for generating habitat covariates; however, there may be particular cases where the study species, habitat, or ecological question requires different, or more specialized, data. For example, shorebird distribution modeling would benefit from data on the extent of tidal flats, seabirds distributions are often influenced by ocean depth, and in many regions elevation plays a critical role in shaping species distributions. Regardless of which habitat data you decide to use for your project, this chapter should provide a template for how to prepare these data as covariates for modeling species distributions.

The following section will cover how to access and download MODIS land cover data. Next, we’ll demonstrate how to summarize these data within a neighborhood around each checklist location. Then, we’ll calculate a set of covariates over a regular grid, which we’ll use to make predictions of species distributions throughout our study area. Finally, as an example of including covariate data from multiple sources, we’ll demonstrate how to incorporate elevation data as an additional covariate. If you want to skip this section and jump straight to the modeling, you can download the data package, which includes all the prepared MODIS data that we’ll use in the remainder of this book.

## 3.2 Downloading MODIS data

As with most satellite data, MODIS data are provided as 1200 km by 1200 km tiles for ease of download. Each tile is a raster GIS dataset consisting of a regular grid of 500 m resolution cells. The surface of the Earth is divided up into a grid of these tiles, each given an ID, for example, h10v12 is the tile from the 10th column and 12th row of the grid. Compiling MODIS data for a given region requires figuring out which set of tiles covers the region, downloading those tiles, combining the tiles together into a single raster dataset, and converting from the native MODIS HDF format, which R can’t read, to a standard GeoTIFF format. This needs to be done for each year for which we want habitat data, and can be a time consuming and error prone process. Fortunately, the R package MODIS automates most of these steps. Unfortunately, this package can be challenging and confusing to get working. With this in mind, this section will provide detailed instruction for setting up and using the MODIS package.

Let’s start by figuring out the tile IDs for the tiles that BCR 27 spans. Recall that we prepared a BCR boundary in Section 1.3.6 of the Introduction; if you haven’t already done so, download the data package now to get that boundary. Given a set of spatial features, the MODIS package can quickly tell us which MODIS tiles we need.

So, we’ll need to download these three tiles for each of the 10 years from 2010-2019.

### 3.2.1MODIS setup

Before we start using MODIS for the first time, a bit of setup is required. First, sign up for a NASA Earthdata account to get access to MODIS, and other NASA data. Then use MODIS::EarthdataLogin(usr = "username", pwd = "password"), with the username and password you just created, to store your login credentials so the MODIS package can access them.

Next, you’ll need to install GDAL, an open source library for working with geospatial data that’s needed for processing the MODIS tiles. The steps for installing GDAL are system dependent:

• Mac OS X: First, check if GDAL is installed with HDF4 support by running gdal-config --formats in Terminal. If you see hdf4 in the list, you don’t need to do anything else! If not, install the Homebrew package manager by following the instructions on the website. Then, run the following commands in Terminal to install GDAL:
brew tap osgeo/osgeo4mac
brew install hdf4
brew link --overwrite hdf4
brew install osgeo-gdal
brew link --force osgeo-gdal
• Windows: install GDAL using OSGeo4W, a suite of open source geospatial tools. In R, run MODIS:::checkTools("GDAL"), which will search your system for GDAL and suggest a command such as MODIS::MODISoptions(gdalPath = "c:/OSGeo4W64/bin") that will make GDAL available to the MODIS package. Run this command and, when it asks, agree to making the settings permanent.
• Linux: run sudo apt-get install gdal-bin in the terminal.

Finally, run MODIS:::checkTools("GDAL") to check that GDAL is installed and that the MODIS package can find it. If GDAL can’t be found, you’ll need to manually locate it and use MODIS::MODISoptions(gdalPath = "path/to/gdal/") to tell the MODIS package where it is.

### 3.2.2 Download using R

Once all the setup steps have been completed, we can start downloading some data! The MODIS function runGdal() downloads and processes MODIS tiles into a single GeoTIFF for each year. Note that at the time of writing, land cover data from 2019 haven’t been prepared yet, so we’ll use 2018 data for both 2018 and 2019. The key arguments to runGdal() are:

• product: is the specific MODIS product to download. For a full list of available datasets use MODIS::getProduct().
• collection: each MODIS product may have multiple collections, corresponding roughly to versions. Use getCollection() to find the available collection for a given product.
• SDSstring: a string specifying which bands to extract, with zeros for bands to drop and 1 for bands to keep. Most MODIS products have multiple bands stored in a single raster file, for example, reflectances in different wavelength ranges or, in our case, land cover using different land cover classification systems. The documentation for the MCD12Q1 dataset shows that there are 13 bands in the downloaded files, and we’re interested in band 2, which contains the UMD landcover classification.
• extent: any of several different spatial objects specifying the region that we want data for. In our case, we’ll use the BCR polygon; however, for a list of available options consult the help for getTile(). Note that runGdal() will return raster data in the same projection as the input extent, which is why we projected the BCR boundary to the MODIS sinusoidal projection.
• begin and end: the start and end dates of the time period from which to extract data. Although the land cover data are only available annually, we need to specify full dates because some other products are available on a more granular basis.
• outDirPath: directory to store processed MODIS data.
• job: a name for this task, which will become the sub-directory of outDirPath within which the processed data are stored.

If everything ran smoothly, we now have annual GeoTIFFs of MODIS land cover data that we can load into R. You may see error messages stating Cannot find proj.db, or something similar, these can be safely ignored provided the modis have been created in data/modis/ directory.

These data have not been prepared yet for the last couple years, so we’ll need to fill in the missing years using the most recent year for which there is data. To facilitate that, let’s figure out which is the most recent year with data.

So, we have landcover data up to 2018.

### 3.2.3 Troubleshooting

If the call to runGDAL() didn’t work for you, don’t worry, you’re not alone! It’s challenging to get the MODIS package working and errors are common when you’re first trying to get it set up. The most common error is not having GDAL installed correctly, which will give an error like GDAL not installed or configured. Either you don’t have GDAL at all or you have it, but it doesn’t have support for HDF4 files (this is the native format for MODIS data). Try following the above instructions again. If it still doesn’t work, consult the instructions on the MODIStsp website for installing GDAL.

Another error you may see is: Make sure either 'wget' or 'curl' is available in order to download data from LP DAAC or NSIDC.. This should only arise on versions of Windows before Windows 10. If you see this error, you’ll need to install curl, which is used by R to download the MODIS tiles. There is a StackOverflow question with excellent instructions for installing curl and getting it setup on your system.

If these tips haven’t solved your particular problem, you’ll need to turn to Google to troubleshoot or find someone who has experience with these tools and ask them to help. Good luck!

## 3.3 Landscape metrics

At this point we could use the MODIS land cover data directly, simply extracting the land cover class for each checklist location. However, we instead advocate summarizing the land cover data within a neighborhood around the checklist locations. As discussed in Section 1.1, checklist locations are not precise, so it’s more appropriate to use the habitat in the surrounding area, rather than only at the checklist location. More fundamentally, organisms interact with their environment not at a single point, but at the scale of a landscape, so it’s important to include habitat information characterizing a suitably-sized landscape around the observation location.

There are a variety of landscape metrics that can be used to characterize the composition (what habitat is available) and configuration (how that habitat is arranged spatially) of landscapes. The simplest metric of landscape composition is the proportion of the landscape in each land cover class (PLAND in the parlance of FRAGSTATS). For a broad range of scenarios, PLAND is a reliable choice for calculating habitat covariates in distribution modeling. Based on our experience working with eBird data, an approximately 2.5 km by 2.5 km neighborhood (5 by 5 MODIS cells) centered on the checklist location is sufficient to account for the spatial precision in the data when the maximum distance of travelling counts has been limited to 5 km, while being a relevant ecological scale for many bird species.

We’ll start by finding the full set of unique checklists locations for each year in the eBird data. Then we convert these locations to spatial sf features and project them to the sinusoidal equal area projection used by MODIS. We’ll buffer these points to create a neighborhood around each location with a diamter equal to 5 MODIS cells. Finally, we split the neighborhoods up by year so we can match to MODIS land cover data from the corresponding year.

Now, we’ll loop over the years and for each square neighborhood extract all the raster values within that neighborhood. We use the velox package for this, since it’s often orders of magnitude faster than using raster::extract().

Now we have the set of land cover values within a neighborhood around each checklist location. We can summarize these data within each neighborhood to calculate PLAND: the proportion of the neighborhood within each land cover class.

## 3.4 Prediction surface

After fitting species distribution models, the goal is typically to make predictions throughout the study area. To do this, we’ll need a regular grid of habitat covariates over which to make predictions. In this section, we’ll create such a prediction surface for BCR 27 using the MODIS land cover data from the most recent year for which they’re available. To start, we’ll need a template raster with cells equal in size to the neighborhoods we defined in the previous section: 5 by 5 MODIS land cover cells. We can use raster::aggregate() to achieve this. We’ll also use raster::rasterize() to assign the value 1 to all cells within BCR 27 and leave all cells outside BCR 27 empty.

Next, for each cell of this raster, we’ll calculate the PLAND metrics using the same approach as the previous section. Note that we will only be creating this prediction surface for the most current year of landcover data in our example.

Keeping these data in a data frame is a compact way to store them and will be required once we make model predictions in later chapters. However, we can always use the raster template to convert these PLAND metrics into a spatial format, for example, if we want to map them. Let’s look at how this works for land cover class 4: deciduous broadleaf forest.

## 3.5 Elevation

In some scenarios, you may want to include additional covariates to complement the land cover variables. There is a wealth of open access raster data available for this purpose; however, in most cases, these data will not have a simple R interface for accessing them. Instead, you’ll typically have to manually download and process these data. As an example of how this works, we’ll demonstrate how to include covariates for elevation, which frequently plays an important role in shaping species distributions.

Amatulli et al. (2018) provide a suite of global, 1km resolution topographic variables designed for use in distribution modeling. A range of variables are available, including elevation, slope, roughness, and many others; we’ll focus on elevation here, but the approach can easily be applied to other variables. To start, visit the website for these data, download the 1 km resolution median elevation product, and save the file (elevation_1KMmd_GMTEDmd.tif) in the data/ subdirectory of your project:

Next we’ll load the file, crop it down from it’s full global extent to just the portion we need for BCR 27, and reproject it to the MODIS sinusoidal projection.

Now we extract the elevation values within the neighborhood of each checklist location just as we did before for the land cover data. Then we’ll calculate the median and standard deviation of the elevation within each neighborhood.

We’ll need to do repeat this process to calculate the elevation covariates for the prediction surface.

Finally, we’ll combine these elevation covariates with the land cover covariates.

# checklist covariates
pland_elev_checklist <- inner_join(pland, elev_checklists, by = "locality_id")
write_csv(pland_elev_checklist, "data/pland-elev_location-year.csv")

# prediction surface covariates
pland_elev_pred <- inner_join(pland_coords, elev_pred, by = "id")
write_csv(pland_elev_pred, "data/pland-elev_prediction-surface.csv")
glimpse(pland_elev_pred)
#> Observations: 90,949
#> Variables: 22
#> $id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24… #>$ longitude        <dbl> -77.3, -77.3, -77.3, -77.3, -77.3, -77.2, -77.4, -77.3, -77.3, -77.3, -77.3, -77.2, -…
#> $latitude <dbl> 37.2, 37.2, 37.2, 37.2, 37.2, 37.2, 37.2, 37.2, 37.2, 37.2, 37.2, 37.2, 37.2, 37.2, 3… #>$ year             <int> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2…
#> $pland_08 <dbl> 0.952, 0.857, 0.810, 0.619, 0.905, 0.714, 0.619, 1.000, 0.810, 0.857, 0.762, 0.190, 0… #>$ pland_10         <dbl> 0.0476, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0476, 0.0000, 0.000…
#> $pland_04 <dbl> 0.0000, 0.1429, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.047… #>$ pland_09         <dbl> 0.0000, 0.0000, 0.1905, 0.1905, 0.0000, 0.0000, 0.0000, 0.0000, 0.0952, 0.0476, 0.190…
#> $pland_05 <dbl> 0.0000, 0.0000, 0.0000, 0.1905, 0.0952, 0.2857, 0.0000, 0.0000, 0.0476, 0.0952, 0.000… #>$ pland_13         <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.3810, 0.0000, 0.0000, 0.0000, 0.000…
#> $pland_12 <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.000… #>$ pland_14         <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.000…
#> $pland_01 <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.000… #>$ pland_06         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $pland_00 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… #>$ pland_11         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $pland_02 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… #>$ pland_03         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $pland_15 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… #>$ pland_07         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $elevation_median <dbl> 40.7, 41.8, 42.5, 43.2, 42.6, 44.3, 45.5, 44.4, 43.7, 41.0, 40.9, 42.9, 42.2, 73.3, 6… #>$ elevation_sd     <dbl> 4.347, 2.135, 3.059, 3.373, 1.331, 1.717, 2.020, 3.863, 2.705, 1.817, 1.802, 0.743, 1…

This completes the data preparation. The following chapters will focus on using these data to model species distributions.

### References

Amatulli, Giuseppe, Sami Domisch, Mao-Ning Tuanmu, Benoit Parmentier, Ajay Ranipeta, Jeremy Malczyk, and Walter Jetz. 2018. “A Suite of Global, Cross-Scale Topographic Variables for Environmental and Biodiversity Modeling.” Scientific Data 5 (March): 180040. https://doi.org/10.1038/sdata.2018.40.

Friedl, Mark, and Damien Sulla-Menashe. 2015. “MCD12Q1 MODIS/Terra+Aqua Land Cover Type Yearly L3 Global 500m SIN Grid V006.” NASA EOSDIS Land Processes DAAC. https://doi.org/10.5067/MODIS/MCD12Q1.006.