This is part II in a series of posts about using RStudio on Amazon Web Services (AWS). In part I, I outlined how to quickly get RStudio Server running on an AWS EC2 instance. This gives you access to RStudio via a web browser with as much (or as little) computing power as you need for any given task.
Once you have RStudio up and running, you’ll likely want to transfer or sync some data and/or code between the remote AWS instance and your local machine. This tutorial covers this process.
There are two broad types of scenarios in which I use RStudio on AWS. I’ll address both of these scenarios in this tutorial.
- Code development: because my laptop is barely functional and I often work on different machines in different locations, I’ll often use a free-tier EC2 instance with RStudio to develop code. These free instances aren’t very powerful, but they’re great if you just want to write code using RStudio in your browser. Wherever you are you can open a browser, point it to the IP of your EC2 instance, and pick up coding where you left off.
- Running code: the real power of AWS comes from the ability to handle large, computationally intensive jobs. You can spin up an EC2 instance with as much memory and as many cores as needed; you’re really only limited by your budget.
Typically, each project I’m working on gets its own folder, which is also an RStudio project and a git repository. Inside the project directory is an
R/ sub-directory containing R scripts, a
data/ directory that contains the data I’ll be analyzing, and a
results./ directory containing results. If the data I’m working with is large, or I don’t want to put it online, I’ll avoid putting the data directory on GitHub.
In this tutorial the goal will be as follows:
- Deploy an AWS EC2 instance with RStudio Server (see part I)
- Get the code and data onto the instance
- Make changes to the code (scenario 1) or run the script to create outputs (scenario 2)
- Get the updated code or analysis outputs off of the instance and onto your local machine or the cloud
- Terminate the instance
For the sake of having a concrete example to work with, I’ve created a simple RStudio project for demonstration purposes. In this project, I look at global trends in forest loss using a data set taken from the UN Food and Agriculture Organization’s Forest Resources Assessment.
Git and GitHub (Scenario 1)
In general, if you’re writing code on multiple machines (e.g. your local machine and an EC2 instance), you’re best bet for keeping everything in sync is to use git and GitHub. These tools are specifically designed for version control, collaborative coding, and keeping code in sync between different machines. Furthermore, RStudio has great integration with git and GitHub. If you’re new to these tools, there are many great tutorials online, however for R users by far the best is Jenny Bryan’s Happy Git and GitHub for the useR. I won’t delve into the gory details of git here, so you’re best to look at Jenny’s tutorial for a proper introduction.
If your needs fall under scenario 1 above, then GitHub is a good way of getting your code onto an EC2 instance, particularly if you’re already using GitHub anyway. In addition, if you just want to run some code on AWS, you can use this approach provided you don’t need to transfer large files. If you will be working with large files, take a look at the next section on S3.
Following the instructions in part I, deploy a free-tier EC2 instance. Be sure to install the
tidyverse package, either with a start-up script or by running
install.package("tidyverse") after start up. Navigate to the URL of your EC2 instance and log on to RStudio Server.
Now, under the Tools menu click Shell… to open on a command prompt. Run the following three commands to introduce yourself to git and turn on the credential helper to store your GitHub password so you don’t have to type it every time. Be sure to substitute your name and the email address associated with your GitHub account.
git config --global user.name 'Your Name' git config --global user.email 'email@example.com' git config --global credential.helper 'cache --timeout=10000000'
Getting data and code on to EC2
From the RStudio File menu, select New Project…, and click Version Control then Git to create a new project based on a git repository. Finally, fill in the URL for the example repository I’ve created (
https://github.com/mstrimas/aws-example), or a repo of your own. If you have a GitHub account, you may want to fork my example repository into your own account so you can push changes you’ve made back to GitHub. To do so, navigate to the GitHub repository for the example RStudio project and click the Fork button in the upper-right corner of the page. Then use the URL for your own copy of the repository when creating a new RStudio project.
The RStudio project should now be copied onto your EC2 instance. Open the R script
R/forest-loss.r, which uses FAO data to calculate the trend in forest loss by continent. Make some changes to the code, for example add
glimpse(fra) on line 4 to get a concise summary of the data immediately after reading it in. Now run the whole script, which should create two new files in the
fao-fra-region.csv is a time series of forest change by continent and
forest-change.png is a graphical representation of these data.
Getting data and code changes off of EC2
You’ve now made changes to the code and created a couple output files that you’ll likely want to get off of the EC2 instance. Provided none of the files are too large you can just add them to your git repository then push them to GitHub. Note that this will only work if you’re using your own GitHub repository, for example if you followed the instructions above for forking the example repository I created.
RStudio has excellent git integration, so doing this is easy. In the upper-right pane click on the Git tab. This tab lists any files that are new/changed since your last git commit, you should see three files listed: a csv file, an image, and the R script that you modified. Click the check boxes indicating that you want to add these new/changed files to the next commit. Then click the Commit button, and enter a commit message, to commit these changes to your local git repository. Finally, click the Push button (green up arrow) to push these changes to GitHub.
Now point your browser to your GitHub repository and you should see the new files and changes there.
S3 (Scenario 2)
Amazon’s Simple Storage Service (S3) offers extremely cheap (~$0.03 per TB per month), highly scalable cloud-based storage of objects of almost any size. In S3, data (i.e. files) are bundled together with metadata into objects, and objects are organized into buckets. There are many ways to move data to and from an EC2 instance, but S3 is perhaps the simplest.
To use S3, you’ll need to install the AWS Command Line Interface (CLI) on both your local machine and the EC2 instance, then configure the AWS CLI with your credentials.
Local CLI install
For Linux and Mac OS X users, run the following commands in the Terminal:
curl "https://s3.amazonaws.com/aws-cli/awscli-bundle.zip" -o "awscli-bundle.zip" unzip awscli-bundle.zip sudo ./awscli-bundle/install -i /usr/local/aws -b /usr/local/bin/aws rm -r awscli-bundle*
For Windows users, use the installer provided by AWS.
EC2 CLI install
Follow the instructions in part 1 to connect to your EC2 instance via SSH. Then run the following command to install the CLI:
sudo apt install awscli
Once this finishes disconnect from the EC2 instance.
Configure the AWS CLI
Now you’ll need to configure the AWS CLI to interact with AWS. This will need to be done on your local machine and the EC2 instance. For this, you’ll need a pair of security keys, which you can generate on the Security Credentials page of the AWS Console. Expand the Access Keys section, click Create New Access Key and a
rootkey.csv file should be downloaded (don’t loose this file). This file contains two keys that you’ll need: an access key and a secret access key. Note that these are your root account credentials, which AWS suggests not using for security reasons. In a later post I’ll address the correct way to do this.
On your local machine, open a Terminal window, type
aws configure to start the CLI configuration, and paste in your access keys from the
credentials.csv file when prompted:
aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE # replace with your key AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY # replace with your key
Finally, while logged in to your RStudio Server on your EC2 instance, select Shell… from the Tools menu and run the same commands to configure the CLI. You’re now ready to start using the AWS CLI to work with S3.
Create a bucket
On S3, all files are stored in buckets, so let’s create our first bucket. Open a Terminal window on your local machine and enter:
aws s3 mb s3://example-bucket/
This creates a bucket named
example-bucket. Note that bucket names must be globally unique across all of S3, so make sure you replace
example-bucket with something more unique and use that bucket in all subsequent commands.
Moving files to and from S3
Download and unzip the example project for this tutorial. This should create a directory named
aws-example-master/. In the Terminal, navigate to the parent directory of
aws-example-master/, and run the following command to copy the entire directory to S3:
aws s3 cp aws-example-master s3://example-bucket/aws-example-master/ --recursive
--recursive flag copies files recursively and is useful when you want to copy entire directories. If you just wanted to copy a single file to S3 use:
aws s3 cp filename.csv s3://example-bucket/
Now that the project we want to work with is on S3, we’ll need to bring that project onto our EC2 instance. Once again, in your RStudio Server session on EC2 open a command prompt by selecting “Shell…” from the “Tools” menu. Change to the RStudio home directory and copy the project from S3 to your EC2 instance with
cd ~ aws s3 cp s3://example-bucket/aws-example-master/ aws-example-master/ --recursive
Next open the RStudio project that you just copied from S3 (
aws-example-master/aws-example.Rproj). Open and run the script in the
R/ directory to generate the output csv and image files. Now that we’re done running the script, we want to get the output off the EC2 instance so we can terminate it (remember you’re paying by the hour!). Here we’ll use
sync rather than
cp, which will only upload new or changed files to S3. At the shell prompt in your cloud-based RStudio instance run:
aws s3 sync . s3://example-bucket/aws-example-master/
Now that the files are on S3, you can terminate the S3 instance safely. If you want to bring these files onto your local machine, change directory to the project directory then run the following command:
aws s3 sync s3://example-bucket/aws-example-master/ .
I’ve described two approaches to transferring entire RStudio projects, including data, to an EC2 instance. However, you may also have a scenario where your code is on GitHub, but your data remains on your local machine because it’s quite large. In this case, you can use a hybrid approach in which you transfer your code to the EC2 instance using GitHub and you transfer individual data files or the whole data directory using S3. Once you understand how each of these tools works it’s easy to combine them in different ways.
Further S3 details
To learn more about using S3 through the AWS CLI consult the AWS CLI Command Reference. Some particularly useful commands are:
- List buckets
aws s3 ls
- List files within a bucket
aws s3 ls s3://example-bucket/
- Remove files from a bucket
aws s3 rm s3://example-bucket/fao-fra.csv
- Remove a bucket (must empty bucket first)
aws s3 rb s3://example-bucket/
Finally, to make files publicly available use the
--acl public-read flag with
sync. For example:
aws s3 cp fao-fra.csv s3://example-bucket/ --acl public-read
Files made public in this way are available through standard URLs of the form
http://bucket.s3.amazonaws.com/file. For example,
fao-fra-csv could be downloaded from