Amazon RDS + R

Intro

Welcome to my first post! To start things off at Data Insights, I’m going to show you how to connect to an AWS RDS instance from R.

For those of you who don’t know, RDS is an easy way to create a database in the cloud. In this post, I won’t be showing you how to setup an RDS instance, but I will show you how to connect to it if you have one running.

Let’s get started.

Step 1: The one where you got connected

You’ll need a few packages to get started.

  • DBI and RMySQL are used to connect to the database, although RMySQL is usually called without explicitely loading it (that’s just the standard)
  • tidyquant is just there to help us download some data to put in and get out of our database
  • dplyr will be used to show off an alternate way to query from the database. Note that you should get the most recent github version of dplyr, along with the database specific pieces from dbplyr.
library(DBI)
# library(RMySQL)

library(tidyquant)

# devtools::install_github("tidyverse/dplyr")
# devtools::install_github("tidyverse/dbplyr")
library(dplyr)
# library(dbplyr)

Getting connected isn’t too hard once you know what you’re looking for.

cn <- dbConnect(drv      = RMySQL::MySQL(), 
                username = "user1", 
                password = "testpassword", 
                host     = "davisdbinstance.crarljboc8to.us-west-2.rds.amazonaws.com", 
                port     = 3306, 
                dbname   = "firstdb")

Let’s go through the arguments to dbConnect(), the function from DBI that we used to connect.

  • drv - The driver I used is from the RMySQL package, an implementation of the general interface provided by DBI. I’ll leave it to the experts to explain all of this.
  • username / password - You’ll have to have created a user and password on AWS first, but then you can use them here.
  • host - The host name is the Endpoint of your RDS server, without the port on the end. I’ve attached a screenshot to show where to find this. Basically, on the RDS Dashboard Instances page, hit the drop down arrow beside “MySQL” to show the Endpoint.

  • port - The rest of the Endpoint shows the port that you’ll need to access your RDS instance through. That goes here.
  • dbname - Finally, you’ll need the DB Name you used when setting up the instance. This can be found by clicking Instance Actions -> See Details, and then under Configuration Details you’ll find DB Name.

Step 2: The one where you take it for a test spin

Well, alright…that was…cool? How do we know it’s working? Let’s get some data to load into the database. We will use some Apple stock data retrieved through tidyquant.

aapl <- tq_get("AAPL")

slice(aapl, 1:10)
## # A tibble: 10 x 7
##    date        open  high   low close    volume adjusted
##    <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
##  1 2007-01-03  13.7  13.7  13.0  12.0 309579900     10.8
##  2 2007-01-04  13.3  13.6  13.3  12.2 211815100     11.0
##  3 2007-01-05  13.6  13.7  13.4  12.2 208685400     10.9
##  4 2007-01-08  13.6  13.7  13.5  12.2 199276700     11.0
##  5 2007-01-09  13.7  14.8  13.5  13.2 837324600     11.9
##  6 2007-01-10  15.0  15.5  14.8  13.9 738220000     12.5
##  7 2007-01-11  15.2  15.4  15.1  13.7 360063200     12.3
##  8 2007-01-12  15.0  15.1  14.8  13.5 328172600     12.2
##  9 2007-01-16  15.2  15.4  15.2  13.9 311019100     12.5
## 10 2007-01-17  15.5  15.5  15.1  13.6 411565000     12.2

To write the tibble (data frame) to the database, we will use another function called dbWriteTable(). It’s pretty straightforward. “name” is the name of the table you are creating, and “value” is the data frame you want to write.

dbWriteTable(cn, name = "apple", value = aapl)

Now the fun part! Let’s use a SQL query to pull it back down with dbGetQuery(). This function is a combination of dbSendQuery(), which returns a result set for your query, and dbFetch() which returns the rows from that result set.

apple_from_db <- dbGetQuery(cn, "SELECT * FROM apple;")

# This effectively is the same as
# dbReadTable(cn, "apple")

slice(apple_from_db, 1:10)
## # A tibble: 10 x 8
##    row_names date        open  high   low close    volume adjusted
##    <chr>     <chr>      <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
##  1 1         2007-01-03  86.3  86.6  81.9  83.8 309579900     10.9
##  2 2         2007-01-04  84.1  85.9  83.8  85.7 211815100     11.1
##  3 3         2007-01-05  85.8  86.2  84.4  85.0 208685400     11.1
##  4 4         2007-01-08  86.0  86.5  85.3  85.5 199276700     11.1
##  5 5         2007-01-09  86.5  93.0  85.2  92.6 837324600     12.0
##  6 6         2007-01-10  94.7  97.8  93.5  97.0 738220000     12.6
##  7 7         2007-01-11  95.9  96.8  95.1  95.8 360063200     12.5
##  8 8         2007-01-12  94.6  95.1  93.2  94.6 328172600     12.3
##  9 9         2007-01-16  95.7  97.3  95.5  97.1 311019100     12.6
## 10 10        2007-01-17  97.6  97.6  94.8  94.9 411565000     12.4

There are a huge number of functions from DBI that you can use to communicate with databases. Maybe I will cover more in a separate post, but for now, let’s move on to dplyr.

Step 3: The one with the pliers

Before dplyr 0.6.0 was announced, you’d have to disconnect, and then reconnect through a dplyr specific function, src_mysql(). That would look something like the code below. Since then, however, you can now use the DBI connection with dplyr!

# There is no need for this code anymore!
dbDisconnect(cn)

cn <- src_mysql(user     = "user1",
                password = "testpassword",
                host     = "davisdbinstance.crarljboc8to.us-west-2.rds.amazonaws.com",
                port     = 3306,
                dbname   = "firstdb")

Select the apple table from the database. This does not actually pull the data into memory. It just makes a connection!

# With dplyr 0.6.0 we can just use the DBI connection!
apple_table <- tbl(cn, "apple")

# By default the first 1000 rows are displayed
apple_table
## # Source:   table<apple> [?? x 8]
## # Database: mysql 5.6.27-log
## #   [user1@davisdbinstance.crarljboc8to.us-west-2.rds.amazonaws.com:/firstdb]
##    row_names date        open  high   low close    volume adjusted
##    <chr>     <chr>      <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
##  1 1         2007-01-03  86.3  86.6  81.9  83.8 309579900     10.9
##  2 2         2007-01-04  84.1  85.9  83.8  85.7 211815100     11.1
##  3 3         2007-01-05  85.8  86.2  84.4  85.0 208685400     11.1
##  4 4         2007-01-08  86.0  86.5  85.3  85.5 199276700     11.1
##  5 5         2007-01-09  86.5  93.0  85.2  92.6 837324600     12.0
##  6 6         2007-01-10  94.7  97.8  93.5  97.0 738220000     12.6
##  7 7         2007-01-11  95.9  96.8  95.1  95.8 360063200     12.5
##  8 8         2007-01-12  94.6  95.1  93.2  94.6 328172600     12.3
##  9 9         2007-01-16  95.7  97.3  95.5  97.1 311019100     12.6
## 10 10        2007-01-17  97.6  97.6  94.8  94.9 411565000     12.4
## # ... with more rows

The best part is that we can use almost any dplyr command with this! It queries the database, and does not do the manipulation in R. All of the familiar syntax of dplyr, but with databases. Let’s use filter() to get all of the rows after January 1, 2009.

filter(apple_table, date > "2009-01-01")
## # Source:   lazy query [?? x 8]
## # Database: mysql 5.6.27-log
## #   [user1@davisdbinstance.crarljboc8to.us-west-2.rds.amazonaws.com:/firstdb]
##    row_names date        open  high   low close    volume adjusted
##    <chr>     <chr>      <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
##  1 505       2009-01-02  85.9  91.0  85.2  90.8 186503800     11.8
##  2 506       2009-01-05  93.2  96.2  92.7  94.6 295402100     12.3
##  3 507       2009-01-06  96.0  97.2  92.4  93.0 322327600     12.1
##  4 508       2009-01-07  91.8  92.5  90.3  91.0 188262200     11.8
##  5 509       2009-01-08  90.4  93.2  90.0  92.7 168375200     12.1
##  6 510       2009-01-09  93.2  93.4  90.1  90.6 136711400     11.8
##  7 511       2009-01-12  90.5  91.0  87.6  88.7 154429100     11.5
##  8 512       2009-01-13  88.2  89.7  86.4  87.7 199599400     11.4
##  9 513       2009-01-14  86.2  87.3  84.7  85.3 255416000     11.1
## 10 514       2009-01-15  80.6  84.1  80.1  83.4 457908500     10.8
## # ... with more rows

To do any serious manipulation outside of dplyr, you’ll likely have pull the data into memory to be able to use it with other R functions. Here, I’ll use the dplyr equivalent to dbWriteTable() to add the stock prices for IBM to the database.

ibm <- tq_get("IBM")

copy_to(cn, df = ibm, temporary = FALSE)

To actually retrieve the data to memory, first make the connection using tbl() like before, and then use collect() to create the in memory tibble. Unfortunately, dates are stored as characters in the table, and collect() won’t try to fix that, so I’ll also take advantage of the readr package’s type_convert() function to do the thinking for me.

Once we have the data in memory, we can calculate the daily return with tidyquant and tq_mutate().

# Connection
ibm_table <- tbl(cn, "ibm")

# Collect to tibble
real_tibble <- collect(ibm_table) %>%
  readr::type_convert()

# Daily return
real_tibble <- real_tibble %>% 
  tq_mutate(select     = adjusted, 
            mutate_fun = periodReturn, 
            period     = "daily")

real_tibble
## # A tibble: 2,542 x 8
##    date        open  high   low close   volume adjusted daily.returns
##    <date>     <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>         <dbl>
##  1 2007-01-03  97.2  98.4  96.3  97.3  9196800     78.4       0      
##  2 2007-01-04  97.2  98.8  96.9  98.3 10524500     79.2       0.0107 
##  3 2007-01-05  97.6  97.9  96.9  97.4  7221300     78.5      -0.00905
##  4 2007-01-08  98.5  99.5  98.3  98.9 10340000     79.7       0.0152 
##  5 2007-01-09  99.1 100.   99.1 100.  11108200     80.6       0.0118 
##  6 2007-01-10  98.5  99.1  97.9  98.9  8744800     79.7      -0.0118 
##  7 2007-01-11  99    99.9  98.5  98.7  8000700     79.5      -0.00243
##  8 2007-01-12  99.0  99.7  98.5  99.3  6636500     80.0       0.00699
##  9 2007-01-16  99.4 101.   99.3 101.   9602200     81.2       0.0149 
## 10 2007-01-17 101.  101.   99.9 100.   8200700     80.6      -0.00793
## # ... with 2,532 more rows

Always disconnect when you’re finished!

dbDisconnect(cn)
## [1] TRUE

Last words

Hopefully I’ve been able to show you the power of DBI + dplyr with Amazon RDS. This integration has come a long way, and is just one of the huge advancements that the RStudio team has been working on in collaboration with other R users in the community.

Until next time!