library(DBI)
# library(RMySQL)
library(tidyquant)
# devtools::install_github("tidyverse/dplyr")
# devtools::install_github("tidyverse/dbplyr")
library(dplyr)
# library(dbplyr)
Warning
The code and advice in this post is still valid, but I have shut down the AWS database that this post pulled from, so it no longer runs. Additionally, the output is no longer shown.
Intro
Welcome to my first post! To start things off at Data Insights, I’m going to show you how to connect to an AWS RDS instance from R.
For those of you who don’t know, RDS is an easy way to create a database in the cloud. In this post, I won’t be showing you how to setup an RDS instance, but I will show you how to connect to it if you have one running.
Let’s get started.
Step 1: The one where you got connected
You’ll need a few packages to get started.
DBI
andRMySQL
are used to connect to the database, althoughRMySQL
is usually called without explicitely loading it (that’s just the standard)tidyquant
is just there to help us download some data to put in and get out of our databasedplyr
will be used to show off an alternate way to query from the database. Note that you should get the most recent github version ofdplyr
, along with the database specific pieces fromdbplyr
.
Getting connected isn’t too hard once you know what you’re looking for.
<- dbConnect(drv = RMySQL::MySQL(),
cn username = "user1",
password = "testpassword",
host = "davisdbinstance.crarljboc8to.us-west-2.rds.amazonaws.com",
port = 3306,
dbname = "firstdb")
Let’s go through the arguments to dbConnect()
, the function from DBI
that we used to connect.
drv
- The driver I used is from theRMySQL
package, an implementation of the general interface provided byDBI
. I’ll leave it to the experts to explain all of this.username
/password
- You’ll have to have created a user and password on AWS first, but then you can use them here.host
The host name is the Endpoint of your RDS server, without the port on the end. I’ve attached a screenshot to show where to find this. Basically, on the RDS Dashboard Instances page, hit the drop down arrow beside “MySQL” to show the Endpoint.
port
- The rest of the Endpoint shows the port that you’ll need to access your RDS instance through. That goes here.dbname
- Finally, you’ll need the DB Name you used when setting up the instance. This can be found by clicking Instance Actions -> See Details, and then under Configuration Details you’ll find DB Name.
Step 2: The one where you take it for a test spin
Well, alright…that was…cool? How do we know it’s working? Let’s get some data to load into the database. We will use some Apple stock data retrieved through tidyquant
.
<- tq_get("AAPL")
aapl
slice(aapl, 1:10)
To write the tibble (data frame) to the database, we will use another function called dbWriteTable()
. It’s pretty straightforward. “name” is the name of the table you are creating, and “value” is the data frame you want to write.
dbWriteTable(cn, name = "apple", value = aapl)
Now the fun part! Let’s use a SQL query to pull it back down with dbGetQuery()
. This function is a combination of dbSendQuery()
, which returns a result set for your query, and dbFetch()
which returns the rows from that result set.
<- dbGetQuery(cn, "SELECT * FROM apple;")
apple_from_db
# This effectively is the same as
# dbReadTable(cn, "apple")
slice(apple_from_db, 1:10)
There are a huge number of functions from DBI
that you can use to communicate with databases. Maybe I will cover more in a separate post, but for now, let’s move on to dplyr
.
Step 3: The one with the pliers
Before dplyr
0.6.0 was announced, you’d have to disconnect, and then reconnect through a dplyr
specific function, src_mysql()
. That would look something like the code below. Since then, however, you can now use the DBI
connection with dplyr
!
# There is no need for this code anymore!
dbDisconnect(cn)
<- src_mysql(user = "user1",
cn password = "testpassword",
host = "davisdbinstance.crarljboc8to.us-west-2.rds.amazonaws.com",
port = 3306,
dbname = "firstdb")
Select the apple table from the database. This does not actually pull the data into memory. It just makes a connection!
# With dplyr 0.6.0 we can just use the DBI connection!
<- tbl(cn, "apple")
apple_table
# By default the first 1000 rows are displayed
apple_table
The best part is that we can use almost any dplyr command with this! It queries the database, and does not do the manipulation in R. All of the familiar syntax of dplyr, but with databases. Let’s use filter()
to get all of the rows after January 1, 2009.
filter(apple_table, date > "2009-01-01")
To do any serious manipulation outside of dplyr
, you’ll likely have pull the data into memory to be able to use it with other R functions. Here, I’ll use the dplyr
equivalent to dbWriteTable()
to add the stock prices for IBM to the database.
<- tq_get("IBM")
ibm
copy_to(cn, df = ibm, temporary = FALSE)
To actually retrieve the data to memory, first make the connection using tbl()
like before, and then use collect()
to create the in memory tibble. Unfortunately, dates are stored as characters in the table, and collect()
won’t try to fix that, so I’ll also take advantage of the readr
package’s type_convert()
function to do the thinking for me.
Once we have the data in memory, we can calculate the daily return with tidyquant
and tq_mutate()
.
# Connection
<- tbl(cn, "ibm")
ibm_table
# Collect to tibble
<- collect(ibm_table) %>%
real_tibble ::type_convert()
readr
# Daily return
<- real_tibble %>%
real_tibble tq_mutate(select = adjusted,
mutate_fun = periodReturn,
period = "daily")
real_tibble
Always disconnect when you’re finished!
dbDisconnect(cn)
Last words
Hopefully I’ve been able to show you the power of DBI
+ dplyr
with Amazon RDS. This integration has come a long way, and is just one of the huge advancements that the RStudio team has been working on in collaboration with other R users in the community.
Until next time!