Tidying Excel cash flow spreadsheets using R
If you love using Excel, and it fits your
needs, then by all means do your thing. However, there is also interest
out there for moving Excel analyses into R. If you are one of those people,
and your Excel data is “messy,” then this post is for you. We will be using
the unpivotr
package (GitHub) to tidy up some Excel cash flow spreadsheets.
Note: A few functions in the unpivotr
package that are used here have been deprecated. This post runs with unpivotr
0.3.1, but I recommend learning about the newest version if you have an interest in the package.
devtools::install_version("unpivotr", "0.3.1")
The problem
Often, cash flow spreadsheets contain valuable info about a company’s performance,
but they generally come in a non-tidy format. For R users, especially those who
use the tidyverse
, this poses a real challenge,
as most of our data analysis toolkit revolves around working with tidy data.
If you aren’t familiar with “tidy” data, check out this vignette. The basic 3 tenants of tidy data are:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
Below is a typical cash flow statement for 1 year of performance, broken down by month. This does not fit the “tidy” data standards, but is incredibly common in the Financial world. Don’t worry too much about how hard this is to see, I’ll zoom in on relevant features as we go along. But do notice how columns C through Z are the months, and the left two columns represent main and sub-headers for the cash flow sheet.
Why doesn’t this fit the “tidy” principles? Here are just a few things we might change, and a picture to highlight the problems.
Cash Inflows (Income)
in column A is a “main header” with “sub-headers” in column B rows 7-13. This violates principle 2, because each row is not a standalone observation. You need information from row 6 (the title of the category) to figure out what categoryCash Collections
belongs in. Ideally,Cash Inflows (Income)
would be repeated in column A, rows 7-10 and in row 13 to uniquely identify that row.- Row 4 identifies 13 columns in this dataset (12 months and a total column). In the tidy world, this violates principle 1, because those 13 can actually be reduced to just 1 column,
month
, where we will just understand thatTOTALS
is the sum of the others. Some people would even drop theTOTALS
column, as it is easily computed from the 12 months of data and may cause confusion when looking at monthly totals. - A general problem is that there are spaces in every other column. We can fix that easy enough.
- A last major problem is that some “main headers” actually contain data in their rows, and some don’t. For instance, row 5 has a main header of
Beginning Cash Balance
and also has data for each month. Compare that toCash Inflows (Income)
which instead has subcategories. We will deal with all this as we go through the tidying process.
Following along
If you want to follow along, feel free to fork this Github repo which contains the data and the RMarkdown script.
The data comes from an Excel cash flow template I found and filled in with random data just to test out.
A tale of two packages
The main two packages that we need are readxl
to extract the data itself from the sheet, and unpivotr
, a relatively little known package that is incredibly powerful
for this kind of work. We will additionally use dplyr
and ggplot2
and purrr
.
library(readxl)
library(dplyr)
library(ggplot2)
library(unpivotr)
library(purrr)
First, let’s load the data and see what we get.
wb <- "./data/untidy.xlsx"
untidy_raw <- read_excel(wb)
rmarkdown:::print.paged_df(untidy_raw)
Hmm, that’s not helpful. Let’s try again, but this time let’s skip the first two rows and ignore the column names for now.
untidy <- read_excel(wb, skip = 2, col_names = FALSE)
rmarkdown:::print.paged_df(untidy)
At this point you can kind of match up the structure of the Excel sheet
to the R tibble
. What now? This is where unpivotr
comes in. The basic
idea is to:
- Tokenize your Excel sheet into a
tibble
with three columns: row, column, and value. - Then you identify your row and column headers (main and sub headers).
- Finally, you use that information to perform a series of “coordinate based” transformations that help tidy up your data (
unpivotr
performs most of the magic for us).
Tokenize your Excel sheet
unpivotr
comes equipped with a function that can help turn our imported
spreadsheet into a tokenized version of itself, where each row of the
resulting tibble
corresponds to a cell identified by a row
and col
number. tidy_table()
is just that function.
untidy_tokens <- tidy_table(untidy)
untidy_tokens
## # A tibble: 1,161 x 4
## row col chr lgl
## <int> <int> <chr> <lgl>
## 1 1 1 <NA> NA
## 2 2 1 Beginning Cash Balance NA
## 3 3 1 Cash Inflows (Income): NA
## 4 4 1 <NA> NA
## 5 5 1 <NA> NA
## 6 6 1 <NA> NA
## 7 7 1 <NA> NA
## 8 8 1 <NA> NA
## 9 9 1 <NA> NA
## 10 10 1 <NA> NA
## # … with 1,151 more rows
You might notice that “Beginning Cash Balance” corresponds to row 4 in the actual spreadsheet, not row 2, but remember that we skipped 2 rows on the import step, so this is actually correct.
You’ll also notice that two other columns have been created, chr
and lgl
.
chr
is the tokenized version of any column in the spreadsheet that contained
character data, and lgl
is the same but for logical data. If you look at
untidy
again, you’ll see that X__4
is a logical column, and just represents
one of the blank columns used for spacing in the original data set, so that is
the only reason we have a lgl
column.
We don’t care about the lgl
column, and at this point we also remove any cells
that contain NA
values. This solves problem 3.
tokens <- untidy_tokens %>%
select(-lgl) %>%
filter(!is.na(chr))
tokens
## # A tibble: 506 x 3
## row col chr
## <int> <int> <chr>
## 1 2 1 Beginning Cash Balance
## 2 3 1 Cash Inflows (Income):
## 3 11 1 Available Cash Balance
## 4 12 1 Cash Outflows (Expenses):
## 5 35 1 Other Cash Out Flows:
## 6 43 1 Ending Cash Balance
## 7 4 2 Cash Collections
## 8 5 2 Credit Collections
## 9 6 2 Investment Income
## 10 7 2 Other:
## # … with 496 more rows
Separating out summary rows
Remember problem 4 from above about how some main headers contain data in their
rows and some don’t? That is going to cause us problems later on if we don’t
take care of it now. As it turns out, this happens in 3 places,
"Beginning Cash Balance"
, "Available Cash Balance"
, and "Ending Cash Balance"
.
These 3 rows all happen to be summary rows computed from other rows, and it makes
sense to think about them separately anyways. For that reason, let’s pull them out.
Luckily, this is incredibly simple thanks to our tokenized spreadsheet.
main_headers <- c(
"Beginning Cash Balance",
"Available Cash Balance",
"Ending Cash Balance"
)
main_header_rows <- tokens %>%
filter(chr %in% main_headers) %>%
pull(row)
# Only the three summary rows
# Also include row 1 so we keep the row headers (months)
tokens_summary <- tokens %>%
filter(row %in% c(1, main_header_rows))
# Everything else
tokens_main <- tokens %>%
filter(!(row %in% main_header_rows))
tokens_main
## # A tibble: 467 x 3
## row col chr
## <int> <int> <chr>
## 1 3 1 Cash Inflows (Income):
## 2 12 1 Cash Outflows (Expenses):
## 3 35 1 Other Cash Out Flows:
## 4 4 2 Cash Collections
## 5 5 2 Credit Collections
## 6 6 2 Investment Income
## 7 7 2 Other:
## 8 10 2 Total Cash Inflows
## 9 13 2 Advertising
## 10 14 2 Bank Service Charges
## # … with 457 more rows
Row and Column Headers
At this point, we need to identify the sets of row and column header cells that help put structure around our spreadsheet. In the actual workbook, row 4 corresponds to our column headers, and columns A and B correspond to our two sets of row headers. For now, we will just focus on the main tokens.
We just need to filter our dataset down
to the cells that correspond to our row and column headers. These will be used
later on to tell unpivotr
how to tidy up.
First we get row headers…
row_headers <- tokens_main %>%
# Only columns A and B (1 and 2)
filter(col <= 2) %>%
# A quick rename
rename(header = chr) %>%
# Split into two tibbles by the column (a nice base R function)
# A list of two tibbles is returned
split(.$col) %>%
# Name the elements of the list for easy access
set_names(c("main_headers", "sub_headers"))
row_headers
## $main_headers
## # A tibble: 3 x 3
## row col header
## <int> <int> <chr>
## 1 3 1 Cash Inflows (Income):
## 2 12 1 Cash Outflows (Expenses):
## 3 35 1 Other Cash Out Flows:
##
## $sub_headers
## # A tibble: 32 x 3
## row col header
## <int> <int> <chr>
## 1 4 2 Cash Collections
## 2 5 2 Credit Collections
## 3 6 2 Investment Income
## 4 7 2 Other:
## 5 10 2 Total Cash Inflows
## 6 13 2 Advertising
## 7 14 2 Bank Service Charges
## 8 15 2 Insurance
## 9 16 2 Interest
## 10 17 2 Inventory Purchases
## # … with 22 more rows
And now column headers…
col_headers <- tokens_main %>%
# Only the first row (where the month names are)
filter(row == 1) %>%
rename(header = chr)
col_headers
## # A tibble: 13 x 3
## row col header
## <int> <int> <chr>
## 1 1 3 April
## 2 1 5 May
## 3 1 7 June
## 4 1 9 July
## 5 1 11 Aug
## 6 1 13 Sept
## 7 1 15 Oct
## 8 1 17 Nov
## 9 1 19 Dec
## 10 1 21 Jan
## 11 1 23 Feb
## 12 1 25 Mar
## 13 1 27 TOTALS
Coordinate tidying
With these in hand, we get to the fun (magic?) part. The key piece of unpivotr
(at least to me) seems to be a handful of functions that perform transformations
on your data based on map directions. This might be something like North with N()
,
South with S()
, or combinations like North-NorthWest with NNW()
.
Internally, these are a mix of dplyr
and data.table
join functions.
It takes a bit to wrap your head around the purpose of them, but I’ll try and use
some pictures to help. For each of our 3 header groups, we need to tell unpivotr
how to join our main data cells (where the numbers are) to the actual headers. For instance,
for the column header, all we have to do is go directly North from any data point
to run into the corresponding header.
We tell unpivotr
this like so:
tokens_main %>%
N(header = col_headers) %>% # passing in our col_headers tibble we extracted earlier
rmarkdown:::print.paged_df()
unpivotr
identified any data cell that had a corresponding column header, and
created a new dataset for us that is essentially an inner join of the column
headers and the original tokens. Notice that columns A and B from the original
sheet do not have column headers, so they don’t show up here.
Next, we need to take care of our row headers. To do so, we need two transformations. For the sub-headers, it is as simple as the column headers, we just go directly west from any cell to run into the sub-header.
For the main headers we need to do something slightly more complicated. For each
data cell, we need to run all the way to the west wall of the sheet, and then run
north from there to run into, for example Cash Inflows (Income)
.
unpivotr
gives us the power to do this in the WNW()
(West-NorthWest) function. It searches
West and NorthWest for the first matching row header. Combining this with the
W()
function that is needed for the sub-headers, we get:
tokens_main %>%
W(header = row_headers$sub_headers) %>%
WNW(header = row_headers$main_headers) %>%
rmarkdown:::print.paged_df()
Notice how each cell in the chr
column now has extra columns that identify
the corresponding main and sub headers that go with it. Click Next
a few times
on the interative data frame to get down to actual numbers. You should see
that each cell has the correct main and sub header added on.
At this point, the actual sub headers are also still in the chr
column, mapped to their corresponding main
header and sub header (itself). When we combine this row header step with the
column header step, those will be removed as well.
So let’s do that. All together now:
tokens_main_tidy <- tokens_main %>%
N(col_headers) %>%
WNW(row_headers$main_headers) %>%
W(row_headers$sub_headers) %>%
# Also rename to give us cleaner names
rename(main_header = header.data, month = i.header, sub_header = header.header)
rmarkdown:::print.paged_df(tokens_main_tidy)
Nice! We successfully tidied the main section of the worksheet.
Tidying the summary rows
We can also quickly tidy up the summary rows that we removed earlier. Now that we know the steps, it’s a quick task to find the row and column headers and perform the corresponding transformation steps.
row_headers_summary <- tokens_summary %>%
filter(col == 1) %>%
rename(header = chr)
col_headers_summary <- col_headers
tokens_summary_tidy <- tokens_summary %>%
W(row_headers_summary) %>%
N(col_headers_summary) %>%
rename(main_header = header.data, month = header.header)
tokens_summary_tidy
## # A tibble: 36 x 5
## row col chr main_header month
## <int> <int> <chr> <chr> <chr>
## 1 2 3 0 Beginning Cash Balance April
## 2 11 3 -4407 Available Cash Balance April
## 3 43 3 -39895 Ending Cash Balance April
## 4 2 5 -39895 Beginning Cash Balance May
## 5 11 5 -34824 Available Cash Balance May
## 6 43 5 -43080 Ending Cash Balance May
## 7 2 7 -43080 Beginning Cash Balance June
## 8 11 7 -40529 Available Cash Balance June
## 9 43 7 -39830 Ending Cash Balance June
## 10 2 9 -39830 Beginning Cash Balance July
## # … with 26 more rows
Visualizing yearly cash flows
So what can you do with this? Well, now that we have a tidy dataset we can
use any of our standard tools, like ggplot2
, and analyze the yearly cash flow
statement.
One thing we can do is divide up the sheet into it’s 3 main headers (that aren’t
summary headers), Cash Inflows (Income):
, Cash Outflows (Expenses):
, and
Other Cash Out Flows:
and then create plots of their sub headers over the year.
# The year goes from April -> March, we will need to create an ordered factor using this
ordered_months <- c("April", "May", "June", "July", "Aug", "Sept",
"Oct", "Nov", "Dec", "Jan", "Feb", "Mar")
# We are going to iterate over the 3 Main headers (that aren't summary headers)
main_headers <- unique(tokens_main_tidy$main_header)
main_headers
## [1] "Cash Inflows (Income):" "Cash Outflows (Expenses):"
## [3] "Other Cash Out Flows:"
# For each main header, we are going to plot all of the sub-headers
yearly_plot <- function(x, header) {
p <- x %>%
filter(main_header == header) %>%
ggplot(aes(x = month, y = chr, group = sub_header)) +
geom_line() +
facet_wrap(~sub_header, ncol = 3) +
labs(title = header) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(p)
}
# We need to manipulate the data a little more to get it ready for plotting
tokens_plot_ready <- tokens_main_tidy %>%
# Remove any total rows
filter(month != "TOTALS") %>%
# To use the month column as our x-axis, we need it to be an ordered factor
# Additionally, the chr column need to be converted over to numeric
mutate(month = factor(month, levels = ordered_months, ordered = TRUE),
chr = as.numeric(chr))
tokens_plot_ready
## # A tibble: 384 x 6
## row col chr main_header month sub_header
## <int> <int> <dbl> <chr> <ord> <chr>
## 1 4 3 2227 Cash Inflows (Income): April Cash Collections
## 2 5 3 -4712 Cash Inflows (Income): April Credit Collections
## 3 6 3 -2412 Cash Inflows (Income): April Investment Income
## 4 7 3 490 Cash Inflows (Income): April Other:
## 5 10 3 -4407 Cash Inflows (Income): April Total Cash Inflows
## 6 13 3 -324 Cash Outflows (Expenses): April Advertising
## 7 14 3 3221 Cash Outflows (Expenses): April Bank Service Charges
## 8 15 3 960 Cash Outflows (Expenses): April Insurance
## 9 16 3 936 Cash Outflows (Expenses): April Interest
## 10 17 3 2522 Cash Outflows (Expenses): April Inventory Purchases
## # … with 374 more rows
Using purrr
, we can walk
over the 3 main headers, producing a plot
at each one. walk
is like map
except it is mainly called for its
“side effects” (like producing a plot) rather than for something like
manipulating data and returning it to the user.
walk(main_headers, ~yearly_plot(tokens_plot_ready, .x))
Conclusion
I really think that unpivotr
is a powerful package. Using map directions to
tidy these data sets is a pretty neat idea! That’s all for now!