This term, we’ve spent a lot of time getting cozy with Python and specifically the pandas package for working with dataframes. Beyond INFO 2950, you may encounter situations where you need to work in R for data analysis. Let’s get to know R a little bit, as well as the tidyverse, a suite of packages built for tidy data analysis in R. We’ll focus on the dplyr package for data manipulation, which is analogous to pandas in Python.

Preamble

Installation

R is the interpreter, while RStudio is a convenient integrated development environment. I would recommend you install both, though you would need to install R first. You do not need to install this software for our course, but you may wish to use these instructions in the future.

To download R, select the mirror from this list that is closest to your geographic location and then follow the system-specific instructions on that page.

To download RStudio, select the system-specific installer here.

R Markdown

The source of this document is an R Markdown file (.Rmd), which is similar to Jupyter Notebooks in that it allows you to weave text and executable code. Here is a quick tour and comprehensive guide of R Markdown for reference.

While a Jupyter notebook provides more of a what you see is what you get experience, R Markdown notebooks need to be compiled to be shared with the code executed. For example, this document has been compiled to a .html file, and all of the code is executed in there, but if we open the source .Rmd in R Studio, we have to execute the code manually.

In .Rmd files, you write text in Markdown, and then you can execute R code in chunks like the following:

print("Hello World!")
## [1] "Hello World!"

You can execute all of the code within a chunk interactively by clicking the green play button at top right of the chunk (or use the keyboard shortcut CMD/CTRL + SHIFT + ENTER while within the chunk). If you want to execute just the line of code where your cursor currently is, use the keyboard shortcut CMD/CTRL + ENTER. The result will display within the .Rmd in RStudio, though it will not be saved in the document. To save the code output in a portable document you can send around, you need to compile the document, which can be done in RStudio using the Knit button (or the keyboard shortcut CMD/CTRL + SHIFT + K).

You can also execute code chunks in other languages, including Python!

import sys
print(sys.version)
## 3.6.10 | packaged by conda-forge | (default, Apr 24 2020, 16:29:39) 
## [GCC Clang 9.0.1 ]

Welcome to the tidyverse

The tidyverse packages share a core philosophy and a human-friendly syntax style that make learning one easier when you already work with another. Today, we’ll focus on tidyverse approaches, but know that there may be good ways of accomplishing the same tasks in base R as well.

While the tidyverse includes a number of individual packages, you can conveniently load them all as follows:

library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::collapse() masks glue::collapse()
## x dplyr::filter()   masks stats::filter()
## x dplyr::lag()      masks stats::lag()

While this is convenient, but might be overkill depending the task at hand. You may be better off just loading the specific packages you need.

Core tidyverse packages worth mentioning are:

  • ggplot2: for intuitive and hassle-free plotting (like seaborn)
  • dplyr: for data manipulation
  • tidyr: for tidying data
  • readr: for reading rectangular data
  • purrr: for functional programming (replacing for loops!)
  • stringr: for working with strings
  • forcats: for working with categorical data

Basic data structures and operations

The most basic data structures in R are vectors and lists.

Vectors

Vectors are arrays where each entry must be of the same type.

## the 'c' is short for 'concatenate'
my_vec <- c(10, 30, 50)
print(my_vec)
## [1] 10 30 50

Indexing in R starts from 1, and you don’t need to explicitly call print to display values:

my_vec[1]
## [1] 10

If we forgot this and tried to index with 0, we would see that an empty element is returned:

my_vec[0]
## numeric(0)

When subsetting or looping over a range, both the start and the stop values are inclusive, unlike in Python, where the stop value is excluded:

my_range <- 1:5

for (i in my_range){
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

A lot of functions in R are vectorized; they act over vectors by default. If you think you need a for loop in R, you’re probably wrong:

paste("Here is", "a sentence.")
## [1] "Here is a sentence."
net_id <- c("ip98", "dz352", "as3934")
email_suffix <- rep("@cornell.edu", times = 3)

paste(net_id, email_suffix, sep = "")
## [1] "ip98@cornell.edu"   "dz352@cornell.edu"  "as3934@cornell.edu"

The + operator sadly does not concatenate strings in R:

## this will throw an error! :(
"Here is" + "a sentence"

Nor can we easily iterate over the characters in a string:

for (i in "my string"){
  print(i)
}
## [1] "my string"

Lists

Lists are more general and can contain entries of different types; you can even have lists of lists!

my_list <- list("bananas", 20, FALSE, list("another list!", "hello :)", 50))
glue("my_list length: {length(my_list)}")
## my_list length: 4
my_list
## [[1]]
## [1] "bananas"
## 
## [[2]]
## [1] 20
## 
## [[3]]
## [1] FALSE
## 
## [[4]]
## [[4]][[1]]
## [1] "another list!"
## 
## [[4]][[2]]
## [1] "hello :)"
## 
## [[4]][[3]]
## [1] 50

The printed indices suggest how to access each list element:

my_list[[4]]
## [[1]]
## [1] "another list!"
## 
## [[2]]
## [1] "hello :)"
## 
## [[3]]
## [1] 50
my_list[[4]][[2]]
## [1] "hello :)"

You can convert from the more complicated list to a vector, but the entries will all convert to the most general data type available for the entries of the list:

unlisted <- unlist(my_list)
unlisted[2]
## [1] "20"
class(unlisted[2])
## [1] "character"
length(my_list)
## [1] 4
length(unlisted)
## [1] 6

Dataframes (tibbles) and dplyr

Both vectors and lists can be used to construct our best friends: dataframes. Dataframes exist in base R, but I prefer their tidyverse counterpart, tibbles:

my_tbl <- tibble(ID = 1:5,
                fruit = c("bananas", "peaches", "tangerine",
                          "kiwis", "grapes"),
                price_per_kilo = c(1.4, 4, 2, 1.2, 3))
my_tbl

Tidy data tables like this (with one observation per row and one variable per column) are the main currency of the tidyverse. Getting real data into this format can be challenging (which is where the tidyr package comes in handy), but once we have this, we’re laughing!

dplyr

dplyr is a package which implements the grammar of data manipulation, with a set of five basic verbs that perform transformations on data tables:

dplyr function description pandas counterpart
mutate() add new variables (columns) df["new_var"] = vals*
select() select variables by name df[['col1']]
filter() select observations (rows) by values query()
summarise() summarises values (e.g. take mean) none
arrange() sort observations sort()

* Mutate is generally used to make new variables from old ones. In this case, the syntax for pandas is a little more complicated to avoid the infamous SettingWithCopyWarning.

These functions can be combined with group_by() which groups observations based on values in the specified columns. There is also the distinct() function, a special filter() method that drops duplicate observations (like drop_duplicates() in pandas).

More comparisons between R and Python can be found here.

The pipe operator (%>%)

One of the most handy features of dplyr is the pipe operator (%>%) which allows you to easily set up data pipelines. It passes whatever is on the left of it (usually a data frame) through to the first argument of the function on the right.

"Hello World!" %>% print()
## [1] "Hello World!"
"ip98" %>% paste("@cornell.edu", sep = "") %>% print()
## [1] "ip98@cornell.edu"

penguins demo

Let’s take dplyr out for a spin by exploring the now familiar penguins dataset:

library(palmerpenguins)

We can inspect this data:

## get the table dimensions
dim(penguins)
## [1] 344   8
## peek at the first few observations of the table
head(penguins)
penguins %>% head()
## drop NAs in a specifed column
penguins %>% drop_na(bill_length_mm)
## count the number of observations by year
penguins %>% group_by(year) %>% count()

Make a new column (and select specific columns):

penguins %>%
  ## make a column where the observations are twice the
  ## observations in another column
  mutate(twice_bill_length_mm = 2*bill_length_mm) %>%
  ## drop some columns for nicer display
  select(-(bill_depth_mm:year))
(penguins
 ## make an indicator variable column
 %>% mutate(is_adelie = case_when(
   ## when the logical condition on the left is true,
   ## map to ("~") the value on the right
   species == "Adelie" ~ 1,
   ## all other observations
   TRUE ~ 0))
 %>% select(species, is_adelie)
)

Filter the observations:

penguins %>%
  ## filter on one condition
  filter(bill_length_mm > 40)
penguins %>%
  ## or filter on arbitrarily many conditions simultaneously
  filter(bill_length_mm > 40, species == "Chinstrap")

Summarise variables:

(penguins
 ## group by two variables
 %>% group_by(species, island)
 ## compute the grouped means
 %>% summarise(mean_body_mass_g = mean(body_mass_g,
                                       na.rm = TRUE))
)

Note that the summary table loses all of the original observations!

We can also compute the desired summary for specific columns in a more programmatic way:

(penguins
 ## group by two variables
 %>% group_by(species, island)
 ## compute the grouped means for all variables ending with
 ## "mm"
 %>% summarise(across(ends_with("mm"), mean, na.rm = TRUE))
)

Sort the observations:

(penguins
 ## filter by row index (first five rows)
 %>% slice(1:5)
 ## sort by flipper length---increasing by default
 ## for descending, wrap variable with desc()
 %>% arrange(desc(flipper_length_mm))
 %>% select(species, island, flipper_length_mm)
)

Bonus: just add ggplot2

OK, so we know how to manipulate dataframes now… but what about plotting? To whet your appetite, here’s how slick ggplot2 can be:

## initialize a ggplot object that defines the "grammar" of
## the plot (this doesn't draw anything yet!)
(ggplot(data = penguins %>% drop_na(),
        mapping = aes(x = bill_length_mm,
                      y = bill_depth_mm,
                      col = sex))
  ## (the + operator adds plot layers to each other)
  ## add a linear model (with a 95% confidence band)
  + geom_smooth(formula = y ~ x,
                method = "lm")
  # ## add the raw data points
  + geom_point(size = 0.5)
  ## facet by species and year
  + facet_grid(
    scales = "free", ## to allow each row and column to have
    ## different x and y scales (axis limits) --- more
    ## zoomed in, but can be harder to compare panels
    rows = vars(year),
    cols = vars(species)
    )
  # ## adjust labels
  + labs(x = "bill length (mm)",
         y = "bill depth  (mm)")
)

Base R vs the tidyverse

Base R is quite useful and stable, though a bit clunky for data analysis at times. If you are going to do any data analysis in R, I would highly recommend learning to do it the tidyverse way. The documentation is fantastic, and there are many well-written (and free!) online textbooks and tutorials authored by tidyverse package developers to help you out. A great place to start is R for Data Science.

As with any software, the tidyverse isn’t perfect, and there are certainly contexts where it isn’t the best choice. In my experience with data science, the tidyverse tools are indispensable; I use them virtually every time I work on data projects.

The tidyverse in Python?

The tidyverse tools have been so popular that similar principles have been integrated into packages for other languages. For instance, there’s are a few implementations of tidyverse-like tools in Python, including this one.

Sources

The content of this tutorial was heavily informed by the following sources: