pandas
to dplyr
: R for Python users in data scienceThis term, we’ve spent a lot of time getting cozy with Python and specifically the pandas
package for working with dataframes. Beyond INFO 2950, you may encounter situations where you need to work in R for data analysis. Let’s get to know R a little bit, as well as the tidyverse, a suite of packages built for tidy data analysis in R. We’ll focus on the dplyr
package for data manipulation, which is analogous to pandas
in Python.
R is the interpreter, while RStudio is a convenient integrated development environment. I would recommend you install both, though you would need to install R first. You do not need to install this software for our course, but you may wish to use these instructions in the future.
To download R, select the mirror from this list that is closest to your geographic location and then follow the system-specific instructions on that page.
To download RStudio, select the system-specific installer here.
The source of this document is an R Markdown file (.Rmd
), which is similar to Jupyter Notebooks in that it allows you to weave text and executable code. Here is a quick tour and comprehensive guide of R Markdown for reference.
While a Jupyter notebook provides more of a what you see is what you get experience, R Markdown notebooks need to be compiled to be shared with the code executed. For example, this document has been compiled to a .html
file, and all of the code is executed in there, but if we open the source .Rmd
in R Studio, we have to execute the code manually.
In .Rmd
files, you write text in Markdown, and then you can execute R code in chunks like the following:
print("Hello World!")
## [1] "Hello World!"
You can execute all of the code within a chunk interactively by clicking the green play button at top right of the chunk (or use the keyboard shortcut CMD/CTRL + SHIFT + ENTER
while within the chunk). If you want to execute just the line of code where your cursor currently is, use the keyboard shortcut CMD/CTRL + ENTER
. The result will display within the .Rmd
in RStudio, though it will not be saved in the document. To save the code output in a portable document you can send around, you need to compile the document, which can be done in RStudio using the Knit
button (or the keyboard shortcut CMD/CTRL + SHIFT + K
).
You can also execute code chunks in other languages, including Python!
import sys
print(sys.version)
## 3.6.10 | packaged by conda-forge | (default, Apr 24 2020, 16:29:39)
## [GCC Clang 9.0.1 ]
The tidyverse packages share a core philosophy and a human-friendly syntax style that make learning one easier when you already work with another. Today, we’ll focus on tidyverse approaches, but know that there may be good ways of accomplishing the same tasks in base R as well.
While the tidyverse includes a number of individual packages, you can conveniently load them all as follows:
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::collapse() masks glue::collapse()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
While this is convenient, but might be overkill depending the task at hand. You may be better off just loading the specific packages you need.
Core tidyverse packages worth mentioning are:
ggplot2
: for intuitive and hassle-free plotting (like seaborn
)dplyr
: for data manipulationtidyr
: for tidying datareadr
: for reading rectangular datapurrr
: for functional programming (replacing for loops!)stringr
: for working with stringsforcats
: for working with categorical dataThe most basic data structures in R are vectors and lists.
Vectors are arrays where each entry must be of the same type.
## the 'c' is short for 'concatenate'
my_vec <- c(10, 30, 50)
print(my_vec)
## [1] 10 30 50
Indexing in R starts from 1, and you don’t need to explicitly call print to display values:
my_vec[1]
## [1] 10
If we forgot this and tried to index with 0, we would see that an empty element is returned:
my_vec[0]
## numeric(0)
When subsetting or looping over a range, both the start and the stop values are inclusive, unlike in Python, where the stop value is excluded:
my_range <- 1:5
for (i in my_range){
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
A lot of functions in R are vectorized; they act over vectors by default. If you think you need a for loop in R, you’re probably wrong:
paste("Here is", "a sentence.")
## [1] "Here is a sentence."
net_id <- c("ip98", "dz352", "as3934")
email_suffix <- rep("@cornell.edu", times = 3)
paste(net_id, email_suffix, sep = "")
## [1] "ip98@cornell.edu" "dz352@cornell.edu" "as3934@cornell.edu"
The +
operator sadly does not concatenate strings in R:
## this will throw an error! :(
"Here is" + "a sentence"
Nor can we easily iterate over the characters in a string:
for (i in "my string"){
print(i)
}
## [1] "my string"
Lists are more general and can contain entries of different types; you can even have lists of lists!
my_list <- list("bananas", 20, FALSE, list("another list!", "hello :)", 50))
glue("my_list length: {length(my_list)}")
## my_list length: 4
my_list
## [[1]]
## [1] "bananas"
##
## [[2]]
## [1] 20
##
## [[3]]
## [1] FALSE
##
## [[4]]
## [[4]][[1]]
## [1] "another list!"
##
## [[4]][[2]]
## [1] "hello :)"
##
## [[4]][[3]]
## [1] 50
The printed indices suggest how to access each list element:
my_list[[4]]
## [[1]]
## [1] "another list!"
##
## [[2]]
## [1] "hello :)"
##
## [[3]]
## [1] 50
my_list[[4]][[2]]
## [1] "hello :)"
You can convert from the more complicated list to a vector, but the entries will all convert to the most general data type available for the entries of the list:
unlisted <- unlist(my_list)
unlisted[2]
## [1] "20"
class(unlisted[2])
## [1] "character"
length(my_list)
## [1] 4
length(unlisted)
## [1] 6
dplyr
Both vectors and lists can be used to construct our best friends: dataframes. Dataframes exist in base R, but I prefer their tidyverse counterpart, tibbles:
my_tbl <- tibble(ID = 1:5,
fruit = c("bananas", "peaches", "tangerine",
"kiwis", "grapes"),
price_per_kilo = c(1.4, 4, 2, 1.2, 3))
my_tbl
Tidy data tables like this (with one observation per row and one variable per column) are the main currency of the tidyverse. Getting real data into this format can be challenging (which is where the tidyr package comes in handy), but once we have this, we’re laughing!
dplyr
dplyr
is a package which implements the grammar of data manipulation, with a set of five basic verbs that perform transformations on data tables:
dplyr function |
description | pandas counterpart |
---|---|---|
mutate() |
add new variables (columns) | df["new_var"] = vals * |
select() |
select variables by name | df[['col1']] |
filter() |
select observations (rows) by values | query() |
summarise() |
summarises values (e.g. take mean) | none |
arrange() |
sort observations | sort() |
* Mutate is generally used to make new variables from old ones. In this case, the syntax for pandas
is a little more complicated to avoid the infamous SettingWithCopyWarning
.
These functions can be combined with group_by()
which groups observations based on values in the specified columns. There is also the distinct()
function, a special filter()
method that drops duplicate observations (like drop_duplicates()
in pandas
).
More comparisons between R and Python can be found here.
%>%
)One of the most handy features of dplyr
is the pipe operator (%>%
) which allows you to easily set up data pipelines. It passes whatever is on the left of it (usually a data frame) through to the first argument of the function on the right.
"Hello World!" %>% print()
## [1] "Hello World!"
"ip98" %>% paste("@cornell.edu", sep = "") %>% print()
## [1] "ip98@cornell.edu"
penguins
demoLet’s take dplyr
out for a spin by exploring the now familiar penguins
dataset:
library(palmerpenguins)
We can inspect this data:
## get the table dimensions
dim(penguins)
## [1] 344 8
## peek at the first few observations of the table
head(penguins)
penguins %>% head()
## drop NAs in a specifed column
penguins %>% drop_na(bill_length_mm)
## count the number of observations by year
penguins %>% group_by(year) %>% count()
Make a new column (and select specific columns):
penguins %>%
## make a column where the observations are twice the
## observations in another column
mutate(twice_bill_length_mm = 2*bill_length_mm) %>%
## drop some columns for nicer display
select(-(bill_depth_mm:year))
(penguins
## make an indicator variable column
%>% mutate(is_adelie = case_when(
## when the logical condition on the left is true,
## map to ("~") the value on the right
species == "Adelie" ~ 1,
## all other observations
TRUE ~ 0))
%>% select(species, is_adelie)
)
Filter the observations:
penguins %>%
## filter on one condition
filter(bill_length_mm > 40)
penguins %>%
## or filter on arbitrarily many conditions simultaneously
filter(bill_length_mm > 40, species == "Chinstrap")
Summarise variables:
(penguins
## group by two variables
%>% group_by(species, island)
## compute the grouped means
%>% summarise(mean_body_mass_g = mean(body_mass_g,
na.rm = TRUE))
)
Note that the summary table loses all of the original observations!
We can also compute the desired summary for specific columns in a more programmatic way:
(penguins
## group by two variables
%>% group_by(species, island)
## compute the grouped means for all variables ending with
## "mm"
%>% summarise(across(ends_with("mm"), mean, na.rm = TRUE))
)
Sort the observations:
(penguins
## filter by row index (first five rows)
%>% slice(1:5)
## sort by flipper length---increasing by default
## for descending, wrap variable with desc()
%>% arrange(desc(flipper_length_mm))
%>% select(species, island, flipper_length_mm)
)
ggplot2
OK, so we know how to manipulate dataframes now… but what about plotting? To whet your appetite, here’s how slick ggplot2
can be:
## initialize a ggplot object that defines the "grammar" of
## the plot (this doesn't draw anything yet!)
(ggplot(data = penguins %>% drop_na(),
mapping = aes(x = bill_length_mm,
y = bill_depth_mm,
col = sex))
## (the + operator adds plot layers to each other)
## add a linear model (with a 95% confidence band)
+ geom_smooth(formula = y ~ x,
method = "lm")
# ## add the raw data points
+ geom_point(size = 0.5)
## facet by species and year
+ facet_grid(
scales = "free", ## to allow each row and column to have
## different x and y scales (axis limits) --- more
## zoomed in, but can be harder to compare panels
rows = vars(year),
cols = vars(species)
)
# ## adjust labels
+ labs(x = "bill length (mm)",
y = "bill depth (mm)")
)
Base R is quite useful and stable, though a bit clunky for data analysis at times. If you are going to do any data analysis in R, I would highly recommend learning to do it the tidyverse way. The documentation is fantastic, and there are many well-written (and free!) online textbooks and tutorials authored by tidyverse package developers to help you out. A great place to start is R for Data Science.
As with any software, the tidyverse isn’t perfect, and there are certainly contexts where it isn’t the best choice. In my experience with data science, the tidyverse tools are indispensable; I use them virtually every time I work on data projects.
The tidyverse tools have been so popular that similar principles have been integrated into packages for other languages. For instance, there’s are a few implementations of tidyverse-like tools in Python, including this one.
The content of this tutorial was heavily informed by the following sources: