In this lab, we’ll learn about the fancy and beautiful R. We’ll learn first the basics of dplyr
library and then we will head to make state-of-the-art data visualizations using ggplot2
. Both of those libraries come as a part of a collection of packages called tidyverse
. According to their (website)[https://www.tidyverse.org]: ‘The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures’. Our goal is to master those two libraries.
The common theme of this lab is the use of grammar. ggplot2
presents a grammar for data visualization while dplyr
presents a grammar for data wrangling. We’ll need to learn those grammar rules before we can use those libraries but you’ll see shortly how little grammar can pay off.
Let’s start with a bit of data wrangling before we make some pretty graphics.
The five main verbs for data wrangling are:
select()
: take a subset of the columns (i.e., variables)filter()
: take a subset of the rows (i.e., observations)mutate()
: add or modify existing columnsarrange()
: sort the rowssummarize()
: aggregate the data across rows (e.g., group it according to some criteria)Each of those verbs is an actual function and each of those functions takes a dataframe as input and returns another dataframe. The combination of those verbs will enable us to create infinite ways to perform descriptive statistics. In this lab, we are covering select()
, filter()
and summarize()
. We’ll cover them all in the next lab.
If you have not installed dplyr or ggplot2 before, then you can use install.packages:
install.packages('dplyr')
install.packages('ggplot2')
Let’s first bring the dplyr and the babynames dataset again:
library(dplyr)
library(babynames)
We already talked about how to select columns by index or by names, and how to filter rows by index or by logical. Here we are going to repeat all that but this time we’ll do it better.
Note that if you are reading your data from a source file, then you’ll need a function like read.csv
or read.table
to read the data.
Now we can select any set of columns using the select
method. The first input is the name of the dataframe we are using, and the following inputs are the column names we want to select. Note that it is not necessary to wrap column names in quotation marks.
select(babynames, year, name, n)
## # A tibble: 1,924,665 x 3
## year name n
## <dbl> <chr> <int>
## 1 1880 Mary 7065
## 2 1880 Anna 2604
## 3 1880 Emma 2003
## 4 1880 Elizabeth 1939
## 5 1880 Minnie 1746
## 6 1880 Margaret 1578
## 7 1880 Ida 1472
## 8 1880 Alice 1414
## 9 1880 Bertha 1320
## 10 1880 Sarah 1288
## # … with 1,924,655 more rows
The equivelant command we learned in the last lab was
babynames[, c('year','name','n')]
## # A tibble: 1,924,665 x 3
## year name n
## <dbl> <chr> <int>
## 1 1880 Mary 7065
## 2 1880 Anna 2604
## 3 1880 Emma 2003
## 4 1880 Elizabeth 1939
## 5 1880 Minnie 1746
## 6 1880 Margaret 1578
## 7 1880 Ida 1472
## 8 1880 Alice 1414
## 9 1880 Bertha 1320
## 10 1880 Sarah 1288
## # … with 1,924,655 more rows
Similarly, the first input to filter()
is the dataframe and then we want to write the logical conditions that should be evaluated on every row. Let’s say we only want female names in the year 1975:
filter(babynames, sex == 'F')
## # A tibble: 1,138,293 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.0724
## 2 1880 F Anna 2604 0.0267
## 3 1880 F Emma 2003 0.0205
## 4 1880 F Elizabeth 1939 0.0199
## 5 1880 F Minnie 1746 0.0179
## 6 1880 F Margaret 1578 0.0162
## 7 1880 F Ida 1472 0.0151
## 8 1880 F Alice 1414 0.0145
## 9 1880 F Bertha 1320 0.0135
## 10 1880 F Sarah 1288 0.0132
## # … with 1,138,283 more rows
The equivalent command we covered in the last lab was
babynames[babynames$sex == 'F' , ]
## # A tibble: 1,138,293 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.0724
## 2 1880 F Anna 2604 0.0267
## 3 1880 F Emma 2003 0.0205
## 4 1880 F Elizabeth 1939 0.0199
## 5 1880 F Minnie 1746 0.0179
## 6 1880 F Margaret 1578 0.0162
## 7 1880 F Ida 1472 0.0151
## 8 1880 F Alice 1414 0.0145
## 9 1880 F Bertha 1320 0.0135
## 10 1880 F Sarah 1288 0.0132
## # … with 1,138,283 more rows
Now, what if we want to do those two together: we first want to filter rows that meet a specific condition and then we want to select only a specific column? To do this, we can use the so-called pipe operator %>%
which essentially takes the output of the command on the left as input to the command on the right. Let’s do an example:
babynames %>%
filter(year == 1950 & sex == 'F') %>%
select(name)
## # A tibble: 6,111 x 1
## name
## <chr>
## 1 Linda
## 2 Mary
## 3 Patricia
## 4 Barbara
## 5 Susan
## 6 Nancy
## 7 Deborah
## 8 Sandra
## 9 Carol
## 10 Kathleen
## # … with 6,101 more rows
Which should return the female names in 1950.
Note that
dataframe %>% filter(condition)
Is actually equivalent to filter(dataframe, condition)
. However, using the operator %>%
makes our code very readable.
Until now, we have covered how to select columns or filter rows – but we haven’t really talked about descriptive statistics. summarize()
will convert a set of numbers (or measurements) into a single number. It is how we can use the descriptive statistic functions we discussed last time. It will only output a single row. Let’s see an example:
babynames %>%
summarize( N=n(),
first_year = min(year),
last_year=max(year),
avg_n = mean(n),
max_n= max(n),
min_n= min(n))
## # A tibble: 1 x 6
## N first_year last_year avg_n max_n min_n
## <int> <dbl> <dbl> <dbl> <int> <int>
## 1 1924665 1880 2017 181. 99686 5
Here, n()
means nrow(babynames)
. Also note that we defined a completely new set of variables but in each of those new variables, we used column names (as vectors/lists) and did whatever we want to do with those columns to make summaries.
But this doesn’t really give enough information. We can use group_by()
to break those summaries by a grouping variable like sex.
babynames %>%
group_by(sex) %>%
summarize( N=n(),
first_year = min(year),
last_year=max(year),
n_females=sum(sex == 'female') ,
avg_n = mean(n))
## # A tibble: 2 x 6
## sex N first_year last_year n_females avg_n
## <chr> <int> <dbl> <dbl> <int> <dbl>
## 1 F 1138293 1880 2017 0 151.
## 2 M 786372 1880 2017 0 223.
Note that we now have two rows, one for each grouping level. And only then we can compare different things.
One last example: how to get the most popular names after 2000, in a year-by-year basis?
popular_names <- babynames %>%
filter(year >= 2015) %>%
group_by(sex, year) %>%
summarize(total_n=sum(n)) %>%
arrange(desc(total_n))
popular_names
## # A tibble: 6 x 3
## # Groups: sex [2]
## sex year total_n
## <chr> <dbl> <int>
## 1 M 2015 1909804
## 2 M 2016 1889052
## 3 M 2017 1834490
## 4 F 2015 1778883
## 5 F 2016 1763916
## 6 F 2017 1711811
And We here got a nice list of popular names, by sex and year. Notice that here we used arrange
to sort the rows according to the input inside.
Let’s now shift our attention to data visualization. R has a set of basic visualization tools (e.g., plot, hist, etc) but we’ll learn a very cool data visualization package called ggplot2.
First, let’s install the gapminder package
install.packages('gapminder')
Then, we import ggplot2 and the gapminder dataset.
library(ggplot2)
library(gapminder)
data <- gapminder
Now let’s use data wrangling and ggplot2 together in a this dataset. First, have a look at the head of the table:
head(data)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
There is actually a very cool function that makes a useful summary
summary(data)
## country continent year lifeExp
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
## Algeria : 12 Asia :396 Median :1980 Median :60.71
## Angola : 12 Europe :360 Mean :1980 Mean :59.47
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
## Australia : 12 Max. :2007 Max. :82.60
## (Other) :1632
## pop gdpPercap
## Min. :6.001e+04 Min. : 241.2
## 1st Qu.:2.794e+06 1st Qu.: 1202.1
## Median :7.024e+06 Median : 3531.8
## Mean :2.960e+07 Mean : 7215.3
## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
## Max. :1.319e+09 Max. :113523.1
##
And it shows that we have 12 records of Afghanistan, 12 records of Albania, etc. Then the numerical columns, we see the minimum, maximum and the mean of each column. After we are satisfied with those numbers, let’s now make some visualization.
Let’s start with making a summary of all the data
data_summary <- data %>%
summarize(avg_life_exp = mean(lifeExp))
data_summary
## # A tibble: 1 x 1
## avg_life_exp
## <dbl>
## 1 59.5
But this little summary will be much better if we break it by year using group_by
data_summary <- data %>%
group_by(year) %>%
summarize(avg_life_exp = mean(lifeExp))
data_summary
## # A tibble: 12 x 2
## year avg_life_exp
## <int> <dbl>
## 1 1952 49.1
## 2 1957 51.5
## 3 1962 53.6
## 4 1967 55.7
## 5 1972 57.6
## 6 1977 59.6
## 7 1982 61.5
## 8 1987 63.2
## 9 1992 64.2
## 10 1997 65.0
## 11 2002 65.7
## 12 2007 67.0
Now that we have the average life expectancy by year, we can go ahead and use ggplot to visualize.
ggplot(data_summary, aes(x=year, y=avg_life_exp))+
geom_point()
Which makes a nice plot of black dots that shows the progress over the years in life expectancy.
So, speaking of the code, what is going on here? And how do we read this code?
The ggplot function is based on the idea of layers: each data visualization is composed of layers and a common logic. The common logic is set in the first line where we identify the data we are using, and then inside the aes()
function we define the aesthetics of the plot like the x-variable and the y-variable. After we are done with those basic definitions, now we want to add layers on top of that.
ggplot(data, aes(x= name_of_x_variable, y=name_of_y_variable)) +
geom_layer(properties_of_this_layer) +
geom_layer(properties_of_this_layer) +
geom_layer(properties_of_this_layer)
The plus sign is what we use to add layers on top of one another.
Let’s now look at some basic layers:
ylab() and xlab() change the labels of the y-axis and the x-axis
ggplot(data_summary, aes(x=year, y=avg_life_exp))+
geom_point() +
xlab('Year') +
ylab('Average Life Expectancy')
ggtitle() adds a title to the plot
ggplot(data_summary, aes(x=year, y=avg_life_exp))+
geom_point() +
xlab('Year') +
ylab('Average Life Expectancy') +
ggtitle('Life expectancy over the years')
Now, we want to make things more interesting. I want to see the same plot but we want to break things down in a continent by continent so that we see which continents are doing better than others. To do so, we’ll need to add continent to the group_by function
data_summary <- data %>%
group_by(year, continent) %>%
summarize(avg_life_exp = mean(lifeExp))
data_summary
## # A tibble: 60 x 3
## # Groups: year [12]
## year continent avg_life_exp
## <int> <fct> <dbl>
## 1 1952 Africa 39.1
## 2 1952 Americas 53.3
## 3 1952 Asia 46.3
## 4 1952 Europe 64.4
## 5 1952 Oceania 69.3
## 6 1957 Africa 41.3
## 7 1957 Americas 56.0
## 8 1957 Asia 49.3
## 9 1957 Europe 66.7
## 10 1957 Oceania 70.3
## # … with 50 more rows
Hopefully you now see that we have a separate column for continent and we are now able to use that column to color things. In ggplot2, we can now add color to the aes
ggplot(data_summary, aes(x=year, y=avg_life_exp, color=continent))+
geom_point() +
xlab('Year') +
ylab('Average Life Expectancy') +
ggtitle('Life expectancy over the years')
And I hope you see that we now got 5 different lines, all in different colors. Note that we can easily add a line to the plot by adding a new layer geom_line()
ggplot(data_summary, aes(x=year, y=avg_life_exp, color=continent))+
geom_point() +
geom_line() +
xlab('Year') +
ylab('Average Life Expectancy') +
ggtitle('Life expectancy over the years')
Which now makes a line that connects all points – again broken down by continents.
Now I am going to use a different plot: bar plot for life expectancy. We won’t be using the year
information so here we are doing an overall average. Note two things: first, the x-axis inside aes has now changed from year to continent. Also, we have a new input called fill. Finally, we see that we used a different layer called geom_bar(stat=’identity’)
which simply makes a bar plot using the exact numbers in the y-axis.
data_summary <- data %>%
group_by(continent) %>%
summarize(avg_life_exp = mean(lifeExp))
data_summary
## # A tibble: 5 x 2
## continent avg_life_exp
## <fct> <dbl>
## 1 Africa 48.9
## 2 Americas 64.7
## 3 Asia 60.1
## 4 Europe 71.9
## 5 Oceania 74.3
ggplot(data_summary, aes(x=continent, y=avg_life_exp, fill=continent))+
geom_bar(stat='identity') +
xlab('Year') +
ylab('Average Life Expectancy') +
ggtitle('Life expectancy over the years')
Note that in both color and fill, when used inside aes, we are not deciding the colors to use but rather the variable that we should use to fill that color.
This marks the end of today’s lesson. We’ll now do some exercises.
How long are people living? Make a histogram using geom_histogram
layer. Experiment with bins and colors (i.e., geom_histogram(bins=20, color=”black”).
Draw a scatter plot of life expectancy by year using only Spain. (Here, you’ll need to use dplyr’s filter command).
Let’s make the same scatter plot but for multiple countries and using gdpPercap. Again, you’ll use the filter command. Let’s say we want to use Mexico, Canada and Iran. We can use the %in% operator to find which elements belong to a given list. For example,
countries <- c('France', 'Brazil', 'China', 'Canada', 'Iran', 'United States')
countries %in% c('Canada','Iran')
## [1] FALSE FALSE FALSE TRUE TRUE FALSE
Which should print the elements in countries that are in the second smaller list. You want to use that inside filter too.
Don’t forget to make a scatter plot and a line (geom_line). Make them in different colors, and also use labeled axes.