In this lab, we’ll learn about the fancy and beautiful R. We’ll learn first the basics of dplyr library and then we will head to make state-of-the-art data visualizations using ggplot2. Both of those libraries come as a part of a collection of packages called tidyverse. According to their (website)[https://www.tidyverse.org]: ‘The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures’. Our goal is to master those two libraries.

The common theme of this lab is the use of grammar. ggplot2 presents a grammar for data visualization while dplyr presents a grammar for data wrangling. We’ll need to learn those grammar rules before we can use those libraries but you’ll see shortly how little grammar can pay off.

The dplyr library

Let’s start with a bit of data wrangling before we make some pretty graphics.

The five main verbs for data wrangling are:

Each of those verbs is an actual function and each of those functions takes a dataframe as input and returns another dataframe. The combination of those verbs will enable us to create infinite ways to perform descriptive statistics. In this lab, we are covering select(), filter() and summarize(). We’ll cover them all in the next lab.

If you have not installed dplyr or ggplot2 before, then you can use install.packages:

install.packages('dplyr')
install.packages('ggplot2')

Let’s first bring the dplyr and the babynames dataset again:

library(dplyr)
library(babynames)

select() and filter()

We already talked about how to select columns by index or by names, and how to filter rows by index or by logical. Here we are going to repeat all that but this time we’ll do it better.

Note that if you are reading your data from a source file, then you’ll need a function like read.csv or read.table to read the data.

Now we can select any set of columns using the select method. The first input is the name of the dataframe we are using, and the following inputs are the column names we want to select. Note that it is not necessary to wrap column names in quotation marks.

select(babynames, year, name, n)
## # A tibble: 1,924,665 x 3
##     year name          n
##    <dbl> <chr>     <int>
##  1  1880 Mary       7065
##  2  1880 Anna       2604
##  3  1880 Emma       2003
##  4  1880 Elizabeth  1939
##  5  1880 Minnie     1746
##  6  1880 Margaret   1578
##  7  1880 Ida        1472
##  8  1880 Alice      1414
##  9  1880 Bertha     1320
## 10  1880 Sarah      1288
## # … with 1,924,655 more rows

The equivelant command we learned in the last lab was

babynames[, c('year','name','n')]
## # A tibble: 1,924,665 x 3
##     year name          n
##    <dbl> <chr>     <int>
##  1  1880 Mary       7065
##  2  1880 Anna       2604
##  3  1880 Emma       2003
##  4  1880 Elizabeth  1939
##  5  1880 Minnie     1746
##  6  1880 Margaret   1578
##  7  1880 Ida        1472
##  8  1880 Alice      1414
##  9  1880 Bertha     1320
## 10  1880 Sarah      1288
## # … with 1,924,655 more rows

Similarly, the first input to filter() is the dataframe and then we want to write the logical conditions that should be evaluated on every row. Let’s say we only want female names in the year 1975:

filter(babynames, sex == 'F')
## # A tibble: 1,138,293 x 5
##     year sex   name          n   prop
##    <dbl> <chr> <chr>     <int>  <dbl>
##  1  1880 F     Mary       7065 0.0724
##  2  1880 F     Anna       2604 0.0267
##  3  1880 F     Emma       2003 0.0205
##  4  1880 F     Elizabeth  1939 0.0199
##  5  1880 F     Minnie     1746 0.0179
##  6  1880 F     Margaret   1578 0.0162
##  7  1880 F     Ida        1472 0.0151
##  8  1880 F     Alice      1414 0.0145
##  9  1880 F     Bertha     1320 0.0135
## 10  1880 F     Sarah      1288 0.0132
## # … with 1,138,283 more rows

The equivalent command we covered in the last lab was

babynames[babynames$sex == 'F' , ]
## # A tibble: 1,138,293 x 5
##     year sex   name          n   prop
##    <dbl> <chr> <chr>     <int>  <dbl>
##  1  1880 F     Mary       7065 0.0724
##  2  1880 F     Anna       2604 0.0267
##  3  1880 F     Emma       2003 0.0205
##  4  1880 F     Elizabeth  1939 0.0199
##  5  1880 F     Minnie     1746 0.0179
##  6  1880 F     Margaret   1578 0.0162
##  7  1880 F     Ida        1472 0.0151
##  8  1880 F     Alice      1414 0.0145
##  9  1880 F     Bertha     1320 0.0135
## 10  1880 F     Sarah      1288 0.0132
## # … with 1,138,283 more rows

Now, what if we want to do those two together: we first want to filter rows that meet a specific condition and then we want to select only a specific column? To do this, we can use the so-called pipe operator %>% which essentially takes the output of the command on the left as input to the command on the right. Let’s do an example:

babynames %>%
    filter(year == 1950 & sex == 'F') %>%
    select(name)
## # A tibble: 6,111 x 1
##    name    
##    <chr>   
##  1 Linda   
##  2 Mary    
##  3 Patricia
##  4 Barbara 
##  5 Susan   
##  6 Nancy   
##  7 Deborah 
##  8 Sandra  
##  9 Carol   
## 10 Kathleen
## # … with 6,101 more rows

Which should return the female names in 1950.

Note that

dataframe %>% filter(condition)

Is actually equivalent to filter(dataframe, condition). However, using the operator %>% makes our code very readable.

summarize() and group_by()

Until now, we have covered how to select columns or filter rows – but we haven’t really talked about descriptive statistics. summarize() will convert a set of numbers (or measurements) into a single number. It is how we can use the descriptive statistic functions we discussed last time. It will only output a single row. Let’s see an example:

babynames %>% 
  summarize( N=n(), 
             first_year = min(year),  
             last_year=max(year), 
             avg_n = mean(n), 
             max_n= max(n), 
             min_n= min(n))
## # A tibble: 1 x 6
##         N first_year last_year avg_n max_n min_n
##     <int>      <dbl>     <dbl> <dbl> <int> <int>
## 1 1924665       1880      2017  181. 99686     5

Here, n() means nrow(babynames). Also note that we defined a completely new set of variables but in each of those new variables, we used column names (as vectors/lists) and did whatever we want to do with those columns to make summaries.

But this doesn’t really give enough information. We can use group_by() to break those summaries by a grouping variable like sex.

babynames %>% 
  group_by(sex)  %>% 
  summarize( N=n(), 
             first_year = min(year),  
             last_year=max(year),   
             n_females=sum(sex == 'female') , 
             avg_n = mean(n))
## # A tibble: 2 x 6
##   sex         N first_year last_year n_females avg_n
##   <chr>   <int>      <dbl>     <dbl>     <int> <dbl>
## 1 F     1138293       1880      2017         0  151.
## 2 M      786372       1880      2017         0  223.

Note that we now have two rows, one for each grouping level. And only then we can compare different things.

One last example: how to get the most popular names after 2000, in a year-by-year basis?

popular_names <- babynames %>% 
                    filter(year >= 2015) %>% 
                    group_by(sex, year)  %>%  
                    summarize(total_n=sum(n)) %>% 
                    arrange(desc(total_n))
popular_names
## # A tibble: 6 x 3
## # Groups:   sex [2]
##   sex    year total_n
##   <chr> <dbl>   <int>
## 1 M      2015 1909804
## 2 M      2016 1889052
## 3 M      2017 1834490
## 4 F      2015 1778883
## 5 F      2016 1763916
## 6 F      2017 1711811

And We here got a nice list of popular names, by sex and year. Notice that here we used arrange to sort the rows according to the input inside.

Basics of ggplot2

Let’s now shift our attention to data visualization. R has a set of basic visualization tools (e.g., plot, hist, etc) but we’ll learn a very cool data visualization package called ggplot2.

First, let’s install the gapminder package

install.packages('gapminder')

Then, we import ggplot2 and the gapminder dataset.

library(ggplot2)
library(gapminder)
data <- gapminder

Now let’s use data wrangling and ggplot2 together in a this dataset. First, have a look at the head of the table:

head(data)
## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

There is actually a very cool function that makes a useful summary

summary(data)
##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
## 

And it shows that we have 12 records of Afghanistan, 12 records of Albania, etc. Then the numerical columns, we see the minimum, maximum and the mean of each column. After we are satisfied with those numbers, let’s now make some visualization.

Let’s start with making a summary of all the data

data_summary <- data %>% 
                  summarize(avg_life_exp = mean(lifeExp))
data_summary
## # A tibble: 1 x 1
##   avg_life_exp
##          <dbl>
## 1         59.5

But this little summary will be much better if we break it by year using group_by

data_summary <- data %>%  
                  group_by(year) %>% 
                  summarize(avg_life_exp = mean(lifeExp))
data_summary
## # A tibble: 12 x 2
##     year avg_life_exp
##    <int>        <dbl>
##  1  1952         49.1
##  2  1957         51.5
##  3  1962         53.6
##  4  1967         55.7
##  5  1972         57.6
##  6  1977         59.6
##  7  1982         61.5
##  8  1987         63.2
##  9  1992         64.2
## 10  1997         65.0
## 11  2002         65.7
## 12  2007         67.0

Now that we have the average life expectancy by year, we can go ahead and use ggplot to visualize.

ggplot(data_summary, aes(x=year, y=avg_life_exp))+
  geom_point()

Which makes a nice plot of black dots that shows the progress over the years in life expectancy.

So, speaking of the code, what is going on here? And how do we read this code?

The ggplot function is based on the idea of layers: each data visualization is composed of layers and a common logic. The common logic is set in the first line where we identify the data we are using, and then inside the aes() function we define the aesthetics of the plot like the x-variable and the y-variable. After we are done with those basic definitions, now we want to add layers on top of that.

ggplot(data, aes(x= name_of_x_variable, y=name_of_y_variable)) +
    geom_layer(properties_of_this_layer) +
    geom_layer(properties_of_this_layer) +
    geom_layer(properties_of_this_layer)

The plus sign is what we use to add layers on top of one another.

Let’s now look at some basic layers:

ylab() and xlab()

ylab() and xlab() change the labels of the y-axis and the x-axis

ggplot(data_summary, aes(x=year, y=avg_life_exp))+
  geom_point() +
  xlab('Year') +
  ylab('Average Life Expectancy')

ggtitle()

ggtitle() adds a title to the plot

ggplot(data_summary, aes(x=year, y=avg_life_exp))+
  geom_point() +
  xlab('Year') +
  ylab('Average Life Expectancy') +
  ggtitle('Life expectancy over the years')

Color and Fill

Now, we want to make things more interesting. I want to see the same plot but we want to break things down in a continent by continent so that we see which continents are doing better than others. To do so, we’ll need to add continent to the group_by function

data_summary <- data %>%  
                  group_by(year, continent) %>% 
                  summarize(avg_life_exp = mean(lifeExp))
data_summary
## # A tibble: 60 x 3
## # Groups:   year [12]
##     year continent avg_life_exp
##    <int> <fct>            <dbl>
##  1  1952 Africa            39.1
##  2  1952 Americas          53.3
##  3  1952 Asia              46.3
##  4  1952 Europe            64.4
##  5  1952 Oceania           69.3
##  6  1957 Africa            41.3
##  7  1957 Americas          56.0
##  8  1957 Asia              49.3
##  9  1957 Europe            66.7
## 10  1957 Oceania           70.3
## # … with 50 more rows

Hopefully you now see that we have a separate column for continent and we are now able to use that column to color things. In ggplot2, we can now add color to the aes

ggplot(data_summary, aes(x=year, y=avg_life_exp, color=continent))+
  geom_point() +
  xlab('Year') +
  ylab('Average Life Expectancy') +
  ggtitle('Life expectancy over the years')

And I hope you see that we now got 5 different lines, all in different colors. Note that we can easily add a line to the plot by adding a new layer geom_line()

ggplot(data_summary, aes(x=year, y=avg_life_exp, color=continent))+
  geom_point() +
  geom_line() +
  xlab('Year') +
  ylab('Average Life Expectancy') +
  ggtitle('Life expectancy over the years')

Which now makes a line that connects all points – again broken down by continents.

Fill fills in color

Now I am going to use a different plot: bar plot for life expectancy. We won’t be using the year information so here we are doing an overall average. Note two things: first, the x-axis inside aes has now changed from year to continent. Also, we have a new input called fill. Finally, we see that we used a different layer called geom_bar(stat=’identity’) which simply makes a bar plot using the exact numbers in the y-axis.

data_summary <- data %>%  
                  group_by(continent) %>% 
                  summarize(avg_life_exp = mean(lifeExp))
data_summary
## # A tibble: 5 x 2
##   continent avg_life_exp
##   <fct>            <dbl>
## 1 Africa            48.9
## 2 Americas          64.7
## 3 Asia              60.1
## 4 Europe            71.9
## 5 Oceania           74.3
ggplot(data_summary, aes(x=continent, y=avg_life_exp, fill=continent))+
  geom_bar(stat='identity') +
  xlab('Year') +
  ylab('Average Life Expectancy') +
  ggtitle('Life expectancy over the years')

Note that in both color and fill, when used inside aes, we are not deciding the colors to use but rather the variable that we should use to fill that color.

This marks the end of today’s lesson. We’ll now do some exercises.

Excercises

Exercise 1

How long are people living? Make a histogram using geom_histogram layer. Experiment with bins and colors (i.e., geom_histogram(bins=20, color=”black”).

Exercise 2

Draw a scatter plot of life expectancy by year using only Spain. (Here, you’ll need to use dplyr’s filter command).

Exercise 3

Let’s make the same scatter plot but for multiple countries and using gdpPercap. Again, you’ll use the filter command. Let’s say we want to use Mexico, Canada and Iran. We can use the %in% operator to find which elements belong to a given list. For example,

countries <- c('France', 'Brazil', 'China', 'Canada', 'Iran', 'United States')
countries %in% c('Canada','Iran')
## [1] FALSE FALSE FALSE  TRUE  TRUE FALSE

Which should print the elements in countries that are in the second smaller list. You want to use that inside filter too.

Don’t forget to make a scatter plot and a line (geom_line). Make them in different colors, and also use labeled axes.

More resources