What is EDA

In this class we’ll focus on the Exploratory Data Analysis (EDA). In short, it refers to what we have been doing so far: data transformation and data visualization. We are going to dive deeper into those two but in a more coherent framework so that you become as comfortable as possible with data exploration in R. Nothing fundamentally new in this lab – we’ll only dive deeper into using the tools we already know.

You should think of the goal of exploratory data analysis as the following: you want to explore as many aspects as possible in your data so that you gain a better understanding of what is already there. The way you go about this is by so many iterations of questions -> visualizations to answer those questions.

ggplo2: one more time

First, let’s revisit some basics of ggplot2 library. We’ll use a new dataset called mpg. The dataset comes from the US Enviromental Protection Agency which collected a couple of models along with the following features for each model: manufacturer, model, year, engine size in liters, transmission type along with other variables.

library(ggplot2)

You can read more about the dataset by typing this command in the console:

?mpg

Let’s look at the first few lines:

mpg
## # A tibble: 234 x 11
##    manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4      1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4      1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4      2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4      2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4      2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4      2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4      3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 q…   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 q…   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 q…   2    2008     4 manu… 4        20    28 p     comp…
## # … with 224 more rows

We can asl a very simple question about fuel efficiency: what is the relationship between car engine size (i.e., displ) and fuel consumption (i.e., hwy)? We can simply answer this question using the two variables: displ (which codes for engine size in liters) and hwy (which codes for fuel efficiancy in miles per gallon in highways). As both variables are continous numbers, we can use a scatter plot:

ggplot(mpg, aes(x=displ, y=hwy)) + 
  geom_point()

As you see, we see a negative relationship between engine size and fuel efficincy. Now we can cliam that we are done with the first iteration of Q->V routine. We’ll do that throughout this lesson.

plotting 3 variables

In ggplot2, we have a lot more options we’ll explore in this short tutorial. Let’s vary the points in the scatter plots by a third variable. For example, we can color those cars using the class of the car:

ggplot(mpg, aes(x=displ, y=hwy, color=class)) + 
  geom_point()

All what we did was adding color=class inside the aes command in the plotting command. We can also use shape as well

ggplot(mpg, aes(x=displ, y=hwy, shape=class)) + 
  geom_point()

If the third variable was continous (made of numbers), then we can use other options including size (which does not make much sense if your third variable is categorical):

ggplot(mpg, aes(x=displ, y=hwy, size=class)) + 
  geom_point()

or alpha:

ggplot(mpg, aes(x=displ, y=hwy, alpha=class)) + 
  geom_point()

faceting

Another feature of ggplot is called faceting. It is the splitting of a plot into many subplots by a given thrid variable. For example, we can plot the same relationship but by making a small subplot for each class using the function facet_wrap:

ggplot(mpg, aes(x=displ, y=hwy, color=class)) + 
  geom_point() + 
  facet_wrap(~class, nrow=2)

And now we can see the relationship clearer. You can also use a combination of two variables by writing the varialbes around the telda ~ symbol

ggplot(mpg, aes(x=displ, y=hwy)) + 
  geom_point() + 
  facet_wrap(cyl~class)

geoms

Geoms are geometrical objects in ggplot. We can use a lot of those shapes when plotting. For example, we already seen “geom_point” Let’s try “geom_smooth”

ggplot(mpg, aes(x=displ, y=hwy)) + 
  geom_point() + 
  geom_smooth(method='lm')

Which will draw a smooth line that run across all the points, overlayed over the scatter plot. To make separate lines per a third variable, we’ll need to provide that third variable inside an aes funciton

ggplot(mpg, aes(x=displ, y=hwy, color=class)) + 
  geom_point() + 
  geom_smooth(method='lm', se=FALSE)

Note the relationship between aes and the lines: If we specify the aes at the base layer (i.e., inside ggplot()), then it will be applied to all the following layers. In the plot above, we specified color=class inside ggplot() and this will be inhereted by all the following layers. However, if we want to specify aesthetics for a certain layer, then we’ll need to define those inside that layer.

ggplot(mpg, aes(x=displ, y=hwy)) +
  geom_point(aes(color=class)) + 
  geom_smooth(method='lm', se=FALSE)

Also note that anything is defined outside aes will apply a static value to all the geoms. For example, we want to have the fit line in black. We can simply write color='black' inside the geom_smooth but not inside an aes function:

ggplot(mpg, aes(x=displ, y=hwy)) + 
  geom_point(aes(color=class)) + 
  geom_smooth(method='lm', se=FALSE, color='black')

Let’s add this smooth line over the points:

Let’s now explore “geom_boxplot”

ggplot(mpg, aes(x=class, y=hwy)) + 
  geom_boxplot()

We can also use “coord_flip()” to make this graph more readible

ggplot(mpg, aes(x=class, y=hwy)) + 
  geom_boxplot() + 
  coord_flip()

dplyr one more time

Now that we have explored some basics, let’s discuss data transformation and wrangling using dplyr. This is probably the last time we’re going to introduce those concepts. In the future labs, we’ll just go ahead and apply those concepts to understand statistics.

We’ll use a unique dataset called nycflights. First, you want to install the package:

install.packages('nycflights13')

Then we’ll load both packages inside our envirnment

library(dplyr)
library(nycflights13)

You can read a bit more about the dataset by typing

?flights

The dataset contains about 337K flights departing from New York City (JFK, LGA and EWR) in 2013. Each row contains: year, month, day, departure time, scheled departure time, departure delay, arrival time along with other variables.

Let’s revisit the basic verbs in dplyr:

filter

Lets begin by selecting flights on 4th of July:

flights %>% 
  filter(day==4 & month==7)
## # A tibble: 737 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     7     4       11           2359        12      400
##  2  2013     7     4       59           2359        60      501
##  3  2013     7     4      454            500        -6      635
##  4  2013     7     4      535            536        -1      802
##  5  2013     7     4      538            540        -2      835
##  6  2013     7     4      539            545        -6      918
##  7  2013     7     4      542            545        -3      814
##  8  2013     7     4      553            600        -7      659
##  9  2013     7     4      554            600        -6      822
## 10  2013     7     4      556            600        -4      705
## # … with 727 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Or select flights in July, October and December:

flights %>% 
  filter(month %in% c(7, 10, 12))
## # A tibble: 86,449 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013    10     1      447            500       -13      614
##  2  2013    10     1      522            517         5      735
##  3  2013    10     1      536            545        -9      809
##  4  2013    10     1      539            545        -6      801
##  5  2013    10     1      539            545        -6      917
##  6  2013    10     1      544            550        -6      912
##  7  2013    10     1      549            600       -11      653
##  8  2013    10     1      550            600       -10      648
##  9  2013    10     1      550            600       -10      649
## 10  2013    10     1      551            600        -9      727
## # … with 86,439 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Let’s also use multiple logical operators: we can select flights that have not been delayed by mre than 2 hours:

flights %>% 
  filter(arr_delay<120 & dep_delay<120)
## # A tibble: 315,868 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # … with 315,858 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

We can also say

flights %>% 
  filter( !(arr_delay>120 & dep_delay>120)  )
## # A tibble: 320,060 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # … with 320,050 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

arrange

With arrange, we can sort the rows in either an ascending order (default) or a descending order.

For example, let’s find the most delayed flight in 2013 – note the way I use desc to get the descending order:

flights %>% arrange(desc(dep_delay)) %>% head(5)
## # A tibble: 5 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     1     9      641            900      1301     1242
## 2  2013     6    15     1432           1935      1137     1607
## 3  2013     1    10     1121           1635      1126     1239
## 4  2013     9    20     1139           1845      1014     1457
## 5  2013     7    22      845           1600      1005     1044
## # … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

So the top flight was delayed by 1301 minutes (or ~22 hours).

select

Imagine we have a dataset with so many columns (think of hundreds – which is very possible in real-life datasets). select is the best way to zoom in specific columns and ignore the rest. Let’s for example select only the year, month and day:

flights %>% select(year, month, day)
## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # … with 336,766 more rows

We can also use a range. Remember, 1:3 will make a list from 1 to 3 (inclusive). with year:day, it will select all columns between year and day inclusive:

flights %>% select(year:day) #
## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # … with 336,766 more rows

We can also use the minus sign to indicate execlusions: select all columns except for those specificed columns:

flights %>% select(-(year:day))
## # A tibble: 336,776 x 16
##    dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
##       <int>          <int>     <dbl>    <int>          <int>     <dbl>
##  1      517            515         2      830            819        11
##  2      533            529         4      850            830        20
##  3      542            540         2      923            850        33
##  4      544            545        -1     1004           1022       -18
##  5      554            600        -6      812            837       -25
##  6      554            558        -4      740            728        12
##  7      555            600        -5      913            854        19
##  8      557            600        -3      709            723       -14
##  9      557            600        -3      838            846        -8
## 10      558            600        -2      753            745         8
## # … with 336,766 more rows, and 10 more variables: carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

You might not be happy about a given column name. You can use rename to rename that column: new name = old name.

flights %>% rename(tail_num = tailnum)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tail_num <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Finally, we can use everything inside select: we want time_hour and air_time to be shown first and then show the remaining columns as is:

flights %>% select(time_hour, air_time, everything())
## # A tibble: 336,776 x 19
##    time_hour           air_time  year month   day dep_time sched_dep_time
##    <dttm>                 <dbl> <int> <int> <int>    <int>          <int>
##  1 2013-01-01 05:00:00      227  2013     1     1      517            515
##  2 2013-01-01 05:00:00      227  2013     1     1      533            529
##  3 2013-01-01 05:00:00      160  2013     1     1      542            540
##  4 2013-01-01 05:00:00      183  2013     1     1      544            545
##  5 2013-01-01 06:00:00      116  2013     1     1      554            600
##  6 2013-01-01 05:00:00      150  2013     1     1      554            558
##  7 2013-01-01 06:00:00      158  2013     1     1      555            600
##  8 2013-01-01 06:00:00       53  2013     1     1      557            600
##  9 2013-01-01 06:00:00      140  2013     1     1      557            600
## 10 2013-01-01 06:00:00      138  2013     1     1      558            600
## # … with 336,766 more rows, and 12 more variables: dep_delay <dbl>,
## #   arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
## #   hour <dbl>, minute <dbl>

mutate

In mutate, we can modify or add new variables based on current variables. For example, we will add two new variables here: the flight departure time but in hours and minutes:

flights %>%
    mutate(hour    = dep_time %/% 100, 
           minutes = dep_time %% 100) %>%
  select(dep_time, hour, minutes)
## # A tibble: 336,776 x 3
##    dep_time  hour minutes
##       <int> <dbl>   <dbl>
##  1      517     5      17
##  2      533     5      33
##  3      542     5      42
##  4      544     5      44
##  5      554     5      54
##  6      554     5      54
##  7      555     5      55
##  8      557     5      57
##  9      557     5      57
## 10      558     5      58
## # … with 336,766 more rows

summarize

In summarize, we compute any summary number from a set of measurements. Summarize will always collapse many rows into a single value. You alomst always want to use summarize paired with group_by so that you can print summary statistics for each group. For example, let’s print the average of delays (in minutes) for each day in the year

flights %>% 
  group_by(year, month, day) %>% 
  summarise(delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day delay
##    <int> <int> <int> <dbl>
##  1  2013     1     1 11.5 
##  2  2013     1     2 13.9 
##  3  2013     1     3 11.0 
##  4  2013     1     4  8.95
##  5  2013     1     5  5.73
##  6  2013     1     6  7.15
##  7  2013     1     7  5.42
##  8  2013     1     8  2.55
##  9  2013     1     9  2.28
## 10  2013     1    10  2.84
## # … with 355 more rows

We can also explore the relationship between delay time and the average distance in the air. Do longer flights have more delays than shorter ones? Lets make a scatter plot as well.

delays <- flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  filter(count > 20, dest != "HNL")

ggplot(data = delays, mapping = aes(x = dist, y = delay)) +
  geom_point(aes(size = count), alpha = 1/3) +
  geom_smooth(se = FALSE)

Let’s deal with only non cancelled flights and print the average delay again:

not_cancelled <- flights %>% 
  filter(!is.na(dep_delay) & !is.na(arr_delay))

not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(mean = mean(dep_delay))
## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day  mean
##    <int> <int> <int> <dbl>
##  1  2013     1     1 11.4 
##  2  2013     1     2 13.7 
##  3  2013     1     3 10.9 
##  4  2013     1     4  8.97
##  5  2013     1     5  5.73
##  6  2013     1     6  7.15
##  7  2013     1     7  5.42
##  8  2013     1     8  2.56
##  9  2013     1     9  2.30
## 10  2013     1    10  2.84
## # … with 355 more rows

counts

A good practice is to use counts whenever you do a summarization so you know we are not dealing with small data.

Let’s explore the number of flights vs average delay.

delays <- not_cancelled %>% 
  group_by(tailnum) %>% 
  summarise(
    delay = mean(arr_delay, na.rm = TRUE),
    n = n()
  )
 
ggplot(data = delays, mapping = aes(x = n, y = delay)) + 
  geom_point(alpha = 1/10)

Which shows that little flights => more variation. As the number of flights increases, we have more tight variaiton. We can filter out the flights with small n

delays %>% 
  filter(n > 25)
## # A tibble: 2,922 x 3
##    tailnum  delay     n
##    <chr>    <dbl> <int>
##  1 N0EGMQ   9.98    352
##  2 N10156  12.7     145
##  3 N102UW   2.94     48
##  4 N103US  -6.93     46
##  5 N104UW   1.80     46
##  6 N10575  20.7     269
##  7 N105UW  -0.267    45
##  8 N107US  -5.73     41
##  9 N108UW  -1.25     60
## 10 N109UW  -2.52     48
## # … with 2,912 more rows
ggplot(delays, aes(x = n, y = delay)) + 
    geom_point(alpha = 1/10)

Misc.

We can use subsetting inside summarize. As arr_delay have negative numbers (for flights that arrived earlier than scheduled), it may be better if we only select the positive values to reflect the truly delayed flights.

not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(
    avg_delay1 = mean(arr_delay),
    avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
  )
## # A tibble: 365 x 5
## # Groups:   year, month [12]
##     year month   day avg_delay1 avg_delay2
##    <int> <int> <int>      <dbl>      <dbl>
##  1  2013     1     1     12.7         32.5
##  2  2013     1     2     12.7         32.0
##  3  2013     1     3      5.73        27.7
##  4  2013     1     4     -1.93        28.3
##  5  2013     1     5     -1.53        22.6
##  6  2013     1     6      4.24        24.4
##  7  2013     1     7     -4.95        27.8
##  8  2013     1     8     -3.23        20.8
##  9  2013     1     9     -0.264       25.6
## 10  2013     1    10     -5.90        27.3
## # … with 355 more rows

Why is distance to some destinations more variable than to others? (This question is for you to think about)

not_cancelled %>% 
  group_by(dest) %>% 
  summarise(distance_sd = sd(distance), count=n()) %>% 
  arrange(desc(distance_sd))
## # A tibble: 104 x 3
##    dest  distance_sd count
##    <chr>       <dbl> <int>
##  1 EGE         10.5    207
##  2 SAN         10.4   2709
##  3 SFO         10.2  13173
##  4 HNL         10.0    701
##  5 SEA          9.98  3885
##  6 LAS          9.91  5952
##  7 PDX          9.87  1342
##  8 PHX          9.86  4606
##  9 LAX          9.66 16026
## 10 IND          9.46  1981
## # … with 94 more rows

Which destinations have the most carriers?

not_cancelled %>% 
  group_by(dest) %>% 
  summarise(carriers = n_distinct(carrier)) %>% 
  arrange(desc(carriers))
## # A tibble: 104 x 2
##    dest  carriers
##    <chr>    <int>
##  1 ATL          7
##  2 BOS          7
##  3 CLT          7
##  4 ORD          7
##  5 TPA          7
##  6 AUS          6
##  7 DCA          6
##  8 DTW          6
##  9 IAD          6
## 10 MSP          6
## # … with 94 more rows

How many flights in each distination ? Let’s make a descending order list.

not_cancelled %>% 
  count(dest) %>%
  arrange(desc(n))
## # A tibble: 104 x 2
##    dest      n
##    <chr> <int>
##  1 ATL   16837
##  2 ORD   16566
##  3 LAX   16026
##  4 BOS   15022
##  5 MCO   13967
##  6 CLT   13674
##  7 SFO   13173
##  8 FLL   11897
##  9 MIA   11593
## 10 DCA    9111
## # … with 94 more rows

What proportion of flights are delayed by more than an hour, on a daily basis?

not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(hour_perc = mean(arr_delay > 60))
## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day hour_perc
##    <int> <int> <int>     <dbl>
##  1  2013     1     1    0.0722
##  2  2013     1     2    0.0851
##  3  2013     1     3    0.0567
##  4  2013     1     4    0.0396
##  5  2013     1     5    0.0349
##  6  2013     1     6    0.0470
##  7  2013     1     7    0.0333
##  8  2013     1     8    0.0213
##  9  2013     1     9    0.0202
## 10  2013     1    10    0.0183
## # … with 355 more rows

Exercises

Find all flights that:

Is there a relationship between proportion of cancelled flights and the average delays? Remember that a missing dep_delay or arr_delay (i.e., is.na() is TRUE) represents a cancelled flight. Make a scatter plot that have the proportion of cancelled flights on the x-axis, and the average delay on the y-axis. Group the observations by month. Also make sure you plot straight line using geom_smooth(method='lm').