In this class we’ll focus on the Exploratory Data Analysis (EDA). In short, it refers to what we have been doing so far: data transformation and data visualization. We are going to dive deeper into those two but in a more coherent framework so that you become as comfortable as possible with data exploration in R. Nothing fundamentally new in this lab – we’ll only dive deeper into using the tools we already know.
You should think of the goal of exploratory data analysis as the following: you want to explore as many aspects as possible in your data so that you gain a better understanding of what is already there. The way you go about this is by so many iterations of questions -> visualizations to answer those questions.
First, let’s revisit some basics of ggplot2 library. We’ll use a new dataset called mpg
. The dataset comes from the US Enviromental Protection Agency which collected a couple of models along with the following features for each model: manufacturer, model, year, engine size in liters, transmission type along with other variables.
library(ggplot2)
You can read more about the dataset by typing this command in the console:
?mpg
Let’s look at the first few lines:
mpg
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 q… 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 q… 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 q… 2 2008 4 manu… 4 20 28 p comp…
## # … with 224 more rows
We can asl a very simple question about fuel efficiency: what is the relationship between car engine size (i.e., displ
) and fuel consumption (i.e., hwy
)? We can simply answer this question using the two variables: displ (which codes for engine size in liters) and hwy (which codes for fuel efficiancy in miles per gallon in highways). As both variables are continous numbers, we can use a scatter plot:
ggplot(mpg, aes(x=displ, y=hwy)) +
geom_point()
As you see, we see a negative relationship between engine size and fuel efficincy. Now we can cliam that we are done with the first iteration of Q->V routine. We’ll do that throughout this lesson.
In ggplot2, we have a lot more options we’ll explore in this short tutorial. Let’s vary the points in the scatter plots by a third variable. For example, we can color those cars using the class of the car:
ggplot(mpg, aes(x=displ, y=hwy, color=class)) +
geom_point()
All what we did was adding color=class
inside the aes
command in the plotting command. We can also use shape
as well
ggplot(mpg, aes(x=displ, y=hwy, shape=class)) +
geom_point()
If the third variable was continous (made of numbers), then we can use other options including size (which does not make much sense if your third variable is categorical):
ggplot(mpg, aes(x=displ, y=hwy, size=class)) +
geom_point()
or alpha:
ggplot(mpg, aes(x=displ, y=hwy, alpha=class)) +
geom_point()
Another feature of ggplot is called faceting. It is the splitting of a plot into many subplots by a given thrid variable. For example, we can plot the same relationship but by making a small subplot for each class using the function facet_wrap
:
ggplot(mpg, aes(x=displ, y=hwy, color=class)) +
geom_point() +
facet_wrap(~class, nrow=2)
And now we can see the relationship clearer. You can also use a combination of two variables by writing the varialbes around the telda ~ symbol
ggplot(mpg, aes(x=displ, y=hwy)) +
geom_point() +
facet_wrap(cyl~class)
Geoms are geometrical objects in ggplot. We can use a lot of those shapes when plotting. For example, we already seen “geom_point” Let’s try “geom_smooth”
ggplot(mpg, aes(x=displ, y=hwy)) +
geom_point() +
geom_smooth(method='lm')
Which will draw a smooth line that run across all the points, overlayed over the scatter plot. To make separate lines per a third variable, we’ll need to provide that third variable inside an aes funciton
ggplot(mpg, aes(x=displ, y=hwy, color=class)) +
geom_point() +
geom_smooth(method='lm', se=FALSE)
Note the relationship between aes and the lines: If we specify the aes at the base layer (i.e., inside ggplot()), then it will be applied to all the following layers. In the plot above, we specified color=class
inside ggplot() and this will be inhereted by all the following layers. However, if we want to specify aesthetics for a certain layer, then we’ll need to define those inside that layer.
ggplot(mpg, aes(x=displ, y=hwy)) +
geom_point(aes(color=class)) +
geom_smooth(method='lm', se=FALSE)
Also note that anything is defined outside aes will apply a static value to all the geoms. For example, we want to have the fit line in black. We can simply write color='black'
inside the geom_smooth but not inside an aes function:
ggplot(mpg, aes(x=displ, y=hwy)) +
geom_point(aes(color=class)) +
geom_smooth(method='lm', se=FALSE, color='black')
Let’s add this smooth line over the points:
Let’s now explore “geom_boxplot”
ggplot(mpg, aes(x=class, y=hwy)) +
geom_boxplot()
We can also use “coord_flip()” to make this graph more readible
ggplot(mpg, aes(x=class, y=hwy)) +
geom_boxplot() +
coord_flip()
Now that we have explored some basics, let’s discuss data transformation and wrangling using dplyr. This is probably the last time we’re going to introduce those concepts. In the future labs, we’ll just go ahead and apply those concepts to understand statistics.
We’ll use a unique dataset called nycflights. First, you want to install the package:
install.packages('nycflights13')
Then we’ll load both packages inside our envirnment
library(dplyr)
library(nycflights13)
You can read a bit more about the dataset by typing
?flights
The dataset contains about 337K flights departing from New York City (JFK, LGA and EWR) in 2013. Each row contains: year, month, day, departure time, scheled departure time, departure delay, arrival time along with other variables.
Let’s revisit the basic verbs in dplyr:
filter()
select()
arrange()
either in ascending or descending ordermutate()
group_by()
followed by summarize()
Lets begin by selecting flights on 4th of July:
flights %>%
filter(day==4 & month==7)
## # A tibble: 737 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 7 4 11 2359 12 400
## 2 2013 7 4 59 2359 60 501
## 3 2013 7 4 454 500 -6 635
## 4 2013 7 4 535 536 -1 802
## 5 2013 7 4 538 540 -2 835
## 6 2013 7 4 539 545 -6 918
## 7 2013 7 4 542 545 -3 814
## 8 2013 7 4 553 600 -7 659
## 9 2013 7 4 554 600 -6 822
## 10 2013 7 4 556 600 -4 705
## # … with 727 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Or select flights in July, October and December:
flights %>%
filter(month %in% c(7, 10, 12))
## # A tibble: 86,449 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 10 1 447 500 -13 614
## 2 2013 10 1 522 517 5 735
## 3 2013 10 1 536 545 -9 809
## 4 2013 10 1 539 545 -6 801
## 5 2013 10 1 539 545 -6 917
## 6 2013 10 1 544 550 -6 912
## 7 2013 10 1 549 600 -11 653
## 8 2013 10 1 550 600 -10 648
## 9 2013 10 1 550 600 -10 649
## 10 2013 10 1 551 600 -9 727
## # … with 86,439 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Let’s also use multiple logical operators: we can select flights that have not been delayed by mre than 2 hours:
flights %>%
filter(arr_delay<120 & dep_delay<120)
## # A tibble: 315,868 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # … with 315,858 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
We can also say
flights %>%
filter( !(arr_delay>120 & dep_delay>120) )
## # A tibble: 320,060 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # … with 320,050 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
With arrange, we can sort the rows in either an ascending order (default) or a descending order.
For example, let’s find the most delayed flight in 2013 – note the way I use desc
to get the descending order:
flights %>% arrange(desc(dep_delay)) %>% head(5)
## # A tibble: 5 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 9 641 900 1301 1242
## 2 2013 6 15 1432 1935 1137 1607
## 3 2013 1 10 1121 1635 1126 1239
## 4 2013 9 20 1139 1845 1014 1457
## 5 2013 7 22 845 1600 1005 1044
## # … with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>
So the top flight was delayed by 1301 minutes (or ~22 hours).
Imagine we have a dataset with so many columns (think of hundreds – which is very possible in real-life datasets). select
is the best way to zoom in specific columns and ignore the rest. Let’s for example select only the year, month and day:
flights %>% select(year, month, day)
## # A tibble: 336,776 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # … with 336,766 more rows
We can also use a range. Remember, 1:3 will make a list from 1 to 3 (inclusive). with year:day, it will select all columns between year and day inclusive:
flights %>% select(year:day) #
## # A tibble: 336,776 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # … with 336,766 more rows
We can also use the minus sign to indicate execlusions: select all columns except for those specificed columns:
flights %>% select(-(year:day))
## # A tibble: 336,776 x 16
## dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
## <int> <int> <dbl> <int> <int> <dbl>
## 1 517 515 2 830 819 11
## 2 533 529 4 850 830 20
## 3 542 540 2 923 850 33
## 4 544 545 -1 1004 1022 -18
## 5 554 600 -6 812 837 -25
## 6 554 558 -4 740 728 12
## 7 555 600 -5 913 854 19
## 8 557 600 -3 709 723 -14
## 9 557 600 -3 838 846 -8
## 10 558 600 -2 753 745 8
## # … with 336,766 more rows, and 10 more variables: carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
You might not be happy about a given column name. You can use rename
to rename that column: new name = old name.
flights %>% rename(tail_num = tailnum)
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tail_num <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Finally, we can use everything
inside select: we want time_hour and air_time to be shown first and then show the remaining columns as is:
flights %>% select(time_hour, air_time, everything())
## # A tibble: 336,776 x 19
## time_hour air_time year month day dep_time sched_dep_time
## <dttm> <dbl> <int> <int> <int> <int> <int>
## 1 2013-01-01 05:00:00 227 2013 1 1 517 515
## 2 2013-01-01 05:00:00 227 2013 1 1 533 529
## 3 2013-01-01 05:00:00 160 2013 1 1 542 540
## 4 2013-01-01 05:00:00 183 2013 1 1 544 545
## 5 2013-01-01 06:00:00 116 2013 1 1 554 600
## 6 2013-01-01 05:00:00 150 2013 1 1 554 558
## 7 2013-01-01 06:00:00 158 2013 1 1 555 600
## 8 2013-01-01 06:00:00 53 2013 1 1 557 600
## 9 2013-01-01 06:00:00 140 2013 1 1 557 600
## 10 2013-01-01 06:00:00 138 2013 1 1 558 600
## # … with 336,766 more rows, and 12 more variables: dep_delay <dbl>,
## # arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
## # hour <dbl>, minute <dbl>
In mutate, we can modify or add new variables based on current variables. For example, we will add two new variables here: the flight departure time but in hours and minutes:
flights %>%
mutate(hour = dep_time %/% 100,
minutes = dep_time %% 100) %>%
select(dep_time, hour, minutes)
## # A tibble: 336,776 x 3
## dep_time hour minutes
## <int> <dbl> <dbl>
## 1 517 5 17
## 2 533 5 33
## 3 542 5 42
## 4 544 5 44
## 5 554 5 54
## 6 554 5 54
## 7 555 5 55
## 8 557 5 57
## 9 557 5 57
## 10 558 5 58
## # … with 336,766 more rows
In summarize, we compute any summary number from a set of measurements. Summarize will always collapse many rows into a single value. You alomst always want to use summarize paired with group_by so that you can print summary statistics for each group. For example, let’s print the average of delays (in minutes) for each day in the year
flights %>%
group_by(year, month, day) %>%
summarise(delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 365 x 4
## # Groups: year, month [12]
## year month day delay
## <int> <int> <int> <dbl>
## 1 2013 1 1 11.5
## 2 2013 1 2 13.9
## 3 2013 1 3 11.0
## 4 2013 1 4 8.95
## 5 2013 1 5 5.73
## 6 2013 1 6 7.15
## 7 2013 1 7 5.42
## 8 2013 1 8 2.55
## 9 2013 1 9 2.28
## 10 2013 1 10 2.84
## # … with 355 more rows
We can also explore the relationship between delay time and the average distance in the air. Do longer flights have more delays than shorter ones? Lets make a scatter plot as well.
delays <- flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL")
ggplot(data = delays, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
Let’s deal with only non cancelled flights and print the average delay again:
not_cancelled <- flights %>%
filter(!is.na(dep_delay) & !is.na(arr_delay))
not_cancelled %>%
group_by(year, month, day) %>%
summarise(mean = mean(dep_delay))
## # A tibble: 365 x 4
## # Groups: year, month [12]
## year month day mean
## <int> <int> <int> <dbl>
## 1 2013 1 1 11.4
## 2 2013 1 2 13.7
## 3 2013 1 3 10.9
## 4 2013 1 4 8.97
## 5 2013 1 5 5.73
## 6 2013 1 6 7.15
## 7 2013 1 7 5.42
## 8 2013 1 8 2.56
## 9 2013 1 9 2.30
## 10 2013 1 10 2.84
## # … with 355 more rows
A good practice is to use counts whenever you do a summarization so you know we are not dealing with small data.
Let’s explore the number of flights vs average delay.
delays <- not_cancelled %>%
group_by(tailnum) %>%
summarise(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
ggplot(data = delays, mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10)
Which shows that little flights => more variation. As the number of flights increases, we have more tight variaiton. We can filter out the flights with small n
delays %>%
filter(n > 25)
## # A tibble: 2,922 x 3
## tailnum delay n
## <chr> <dbl> <int>
## 1 N0EGMQ 9.98 352
## 2 N10156 12.7 145
## 3 N102UW 2.94 48
## 4 N103US -6.93 46
## 5 N104UW 1.80 46
## 6 N10575 20.7 269
## 7 N105UW -0.267 45
## 8 N107US -5.73 41
## 9 N108UW -1.25 60
## 10 N109UW -2.52 48
## # … with 2,912 more rows
ggplot(delays, aes(x = n, y = delay)) +
geom_point(alpha = 1/10)
We can use subsetting inside summarize. As arr_delay have negative numbers (for flights that arrived earlier than scheduled), it may be better if we only select the positive values to reflect the truly delayed flights.
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
avg_delay1 = mean(arr_delay),
avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
)
## # A tibble: 365 x 5
## # Groups: year, month [12]
## year month day avg_delay1 avg_delay2
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 12.7 32.5
## 2 2013 1 2 12.7 32.0
## 3 2013 1 3 5.73 27.7
## 4 2013 1 4 -1.93 28.3
## 5 2013 1 5 -1.53 22.6
## 6 2013 1 6 4.24 24.4
## 7 2013 1 7 -4.95 27.8
## 8 2013 1 8 -3.23 20.8
## 9 2013 1 9 -0.264 25.6
## 10 2013 1 10 -5.90 27.3
## # … with 355 more rows
Why is distance to some destinations more variable than to others? (This question is for you to think about)
not_cancelled %>%
group_by(dest) %>%
summarise(distance_sd = sd(distance), count=n()) %>%
arrange(desc(distance_sd))
## # A tibble: 104 x 3
## dest distance_sd count
## <chr> <dbl> <int>
## 1 EGE 10.5 207
## 2 SAN 10.4 2709
## 3 SFO 10.2 13173
## 4 HNL 10.0 701
## 5 SEA 9.98 3885
## 6 LAS 9.91 5952
## 7 PDX 9.87 1342
## 8 PHX 9.86 4606
## 9 LAX 9.66 16026
## 10 IND 9.46 1981
## # … with 94 more rows
Which destinations have the most carriers?
not_cancelled %>%
group_by(dest) %>%
summarise(carriers = n_distinct(carrier)) %>%
arrange(desc(carriers))
## # A tibble: 104 x 2
## dest carriers
## <chr> <int>
## 1 ATL 7
## 2 BOS 7
## 3 CLT 7
## 4 ORD 7
## 5 TPA 7
## 6 AUS 6
## 7 DCA 6
## 8 DTW 6
## 9 IAD 6
## 10 MSP 6
## # … with 94 more rows
How many flights in each distination ? Let’s make a descending order list.
not_cancelled %>%
count(dest) %>%
arrange(desc(n))
## # A tibble: 104 x 2
## dest n
## <chr> <int>
## 1 ATL 16837
## 2 ORD 16566
## 3 LAX 16026
## 4 BOS 15022
## 5 MCO 13967
## 6 CLT 13674
## 7 SFO 13173
## 8 FLL 11897
## 9 MIA 11593
## 10 DCA 9111
## # … with 94 more rows
What proportion of flights are delayed by more than an hour, on a daily basis?
not_cancelled %>%
group_by(year, month, day) %>%
summarise(hour_perc = mean(arr_delay > 60))
## # A tibble: 365 x 4
## # Groups: year, month [12]
## year month day hour_perc
## <int> <int> <int> <dbl>
## 1 2013 1 1 0.0722
## 2 2013 1 2 0.0851
## 3 2013 1 3 0.0567
## 4 2013 1 4 0.0396
## 5 2013 1 5 0.0349
## 6 2013 1 6 0.0470
## 7 2013 1 7 0.0333
## 8 2013 1 8 0.0213
## 9 2013 1 9 0.0202
## 10 2013 1 10 0.0183
## # … with 355 more rows
Find all flights that:
Is there a relationship between proportion of cancelled flights and the average delays? Remember that a missing dep_delay
or arr_delay
(i.e., is.na()
is TRUE) represents a cancelled flight. Make a scatter plot that have the proportion of cancelled flights on the x-axis, and the average delay on the y-axis. Group the observations by month. Also make sure you plot straight line using geom_smooth(method='lm')
.