In this class we’ll focus on the Exploratory Data Analysis (EDA). In short, it refers to what we have been doing so far: data transformation and data visualization. We are going to dive deeper into those two but in a more coherent framework so that you become as comfortable as possible with data exploration in R. Nothing fundamentally new in this lab – we’ll only dive deeper into using the tools we already know.
You should think of the goal of exploratory data analysis as the following: you want to explore as many aspects as possible in your data so that you gain a better understanding of what is already there. The way you go about this is by so many iterations of questions -> visualizations to answer those questions.
First, let’s revisit some basics of ggplot2 library. We’ll use a new dataset called mpg
. The dataset comes from the US Enviromental Protection Agency which collected a couple of models along with the following features for each model: manufacturer, model, year, engine size in liters, transmission type along with other variables.
library(ggplot2)
You can read more about the dataset by typing this command in the console:
?mpg
Let’s look at the first few lines:
mpg
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 q… 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 q… 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 q… 2 2008 4 manu… 4 20 28 p comp…
## # … with 224 more rows
We can asl a very simple question about fuel efficiency: what is the relationship between car engine size (i.e., displ
) and fuel consumption (i.e., hwy
)? We can simply answer this question using the two variables: displ (which codes for engine size in liters) and hwy (which codes for fuel efficiancy in miles per gallon in highways). As both variables are continous numbers, we can use a scatter plot:
ggplot(mpg, aes(x=displ, y=hwy)) +
geom_point()
As you see, we see a negative relationship between engine size and fuel efficincy. Now we can cliam that we are done with the first iteration of Q->V routine. We’ll do that throughout this lesson.
In ggplot2, we have a lot more options we’ll explore in this short tutorial. Let’s vary the points in the scatter plots by a third variable. For example, we can color those cars using the class of the car:
ggplot(mpg, aes(x=displ, y=hwy, color=class)) +
geom_point()
All what we did was adding color=class
inside the aes
command in the plotting command. We can also use shape
as well
ggplot(mpg, aes(x=displ, y=hwy, shape=class)) +
geom_point()
If the third variable was continous (made of numbers), then we can use other options including size (which does not make much sense if your third variable is categorical):
ggplot(mpg, aes(x=displ, y=hwy, size=class)) +
geom_point()
or alpha:
ggplot(mpg, aes(x=displ, y=hwy, alpha=class)) +
geom_point()
Another feature of ggplot is called faceting. It is the splitting of a plot into many subplots by a given thrid variable. For example, we can plot the same relationship but by making a small subplot for each class using the function facet_wrap
:
ggplot(mpg, aes(x=displ, y=hwy, color=class)) +
geom_point() +
facet_wrap(~class, nrow=2)
And now we can see the relationship clearer. You can also use a combination of two variables by writing the varialbes around the telda ~ symbol
ggplot(mpg, aes(x=displ, y=hwy)) +
geom_point() +
facet_wrap(cyl~class)