Before you start any data analysis project, you must have a question in mind. In this post, I will talk about the main types of questions that we can ask in a data science project as laid out by Jeffery Leek and Roger Peng in this amazing Science article.
Asking the right question is the most important task because it will guide the whole pipeline of your data science project. The most common mistakes in data analysis projects can be traced back to the wrong tool that is not appropriate for the given question. I will lay out a few examples toward the end of this post.
Generally, questions in data science should fall into one of the following categories: descriptive, exploratory, inferential, predictive, causal and mechanistic. Let’s talk about each of those categories in more detail.
The descriptive question
Descriptive data analysis simply summaries the numbers without looking for further details or trends. A famous example of those questions is the census data and government statistics. For example, one can ask questions such as:
- How many car accidents in the last year?
- How many people live in California?
The goal of those numbers is to summarize a set of measurements in a single number, leaving the interpretation to someone else.
The exploratory question
Exploratory data analysis (EDA) takes the descriptive analysis one step further by looking at the different variables and how they relate to one another. Here, we no longer deal with a single dimension but with many dimensions of the same data point. The goal of exploratory data analysis is to find trends and correlations that might help us generate hypotheses and discover new insights. For example, one can ask:
- How many car accidents last year per region? Which region has more accidents and which region has the least number of accidents?
- Where are the accidents that led to the most number of injuries or deaths?
- What is the relationship between the driver’s age and the number of accidents?
As you see, this kind of analysis will open the opportunity for us to see new trends that might generate new hypotheses. We might find that younger drivers tend to be involved in more accidents than older drivers, or that intercity roads have more accidents during rush hours. Any discovery or insight at this point is still a hypothesis that is bound to be confirmed.
The main caveat of exploratory data analysis is that it does not confirm or deny any findings. In other words, any findings we get from this exploration is a hypothesis that awaits further tests to confirm its significance (or it may just be a statistical fluke; read Common Mistakes below to see why).
The inferential question
In the inferential data analysis, we take the exploratory data analysis to the next level by investigating if the hypotheses we collected still apply to new data (or to the population). This kind of analysis is very commonly used in scientific literature. To infer anything from the sample at hand, we rely on probability theory (and the associated inferential tests such as t-test and ANOVA). The result of this analysis usually comes in the form of a p-value, which quantifies the probability of obtaining a given result if things are random (or, more technically, if the null hypothesis is true).
For example, we can run a one-way ANOVA to test if differences in car accidents between regions or age groups are significant. If they were significant, then we can confirm, with some confidence (according to the effect size), that our hypothesis does indeed hold.
The main caveat of the inferential data analysis is the fact that it only works on the population level. Whatever hypothesis you confirm (or fail to confirm) is only applicable to the population level and may not be correct at the individual level. For example, assume that we found a significant difference between the heights of men and women — does that mean you can reliably predict gender based on a given height? Not really. The long answer is beyond the scope of this post and will be the subject of a future post.
Now that we don’t know much about individual samples in the inferential test, what should we do? We can approach the data using predictive data analysis, my next point.
The predictive question
In the predictive data analysis, we usually take a different route (that may or may not depend on the inferential analysis). In this analysis, we seek answers but at the level of individual samples: use some measurements (called features) to predict another measurement (called the outcome). The aim of predictive data analysis is to find out if we can reliably predict an outcome from a set of measurements.
For example, you may build a model that takes the region of the accident, the age group of the driver, and the time of the day as inputs and outputs the predicted number of car accidents (as you would do in a multiple linear regression analysis). While we use p-values in the inferential data analysis to assess a given hypothesis, models in the predictive data analysis are assessed using a variety of evaluation metrics such as the mean squared error (MSE) or the accuracy, depending on the type of outcome measurement. All that matters in predictive data analysis is one thing: predictive power.
This kind of analysis is the most common analysis used in modern data-driven scientific or commercial applications, such as building neural networks models to predict the class of an image or predicting total sales, given some other metrics. The new thing in this kind of analysis is that it allows us to use any kind of data, such as raw images, audio samples, locations, and probably anything you can think of.
One of the main caveats is the issue of generalizability. Models usually are interesting not because they have a high accuracy in the samples at hand but in their ability to predict new samples correctly. A hidden assumption here is that all future data are sampled from the same distribution of the data that have been used to train the model. In reality, models usually achieve a high predictive power on the training data but fail to correctly predict new unseen data (technically called overfitting). Although many statistical techniques can be used to mitigate this issue (e.g., regularization, cross-validation, etc.), it is very hard to prove that a given model is generalizable.
Another caveat of this kind of analysis is that sometimes we do not have a clear explanation of why this model is highly predictive, especially if the model has a large number of free parameters.
All the previous types of analyses have one thing in common: they won’t tell you what the causal effect of one measurement on another measurement is. To answer this question, you need to look at the casual data analysis, my next topic.
The casual question
The causal data analysis answers what happens in measurement Y of measurement X changed. Is there a causal relationship between X and Y, or is their relationship is merely correlational (driven by a hidden factor)? Think of the positive relationship between ice cream sales and homicides (or drowning). A correlation between the two simply means that the two variables change together. A causal relationship, on the other hand, means that changes in variable X control changes in variable Y. Do ice creams cause people to murder or die at swimming pools? Probably not. A third variable that may cause both variables is the season, as both measurements spike during summertime, where we have a lot more sunny days and warm temperatures.
The causal data analysis aims to identify if there are such causal relationships between different measurements. For example, huge amounts of studies found that smoking, on average, increases the risk of cancer. If you smoke, your risk of cancer increases. On the other hand, if your risk of cancer is high, it doesn’t necessarily mean that you are a smoker.
In casual data analysis, we usually collect the data under a specific experimental design, such as the randomized control trials (RCT) or A/B testing. In RCT, you randomly assign participants to two or more groups: a control group and a treatment group. The treatment group receives some sort of intervention while the control group does not receive any intervention, and we then use inferential data analysis to compare the outcome measure between the two groups. If we find any reliable difference, we would conclude that the treatment caused that outcome.
As you see, performing a casual data analysis is most sensible in the context of a dedicated experimental design under which data were collected (i.e., experimental studies). The majority of data, however, are collected in observational settings, which only record what is there without dedicated manipulation. You might wonder, how did we know that smoking causes cancer? Did an evil scientist force some people to smoke and watch them die? The answer is no. There are ways of deducing casual relationships in observational studies, but that usually includes very careful and long analyses (you can learn more about that in this excellent article). The bottom line is that you can’t just download a dataset from the internet and, based on some correlations, conclude that X causes Y — a point that is sometimes ignored by many (see Common Mistakes).
The mechanistic question
The mechanistic data analysis takes a step further to show that changing one measurement always and exclusively leads to a deterministic change in another measurement (think of simple physics). This kind of analysis is only applicable in physical and deterministic systems, and it is extremely difficult to achieve in other contexts.
Now that we have laid out the types of questions in data science projects, we should consider how they might be confused. The core issue here is that each of those analyses requires separate statistical procedures and should not be confused with another analysis. What happens when you confuse the results of one analysis with another? Lots of bad things. Let’s take a look at a few common examples (also mentioned in the paper).
Interpreting inferential analysis as causal analysis
This is probably the most well-known mistake, and you can smell it a mile away, especially if you hear someone saying “oh.. but correlation does not mean causation”. Funny examples can be found in many spurious correlations. In practice, however, it is very tricky to detect without a careful eye because we, humans, love simple and linear stories. There are many famous examples of such studies that you can find in this dedicated list. The problem is not with the data of those studies, but rather with the way results are interpreted. When reporting inferential data analysis, be very careful with those expressions “the real reason for that is” or “as a result” and clearly the word “cause.”
Interpreting exploratory analysis as predictive analysis (aka overfitting)
Another common mistake mentioned in the paper is interpreting exploratory analysis as predictive analysis, such as claiming that Google searches predict flu outbreaks (or take this other example). Experienced data scientists know this mistake as overfitting: when a model shows great predictive accuracy in training data but a poor performance on unseen data. As I mentioned before, when I talked about the predictive question, generalization is a really big issue in predictive models. The predictive accuracy of any given model can only be assessed using separate data.
Interpreting an exploratory analysis as inferential analysis (aka data dredging, data fishing, data snooping, p-hacking, etc.)
There are lots of names for this mistake, indicating its importance and implications. The root cause of this mistake almost always lies in multiple comparison settings. Let me explain: if you have a dataset with 50 data samples (e.g., participants) and 100 features (e.g., survey responses), then you run a correlation test between each pair of the 100 features (1 vs. 2, 1 vs. 3, … 99 vs. 100), you will almost always get significant correlations due to chance. In other words, you will maximize the probability of false-positive findings. This might be okay if you are running an exploratory data analysis. However, as mentioned, the goal of exploratory data analysis is to generate hypotheses and ideas to be confirmed via inferential tests. You can’t take a single significant correlation from the exploratory data analysis and send it to your boss or publish a paper about it (this is the definition of p-hacking, a potential source of the reproducibility crises in psychology and it is famously called the garden of forking paths). There are ways to mitigate such risks, such as pre-registration. This particular mistake is very likely to happen if you, for example, run an A/B testing with so many metrics. Such analysis is almost always bound to result in “significant finding” that ends up being statistical flukes.
Interpreting descriptive analysis as inferential analysis
The main goal of a descriptive analysis is to summarize a set of measurements. There is, however, a class of descriptive analyses that do not include any summary: the N of 1, or when you have a single data sample. It is very rare to see studies with N of 1, but if you do such as case reports, it will be in the form of qualitative analysis. While such analysis is, in many cases, very informative and even ground-breaking, it does not have any inferential value. Whatever your findings you report from the N of 1, they aren’t generalized to any other sample from the same population.