انقر هنا لقراءة النسخة العربية

Road accidents are a major cause of fatalities and injuries in Saudi Arabia and reducing road fatalities is one of the main goals of Vision 2030. Recently, a dataset of car accidents has been published by the Ministry of Transportation. The dataset is very hard to find as it contains the raw records of car accidents (not the aggregate counts). In this blog post, I will use this dataset to conduct an exploratory data analysis on road accidents in Saudi Arabia to answer some basic questions that I have always wanted to ask:

- How are car accidents related to the quality of the roads?
- What are the most frequent accident types? Which region has a higher proportion of car accidents?
- Are some areas (e.g., travel roads) different from urban areas in the patterns of car accidents?

Among many others.

Although not all the questions can be answered solely through data exploration, exploratory data analysis gives us the right tools to know how to approach a problem.

The data comes from a challenge organized by Thakaa Center called (The Road Safety Challenge). The data have 36K raw records of road accidents across the kingdom. The data contains many attributes about every accident such as date and time, region, road number, road type, number of deaths, number of injuries, the geometric road type, latitude/longitude, weather status, road status, across many other variables (that you can explore here and here).

The only concern I had about the data itself was its accuracy. Although the Ministry of Transportation sponsored the Road Safety Challenge, the numbers seem to diverge from those from the Ministry of Interior and the General Authority of Statistics (source 1, source 2) where they report a total of 352,464 accidents in 2019 alone (compare that with 14,842 accidents in the dataset).

I do not have any explanation of those diverging numbers, but I would consider this dataset as a sample of all car accidents that took place and I would hope that someday we can have access to the full dataset of accidents.

At first look, there are about 68% of accidents that reported the number of deaths while 81% of accidents reported injuries (including 0’s). The remaining accidents are missing this key information. Of those valid accidents, about 8.7% of the accidents resulted in at least one fatality, while 48.8% of accidents resulted in at least one injury.

“Region” is one of the important independent variables we have in this dataset because Saudi Arabia is a big country with an underappreciated diversity between regions both in geography and population. The only problem we’ll need to fix is to scale the numbers within each region as central regions have more accidents due to the higher population. After scaling (using a z-score method), we can see the overall trends between regions and make a few observations.

Looking at accident types, we see that crash is the most common accident type (44%) followed by a coup (24%) after which the percentage falls dramatically to 8% for deflection.

While looking at different accident types is informative, it is most informative if we look at how it also relates to different regions. After scaling the number of accidents in each region across types (to remove the effect of the baseline number of accidents), we can notice the following:

Crash and coup still account for most accidents in all regions. However, we see that some regions have more unique profiles. For example, ‘overdrive’ is most reported in the Southern regions (Assir, Baha and Najran) and Mecca. ‘Deflection’ also is common in Qassim, Mecca, Jawf, Hail, Eastern Providence, and Baha. Similarly, ‘Crush from behind’ is more common in Eastern Providence, Hail, Madina, and Qassim.

Overall, it looks like that Baha, Assir, and Mecca (all of which have mountains), share more of ‘overdrive’ and ‘deflection’ accidents, while Madina, Eastern Providence, Qassim, and Hail share more of ‘crush from behind’ and ‘deflection’ accidents. Aside from that, profiles of both Riyadh and Jazan are very similar, where they share ‘crash’ accidents more than any other accidents.

To me, the story here is about the geography: cities that have mountain roads tend to have a very unique profile that is opposite from cities on flat geography with major highways.

The first idea would be to look at the overall temporal distribution of those numbers: how do they change with time? When we look at the monthly trends (over two years: 2017 and 2018), we see two peaks in both the number of accidents and the total number of deaths: one around the summer (May-June-July) and the other around the end of the year (December-January).

If we look at the hourly trends, we also notice a peak in common hours that starts going upward starting from 7 AM. The number of accidents peaks once at 8 AM while the number of deaths peaks twice: one at 8 AM and the other at 8 PM. The number of injuries also show similar peaks.

We already know that accidents, on average, show a temporal pattern. This time, we’ll look at the temporal patterns of each accident type and also of each region. The hope is that we can connect the different pieces into a more coherent understanding of the big picture. Here, we’ll use the same type of scaling but we’ll scale either within-accident-type or within-region.

The graph on the left tells us that while most accidents peak during the morning-afternoon period (from 8 AM-5 PM), some peak at different times. For example, ‘coup and crash’, ‘crush from behind’ and ‘crush with a stationary body in the road’ also peak after midnight. Some accidents only occur at night between 8 PM – 1 AM such as ‘tread animal’ and ‘tread man’.

The one on the right shows us that when we look at regions instead, we first see that most accidents occur during the common hours from 7 AM up to 3 PM. Although in Eastern Providence and Riyadh it starts from 6 AM. We also notice that some regions have more of night accidents than others. Tabuk, Madina, Northern borders and Najran all have a higher share of night accidents, as opposed to Riyadh where most of its accidents are during the day.

Those “two peaks” patterns in both analyses make me wonder if the accidents at night in those cities are similar to the types of accidents that occur at night (like ‘crush from behind’ and ‘tread man’ or ‘tread animal’). Those types of accidents make it clear that we should look further into how road types now interact with those observations.

Although the overall number of accidents has decreased from 2017 to 2018, some accident types have actually increased in some cities. In the following heat plot, color red indicates an increase between 2017 and 2018 while the color blue indicates a decrease. The intensity of the color should indicate the magnitude of this difference. We generally see that the decrease in the number of accidents has been due to the decrease in the accidents of type ‘crash’ that represents 44% of all observations. However, we see increases: ‘crush from behind’ and ‘deflection’ have increased in many cities as well as ‘overdrive’ in Najran.

We are all here for the density maps. Those maps show the density of records in a given coordinate. I have 3 maps to show: the first map shows the density of accidents by different geometric road types. The red color indicates higher counts (and potentially more deadly points). The second map does the same thing but with the road type, and the final map shows the weather status.

Looking at the first map, we can notice a few things: injuries in U-Turns are scattered throughout the kingdom but we see dense clusters in Jazan and Tabuk. Also, accidents in Straight Link roads and 3-Leg intersections are scattered throughout the kingdom. The accidents in 4-Leg intersection and interchange are also clustered in both Qassim and Riyadh. Accidents in horizontal curve roads are also clustered in the southern regions.

When we look at the density map of accident types, we can see a few more observations. Coup, the explosion of a car tire and deflection accidents are common across the main highways of the kingdom. Those are the “highway accidents”. Compare that with overdrive which is most common in the Southern regions, Mecca (and the highway linking both), and Riyadh. Also, collateral crush is most common within cities. Coup and crush look more common in Madina and Madina-Mecca highway. Finally, tread animal is very scattered throughout the kingdom (and non-highway roads).

Finally, when we look at the density map of weather status, we can see that we are missing this variable from about half of the accidents. With the valid half, the majority of accidents occurred in “Good” weather status. Besides, Riyadh-Dammam and Riyadh-Qassim highways report more injuries in dusty weather.

There were lots of ways to look at this dataset. The exploratory data analysis process is, by nature, a bit opinionated as different analysts will look from different angles.

I hope that my analysis will inspire many more explorations to pinpoint the major causes of car accidents and hopefully make them go away.

The data were obtained as a part of the Road Safety Challenge organized by Thakaa Center. While downloading the data, I passively agreed that I won’t distribute the dataset with anyone else (despite its availability in a public URL). I choose to err on the side of caution and refrain from distributing the link. If you are really interested in obtaining the dataset, reach out to Thakaa Center at Twitter.

A Jupyter notebook that was used to analyze those trends (and contains much more analyses and interactive maps) will be available here soon.

I was informed that this dataset comes from the Ministry of Transportation and it only reports the inter-city accidents, not the intra-city accidents. This explains the discrepancy between the number of accidents here and the other numbers.

]]>Before you start any data analysis project, you must have a question in mind. In this post, I will talk about the main types of questions that we can ask in a data science project as laid out by Jeffery Leek and Roger Peng in this amazing Science article.

Asking the right question is the most important task because it will guide the whole pipeline of your data science project. The most common mistakes in data analysis projects can be traced back to the wrong tool that is not appropriate for the given question. I will lay out a few examples toward the end of this post.

Generally, questions in data science should fall into one of the following categories: descriptive, exploratory, inferential, predictive, causal and mechanistic. Let’s talk about each of those categories in more detail.

Descriptive data analysis simply summaries the numbers without looking for further details or trends. A famous example of those questions is the census data and government statistics. For example, one can ask questions such as:

- How many car accidents in the last year?
- How many people live in California?

The goal of those numbers is to summarize a set of measurements in a single number, leaving the interpretation to someone else.

Exploratory data analysis (EDA) takes the descriptive analysis one step further by looking at the different variables and how they relate to one another. Here, we no longer deal with a single dimension but with many dimensions of the same data point. The goal of exploratory data analysis is to find trends and correlations that might help us generate hypotheses and discover new insights. For example, one can ask:

- How many car accidents last year per region? Which region has more accidents and which region has the least number of accidents?
- Where are the accidents that led to the most number of injuries or deaths?
- What is the relationship between the driver’s age and the number of accidents?

As you see, this kind of analysis will open the opportunity for us to see new trends that might generate new hypotheses. We might find that younger drivers tend to be involved in more accidents than older drivers, or that intercity roads have more accidents during rush hours. Any discovery or insight at this point is still a hypothesis that is bound to be confirmed.

The main caveat of exploratory data analysis is that *it does not confirm or deny any findings*. In other words, any findings we get from this exploration is a hypothesis that awaits further tests to confirm its significance (or it may just be a statistical fluke; read **Common Mistakes** below to see why).

In the inferential data analysis, we take the exploratory data analysis to the next level by investigating if the hypotheses we collected still apply to new data (or to the population). This kind of analysis is very commonly used in scientific literature. To infer anything from the sample at hand, we rely on probability theory (and the associated inferential tests such as t-test and ANOVA). The result of this analysis usually comes in the form of a p-value, which quantifies the probability of obtaining a given result if things are random (or, more technically, if the null hypothesis is true).

For example, we can run a one-way ANOVA to test if differences in car accidents between regions or age groups are significant. If they were significant, then we can confirm, with some confidence (according to the effect size), that our hypothesis does indeed hold.

The main caveat of the inferential data analysis is the fact that *it only works on the population level*. Whatever hypothesis you confirm (or fail to confirm) is only applicable to the population level and may not be correct at the individual level. For example, assume that we found a significant difference between the heights of men and women — does that mean you can reliably predict gender based on a given height? Not really. The long answer is beyond the scope of this post and will be the subject of a future post.

Now that we don’t know much about *individual samples* in the inferential test, what should we do? We can approach the data using predictive data analysis, my next point.

In the predictive data analysis, we usually take a different route (that may or may not depend on the inferential analysis). In this analysis, we seek answers but at the level of individual samples: use some measurements (called features) to predict another measurement (called the outcome). The aim of predictive data analysis is to find out if we can reliably predict an outcome from a set of measurements.

For example, you may build a model that takes the region of the accident, the age group of the driver, and the time of the day as inputs and outputs the predicted number of car accidents (as you would do in a multiple linear regression analysis). While we use p-values in the inferential data analysis to assess a given hypothesis, models in the predictive data analysis are assessed using a variety of evaluation metrics such as the mean squared error (MSE) or the accuracy, depending on the type of outcome measurement. All that matters in predictive data analysis is one thing: *predictive power*.

This kind of analysis is the most common analysis used in modern data-driven scientific or commercial applications, such as building neural networks models to predict the class of an image or predicting total sales, given some other metrics. The new thing in this kind of analysis is that it allows us to use any kind of data, such as raw images, audio samples, locations, and probably anything you can think of.

One of the main caveats is the issue of *generalizability*. Models usually are interesting not because they have a high accuracy in the samples at hand but in their ability to predict new samples correctly. A hidden assumption here is that *all* future data are sampled from the same distribution of the data that have been used to train the model. In reality, models usually achieve a high predictive power on the training data but fail to correctly predict new unseen data (technically called overfitting). Although many statistical techniques can be used to mitigate this issue (e.g., regularization, cross-validation, etc.), it is very hard to prove that a given model is generalizable.

Another caveat of this kind of analysis is that sometimes we do not have a clear explanation of why this model is highly predictive, especially if the model has a large number of free parameters.

All the previous types of analyses have one thing in common: they won’t tell you what the *causal effect* of one measurement on another measurement is. To answer this question, you need to look at the casual data analysis, my next topic.

The causal data analysis answers what happens in measurement Y of measurement X changed. Is there a causal relationship between X and Y, or is their relationship is merely correlational (driven by a hidden factor)? Think of the positive relationship between ice cream sales and homicides (or drowning). A correlation between the two simply means that the two variables change together. A causal relationship, on the other hand, means that changes in variable X *control* changes in variable Y. Do ice creams cause people to murder or die at swimming pools? Probably not. A third variable that may cause both variables is the season, as both measurements spike during summertime, where we have a lot more sunny days and warm temperatures.

The causal data analysis aims to identify if there are such causal relationships between different measurements. For example, huge amounts of studies found that smoking, on average, increases the risk of cancer. If you smoke, your risk of cancer increases. On the other hand, if your risk of cancer is high, it doesn’t necessarily mean that you are a smoker.

In casual data analysis, we usually collect the data under a specific experimental design, such as the randomized control trials (RCT) or A/B testing. In RCT, you *randomly* assign participants to two or more groups: a control group and a treatment group. The treatment group receives some sort of intervention while the control group does not receive any intervention, and we then use inferential data analysis to compare the outcome measure between the two groups. If we find any reliable difference, we would conclude that the treatment *caused* that outcome.

As you see, performing a casual data analysis is most sensible in the context of a dedicated experimental design under which data were collected (i.e., experimental studies). The majority of data, however, are collected in *observational* settings, which only record what is there without dedicated manipulation. You might wonder, how did we know that smoking causes cancer? Did an evil scientist force some people to smoke and watch them die? The answer is no. There are ways of deducing casual relationships in observational studies, but that usually includes very careful and long analyses (you can learn more about that in this excellent article). The bottom line is that you can’t just download a dataset from the internet and, based on some correlations, conclude that X causes Y — a point that is sometimes ignored by many (see **Common Mistakes**).

The mechanistic data analysis takes a step further to show that changing one measurement *always and exclusively* leads to a *deterministic* change in another measurement (think of simple physics). This kind of analysis is only applicable in physical and deterministic systems, and it is extremely difficult to achieve in other contexts.

Now that we have laid out the types of questions in data science projects, we should consider how they might be confused. The core issue here is that each of those analyses requires separate statistical procedures and should not be confused with another analysis. What happens when you confuse the results of one analysis with another? Lots of bad things. Let’s take a look at a few common examples (also mentioned in the paper).

This is probably the most well-known mistake, and you can smell it a mile away, especially if you hear someone saying “oh.. but correlation does not mean causation”. Funny examples can be found in many spurious correlations. In practice, however, it is very tricky to detect without a careful eye because we, humans, love simple and linear stories. There are many famous examples of such studies that you can find in this dedicated list. The problem is not with the data of those studies, but rather with the way results are interpreted. When reporting inferential data analysis, be very careful with those expressions “*the real reason for that is*” or “*as a result*” and clearly the word “*cause*.”

Another common mistake mentioned in the paper is interpreting exploratory analysis as predictive analysis, such as claiming that Google searches predict flu outbreaks (or take this other example). Experienced data scientists know this mistake as overfitting: when a model shows great predictive accuracy in training data but a poor performance on unseen data. As I mentioned before, when I talked about the predictive question, generalization is a really big issue in predictive models. The predictive accuracy of any given model can only be assessed using separate data.

There are lots of names for this mistake, indicating its importance and implications. The root cause of this mistake almost always lies in multiple comparison settings. Let me explain: if you have a dataset with 50 data samples (e.g., participants) and 100 features (e.g., survey responses), then you run a correlation test between each pair of the 100 features (1 vs. 2, 1 vs. 3, … 99 vs. 100), you will almost always get significant correlations due to chance. In other words, you will maximize the probability of false-positive findings. This might be okay if you are running an exploratory data analysis. However, as mentioned, the goal of exploratory data analysis is to generate hypotheses and ideas to be confirmed via inferential tests. You can’t take a single significant correlation from the exploratory data analysis and send it to your boss or publish a paper about it (this is the definition of p-hacking, a potential source of the reproducibility crises in psychology and it is famously called the garden of forking paths). There are ways to mitigate such risks, such as pre-registration. This particular mistake is very likely to happen if you, for example, run an A/B testing with so many metrics. Such analysis is almost always bound to result in “significant finding” that ends up being statistical flukes.

The main goal of a descriptive analysis is to summarize a set of measurements. There is, however, a class of descriptive analyses that do not include any summary: the N of 1, or when you have a single data sample. It is very rare to see studies with N of 1, but if you do such as case reports, it will be in the form of qualitative analysis. While such analysis is, in many cases, very informative and even ground-breaking, it does not have any inferential value. Whatever your findings you report from the N of 1, they aren’t generalized to any other sample from the same population.

]]>One of the most central concepts in statistics is the concept of standard deviation and its relationship to all other statistical quantities such as the variance and the mean. Students in introductory courses are told to “just remember the formula” but, believe me, this is not the best way to explain a concept. In this post, I will try to provide a visual and intuitive explanation of the standard deviation.

Let’s say you got a list of grades, which in this case would be our real-world measurements. We want to “compress” the information in those measurements into a handful number of quantities that we can later use to compare, say, grades of different classes or grades of different years. Due to our limited cognitive capacity, we do not want to go over the grades, one by one, to find out which class scored higher on average. You need to summarize those numbers. This is why we have **descriptive statistics**.

There are two ways to summarize the numbers: by quantifying their similarities or their differences. Ways of quantifying their similarity to one another are formally called “measures of central tendency”. Those measures include the mean, median and mode. Ways of quantifying their differences are called “measures of variability” and include the variance and standard deviation. **The standard deviation should tell us how a set of numbers are different from one another**, with respect to the mean.

Let’s take an actual example. Imagine that you collected those numbers for student grades (and, for the sake of simplicity, let’s assume those grades are the population).

\(2, 8, 9, 3, 2, 7, 1, 6\)

Let’s first plot those numbers in a simple scatter plot

Now that we have all the numbers in a scatter plot, the first step to calculate the variation is to find the center of those numbers: the average (or the mean).

\(\bar{x} = \frac{\sum_{n=1}^{N} x_{n}}{N} = \frac{2+8+9+3+2+7+1+6}{8} = \frac{38}{8} = 4.75\)

Visually, we can plot a line to indicate the mean grade.

Now that we have a line for the mean, the next step is to calculate the distance between each point and the mean and then square this distance. Remember that our goal is to calculate the variation of those numbers, with respect to the mean. We can simply do this mathematically or visually

As you see here, “squaring” is really nothing but drawing a square. There are two points here: we can’t just take the sum of all differences. As some differences are positive and some are negative, taking the sum will make negative numbers cancel out the positive ones ending up with zero (which does not mean anything). To resolve this, we take the square of differences (and I will explain at the end why we take the square of differences and not any other measure such as the absolute value).

Now, we calculate the sum of those squared differences (or, the sum of squares):

By calculating the **sum of squares** we effectively calculated the total variability (i.e., differences) in those grades. Understanding how variability relates to differences is the key to understand many statistical estimates and inference tests. What 67.5 means is that if we stack all those squares in a mega square, its area will be equal to \(67.5 \text{ points}^2\), where points here refers to the unit of grades. The total variability of any set of measurements is an area of a square.

Now that we got the total variability or the area of the mega-square, what we really want is the mean variability. To find that mean, we just divide the total area by the number of squares.

\(\frac{\sum(x_{n} – \bar{x})^2}{N}=\frac{67.5}{8} = 8.45 \text{ points}^2\)

For most practical purposes you want to divide by \(N-1\), and not by \(N\), as you will be trying to estimate this value from a sample, not from a population. However, here we assumed we have the total population. The point still is that you want to calculate the mean square of those little squares. What we just calculated is **the variance**, which is the mean variability, or the mean squared difference.

Why can’t we just go ahead with the variance as an indicator of the variability in the grades? The only problem with the variance is that we can’t compare it with the raw grades, because the variance is a “squared” value or, in other words, it is an area and not a length. Its unit is \(\text{points}^2\) which is not the same unit of our raw grades (which is \(\text{points}\)). So what should we do to get rid of the square? Taking the square root!

At last,** we now have the standard deviation**: the square root of the variance which is \(2.91 \text{points}\).

This is the core idea of standard deviation. This basic intuition should make it easier to understand why it makes sense to use units of standard deviations when dealing with z-scores, normal distribution, standard error and analysis of variance. Also, if you just replace the mean with a fitted (predicted) line Y in the standard deviation formula, then you are dealing with basic regression terms like the mean squared error (if you didn’t use the square root), the root mean squared error (with taking the square root but now with respect to a fitted line). Furthermore, both correlation and regression formulas can be written with the sum of squares (or the total variability area) of different quantities. Partitioning sums of squares is a key concept to understand the generalized linear models and the bias-variance tradeoff in machine learning.

In short: standard deviation is everywhere.

You might be wondering, why should we square the differences and not just take the absolute value. There is nothing really that prevents you from using the mean absolute value of differences instead of the mean squared difference. The mean absolute value will give the same exact weight to all the differences while squaring the differences will give more weight to the numbers that are further apart from the mean. This might be something you want to do. However, most mathematical theories make use of the squared differences (for reasons beyond the scope of this post such as differentiability).

However, I will answer this question with a counterexample that is easy to understand (source). Let’s say we have two sets of grades wit the same mean, \(x_{1}\) and \(x_{2}\):

\(x_{1}= 2, 2, 10, 10\)

\(x_{2}= 13, 7, 0, 4\)

By looking at those grades, you can easily see that \(x_{1}\) has lower variability and spread of numbers than \(x_{2}\). Let’s go ahead and calculate the mean absolute differences of both (knowing that their means is 6):

\(\frac{\sum |x – \bar{x}|}{N} = \frac{|-4| + |-4| + |4| + |4|}{4} = \frac{16}{4} = 4\)

\(\frac{\sum |x – \bar{x}|}{N} = \frac{|7| + |1| + |-6| + |-2|}{4} = \frac{16}{4} = 4\)

Opps! that should be bad. Both sets give the exact same value of variability although we would want to see \(x_{1}\) having a slightly lower value than \(x_{2}\) as the numbres are less variable. If we use the squared differences, however, we get:

\( \sqrt{\frac{\sum (x – \bar{x})^2}{N}} = \sqrt{\frac{(-4)^2 + (-4)^2 + (4)^2 + (4)^2}{4}} = \sqrt{\frac{64}{4}} = \sqrt{16} = 4 \)

\( \sqrt{\frac{\sum (x – \bar{x})^2}{N}} = \sqrt{\frac{(7)^2 + (1)^2 + (-6)^2 + (-2)^2}{4}} = \sqrt{\frac{90}{4}} = \sqrt{22.5} = 4.74 \)

Which, thanks to squaring the differences, appearaly gives us what we hoped for: the standard deviation gets bigger when numbres are more spread out.

]]>This is a legit question — why should we care about communicating with the brain at all? There are many reasons why and here I will mention few applications for BCI hoping that they will convince you. The first application that comes to mind is using BCI for assistive and prosthetic purposes. Specifically, BCI tools like cochlear and retina implants, artificial limbs and deep brain stimulation technologies are helping millions around the world.

Restoring brain functions that have been lost is one of the most important motives behind BCI technology. There are, however, many other applications that excite most people such as augmenting brain functioning via neurofeedback, using the power of thought (alone) to control your favorite device or play video games, and many others (check the amazing brain-computer interface entry on Wikipedia if you want to see wider coverage of possible applications).

P300 is a very salient neural activity that happens within the first second of seeing something that the subject cares about. It is being used in many innovative ways such as lie detection and typing (with your thought alone) making it a viable tool to use to enable paralyzed patients to communicate with their thoughts.

The design of my experiments is pretty straightforward. They all involve showing different images at random order while recording brainwaves at the occipital and temporal areas. In this post, I used grating images (see the image) that consist of multiple black bars arranged in different spatial frequencies (more or fewer bars). Those images are very popular in vision research for many reasons that are beyond the scope of this post. Each image was repeated 50 times resulting in 150 image presentations (spatial frequencies of 3, 12 and no image). Each image was presented for half a second, followed by 3s inter-trial interval (ITI) in which a ‘+’ sign was presented.

I honestly didn’t know about P300 until I saw this image (the image is only from the first channel but the exact pattern is observed in all channels). This positive ramp is present in each experiment I ran using a wide variety of stimulus types (will explore some of those in future posts). Notice how the control condition did not show any of that deflection while both experimental conditions (where actual images were shown) show that pattern. A thing that I did not expect is the second positive ramp around 400ms only for the orange (higher frequency bars) but not for the blue line. This makes it very easier for machine learning algorithms to distinguish the two. Indeed, a very simple logistic regression classifier achieved about 52% accuracy (in cross-validation settings) in distinguishing the three classes.

Here, I showed that low-cost hardware (it costs $322 to get all the equipments) can get you a very high-quality EEG signal. Indeed, OpenBCI keeps a running list of scientific publications that used OpenBCI in their data collection. I plan to pursue further experiments and share their results in this blog.

Finally, a word of thanks to OpenBCI team and community for their incredible effort in making neuroscience and BCI hardware and software tools more accessible to the general public and to the Neurotech@Berkeley team for their amazing course and software that I used to work on the experiments.

Check my Github repository for the code used in this post

**More details on technical Setup**: I used the Ganglion device that offers 4 channels from OpenBCI. Those channels were attached to (approximately) O1, O2, T1, and T2 (covering both sides of Occipital and Temporal Areas). I used Node.js to connect with the chip and process the data (via the lab streaming layer) in a python script that also stores the recordings in a text file. All those tools are adapted from the neurotech course labs. Along with the recording, I also used the PsychoPy to design and run the experiment.

A big barrier from writing is perfection (and, as you might know, perfection is the enemy of done) and it may cause me a lot of pain and regrets. I have been wanting to start this blog a long time ago (a whole year to be exact) but my bar was too high that I did not write a thing. To mitigate this barrier, I will assume that the quality of the blog posts will vary from time to time — but hopefully, you won’t leave empty-handed.

See you next time.

]]>