Data Science

Data Science Bootcamp (in Arabic)

I led content development for a new online Data Science Bootcamp form barmej.com. I wrote the content of six courses that span topics such as introduction to data science, basics of data cleaning, machine learning, and model validation techniques. Each course comes with its own tutorials, labs, and projects that aim to enhance comprehension not just from a theoretical point of view but also from a very practical, hands-on perspective.

Datasets

I created many public and unique datasets by scraping them from the public domain sources:

Hadith Dataset

Hadith (an Arabic word) refers to the words and actions of Prophet Mohammed. Those collections of Hadiths have been transmitted through generations of Muslim scholars until they have been collected and written in big collections. The chain of narrators is the main area of study in Islamic scholarship because a single hadith may have multiple chains of narrators (that may or may not overlap). However, it has mainly remained a qualitative field where scholars of Hadith try to determine the authenticity of Hadiths by investigating and validating the chains of narrators who transmitted a given hadith. Further, the raw texts of Hadiths have not yet been used in qualitative approaches in data analysis. I hope this dataset makes it easier to further progress in this direction.

Hadith dataset contains the set of all Hadiths from the six primary hadith collections. The data is scraped from http://qaalarasulallah.com/. Note that the chain_indx column refers to the scholar_indx column in Hadith Narrators Dataset.

Notably, this is a very draft version of the dataset as it is not validated. For example, the number of Hadiths in this dataset is much higher than the real number of Hadiths contained in those sources. This may be due to a bug in my script. Further actions will be taken to further clean up this dataset. However, as it is right now, it can be used to prototype certain analyses in those areas.

Hadith Narrators Dataset

Hadith (an Arabic word) refers to the words and actions of Prophet Mohammed. Those collections of Hadiths have been transmitted through generations of Muslim scholars until they have been collected and written in big collections. The chain of narrators is the main area of study in Islamic scholarship because a single hadith may have multiple chains of narrators (that may or may not overlap). However, it has mainly remained a qualitative field where scholars of Hadith try to determine the authenticity of Hadiths by investigating and validating the chains of narrators who transmitted a given hadith.

This unprecedented dataset contains over 24,000 scholars and narrators along with their teachers/students (and other metadata as well) which will provide a macroscopic overview of how and where hadith have been preserved in the early days of Islam. The dataset can also answer many other questions about whether certain schools of scholarships are more prolific in preserving hadiths than others.

Arabic Poetry Dataset (6th – 21st century):

Arabic poetry is the oldest and most prominent form of Arabic literature today. Ancient Arabic poetry is probably the primary source for describing the social, political and intellectual life in the Arab world. Modern poetry has gone through major changes and shifts both in the form and in the topics.

The dataset contains over 58K poems that extend from the 6th century to the present day. Along with each poem, poem metadata have also been scrapped such as the poet’s name, the poem, and its category. The data were scraped from adab.com

EDAs and fun projects

Exploratory data analysis of crimes in Chicago (2005-2016)

Crime in Chicago is a very interesting topic for exploration for all kinds of reasons. Personally, I have been living in Chicago for a couple of months and crime here is always a topic of conversation with friends and family. Another reason is the availability of huge amounts of publicly available (high quality) crime datasets open for data scientists to mine and investigates such as this one.

In this notebook, I am going to explore more about crime in Chicago and try to answer a few questions:

  • How has crime in Chicago changed across years? Was 2016 really the bloodiest year in two decades?
  • Are some types of crimes more likely to happen in specific locations or specific time of the day or specific day of the week than other types of crimes?

Exploratory Analysis of Emoji usage in Saudi Twitter (in Arabic)

In this notebook (that gained large popularity in Twitter), I explore the semantic space underlying the new language of our time: emojis. Those are a few questions that I tried to answer: How emojis are being used? what kind of emotions do they deliver? How do those patterns change in different cities? Can we derive a semantic analysis model based on the sentiment of the emojis? and much more. The notebook is in Arabic.

Harvard Business Reviews in 90 Years

In this project, I used a very cool multivariate technique called Correspondence Analysis to analyze the corpus of Harvard Business Review articles from 1922 to 2012. The result of this project is summarized in a stunning 2D chart of what words were most unique to which years. The chart itself says a lot about the history of the world in the last century.

Arabic AI Songwriter

In this project, I use an LSTM-Recurrent Neural Network (RNN) to learn the embedding space of more than 15,000 Arabic songs and then use this space to generate new lyrics.