Data Science: Statistical Analysis of Big Data

Date

2022-2

Type

Master Thesis

Thesis title

جامعة طرابلس

Author(s)

Soha E. Elhaddar
Dr. Abdullatif S. Tubbal

Abstract

We live in the Big Data era; as our daily interactions move from the physical world to the digital world, every action we take generates data, information pours from our mobile devices, our computers, every file we save, and every social media interaction we make, it is even generated when we do something as simple as asking Google for directions to the nearest gas station...!! Data science is the key to making this flow of information helpful. Simply, it is the art of employing data to predict our future behavior, discover hidden patterns, and use it to help provide information or draw meaningful conclusions from these vast untapped data resources. These vague and misty definitions are shared with other modern fields such as Data Mining, Machine Learning, and Artificial Intelligence. So, what are the differences between these fields and Data Science? Furthermore, what is Data science in practice? As far as I know, the subject of data science is not well known for most Libyan statisticians, and there is no Libyan research in this field. Therefore, this thesis will provide additional knowledge to those interested in data science, especially Libyan scientists, and consider the first step to introduce data science for researchers in the Department of Statistics at the University of Tripoli. The primary purpose of this thesis is to declare the vague definitions of data science and show that statistics is the base behind all of its theories, and other fields are just giving advanced tools to apply statistical analysis on enormous amounts of data. We will focus on statistics and its contributions in applying data science by analyzing and discussing (with some details) the fundamental steps of the data science process by using the European Soccer Database (2008-2016) as a case study, using SQLite data base and R programming language version 4.1.1. In order to apply the data science process to this case study, many statistical techniques have been used in this thesis for different purposes, such as descriptive statistics, confidence intervals, and design models such as the design of factorial experiments with interaction, and design of factorial experiments with blocks and interaction, in addition to some data mining techniques such as clustering with K-mean algorithm and classification with Decision Tree algorithm.

Fulltext

View