Data Science, Statistics and Data Preparation

Word cloud from https://elitedatascience.com/learn-statistics-for-data-science

Data Science is everywhere.

Data science has been exponentially growing in the last couple of years and the number of companies adopting data-driven measures to further improve their businesses rapidly increases.

Now, given the limited and basic knowledge that I have regarding Data Science, let’s answer some questions that will help us think and learn more once we’re studying Statistics and Exploratory Data Analysis on FTW’s next bootcamp session. Here are the questions presented to us,

In your opinion:

Discuss the relevance of Statistics in the field of data science. Why do we need to have sufficient knowledge of Statistics? What concepts in statistics are useful in Data Science?

  • There are different kinds of data, and mostly in data science, we’re dealing with numbers. Data science is a process of creating insights and value from the given data or making sense of its actual meaning beyond the numbers — in a way that everyone would understand and can take action to. When given a dataset, one can easily play with the data or use different data visualizations or models using pre-built libraries and packages in Python or Tableau, but how will you know that the right model/visualization is chosen and how will you extract the information and reach with your actionable insights? I think this is where statistics, a process to collect and analyze data using mathematical summaries, come into place — for a data scientist to handle the data with care and use several questions and techniques to better understand the data. Some useful and basic concepts in statistics that are useful to data science are central tendency measures (mean, median, mode) that describes the dataset from a center value, spread that shows how the data is squeezed or spread towards a range of values, percentiles that informs us of the position of the data given an ordered range of values, skewness that describes shape and asymmetry of the data and covariance and correlation that tells you the relation between two variables. By analyzing and working around statistics concepts, a data scientist is guided in evaluating the results and answering the questions to the data science problem. Statistics being a foundation and core of data science is very essential and crucial for data scientists to be able to create meaning from the data. We can think of it this way: just as Science will be nothing without Mathematics, Data Science will also be nothing without Statistics.

“Data Science needs statistics. Because using data carelessly is often worse than not using it at all” — Michael Hochster, 2015

Discuss the importance of data preparation before data analysis. Why do we need to explore the data? Why do we need to clean the data?

  • Data should be prepared, explored and cleaned before data analysis just as how we need to collect, choose, clean and marinate the food that needs to be cooked. Why is data preparation, exploration and cleaning important anyway? It all boils down to the quality of the results of your data analysis. One: finding, preparing or gathering the right data for the specific problem will save you time when you’re doing the actual data analysis. After all, without the data, you will not be able to perform any analysis. Two: exploring or discovering the data guides you to get to know more and understand about the data to better grasp on which specific aspects to focus. This will help streamline the questions and techniques that you will use for data analysis. Lastly, cleaning the data, which may be the step which will take the longest in the data preparation stage, will help give you the data that you only need from the collected data which enables for easier analysis and visualization. Data cleaning will help in removing errors or outliers that will just mess up your analysis. All in all, just like in cooking where you need to go through the preparation stage to come up with a very delicious meal, these data preparation steps in data science are very essential and crucial before data analysis to reach the true value and meaning of the data.