The Five steps of Data Science

Introduction to data science

Data Science Interview Guide

Overview of the five steps

The five essential steps to perform data science are as follows:
1. Asking an interesting question
2. Obtaining the data
3. Exploring the data
4. Modeling the data
5. Communicating and visualizing the results
First, let’s look at the five steps with reference to the big picture.

Ask an interesting question

This is probably my favorite step. As an entrepreneur, I ask myself (and others) interesting questions every day. I would treat this step as you would treat a brainstorming session. Start writing down questions regardless of whether or not you think the data to answer these questions even exists.

Obtain the data

Explore the data

Once we have the data, we use the lessons learned in Chapter 2, Types of Data, of this book and begin to break down the types of data that we are dealing with. This is a pivotal step in the process.

Model the data

Communicate and visualize the results

Basic questions for data exploration

When looking at a new dataset, whether it is familiar to you or not, it is important to use the following questions as guidelines for your preliminary analysis:

  • Is the data organized or not?
    We are checking for whether or not the data is presented in a row/column
    structure. For the most part, data will be presented in an organized fashion. In this book, over 90% of our examples will begin with organized data. Nevertheless, this is the most basic question that we can answer before diving any deeper into our analysis. A general rule of thumb is that if we have unorganized data, we want to transform it into a row/column structure. For example, earlier in this book, we looked at ways to transform the text into a row/column structure by counting the number of words/phrases.
  • What does each row represent?
    Once we have an answer to how the data is organized and are now looking
    at a nice row/column-based dataset, we should identify what each row
    actually represents. This step is usually very quick and can help put things
    in perspective much more quickly.
  • What does each column represent?
    We should identify each column by the level of data and whether or not it is quantitative/qualitative, and so on. This categorization might change as our analysis progresses, but it is important to begin this step as early as possible.
  • Are there any missing data points?
    Data isn’t perfect. Sometimes we might be missing data because of human
    or mechanical error. When this happens, we, as data scientists, must make
    decisions about how to deal with these discrepancies.
  • Do we need to perform any transformations on the columns?
    Depending on what level/type of data each column is at, we might need to
    perform certain types of transformations. For example, generally speaking, for the sake of statistical modeling and machine learning, we would like each column to be numerical. Of course, we will use Python to make any and all transformations. All the while, we are asking ourselves the overall question, what can we infer from the preliminary inferential statistics? We want to be able to understand our data a bit more than when we first found it.

Recap

What I have presented here are the steps that data scientists follow chronologically in a typical data science project. If it is a brand new project, we usually spend about 60–70% of our time just on gathering and cleaning the data.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Desi Ratna Ningsih

Desi Ratna Ningsih

Data Science Enthusiast, Remote Worker, Course Trainer, Archery Coach, Psychology and Philosophy Student