When we talk about data science projects, nobody seems to be able to come up with a solid explanation of how the entire process goes. From gathering the data, all the way up to the analysis and presenting the results.
In this post, I break down five steps of data science, we have spent extensive time looking at the preliminaries of data science, including outlining the types of data and how to approach datasets depending on their type.
Introduction to data science
That is the biggest difference between data science and data analytics.
While one can argue that there is no difference between the two, many will argue that there are hundreds! I believe that regardless of how many differences there are between the two terms, the biggest is that data science follows a structured, step by step process that, when followed, preserves the integrity of the results.
Data Science is a detailed study of the flow of information from the colossal amounts of data present in an organization’s repository. It involves obtaining meaningful insights from raw and unstructured data which is processed through analytical, programming, and business skills.
Like any other scientific endeavor, this process must be adhered to, or else the
analysis and the results are in danger of scrutiny. On a simpler level, following astrict process can make it much easier for amateur data scientists to obtain results faster than if they were exploring data with no clear vision.
Overview of the five steps
The five essential steps to perform data science are as follows:
1. Asking an interesting question
2. Obtaining the data
3. Exploring the data
4. Modeling the data
5. Communicating and visualizing the results
First, let’s look at the five steps with reference to the big picture.
Ask an interesting question
This is probably my favorite step. As an entrepreneur, I ask myself (and others) interesting questions every day. I would treat this step as you would treat a brainstorming session. Start writing down questions regardless of whether or not you think the data to answer these questions even exists.
The reason for this is twofold. First off, you don’t want to start biasing yourself even before searching for data. Secondly, obtaining data might involve searching in both public and private locations and, therefore, might not be very straightforward. You might ask a question and immediately tell yourself “Oh, but I bet there’s no data out there that can help me,” and cross it off your list. Don’t do that! Leave it on your list.
Obtain the data
Once you have selected the question you want to focus on, it is time to scour the world for the data that might be able to answer that question. As mentioned before, the data can come from a variety of sources; so, this step can be very creative!
Explore the data
Once we have the data, we use the lessons learned in Chapter 2, Types of Data, of this book and begin to break down the types of data that we are dealing with. This is a pivotal step in the process.
Once this step is completed, the analyst generally has spent several hours learning about the domain, using code or other tools to manipulate and explore the data, and has a very good sense of what the data might be trying to tell them.
Model the data
is step involves the use of statistical and machine learning models. In this step, we are not only fitting and choosing models, but we are also implanting mathematical validation metrics in order to quantify the models and their effectiveness.
Communicate and visualize the results
This is arguably the most important step. While it might seem obvious and simple, the ability to conclude your results in a digestible format is much more difficult than it seems. We will look at different examples of cases when results were communicated poorly and when they were displayed very well.
Basic questions for data exploration
When looking at a new dataset, whether it is familiar to you or not, it is important to use the following questions as guidelines for your preliminary analysis:
- Is the data organized or not?
We are checking for whether or not the data is presented in a row/column
structure. For the most part, data will be presented in an organized fashion. In this book, over 90% of our examples will begin with organized data. Nevertheless, this is the most basic question that we can answer before diving any deeper into our analysis. A general rule of thumb is that if we have unorganized data, we want to transform it into a row/column structure. For example, earlier in this book, we looked at ways to transform the text into a row/column structure by counting the number of words/phrases.
- What does each row represent?
Once we have an answer to how the data is organized and are now looking
at a nice row/column-based dataset, we should identify what each row
actually represents. This step is usually very quick and can help put things
in perspective much more quickly.
- What does each column represent?
We should identify each column by the level of data and whether or not it is quantitative/qualitative, and so on. This categorization might change as our analysis progresses, but it is important to begin this step as early as possible.
- Are there any missing data points?
Data isn’t perfect. Sometimes we might be missing data because of human
or mechanical error. When this happens, we, as data scientists, must make
decisions about how to deal with these discrepancies.
- Do we need to perform any transformations on the columns?
Depending on what level/type of data each column is at, we might need to
perform certain types of transformations. For example, generally speaking, for the sake of statistical modeling and machine learning, we would like each column to be numerical. Of course, we will use Python to make any and all transformations. All the while, we are asking ourselves the overall question, what can we infer from the preliminary inferential statistics? We want to be able to understand our data a bit more than when we first found it.
What I have presented here are the steps that data scientists follow chronologically in a typical data science project. If it is a brand new project, we usually spend about 60–70% of our time just on gathering and cleaning the data.
The true north is always the business question we defined before even started the data science project. Always remember that solid business questions, clean and well-distributed data always beat fancy models.
I hope you learned something today. Feel free to leave a message if you have any feedback, and share it with anyone that might find this useful.
See this pose:
- Sinan Ozdemir-Principles of Data Science (Packt)