“The key chapter to master in data scientist”
Data Science is now in everyone’s mind, we have a basic introduction to the world of data science and understand why the field is so important, let's take a look, especially in this chapter we will following the topics:
- Structured versus unstructured data (sometimes called organized vs unorganized)
- Quantitative and qualitative data
- The four levels of data
Structured versus unstructured data
The distinction between structured and unstructured data in the entire dataset. The answer to this question can mean the difference between needing time to perform a proper analysis.
- Structured (organized) data: This is data that can be thought of as observation and characteristics. It is usually organized using a table method (row and columns).
- Unstructured (unorganized) data: This data exists as a free entity and does not follow any standard organization hierarchy.
Structured data is generally thought of as being much easier to work with and analyze. Most statistical and machine learning models were built with structured data in mind and can't work on the loose interpretation of unstructured data. The natural row and column structure are easy to digest of human and machine eyes. So, why even talk about unstructured data? Because it is so common! Most estimates place unstructured data as 80–90% of the world's data.
Quantitative versus qualitative data
When someone asks a data scientist, “what type of data is this?”, they will usually assume that you are asking them whether or not it is mostly quantitative and qualitative. It is likely the most common way of describing the specific characteristics if a dataset.
For the most part, when talking about quantitative data, you are usually talking about a structured dataset with a strict row/column structure. All the more reason why the preprocessing step is so important.
These two data types can be defined as follows:
- Quantitative data: This data can be described using numbers, and basic mathematical procedures, including addition, are possible on the set.
- Qualitative data: This data can’t be described using the number and basic mathematics. This data is generally thought of as being described using natural categories and language.
The four levels of data
It is generally understood that a specific characteristic (feature/column) of structured data can be broken down into one of four levels of data. The levels are:
- The nominal level
- The ordinal level
- The interval level
- The ratio level
As we move down the list, we gain more structure and, therefore, more returns from our analysis. Each level comes with its own accepted practice in measuring the center of the data. We usually think of the mean/average as being an acceptable form of the center, however, this is only true for a specific type of data.
The nominal level
The first level of data, the nominal level, (which also sounds like the word name) consists of data that is described purely by name or category. Basic examples include gender, nationality, species, or yeast strain in a beer. They are not described by numbers and are therefore qualitative. The following are some examples:
- A type of animal is on the nominal level of data. We may also say that if you are a chimpanzee, then you belong to the mammalian class as well.
- A part of speech is also considered on the nominal level of data. The word she is a pronoun, and it is also a noun.
Of course, being qualitative, we cannot perform any quantitative mathematical operations, such as addition or division. These would not make any sense.
The ordinal level
The nominal level did not provide us with much flexibility in terms of mathematical operations due to one seemingly unimportant fact we could not order the observations in any natural way. Data in the ordinal level provides us with a rank order, or the means to place one observation before the other; however, it does not provide us with relative differences between observations, meaning that while we may order the observations from first to last, we cannot add or subtract them to get any real meaning.
The interval level
Now we are getting somewhere interesting. At the interval level, we are beginning to look at data that can be expressed through very quantifiable means, and where much more complicated mathematical formulas are allowed. The basic difference between the ordinal level and the interval level is, well, just that difference. Data at the interval level allows meaningful subtraction between data points.
The ratio level
Finally, we will take a look at the ratio level. After moving through three different levels with differing levels of allowed mathematical operations, the ratio level proves to be the strongest of the four. Not only can we define order and difference, but the ratio level also allows us to multiply and divide as well. This might seem like not much to make a fuss over but it changes almost everything about the way we view data at this level.
Some Parting Words
Please don’t feel overwhelmed, we just started in the basic chapter. Though there are a lot of things to learn, after a refresher on this chapter and learning new concepts, you will be empowered to enjoy the hidden study in your daily routine. And that’s a big leap toward becoming an amazing data scientist.
See this pose:
- Sinan Ozdemir-Principles of Data Science (Packt)