Data science

Why data science isn't magic: an introduction to data science

Data science, artificial intelligence and machine learning are terms that you might have seen before. Companies big and small are jumping on the data science bandwagon. Although the incentive is there, the fundamental understanding often lacks. In this introduction to data science you will learn all about the basics and fundamentals. Get started!

The fundamentals: data

Data without context is gibberish, just as the word ‘red’ has no inherent meaning except for being a color. But in context, e.g. “the traffic light’s color is red”, you can use that information to anticipate your next actions. You can use the DIKW (Data Information Knowledge and Wisdom) pyramid to put data in context.

DIKW pyramid data science

What is data science?

Since this blog is an introduction to data science, let’s start with the basics. With data science you use data and give meaning to the data. In other words, transform data to information. This transformation can be done by visualizing data, learning the processes in which the data is used, and/or understanding the metadata. The information will be transformed into knowledge when you provide the context. Once all the cards are on the table, we can plan for action: wisdom!

Wisdom is where we apply our knowledge to the best of our abilities. Data scientist usually report their findings with a recommendation to the management. The DIKW pyramid is the foundation of data transformation. Every data scientist or machine learning engineer goes through this process explicitly or implicitly.

Data science & data mining

Now that we know what cycle the data goes through, we can start looking at the good stuff: data mining. Data mining is a field in data science that is concerned with retrieving data and creating predictions. This definition is a nuance since it covers a wide area of things, but since this is an introduction to data science, we want to keep it simple in this blog. Also, these definitions are not set in stone and vary from time to time and from person to person. We use data mining techniques to predict what new data will do, based on old data. In other words, we train models with old data to predict what new data will do. We will be covering two techniques of data mining: supervised and unsupervised.

Supervised Learning

What is Supervised learning?

Supervised learning means we have both the characteristics and the answer to the question. Much like teaching a kid the difference between a dog and a cat, we teach the model in a similar fashion. The next figure shows the characteristics of a dog and a cat. The characteristics describing the animal are called ‘features’. These features describe an animal, namely a dog or a cat. The element that is described by the features, is called the ‘class’ or ‘target’ (variable). In almost any case, you structure the layout of the features with its class in a table. We refer to the total number of features as the dimensionality of the dataset. The more features you have, the more information you have for predicting the class. However, too many features can cloud the model, damning your dataset with the curse of dimensionality.

Model supervised learning data science

The model tells us its prediction based on the characteristics, and we can assess (supervise) whether the model has predicted the class correctly. If not, we will adjust the model and tell him which characteristics are important, and which are not. This way of working is the premise of supervised learning in a nutshell. If you are looking for a Data Science and Machine Learning solution, then Gartner’s Magic Quadrant is a valuable resource.

Unsupervised learning

What is Unsupervised learning?

Unsupervised learning means we do not have the answer to the question (i.e. we do not have a class variable), but we deduce the answer by comparing the characteristics. When we are at a social event, small groups will form. We can use the proximity (characteristic) of the people to see which group they belong to. We can then label the people with, say, colored t-shirts. The t-shirts represent the class variable. Once they all wear the t-shirts, and new characteristics arise, we are still able to tell who belongs to which group. That is the gist of unsupervised learning. There are more methods such as semi-supervised learning or reinforcement learning. Now that you’ve learned all about the basics and fundamentals in this introduction to data science, we want you to go down that rabbit hole on your own.

Related posts

How to Establish a Single Source of Truth?

Many companies collect data but struggle with fragmentation, inconsistent reports, and inefficient processes. The result? Data is scattered everywhere, leading to decisions based on conflicting figures.

Read more

Webinar: How to Choose a Future-Proof Data Platform for Your Organization?

How do you ensure that your organization always has access to the right data—without wasting time on manual processes and scattered systems? In this webinar, you’ll learn how to build a scalable and reliable data platform that fits the current maturity level of your organization.

Read more

Data Maturity Scan

With the Data Maturity Scan, you’ll discover where your organization stands on the Data Maturity Ladder and receive a concrete action plan to elevate your data strategy.

Read more
All posts