EXPERT KNOWLEDGE AT A GLANCE

Category: Machine learning (Page 2 of 3)

Data Mining vs Big Data Analytics – You need the right tools and you need to know how to use them!

Data Mining vs Big Data Analytics – Both data disciplines, but what makes them different? In this article, we introduce you to both fields and explain the key differences.

Data Science is an interdisciplinary scientific field, as it has become more and more in focus in the last decades. Many companies see this as the key to an Industry 4.0 company. The hope is that valuable information can be found in the company’s own data, which can be used to massively increase its own profitability. Terms such as big data, data mining, data analytics and machine learning are being thrown into the ring. Many people do not realize that these terms describe other disciplines. If you want to build a house, you need the right tools and you have to know how to use them.

Map of Data Disciplines

First of all, you should think of the individual disciplines as being layered into each other like an onion. So there is overlap between all the fields and when you talk about a discipline, you are also talking about lower layers.

data mining vs analytics - This diagram shows the relationships between the individual data disciplines
Map of data disciplines

Since data analytics is located above data mining in the layer model, it is already clear that mining must be a sub discipline of analytics. Therefore, we will first describe the comprehensive discipline.

Data Mining vs Big Data Analytics – What is Analytics?

Big data analytics, as a sub field of data analysis, describes the use of data analysis tools and without special data processing. in data analytics, you use queries and data aggregation methods, but also data mining techniques and tools. The goal of this discipline is to represent various dependencies between input variables.

The goal of this discipline is to represent various dependencies between input variables. The following figure shows the individual overlaps in the use of the tools of the different disciplines.

scheme about overlaps in the use of the tools of the different data disciplines
Overlaps of the different data disciplines

Data Mining vs Big Data Analytics – What is Data Mining?

Data mining is a subset of data analytics. At its core, it is about identifying and discovering a large data set through correlations. Especially if you know little about the available data this field should be used.

datamining

But what does a typical data mining process look like and what are typical data mining tasks?

Data Mining Process

You can divide a typical data mining process into several sequential steps. In the preprocessing stage, your data is first cleaned. This involves integrating sources and removing inconsistencies. Then you can convert the data into the right format. After that, the actual analysis step, the data mining, takes place.Finally, your results have to be evaluated. Expert knowledge is required here to control the patterns found and the fulfillment of your own objectives.

This diagram shows the flow of a typical data mining process
Data Mining Process

The term data mining covers a variety of techniques and algorithms to analyze a data set. In the following we will show you some typical methods.

Data Mining Tasks

Besides identifying unusual data sets with outlier detection, you can also group your objects based on similarities using cluster analysis. In this article we have already summarized some popular clustering algorithms that you should know as a data scientist. While association analysis only identifies the relationships and dependencies in the data, regression analysis provides you with the relationships between dependent and independent variables. Through classification, you assign elements that were not previously assigned to classes to existing classes. You can also summarize the data to reduce the data set to a more compact description without significant loss of information.

data mining tasks
Typical Data Mining Tasks

Data Mining vs Big Data Analytics – Conclusion

Although the two disciplines are related, they are two different disciplines. Data mining is more about identifying key data relationships, patterns or trends in the data, while data analytics is more about deriving a data-driven model. On this path, data mining is an important step in making the data more usable. In the end, it’s not a versus, but both disciplines are part of an analytics pipeline.
In this article, we will go further into the differences between the various data sciences and clarify the difference between data analysis and data science.

Apache Avro – Effective Big Data Serialization Solution for Kafka

In this article we will explain everything you need to know about Apache Avro, an open source big data serialization solution and why you should not do without it.


You can serialize data objects, i.e. put them into a sequential representation, in order to store or send them independent of the programming language. The text structure reflects your data hierarchy. Known serialization formats are for example XML and JSON. If you want to know more about both formats, read our articles on the topics. To read, you have to deserialize the text, i.e. convert it back into an object.

In times of Big Data, every computing process must be optimized. Even small computing delays can lead to long delays with a correspondingly large data throughput, and large data formats can block too many resources. The decisive factors are therefore speed and the smallest possible data formats that are stored. Avro is developed by the Apache community and is optimized for Big Data use. It offers you a fast and space-saving open source solution. If you don’t know what Apache means, look here. Here we have summarized everything you need to know about it and introduce you to some other Apache open source projects you should know about.

Apache Avro – Open Source Big Data Serialization Solution

With Apache Avro, you get not only a remote procedure call framework, but also a data serialization framework. So on the one hand you can call functions in other address spaces and on the other hand you can convert data into a more compact binary or text format. This duality gives you some advantages when you have cross-network data pipelines and is justified by its development history.

Avro was released back in 2011 as a part of Apache Hadoop. Here, Avro was supposed to provide a serialization format for data persistence as well as a data transfer format for communication between Hadoop nodes. To provide functionality in a Hadoop cluster, Avro needed to be able to access other address spaces. Due to its ability to serialize large amounts of data, cost-efficiently, Avro can now be used Hadoop-independently. 

You can access Avro via special API’s with many common programming languages (Java, C#, C, C++, Python and Ruby). So you can implement it very flexible.

In the following figure we have summarized some reasons what makes the framework so ingenious. But what really makes Avro so fast?

The schema clearly shows all the features that Apache Avro offers the user and why he should use it
Features Apache Avro

What makes Avro so fast?

The trick is that a schema is used for serialization and deserialization. About that the data hierarchy, i.e. the metadata, is stored separately in a file. The data types and protocols are defined via a JSON format. These are to be assigned unambiguously by ID to the actual values and can be called for the further data processing constantly. This schema is sent along with the data exchange via RPC calls.

Creating a schema registry is especially useful when processing data streams with Apache Kafka.

Apache Avro and Apache Kafka

Here you can save a lot of performance if you store the metadata separately and call it only when you really need it. In the following figure we have shown you this process schematically.

avro kafka

When you let Avro manage your schema registration, it provides you with comprehensive, flexible and automatic schema development. This means that you can add additional fields and delete fields. Even renaming is allowed within certain limits. At the same time, Avro schema is backward and forward compatible. This means that the schema versions of the Reader and Writer can differ. Schema registration management solutions exist, with Google Protocol Buffers and Apache Thrift, among others. However, the JSON data structure makes Avro the most popular choice.

What does HCA stand for?

What does HCA stand for? What is the difference between Agglomerative and Divisive? When do I use the algorithm and what are its strengths? In this article we will clarify all these questions.

If you don’t know what clustering means, check out this article. Here we also explain four other clustering methods that you as a data scientist must know.

What is an HCA?

Hierarchical Cluster Analysis, or HCA, is a technique for optimal and compact connection of objects based on empirical similarity measures. The two most similar objects are assigned one after another until all objects are finally in one cluster. This then results in a tree-like structure.

What does HCA mean - This figure shows the basic principle of an applied HCA to raw data.
What does HCA stand for? Basic principle of an applied HCA to raw data.

So how does a hierarchical cluster procedure work?

Agglomerative vs Divisive Calculation

The basic clustering can be done in two opposite ways, Agglomerative and Divisive calculation.

Agglomerative clustering:

Agglomerative Nesting, abbreviated AGNES, is also known as the bottom-up method. This method first creates a cluster between two objects with high similarity, and then adds more clusters until all the data has been enclosed.

The divisive cluster calculation follows an opposite concept.

Divisive hierarchical clustering:

Divise Analysis, also known as DIANA, is a top-down method. All objects are directly framed into a cluster and then reduced in size.

In the following figure, the agglomerative process is compared with the divisive process.

What does HCA stand for?  The figure compares the agglomerative and divisive calculation.
What does HCA stand for? Agglomerative vs Divisive Calculation

Thus, the goal is to represent the common properties in low dimension in multidimensional raw data. A strength of this machine learning method is the inclusion of cluster relationships. With K-means, only all objects within a collection are similar to each other, while they are dissimilar to objects in other clusters. If you want to know more about this other popular clustering method, read this article.

How to calculate the cluster distances?

As mentioned earlier, not only are similarities between data points in a cluster weighted, but also similarities between groups. These similarities are represented by distances between the clusters. These distances can be determined in different ways. The distance between the centroids of two clusters can be calculated. A single linkage is the shortest distance between two clusters, a complete linkage is the largest distance between two clusters and an average linkage is the average distance between two clusters.

The figure below contrasts each cluster distance calculation method.

The figure contrasts each cluster distance calculation method. A single linkage is the shortest distance between two clusters, a complete linkage is the largest distance between two clusters and an average linkage is the average distance between two clusters
Cluster distance calculation methods

In addition to the planar representation, the HCA can also be represented in a dendrogram.

HCA represented in a Dendrogram

Since an HCA describes a tree structure, it can be well represented in a dendrogram. Here the connections between the individual data elements and the connections between the clusters become well visible. This diagram can help to choose the optimal number of clusters in the data depending on where you intersect the tree.

In the following figure, for example, such a dendrogram is shown in dependence on Agglomerative and Divisive Calculation.

The figure shows a HCA represented as a dendrogram in dependence to Agglomerative and Divisive Calculation.
HCA presented as dendrogram in dependence to Agglomerative and Divisive Calculation.

Matplotlib vs Seaborn – Who owns the Python visualization throne?

Matplotlib vs Seaborn – Matplotlib is often the first choice when it comes to creating mathematical plots with Python. But is it always the best choice? With Seaborn there is a potent competitor.

Matplotlib was developed by John D. Hunter back in 2003 and has become indispensable. Due to the increasing importance of the Python programming language in almost all scientific areas, the importance of fully compatible visualization methods is also growing.


Due to its open source concept, Matplotlib can be used absolutely free of charge and is a basic component of many popular Python distribution platforms, such as Anaconda.


The library offers a MATLAB-like interface and can be used in combination with NumPy, Pandas and Scipy, just like MATLAB.

SciPy is a collection of mathematical algorithms and convenience functions and is mainly used by scientists, analysts and engineers for scientific computing, visualization and related activities.
NumPy allows easy handling of vectors, matrices, or large multidimensional arrays in general.
NumPy’s operators and functions are optimized for multidimensional array operations and evaluate particularly efficiently.

Pandas is also an open source Python library that can be used to perform data analysis and manipulation efficiently. Its strength lies in the processing and evaluation of tabular data and time series.

These components, which are absolutely compatible with each other, offer in their entirety an absolutely free, but fully comprehensive alternative to the commercial analysis software MATLAB.

This figure shows some Python libraries, which together form an open source MATLAB alternative.
Matplotlib vs Seaborn – Together the Python libraries form a MATLAB replacement

Python Matplotlib – What are the features?

The library offers a wide range of visualization functions. Some of them are listed in the figure below.

Matplotlib vs Seaborn - This figure shows Matplotlib features sorted by their use cases.
Matplotlib vs Seaborn – Matplotlib Features

Matplotlib is designed to effectively visualize the results of mathematical calculations. Visualization is an efficient and important data analysis tool.
The library is able to generate all the usual diagrams and figures by default. It is even possible to create animations that can be used to better understand the flow of certain algorithms.

Event Handling

Matplotlib offers an important feature with event handling. Behind the name is a UI-neutral event model. This allows the library to connect to events without knowing which UI Matplotlib will eventually plug into.


This allows me to develop a very flexible and portable code.
However, the events can then be used to transfer things like the data coordinate.

PyLab vs Pyplot

PyLab is a collection of functions that is installed together with Matplotlib and make the library work like MATLAB.
The module brings a set of NumPy functions and classes into the namespace. This makes them accessible without having to import them.
However, this often led to conflicts between individual Matplotlib functions.
For this reason, the use of PyLab is now no longer recommended.
Pyplot is a module in Matplotlib and provides the state-machine interface to the underlying plotting library.


The conflicts are prevented because an import is done with Pyplot and a separate NumPy import.

Python Matplotlib – Third party packages

If the standard library features are not enough, you can extend Matplotlib with additional external packages. In the following figure some of the possible extensions are listed and grouped by application.

Python Matplotlib - This figure shows Matplotlib Third Party Packages sorted by their use cases.
Matplotlib vs Seaborn – Matplotlib Third Party Packages

These external packages must be installed individually and extend the functionality of the plotting library, or build on existing features.
They sometimes offer more complex graphics or higher performance data analysis methods. Most of these packages are open source and are constantly updated by very active communities.

Matplotlib also has weaknesses

Matplotlib is not perfect despite the wide feature set. For example, only poor default options for the size and colors of plots are offered. Matplotlib is often considered to be a low-level technology compared to today’s requirements. Thus, very specialized code is needed to generate appealing plots.

What is Seaborn?

Seaborn is a Python visualization library, but based on Matplotlib. This library provides a high-level interface for visualization of statistical data and not only has its own graphics library, but internally uses Matplotlib’s functionalities and data structures.
It thus offers a variety of additional features besides the śtandard Matplotlib functions.

This scheme shows the main features of Seaborn
Matplotlib vs Seaborn – Main features of Seaborn

Among other things, Seaborn provides built-in themes for designing matplotlib graphs and a dataset-oriented API for determining the relationship between variables. It can visualize both univariate and bivariate data and plot statistical time series. Estimation and plotting of linear regression models run automatically and Seaborn, unlike Matplotlib, offers optimization when processing NumPy and Pandas data structures.

So what should you choose?

Especially when it comes to deep statistics, Seaborn clearly has the edge. Matplotlib, however, is often the leaner solution due to its simplicity. So both have their strengths and weaknesses. Which tool you ultimately choose depends on the situation. You can’t do much wrong. With one solution, however, you have more contextual options. But now that you know the differences between the two, this decision will be easier for you.

Data Science vs Data Analysis – How to decide which one is right for you?

Data Science vs Data Analysis – What distinguishes both professions from each other? How do your tasks differ? In this article, we will discuss all of these questions.

By now, almost every company, across industries and sizes, has recognized the potential in their own data. Every company wants to access this treasure and gain valuable information in order to develop profitable business strategies.
The economy is crying out for experts who can manage and analyze the enormous volumes of data. A trend that is not expected to end in the next 10 years, but rather to grow steadily.
So if you decide to enter the industry today and start studying, or if you want to teach yourself, you should first be clear about the differences between these often confusingly named professions. Often, HR professionals don’t even know these differences and look for the wrong profiles.

What are the similarities?

Let’s start with the similarities and the main reasons why both disciplines are often confused with each other.

Both professions deal with large amounts of data from which knowledge is to be extracted for a specific purpose.
New insights are to be generated and actions are to be identified.

Map of data disciplines

In order to properly understand the relationships between the data sciences, we need to look at the following figure. The individual disciplines and their relationships to each other are shown here.

Data Science vs Data Analysis  - This scheme shows the map of all data disciplines
Data Science vs Data Analysis – Map of data disciplines

The diagram corresponds to an onion-like layering. It is important to understand that all the disciplines listed here are different. Not only are there intersections, but when you talk about a higher level discipline, it includes other, lower level disciplines.

As you can see, both data science and data analysis are ranked very high. So to understand these two disciplines you need to know the other fields as well.

What is Data Science?

When you talk about data science, you are also talking about all other data disciplines.
A data scientist is an all-rounder and can apply all interdisciplinary tools and methods. He or she can handle structured and unstructured data and perform data preprocessing in addition to analysis.

What is Data Analysis?

Data analysis is more about using the right data analysis tools. Specialized data processing is not required at this level, but a data analyst must be able to fully master and understand the tools in order to gain new insights from the data.

What is Data Analytics?

Data analytics is primarily about the use of queries and data aggregation methods. The primary question here is: How can different dependencies between input variables be represented?
Furthermore, this discipline makes use of data mining techniques and tools.

Data mining

Data mining uses the predictive power of machine learning by applying various machine learning algorithms to big data to identify new trends in the data.

If you want to know even more about how data mining differs from data analytics, check out this article we wrote on the subject.

Data Science vs Data Analysis – So what are the differences?

So we have found that all data disciplines are similar in many ways and one discipline can imply other disciplines. In order to be able to define the differences precisely, the methods used must be compared with each other. Are programming skills required, or is the business intelligence part higher?

In the following figure, the assignment to both professions is shown once.

Data Science vs Data Analysis - This diagram shows the cornerstones of the two data disciplines. Mathematics, statistics and business intelligence
Venn Diagram
by Hugh Conway in 2010

Both disciplines lie at the intersection of mathematics, statistics, and development. While data science is characterized by the fact that it consists of all three cornerstones, data analysis lacks the connection to computer science. And that is the biggest difference between the two fields.

Data Science vs Data Analysis – Comparison

Data Science is a branch of Big Data, with the objective of extracting and interpreting information from a huge amount of data. To do this, a data scientist must design and implement mathematical algorithms and predictive models based on statistics, machine learning, and other methods.
Data Analysis is the specific application of Data Science. It specifically involves searching raw data sources to find trends and metrics. However, this involves working with larger data sets than in the area of Business Intelligence.

In the following diagram, these differences and the overlaps between the two professions are compared once again.

datascientist vs dataanalyst
Data Science vs Data Analysis – Comparison

So what you ultimately decide to do depends on your programming interests. Do you want to develop the analyses yourself, or do you prefer to use specific analysis tools to get more value out of large data sets?

k-Means: One of the simplest Clustering Algorithms

One of the most popular unsupervised clustering methods is the k-means algorithm. It is considered one of the easiest and most cost-effective clustering algorithms to create. It is therefore well suited to identify an overview of possible patterns in data.

What is the principle behind the k-means algorithm?
in this article we will explain what is behind this algorithm and how it really works, because, the better you know your data science tools, the better you will be able to analyze your data.

What is k-Means?

The k-means algorithm described by MacQueen, 1967 goes back to the methods described by.
Lloyd, 1957 and Forgy, 1965 described methods. The unsupervised machine learning algorithm is used for vector quantification or cluster analysis. If you don’t know what the differences are between supervised, unsupervised and reinforcement methods, read this article on the main machine learning categories.

The following figure shows the basic principle of the k-Means clustering algorithm.

The figure shows the basic principle of the k-Means clustering algorithm.
Basic principle of the k-Means clustering

The main goal of unsupervised clustering is to create collections of data elements that are similar to each other, but dissimilar to elements in other clusters.

What is the principle behind the k-means algorithm?

Here, a data set is partitioned into k groups with equal variance. The number of clusters searched for must be specified in advance. Each disjoint cluster is described by the average of all contained samples.
The so-called cluster centroid.
The following figure shows the cluster center of gravity principle.


The figure shows the k-Means cluster center of gravity principle.
cluster center of gravity principle

Each centroid is updated to represent the average of its constituent instances. This is done until the assignment of instances to clusters does not change.

Applied algorithm process

But how exactly does the algorithm work?
First, initial centroids are set. The distances between data instances and centroids are measured and data instances are added as members of the nearest centroid. The centroids are recalculated. If necessary, final centroids are re-measured, re-clustered or re-calculated


The figure shows the process of k-Means Centroid creation
Centroid creation

That is why Liquid State Machines (LSM) are great

– Recently developed computational model

– does not require information to be stored in some stable state of the system

→ the inherent dynamics of the system are used by a memory less readout function to compute the output

→ can be used for complex Tasks (pattern classification, function approximation, object tracking, …)

LSMs take the temporal aspect of the input into account

Concept

The figure shows a typical structure of a liquid State Machine.
Liquid State Machine

Reservoir/ Liquid

– large accumulation of recurrent interacting nodes
→ is stimulated by the input layer
– Liquid itself is not trained, but randomly constructed with the help of heuristics
– Loops cause a short-term memory effect
– preferably a Spiking Neural Network (SNNs)
→ are closer to biological neural networks than the multilayer Perceptron
→ can be any type of network that has sufficient internal dynamics

Running State

→ will be extracted by the readout function

– depend on the input streams they’ve been presented

Readout Function

– converts the high-dimensional state into the output

– since the readout function is separated from the liquid, several readout functions can be used with the same liquid

→ so different tasks can be performed with the same input

lsm readout fcts
different types of readout functions

AutoEncoder – What Is It? And What Is It Used For?

AutoEncoder – In data science, we often encounter multidimensional data relationships. Understanding and representing these is often not straightforward. But how do you effectively reduce the dimension without reducing the information content?

Unsupervised dimension reduction

One possibility is offered by unsupervised machine learning algorithms, which aim to code high-dimensional data as effectively as possible in a low-dimensional way.
If you don’t know the difference between unsupervised, supervised and reinforcement learning, check out this article we wrote on the topic.

What is an AutoEncoder?

The AutoEncoder is an artificial neural network that is used to unsupervised reduce the data dimensions.
The network usually consists of three or more layers. The gradient calculation is usually done with a backpropagation algorithm. The network thus corresponds to a feedforward network that is fully interconnected layer by layer.

Types

AutoEncoder types are many. The following table lists the most common variations.

The figure shows all common AutoEncoder types
AutoEncoder types

However, the basic structure of all variations is the same for all types.

Basic Structure

Each AutoEncoder is characterized by an encoding and a decoding side, which are connected by a bottleneck, a much smaller hidden layer.

The following figure shows the basic network structure.

The figure shows the basic AutoEncoder structure.
AutoEncoder model architecture


During encoding, the dimension of the input information is reduced. The average value of the information is passed on and the information is compressed in such a way.
In the decoding part, the compressed information is to be used to reconstruct the original data. For this purpose, the weights are then adjusted via backpropagation.
In the output layer, each neuron then has the same meaning as the corresponding neuron in the input layer.

Autoencoder vs Restricted Boltzmann Machine (RBM)

Restricted Boltzmann Machines are also based on a similar idea. These are undirected graphical models useful for dimensionality reduction, classification, regression, collaborative filtering, and feature learning. However, these take a stochastic approach. Thus, stochastic units with a particular distribution are used instead of the deterministic distribution.


RBMs are designed to find the connections between visible and hidden random variables. How does the training work?
The hidden biases generate the activations during forward traversal and the visible layer biases generate learning of the reconstruction during backward traversal.

Pretraining

Since the random initialization of weights in neural networks at the beginning of training is not always optimal, it makes sense to pre-train. The task of training is to minimize an error or a reconstruction in order to find the most efficient compact representation for input data.

The figure shows the pretraining procedure of an autoencoder according to Hinton.
Training Stacked Autoencoder


The method was developed by Geoffrey Hinton and is primarily for training complex autoencoders. Here, the neighboring layers are treated as a Restricted Boltzmann Machine. Thus, a good approximation is achieved and fine-tuning is done with a backpropagation.

scikit-learn – Machine learning, Data Mining and Data Analysis in Python for free

In almost no scientific discipline you can get around the programming language Python nowadays.
With it, powerful algorithms can be applied to large amounts of data in a performant way.
Open source libraries and frameworks enable the simple implementation of mathematical methods and data transports.

What is scikit-learn?

One of the most popular Python libraries is scikit-learn. It can be used to implement both supervised and unsupervised machine learning algorithms. scikit-learn primarily offers ready-made solutions for data mining, preprocessing and data analysis.
The library is based on the SciPy Toolkit (SciKit) and makes extensive use of NumPy for high performance linear algebra and array operations. If you don’t know what NumPy is, check out our article on the popular Python library.
The library was first released in 2007 and since then it is constantly extended and optimized by a very active community.
The library was written primarily in Python and is based on Cython only for some high-level operations.
This makes the library easy to integrate into Python applications.

scikit-learn Features

Easily implement many machine learning algorithms with scikit-learn. Both supervised and unsupervised machine learning are supported. If you don’t know what the difference is between the two machine learning categories, check out this article from us on the topic.
The figure below lists all the algorithms provided by the library.

The figure  lists all the upervised and unsupervised machine learning algorithms provided by scikit-learn..
machine learning algorithms provided by scikit-learn..

scikit-learn thus offers rich capabilities to recognize patterns and data relationships in a dataset. Thus, high dimensions can be reduced to visualize the relationships without sacrificing much information.
Features can be extracted and data clustering algorithms can be easily created.

Dependencies

scikit-learn is powerful and versatile. However, the library does not exist completely solitary. Besides the obvious dependency on Python, the library requires the import of other libraries for special operations.

NumPy allows easy handling of vectors, matrices or generally large multidimensional arrays. SciPy complements these functions with useful features like minimization, regression or the Fourier transform. With joblib Python functions can be built as lightweighted pipeline jobs and with threadpoolctl methods can be coordinated as threads to save resources.

« Older posts Newer posts »