EXPERT KNOWLEDGE AT A GLANCE

Tag: big data

Is Hadoop dead? Should I invest time to learn the Hadoop ecosystem?

Is Hadoop dead – In the IT sector in particular, technologies and software architectures do not have a long shelf life. As new technical insights are gained, the requirements and use cases for the systems also change. As young as the term “big data” is, it is also undergoing constant change. The increased acceptance of open source projects in the business community has led to increased diversification and thus to many mutually beneficial competitive situations.
Apache Hadoop has been considered the one all-purpose solution for over a decade. A Big data ecosystem in which Hadoop plays together with many other extensions. In recent years, however, more and more people are claiming that the demands on data processing have changed and see Hadoop as an outdated concept.

A few years ago, the primary goal was to efficiently handle ever-increasing data volumes, but today iterative real-time analyses on dynamic data sets are required. Data management systems must not be self-contained, but must remain manipulable and monitorable at all times.
So is Hadoop dead, or still indispensable?

What is Hadoop?

Hadoop is a Linux-based open source Big Data framework for scalable, distributed software. It is originally based on Google’s MapReduce algorithm and enables computationally intensive processes of large data sets by parallelizing them on computer clusters, i.e. a large number of networked computers, using multiple components working together.

Is Hadoop dead? This diagram shows the Hadoop ecosystem
Is Hadoop dead? Hadoop ecosystem

The Hadoop ecosystem is composed of the Hadoop Common, an interface for all other components. It connects Hadoop to the file system of the computers and contains the libraries.In the Hadoop Distributed File System
( HDFS ) very large amounts of data are stored. This is organized as a server cluster with master and slave nodes. The resources are controlled via the Yet Another Resource Negotiator (YARN) component. This resource manager distributes the individual tasks to the available resources, such as CPU and memory.

What is the MapReduce algorithm?

Google’s MapReduce programming model, even though it is currently being replaced by engines based on Directred-Acyclic-Graph (DAG), is still a core component of the Hadoop framework. So if we want to understand how Hadoop works, we first need to understand what MapReduce is in the first place.

Is Hadoop dead? This diagram shows the principle behind Google's MapReduce algorithm
Is Hadoop dead? Googles Map Reduce Algorithm principle

Configurable classes for Map, Reduce and Combination phases are provided via the Hadoop MapReduce framework. Map means that a set of data is transformed into another set of data, where the individual elements of the data are combined into tuples (key/value pairs). In the Reduce phase, the formed tuples are then combined into smaller sets of tuples.

How a Hadoop cluster works

As mentioned earlier, Hadoop distributes storage and processing of large amounts of data in a balanced manner across compute clusters, or interconnected hardware.
These computers are connected to a dedicated server that acts as the master
components. The master node organizes the storage of files and the metadata in the individual slave nodes. Within a cluster, data is stored on multiple computers called nodes. The files are partitioned into data blocks and distributed redundantly among the nodes.

Is Hadoop dead? This diagram shows the components of a Hadoop cluster
Is Hadoop dead? Components of a Hadoop Cluster

The NameNode and Resource Manager run on the master node. These collect data in the Hadoop Distributed File System (HDFS) and store data with parallel computations by applying MapReduce.

The client nodes are responsible for loading the data into the cluster’s
Architecture. The slave node is one responsible for collecting the data
Client nodes.

How does communication within a cluster work?

The internal communication, i.e. the process of job execution, is organized via so-called JobTrackers and TaskTrackers.
The client submits a MapReduce job to the JobTracker on the master to process a particular file.The JobTracker then determines the DataNodes that store the blocks for that file by querying the NameNode. The NameNode manages the HDFS file system metadata, so it keeps track of all the files that are divided into blocks. The DataNodes store and retrieve these blocks. Then tasks are assigned to different TaskTrackers based on the information received from the NameNode . In the process, the status of each task NameNode and DataNode is monitored.
A secondary NameNode communicates with the NameNode at a periodic interval to take the snapshot of the HDFS metadata. In other words, a backup. This information can then be used in the event of a NameNode failure.

Is Hadoop dead? This scheme the internal communication of the components of a Hadoop cluster
Is Hadoop dead? internal communication of the components of a Hadoop cluster

In principle, both single-node clusters and multi-node clusters can be implemented with Hadoop. In the case of a single node, the cluster is implemented on one machine only. All processes then run on a Java virtual machine instance.
In the case of multi-nodes, the master slave architecture already discussed is then implemented over several computers.

Is Hadoop dead?

So is Hadoop dead? Apache Hadoop has clearly lost its status as the sole Big Data solution. Many technologies have already been added that can solve smaller tasks better than the big one solution Hadoop.Today, this small-scale nature enables Big data management solutions that can be optimally tailored to specific use cases. However, Hadoop Hadoop is not dead either. The system still has its strengths and will continue to be the first choice for special use cases in the foreseeable future.

So how is Hadoop evolving?

With the Hadoop Ozone project, an alternative to the Hadoop Distributed File System (HDFS) has now been developed.
It is still to be deployed on a cluster, but corresponds to an object store for Big Data applications. This is much more scalable than than standard file systems and is intended to optimize the handling of small files, a previous Hadoop weakness. Object stores are typically used as a data storage method in the cloud. Through Ozone, they can now be managed locally.
This object store can be accessed by established Big Data solutions such as Hive or Spark without modification.If you want to know more about the hadoop compatible frameworks read our articles on Hive and Spark.


Ozone is built on a block storage layer called Hadoop Distributed Data Store (HDDS) and is designed to scale to billions of objects. The blocks are organized internally using unique namespaces in many independent volumes.
However, one disadvantage of these local object stores is that they are not yet implemented in the core, but must be separated from the traditional file systems by containerized environments such as Kubernetes and YARN. So there are always two truths.

H2O AI – That’s why it’s so great

There is a lot of Big Data software available now. One of them that you should definitely know about is the H2O AI Machine Learning solution.

With this open-source application you can implement algorithms from the fields of statistics, data mining and machine learning. The H2O AI Engine is based on the distributed file system Hadoop and is therefore more performant than other analysis tools. Your machine learning methods can thus be used as
parallelized methods.

Software Stack

They can program their algorithms in R, Python and Java and thus in the most important mathematical programming languages. H2O provides a REST interface to Python, R, JSON and Excel. Additionally, you can access H2O directly with Hadoop and Apache Spark. This makes integration into your data science workflow much easier. You already get approximate results while running the algorithms. A graphical web browser UI helps you to better analyze the processes and perform targeted optimizations.

How Clients Interacts with H2O AI

You can interact with H2O via clients using various interfaces. It is important for you to know that the data is usually not held in memory. They are localized in a H2O cluster and you only get a pointer to the data when you make a request.

How Clients Interacts with H2O AI
H2O Interaction flow

H2O Frame

The basic unit of data storage accessible to you is the H2O Frame. This corresponds to a two-dimensional, resizable and potentially heterogeneous data point. This tabular data structure also contains labeled axes.

H2O Cluster

Your H2O cluster consists of one or more nodes. A node corresponds to a JVM process and this process consists of three layers.

H2O Machine Learning Software Structure
H2O Software Stack

H2O Machine Learning Components

Language Layer

The R evaluation layer is a slave to the REST client front-end and in the Scala layer you can write native programs and algorithms. You can then use these with H2O Machine learning.

Algorithms Layer

This layer is where your algorithms are applied. You can run statistical methods, data import and machine learning here.

Core Layer

In this layer you handle the resource management. You can manage both the memory and the CPU processing capacity.

5 Clustering Algorithms Data Scientists need to know – The key is always to understand the basic approach of any algorithm you want to use

As a data scientist, you have several basic tools at your disposal, which you can also apply in combination to a data set. Here we present some clustering algorithms that you should definitely know and use

In times of Big Data, not only the sheer number of data increases, but also the relationships between them. More and more complex dependencies are formed. This makes it all the more difficult to recognize these similar properties and to assign the data to so-called clusters in a way that can be evaluated.

You have certainly heard of these algorithms and maybe used one or the other, but do you really know what clustering algorithms are?

What are clustering algorithms?

So let’s first clarify what these algorithms are in the first place. The goal is clear: You want to identify similar properties between individual data points in a data set and group them in a meaningful way. These properties are often high-dimensional.

With the help of cluster analysis, you want to reduce this high-dimensional information to a low-dimensional dependency. So, for example, a representation in 2D space. Clustering is an unsupervised machine learning technique and in the end you classify the data points by using algorithms.

The approach to clustering differs from technique to technique. All have their advantages and disadvantages, so it makes sense to try several on one set of data, or apply them in combination. Below we will introduce you to some popular clustering methods and explain their grouping approach.

This picture shows schematically popular Clustering Machine Learning Algorithms you should know as a data scientist
Clustering Machine Learning Algorithms – Popular clustering algorithms

Mean-Shift Clustering

The first algorithm we want to introduce you to is Mean-Shift Clustering. With this you can find dense areas of data points according to the concept of kernel density estimation (KDE). The basis of the clustering is a circular sliding window, which moves towards higher density at each iteration. Within the window, the centers of each class are determined, called centroids.

The movement is now created by moving the center to the average of the points within the window. The density within the sliding window is thus proportional to the number of points within it. This motion continues until there is no direction in which the motion can take more points within the kernel.

Clustering Machine Learning Algorithms - Schematic and simplified representation of the Mean-Shift principle.
Clustering Machine Learning Algorithms – Mean-Shift Clustering Priciple

Hierarchical Cluster Analysis (HCA)

With HCA, clusters are formed based on empirical similarity measures of the data points. This means that the two most similar objects are assigned one after the other until all objects are in one cluster. This results in a tree-like structure. In contrast to the K-means algorithm, which we will discuss later, similarities between the clusters play a role. These are represented by a cluster distance. With K-means, only all objects within a collection are similar to each other, while they are dissimilar to objects in other clusters.

You can create an HCA in different ways. There are two elementary procedures, the top-down and the bottom-up. If you want to know more about Hierarchical Cluster Analysis, read this article.

Schematic and simplified representation of the HCA clustering  principle.
Clustering Machine Learning Algorithms – HCA Principle

Expectation-Maximization (EM) Clustering using Gaussian Mixture Models (GMM)

GMM basically assumes that the data points are Gaussian and not circular. The clusters are described by their mean and standard deviation. Each Gaussian distribution is randomly assigned to a single cluster and found using the Expectation-Maximization (EM) optimization algorithm. The probability of belonging to a cluster is then calculated for each data point. Thus, the closer a point is to the Gaussian center, the more likely it is then to belong to that cluster. Based on these probabilities, a new set of parameters for the Gaussian distributions is iteratively calculated. That is, the probabilities within a cluster are maximized.

K-Means clustering algorithms

The k-Means algorithm described by MacQueen, 1967 goes back to the methods described by Lloyd, 1957 and Forgy, 1965. You can use the algorithm besides cluster analysis also for vector quantization. Here, a data set is partitioned into k groups with equal variance.

The number of clusters must be specified in advance. Each disjoint cluster is described by the average of all contained samples. The so-called cluster centroid.


Each centroid is updated to represent the average of its constituent instances. This is done until the assignment of instances to the clusters does not
changes any more. If you want to learn more about the K-means algorithm, check this out.

Schematic and simplified representation.of the kmeans clustering algorithm
K-Means Principle

Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

DBSCAN is a density-based cluster analysis with noise. From an arbitrary starting data point, neighborhood points are specified at a distance epsilon. Clustering then begins from a certain neighborhood data point count.

The current data point becomes the first point of the new cluster, or referred to as noise. In both cases, however, it is considered to be examined. The neighboring data points are then added to the cluster. Once all neighbors have been added, a new, unexamined point is called and processed. A new cluster is thus formed.

Schematic and simplified representation of the DBSCAN Clustering principle.
Clustering Machine Learning Algorithms – How DBSCAN works

The field of cluster algorithms is wide and everyone’s approach is different. You should be aware that there is no one solution. You have to consider each algorithm as another tool. Not every technique works equally well in every situation.

The key here is to always understand the basic approach of each algorithm you want to use. Build a small portfolio and get to know these techniques well. Once you master them, you should then add new ones. Knowing your own tools is crucial to avoid try and error and to gain control over your data. Remember: no result is a result. Your added value here is that even if an algorithm doesn’t work well on your data set, it will give you information about the data properties.

Apache Avro – Effective Big Data Serialization Solution for Kafka

In this article we will explain everything you need to know about Apache Avro, an open source big data serialization solution and why you should not do without it.


You can serialize data objects, i.e. put them into a sequential representation, in order to store or send them independent of the programming language. The text structure reflects your data hierarchy. Known serialization formats are for example XML and JSON. If you want to know more about both formats, read our articles on the topics. To read, you have to deserialize the text, i.e. convert it back into an object.

In times of Big Data, every computing process must be optimized. Even small computing delays can lead to long delays with a correspondingly large data throughput, and large data formats can block too many resources. The decisive factors are therefore speed and the smallest possible data formats that are stored. Avro is developed by the Apache community and is optimized for Big Data use. It offers you a fast and space-saving open source solution. If you don’t know what Apache means, look here. Here we have summarized everything you need to know about it and introduce you to some other Apache open source projects you should know about.

Apache Avro – Open Source Big Data Serialization Solution

With Apache Avro, you get not only a remote procedure call framework, but also a data serialization framework. So on the one hand you can call functions in other address spaces and on the other hand you can convert data into a more compact binary or text format. This duality gives you some advantages when you have cross-network data pipelines and is justified by its development history.

Avro was released back in 2011 as a part of Apache Hadoop. Here, Avro was supposed to provide a serialization format for data persistence as well as a data transfer format for communication between Hadoop nodes. To provide functionality in a Hadoop cluster, Avro needed to be able to access other address spaces. Due to its ability to serialize large amounts of data, cost-efficiently, Avro can now be used Hadoop-independently. 

You can access Avro via special API’s with many common programming languages (Java, C#, C, C++, Python and Ruby). So you can implement it very flexible.

In the following figure we have summarized some reasons what makes the framework so ingenious. But what really makes Avro so fast?

The schema clearly shows all the features that Apache Avro offers the user and why he should use it
Features Apache Avro

What makes Avro so fast?

The trick is that a schema is used for serialization and deserialization. About that the data hierarchy, i.e. the metadata, is stored separately in a file. The data types and protocols are defined via a JSON format. These are to be assigned unambiguously by ID to the actual values and can be called for the further data processing constantly. This schema is sent along with the data exchange via RPC calls.

Creating a schema registry is especially useful when processing data streams with Apache Kafka.

Apache Avro and Apache Kafka

Here you can save a lot of performance if you store the metadata separately and call it only when you really need it. In the following figure we have shown you this process schematically.

avro kafka

When you let Avro manage your schema registration, it provides you with comprehensive, flexible and automatic schema development. This means that you can add additional fields and delete fields. Even renaming is allowed within certain limits. At the same time, Avro schema is backward and forward compatible. This means that the schema versions of the Reader and Writer can differ. Schema registration management solutions exist, with Google Protocol Buffers and Apache Thrift, among others. However, the JSON data structure makes Avro the most popular choice.

What does HCA stand for?

What does HCA stand for? What is the difference between Agglomerative and Divisive? When do I use the algorithm and what are its strengths? In this article we will clarify all these questions.

If you don’t know what clustering means, check out this article. Here we also explain four other clustering methods that you as a data scientist must know.

What is an HCA?

Hierarchical Cluster Analysis, or HCA, is a technique for optimal and compact connection of objects based on empirical similarity measures. The two most similar objects are assigned one after another until all objects are finally in one cluster. This then results in a tree-like structure.

What does HCA mean - This figure shows the basic principle of an applied HCA to raw data.
What does HCA stand for? Basic principle of an applied HCA to raw data.

So how does a hierarchical cluster procedure work?

Agglomerative vs Divisive Calculation

The basic clustering can be done in two opposite ways, Agglomerative and Divisive calculation.

Agglomerative clustering:

Agglomerative Nesting, abbreviated AGNES, is also known as the bottom-up method. This method first creates a cluster between two objects with high similarity, and then adds more clusters until all the data has been enclosed.

The divisive cluster calculation follows an opposite concept.

Divisive hierarchical clustering:

Divise Analysis, also known as DIANA, is a top-down method. All objects are directly framed into a cluster and then reduced in size.

In the following figure, the agglomerative process is compared with the divisive process.

What does HCA stand for?  The figure compares the agglomerative and divisive calculation.
What does HCA stand for? Agglomerative vs Divisive Calculation

Thus, the goal is to represent the common properties in low dimension in multidimensional raw data. A strength of this machine learning method is the inclusion of cluster relationships. With K-means, only all objects within a collection are similar to each other, while they are dissimilar to objects in other clusters. If you want to know more about this other popular clustering method, read this article.

How to calculate the cluster distances?

As mentioned earlier, not only are similarities between data points in a cluster weighted, but also similarities between groups. These similarities are represented by distances between the clusters. These distances can be determined in different ways. The distance between the centroids of two clusters can be calculated. A single linkage is the shortest distance between two clusters, a complete linkage is the largest distance between two clusters and an average linkage is the average distance between two clusters.

The figure below contrasts each cluster distance calculation method.

The figure contrasts each cluster distance calculation method. A single linkage is the shortest distance between two clusters, a complete linkage is the largest distance between two clusters and an average linkage is the average distance between two clusters
Cluster distance calculation methods

In addition to the planar representation, the HCA can also be represented in a dendrogram.

HCA represented in a Dendrogram

Since an HCA describes a tree structure, it can be well represented in a dendrogram. Here the connections between the individual data elements and the connections between the clusters become well visible. This diagram can help to choose the optimal number of clusters in the data depending on where you intersect the tree.

In the following figure, for example, such a dendrogram is shown in dependence on Agglomerative and Divisive Calculation.

The figure shows a HCA represented as a dendrogram in dependence to Agglomerative and Divisive Calculation.
HCA presented as dendrogram in dependence to Agglomerative and Divisive Calculation.

Data Science vs Data Analysis – Which of the two Professions Suits you Best?

Data Science vs Data Analysis – What distinguishes both professions from each other? How do your tasks differ? In this article, we will discuss all of these questions.

By now, almost every company, across industries and sizes, has recognized the potential in their own data. Every company wants to access this treasure and gain valuable information in order to develop profitable business strategies.
The economy is crying out for experts who can manage and analyze the enormous volumes of data. A trend that is not expected to end in the next 10 years, but rather to grow steadily.
So if you decide to enter the industry today and start studying, or if you want to teach yourself, you should first be clear about the differences between these often confusingly named professions. Often, HR professionals don’t even know these differences and look for the wrong profiles.

What are the similarities?

Let’s start with the similarities and the main reasons why both disciplines are often confused with each other.

Both professions deal with large amounts of data from which knowledge is to be extracted for a specific purpose.
New insights are to be generated and actions are to be identified.

Map of data disciplines

In order to properly understand the relationships between the data sciences, we need to look at the following figure. The individual disciplines and their relationships to each other are shown here.

Data Science vs Data Analysis  - This scheme shows the map of all data disciplines
Data Science vs Data Analysis – Map of data disciplines

The diagram corresponds to an onion-like layering. It is important to understand that all the disciplines listed here are different. Not only are there intersections, but when you talk about a higher level discipline, it includes other, lower level disciplines.

As you can see, both data science and data analysis are ranked very high. So to understand these two disciplines you need to know the other fields as well.

What is Data Science?

When you talk about data science, you are also talking about all other data disciplines.
A data scientist is an all-rounder and can apply all interdisciplinary tools and methods. He or she can handle structured and unstructured data and perform data preprocessing in addition to analysis.

What is Data Analysis?

Data analysis is more about using the right data analysis tools. Specialized data processing is not required at this level, but a data analyst must be able to fully master and understand the tools in order to gain new insights from the data.

What is Data Analytics?

Data analytics is primarily about the use of queries and data aggregation methods. The primary question here is: How can different dependencies between input variables be represented?
Furthermore, this discipline makes use of data mining techniques and tools.

Data mining

Data mining uses the predictive power of machine learning by applying various machine learning algorithms to big data to identify new trends in the data.

If you want to know even more about how data mining differs from data analytics, check out this article we wrote on the subject.

Data Science vs Data Analysis – So what are the differences?

So we have found that all data disciplines are similar in many ways and one discipline can imply other disciplines. In order to be able to define the differences precisely, the methods used must be compared with each other. Are programming skills required, or is the business intelligence part higher?

In the following figure, the assignment to both professions is shown once.

Data Science vs Data Analysis - This diagram shows the cornerstones of the two data disciplines. Mathematics, statistics and business intelligence
Venn Diagram
by Hugh Conway in 2010

Both disciplines lie at the intersection of mathematics, statistics, and development. While data science is characterized by the fact that it consists of all three cornerstones, data analysis lacks the connection to computer science. And that is the biggest difference between the two fields.

Data Science vs Data Analysis – Comparison

Data Science is a branch of Big Data, with the objective of extracting and interpreting information from a huge amount of data. To do this, a data scientist must design and implement mathematical algorithms and predictive models based on statistics, machine learning, and other methods.
Data Analysis is the specific application of Data Science. It specifically involves searching raw data sources to find trends and metrics. However, this involves working with larger data sets than in the area of Business Intelligence.

In the following diagram, these differences and the overlaps between the two professions are compared once again.

datascientist vs dataanalyst
Data Science vs Data Analysis – Comparison

So what you ultimately decide to do depends on your programming interests. Do you want to develop the analyses yourself, or do you prefer to use specific analysis tools to get more value out of large data sets?

That is why Liquid State Machines (LSM) are great

– Recently developed computational model

– does not require information to be stored in some stable state of the system

→ the inherent dynamics of the system are used by a memory less readout function to compute the output

→ can be used for complex Tasks (pattern classification, function approximation, object tracking, …)

LSMs take the temporal aspect of the input into account

Concept

The figure shows a typical structure of a liquid State Machine.
Liquid State Machine

Reservoir/ Liquid

– large accumulation of recurrent interacting nodes
→ is stimulated by the input layer
– Liquid itself is not trained, but randomly constructed with the help of heuristics
– Loops cause a short-term memory effect
– preferably a Spiking Neural Network (SNNs)
→ are closer to biological neural networks than the multilayer Perceptron
→ can be any type of network that has sufficient internal dynamics

Running State

→ will be extracted by the readout function

– depend on the input streams they’ve been presented

Readout Function

– converts the high-dimensional state into the output

– since the readout function is separated from the liquid, several readout functions can be used with the same liquid

→ so different tasks can be performed with the same input

lsm readout fcts
different types of readout functions

Apache Flink

Overview

– Open source stream processor framework developed by the Apache Software Foundation (2016)
– Data streams with high data volume can be processed and analyzed with low delay and high speed

flink analytics
Flink provides various tools for efficient real-time processing of continuous data streams and batch data

Core functions

– diverse, specialized APIs:
→ DataStream API (Stream Processing)
→ ProcessFunctions (control of states and time; event states can be saved and timers can be added for future calculations)
→ Table API
→ SQL API
→ provides a rich set of connectors to various storage systems such as Kafka, Kinesis, Kubernetes, YARN, HDFS, Elasticsearch, and JDBC database systems
→ REST API

Stream Processing

pexels pixabay 2438
How to handle this flood of data?

== Data is processed continuously with a short delay
→ without intermediate storage of the data in separate databases
– several data streams can be processed in parallel
– Each stream can be used to derive own follow-up actions and analyses

Architecture

Data can be processed as unbounded or bounded streams:

  • Unbounded stream

    • have a start but no defined end

    • must be continuously processed

  • Bounded stream

    • have a defined start and end

    • can be processed by ingesting all data before performing any computations(== batch processing)

– Flink automatically identifies the required resources based on the application’s configured parallelism and requests them from the resource manager.

In case of a failure, Flink replaces the failed container by requesting new resources.

– Stateful Flink applications are optimized for local state access