Real-time Streamig with Confluent Platform: The Future of Data Processing

March 24, 2023 / RainerGewalt / 0 Comments

Data processing has become a vital component for the success of businesses in the modern world. While batch processing used to be the primary method for data processing, data streaming has emerged as a promising alternative for real-time data processing. In this article, we will delve into the Confluent Platform, one of the leading data streaming platforms in the market, and explore its features and benefits.

Stream processing

Stream processing, also known as data streaming, refers to a software paradigm where continuous data streams are captured, processed, and managed in real-time. Unlike traditional data processing, which relied on batch processing, real-time data processing enables businesses to gain insights into their data as events occur, rather than after the fact. This is particularly essential in today’s dynamic business environment, where data is rarely static.

Apache Kafka

The Confluent Platform is a comprehensive event streaming platform that builds on top of Apache Kafka, a message broker that allows data to be stored in topics as logs, enabling any number of clients to subscribe and rewrite the data. Microservices can access the data streams from multiple topics with ease, and data structure remains consistent, regardless of programming languages and technology used. Confluent Platform provides additional features such as the Schema Registry, Kafka Connect, and Control Center.

Schema Registry

The Schema Registry allows for the management of schema versions for data stored in Kafka, ensuring that data is properly structured and can be easily consumed by different systems. Kafka Connect simplifies the integration of Kafka with other systems, while Control Center provides a graphical user interface for monitoring and managing Kafka clusters.

avro kafka — Schema Registry integration with Kafka and Apache Avro

Confluent tools and services

Confluent offers additional tools and services such as Confluent Cloud, a fully-managed cloud service for event streaming, and Confluent Hub, a centralized marketplace for Kafka connectors and other Kafka-related extensions. With the Confluent Platform, businesses can leverage the power of real-time data processing to gain a competitive edge in today’s market.

ksqlDB

One of the key components of the Confluent Platform is ksqlDB, an event streaming database that allows for easy transformation of data within Kafka’s data pipelines. With ksqlDB, microservices can enrich and transform data in real-time, enabling anomaly detection, real-time monitoring, and real-time data format conversion. This is made possible by window-based query processing, which allows continuous stream queries based on window-based aggregation of events. Windows are polling intervals that are continuously executed over the data streams. Several window types are available, such as Tumbling, Bouncing, and Session, and they differ in their composition to each other. In addition to continuous queries through window-based aggregation of events, ksqlDB offers many other features that are helpful in dealing with streams.

For example, the last value of a column can be tracked when aggregating events from a stream into a table. Multiple streams can be merged by real-time joins or transformed in real-time. The database is distributed, fault-tolerant, and scalable, and Kafka Connect connectors can be executed and controlled directly.

live streaming microservice architecture with confluent — Example microservice architecture with Confluent Infrastructure: Apache Kafka – ksqlDB data stream live processing

Confluent’s event streaming database ksqlDB offers an excellent solution for real-time data stream processing with Kafka. Kafka is an ideal solution as a central element in a microservice-based software architecture. Microservices can run as separate processes and consume in parallel from the message broker, and ksqlDB ensures real-time stream processing within the services.

Conclusion

In conclusion, real-time data streaming and processing is the future of data processing, with businesses increasingly relying on this technology to gain insights from their data as events occur. Data streaming complements batch processing instead of completely replacing it. Batch processing is still used for tasks where real-time processing is not required, such as generating reports or conducting periodic data analysis. On the other hand, data streaming is used for tasks that require real-time processing, such as monitoring IoT devices or processing financial transactions in real-time. The two approaches complement each other and can be combined depending on the use case to achieve the best results. While the Confluent Platform offers a robust set of tools and services for real-time data processing, it is important to note that there are several alternatives available in the market. As technology continues to evolve, it is difficult to predict which platform or solution will emerge as the dominant player in the field. However, it is clear that real-time data processing and streaming will continue to play a crucial role in helping businesses stay competitive in today’s market.

ERP vs MES vs PLM vs ALM – What role will they play in industry 4.0?

March 14, 2021 / RainerGewalt / 3 Comments

ERP vs MES vs PLM vs ALM – These terms are being mentioned more and more often in connection with Industry 4.0. But what is behind these systems and what are the differences?

this scheme gives an Example of a business process pyramid — Example of a business process pyramid

To stay competitive in today’s world, you need to increase the efficiency of your business processes. It is important that you optimally plan, control and manage your operational resources (capital, personnel…).

Your goal should be to create high quality and continuity with high productivity and low lead time.
Many of your business processes create ever larger amounts of data and increase in complexity. You need to reduce this complexity and increase your flexibility.
Many software solutions are available to your company for the optimal use of resources.

What is an ERP?

Basically, an ERP system is an IT-supported system of software solutions that communicate with each other. Your data is stored centrally and should represent your company in its entirety through quickly available information.

This scheme gives xou an overview about the ERP systems — ERP vs MES vs PLM vs ALM – Overview ERP Systems

The information of your business processes is optimized and documented.
The trend is towards web-based applications.

This means that you access the system interface via your browser and that you can also access it beyond the boundaries of your company. Another advantage is that you don’t have to install any services, making you hardware-independent.

What are ERP Subsystems?

You can use ERP systems in all areas of your business. They provide you with complete solutions for all necessary subsystems.

This scheme shows the ERP fetures — ERP vs MES vs PLM vs ALM – ERP features

Complex systems are divided into so-called application modules, which you can combine with each other as you wish. These fulfill various tasks for the provision and further processing of information. In this way, you can put together your ERP system according to your requirements and adapt it to the size of your company.

What is Advantages Cloud ERP?

ERPs can also be purchased as a complete Software-as-a-Service (SaaS) solution.

This scheme shows the ERP Cloud Advantages — ERP vs MES vs PLM vs ALM – ERP Cloud Advantages

These are comletely industry and hardware independent. You, as a user, can access a sophisticated ERP software package online and thus from anywhere. This gives you absolute spatial flexibility. However, Cloud ERP solutions are still quite new and not yet fully mature. So you should weigh up well in advance whether you want to use a cloud application.

What is an MES?

The MES system is an operational process-related part of a multi-layer MES System. It is responsible for real-time production management and control. You can use MES data to optimize manufacturing processes and detect errors during the production process.

The MES system is assigned to the ERP system. This system accesses your MES data to plan production. It then feeds this information back to your production control system for implementation.

ERP vs MES vs PLM vs ALM – Relationship between company level

The interaction of the individual components is moving closer together in Industry 4.0.

What does the MES include?

MES is usually a multi-layer overall system. It processes your production data into Key Performance Indicators (KPI) and enforces the fulfillment of an existing production plan.

this diagram clearly shows all components of a MES System — ERP vs MES vs PLM vs ALM – MES System features

It processes your production data into Key Performance Indicators (KPI) and enforces the fulfillment of an existing production plan.

What is an PLM?

In addition to MES and ERP, the Product Life Cycle Management (PLM) system plays an elementary role in the digitization of your company.

In order for your company to remain internationally competitive in today’s world, you need to optimize your business models in order to be able to act preventively.

As a manufacturing company, you need to be able to analyze large amounts of data quickly. This way you can recognize deviations from the plan early on and make the right decisions.

Many software solutions help you in all business areas and even exchange data with each other. In this way, you can create information chains within a company and act more quickly.

PLM System is a management approach for the seamless integration of all information that accumulates during the life cycle of a product.
The core components of PLM are the data and information related to the product lifecycle.

this scheme shows your production life cycle process — ERP vs MES vs PLM vs ALM – Production life cycle process

A large amount of product-related and time-dependent data is generated along the product life cycle. The PLM enterprise concept is based on coordinated methods, processes and organizational structures and usually makes use of IT systems. PLM tools link design, implementation and production and provide feedback from manufacturing.

this scheme shows PLM main application areas — ERP vs MES vs PLM vs ALM – PLM main areas of application

The goal of a PLM system is the central management of information and corresponding user groups. One advantage here is that you can control the process of editing and distribution throughout the company.

Application Lifecycle Management (ALM) vs PLM System

More and more products and systems now contain a software component. However, since hardware and software are historically different, you must also differentiate between the management systems.

This schema shows the major differences between ALM and PLM — ALM vs PLM

With PLM you are looking at a physical product, with ALM you are looking at a software product. Basically, however, there are similarities between the two systems. Both also track a product over its entire lifecycle. However, since both product types are increasingly merging today, you can also link both systems on an IT basis at the overall product level.

ERP vs MES vs PLM vs ALM – What does the future hold?

When people talk about Industry 4.0, they are referring to a new level of technological progress. The basis of this innovation is the Internet of Things (IoT). The software solutions of various company levels are networked to form cyber-physical systems and exchange information with each other in real time. In this way, production planning can take place in management and be implemented directly in production. As production becomes more complex in the future, mastering complexity and complex technologies will come with the necessary know-how.

The software solutions presented here are systems optimized for business areas. Each software system is therefore an expert in its own field. This ensures a decisive modularity for a company’s overall solution. On the other hand, this modularity always leads to increased complexity. In the future, it will become increasingly important to create reciprocal data pipelines, so-called data streams, between the individual systems, which currently still operate very autonomously.

ERP vs MES vs PLM vs ALM - This schema shows their roe in industry 4.0 — ERP vs MES vs PLM vs ALM – And their role in Industry 4.0

A decision made at the management level should be implemented in production and at the same time remain controllable at all levels. Optimally, the system should be able to make its own analyses. AI algorithms can help here to find sensible decisions despite increasing complexity. This allows you to optimize your individual production steps and shorten life cycles.

This schema shows the role of a MES System in Industry 4.0 — ERP vs MES vs PLM vs ALM – Industry 4.0 and MES System

The MES, for example, plays an important role here due to its proximity to production. This allows you to make important decisions quickly and implement production plans.In your company of the future, software solutions from various divisions are networked with each other. So you can form information chains and the MES is part of this network.

NumPy vs Pandas – Which is used When?

March 13, 2021 / RainerGewalt / 3 Comments

NumPy vs Pandas – Since in our time in every science and economic branch ever larger amounts of data accumulate, which must be analyzed and managed performantly, the learning of a programming language has become interdisciplinary indispensable.

For many, Python is the first programming language in the classical sense, due to its beginner friendliness and mathematical focus. Python offers the possibility of accessing ready-made, optimized computational tools through the modular implementation of powerful mathematical libraries.

NumPy vs Pandas - The schme shows popular python libraries and their place in the Python ecosystem — **NumPy vs Pandas** – Their place in the Python ecosystem

However, this offer can also quickly become overwhelming. Which library, which framework is suitable for my purposes? Will I save myself work with this tool, or will I reach its limits? Here you can learn more about SciPy and why you should definitely prefer it over MATLAB and here we compared the two Python visualization methods matplotlib and seaborn. These Python libraries are absolutely compatible with each other and together they make a very interesting data science tool. NumPy and Pandas are perhaps two of the best-known python libraries. But what are the differences between them? We will get to the bottom of this question in this article.

What actually is NumPy?

NumPy stands for “Numerical Python” and is an open source Python library for array-based calculations. It was first released in 1995 as Numeric, making it the first implementation of a Python matrix package, and rereleased as NumPy in 2006. This library is intended to allow easy handling of vectors, matrices, or large multidimensional arrays in general.

The scheme shows NumPys major applications — NumPy vs Pandas – Numpys Major Applications

For performance purposes, it is written in C, a deep, machine-oriented programming language. NumPy is compatible with a wide variety of Python libraries, some of which are also based on NumPy, adding further useful functions to its power, such as: Minimization, Regression, Fourier Transform

Python and Science

As mentioned earlier, Python is the programming language most intensively used in the application domain of scientific research across all disciplines for data processing and analysis. What is very interesting here is that the solution approaches are similar across disciplines at the data level. Thus, an exchange of ideas has become indispensable and leads more and more to a fusion of the sciences.

This is only mentioned in passing, but should also emphasize the importance of this programming language and its libraries, which are so often open source and further developed by a community.

NumPy vs Pandas - The schema shows Scientific Computing with NumPy over science disciplines — NumPy vs Pandas – Scientific Computing with NumPy

NumPy was developed specifically for scientific calculations and forms the basis for many specific frameworks and libraries.

The elementary NumPy data structure

The core functionality of NumPy is based on the “ndarray” data structure.

The schema shows NumPys fundamental data structure — NumPy vs Pandas – NumPys fundamental data structure

Such an array can only hold elements of the same data type and always consists of a pointer to a contiguous memory area together with the metadata describing the data stored in it. This allows processes to access them very efficiently and manipulate them as desired.

The schema shows how NumPys fundamental data structure could be manipulate — NumPy vs Pandas – NumPys data structure is manipulable

Thus, the shape can be changed via so-called reshaping, smaller subarrays can be created within a given larger array, arrays can be split, or merged.

What is Pandas?

Pandas is an open source library for data analysis and manipulation in Python. Already released in 2008 by Wes McKinney and written in Python, Cython and in C. Pandas are used in almost all areas and find worldwide appeal in all industries.

The schema shows Pandas major applications — NumPy vs Pandas – Pandas Major Applications

The name Pandas is derived from Panel Data.
Its strength lies in the processing and analysis of tabular data and time series.

The schema shows Pandas major features — NumPy vs Pandas – Pandas Features

Especially in the pre-processing of data, pandas offers a lot of operations. In addition to high-performance filter functions, very large data volumes with over 500 thousand rows can be transformed, manipulated, aggregated and cleaned.

Pandas fundamental data structures

As a basis for the individual functions and tools that Pandas provides, the library defines its own data objects. These objects can be one, two, or even three-dimensional.

The one-dimensional series object can take up different data types in contrast to NumPys ndarrays and corresponds to a data structure with two arrays. One array as index and one array holding the actual data.

The two-dimensional DataFrame object contains an ordered collection of columns. Here, each column can consist of different data types and each value is unique by a row index and a column index.
The eponymous Panel object is then a three-dimensional dataset consisting of dataframes. These objects can be divided into major axes, which are the index rows of each DataFrame, and minor axes, which are the columns of each of the DataFrames.

NumPy vs Pandas – Conclusion

Both libraries have their similarities, which are due to the fact that Pandas is based on NumPy, but is it an either or question? No, clearly not. Pandas is based on NumPy, but adds so many individual features to its functionality that there is a clear justification for their parallel existence. They simply serve different purposes and should be used for both.

One of the main differences between the two open source libraries is the data structure used. Pandas allows analysis and manipulation on a tabular form while NumPy works mainly with numerical data in arrays whose objects can have up to n dimensions. These data forms are easily convertible among themselves via an interface.

Pandas is more performant especially with very large data sets (500K rows and more). This makes data preprocessing and reading from external data sources easier to perform with Pandas and can then be transferred as a NumPy array into complex machine learning or deep learning algorithms. If you want to know more about machine learning methods and their fields of application, take a look at this article from us.

What is data warehousing and does it still make sense?

March 12, 2021 / RainerGewalt / 3 Comments

Data Warehousing – In today’s flood of data, it is becoming increasingly difficult to maintain a clear data management system. More and more data sources are recorded via different software systems.
A unified, centralized system can facilitate analysis and ensure that only one data truth exists in an organization.

What is a Data Warehouse System?

Data warehouse systems are built by integrating data from multiple heterogeneous sources and, in addition to centralization, performs the task of structuring data, supporting analytical reporting and structuring decision-making.
The system can perform data cleansing as well as data integration and data consolidation and does not require transaction processing or recovery.

Data Warehousing - The figure shows all other names that the system has. — Multiple names for Data Warehouse

It is thus a powerful Big Data information system that can centrally handle everything related to data processing.

Data Warehousing Features

Data warehousing offers several features. Such an information system is subject oriented. It does not focus on the current operation, as these data are separated. This means that frequent changes in the operational database are not reflected in the data warehouse. Thus, the focus is on modeling and analysis of data.
The system is Time variant, which means that the collected data are identified with a certain period of time and previous data are not deleted when new data are added.

What does a Data Warehouse structure look like?

The complexity of this system increases exponentially with the complexity of the business. Many distinctive data sources, i.e. business processes, provide commutative and historical data.

This figure shows the main principle behind data warehousing simplified. — Data Warehousing main Principle

Therefore, basic approaches have been defined according to which every data warehouse system should be structured. Single Tier, Two Tier and Three Tier.

2 Tier vs Three Tier Data Warehouse Architecture

In the following we will work out the three tier architecture.
This, the most commonly used, structure is completely decoupled from the data and the user interface by moving the application logic to a middle tier.
In two-tier, the application logic resides either in the user interface on the client or in the database on the server.
Thus, without a middle tier, this system is less scalable and more flexible. Integration of other data sources is more difficult here.

Three Tier Data Warehouse Architecture

The Three Tier Data Warehouse Architecture is the design on the basis of which a data warehouse with three tiers is then built. The figure below shows this structure with common components.

However, the individual components can vary and depend on the project framework. As a rule, however, these changes do not alter the basic structure.

Bottom Tier

The lowest layer is persistence, which is usually located on a server. The data from various data sources is prepared and stored here using an ETL (extract, transform and load) process. Tools and other external resources can be used to feed the data.
This persistence can consist of a relational but also a multidimensional database system.

Load Data into Warehouse

In addition to the different components and architectures, data can also be transmitted to the information system in different ways.

etl elt warehousing 2 — Data Warehousing – ETL vs ELT

As shown in the figure, a basic distinction is made between two elementary processes.

What is ELT?

Extract, Load, and Transform, or ELT for short, is about extracting aggregate information from the source system and loading it into the target method.

The following figure shows such an example system. In this case, the Hadoop framework handles the central data management, while applications and analysis tools access the untransformed data.

warehouse elt 1 — Data Warehousing – Extract, Load, and Transform

What is ETL?

In Extract, Transform and Load, or ETL for short, the data set is first extracted from the sources into a staging area, then transformed or reformatted with business manipulation performed on it, and only then loaded into the target or destination database or data warehouse.

warehouse etl 1 — Data Warehousing – Extract, Transform, and Load

Middle tier

One or more OLAP (Online Analytical Processing) servers reside in the middle data warehouse layer. This technology can be used to create complex budget plans and perform analyses cost-effectively. So in the three tier data warehouse architecture, jobs are generated in the top tier and sent to this middle tier. Here, the data in the bottom tier is then accessed and analyses are performed. The result is then sent to the top tier and thus made available to the user, and/or forwarded to the bottom tier for storage of the analysis results in persistence.

What is an OLAP Server?

Basically, three OLAP server models are distinguished.
In Relational OLAP (ROLAP) the operations on multidimensional data are based on standard relational operations. The Multidimensional OLAP (MOLAP) directly implements the multidimensional data operations. A mixture of relational and multidimensional processing can be handled by Hybrid OLAP (HOLAP).
The choice of the server model always depends on the data composition in the lowest layer.

Top-Tier

The top tier is the top of the three tier data warehouse architecture, the front-end client layer. It contains query and reporting tools, analysis tools, and data mining tools, thus providing the interface to the user. Here he can generate analyses and take a look at the data.

Terminologies

However, some terms that often come up in connection with this system need to be clarified.
When metadata is mentioned, a kind of roadmap to the data warehouse is meant. Here the warehouse objects are defined and it acts as a directory. This means that the decision support system finds the contents via the metadata.

The metadata is stored in the metadata repository. An integral directory that manages both the business metadata, i.e. data ownership information, business definition and change policies, and the operational metadata. Operational metadata refers to the timeliness of the data is it active, archived or cleansed, and data lineage, which is the history of the data. This includes the data used to map the operational environment, source databases and their contents, data extraction, data partitioning, cleansing, transformation rules, data refreshing and cleansing rules, but also the algorithms for summation, dimensional algorithms, data for granularity, aggregation, summation, etc.

The so-called data cube represents data in multiple dimensions and the data mart contains only the data specific to a certain group.

Data Warehouse Types and How they work

All Data Warehouse systems follow the same basic structure, which we explain in this article, but can consist of different components. Accordingly, they are typified.

Host-Based mainframe warehouse

The Host-Based mainframe warehouse resides on a large-volume database.
In addition to this database, metadata is managed in a central metadata repository. Within this metadata, for example, the information for the documentation of data sources or data translation rules are stored.

In general, three phases run in this information system.
Selections and scrubbing methods take place in the unloading phase. That is, the appropriate data types and data sources are determined here and the data is subsequently error corrected.
In the following transform phase the data are translated into a suitable form. Here also already the rules for the access and the storage are specified.
In the final Load phase, the preprocessed data set is moved into tables.

Host-Based LAN data warehouse

With this type, information can be extracted from a variety of sources. Multiple LAN-based warehouses are supported. Data provisioning can be either centralized or from the workgroup environment. The size depends on the platform.

Multi-Stage Data Warehouses

Here, the data is staged several times before being loaded into the data warehouse and finally distributed into department-specific data marts.

Stationary Data Warehouse

In the case of stationary warehouse types, the data from the sources are not changed. The customer thus has direct access to the data.

Distributed Data Warehouses

In the distributed warehouse system, the data is basically distributed. For technological or business reasons, this separation can take place in local and global warehouses. The local levels are then only integrated within the local site but also contain historical data and is therefore absolutely autonomous. This is where most of the operational processing takes place, while the global part processes the data that is relevant to the company as a whole.

Are data warehouses still promising?

Nowadays, data streams are on everyone’s lips and it is precisely with iterative and highly dynamic data sources that large data warehouse systems reach their limits. Often, smaller tools that do not require a complete solution are more efficient. So are data warehouse systems dead after all? No, definitely not. Apart from the basic principles, such as the pursuit of data truth, which can be applied to any other software system. In many cases, however, a data warehouse system is still a high-performance overall solution and can coexist with data stream pipeline systems.

With Apache Hive you can access a free Apache data warehouse software. With this software you can easily implement very performant big data systems. Here we have collected the most important information about Hive.

Is Hadoop dead? Should I invest time to learn the Hadoop ecosystem?

February 28, 2021 / RainerGewalt / 5 Comments

Is Hadoop dead – In the IT sector in particular, technologies and software architectures do not have a long shelf life. As new technical insights are gained, the requirements and use cases for the systems also change. As young as the term “big data” is, it is also undergoing constant change. The increased acceptance of open source projects in the business community has led to increased diversification and thus to many mutually beneficial competitive situations.
Apache Hadoop has been considered the one all-purpose solution for over a decade. A Big data ecosystem in which Hadoop plays together with many other extensions. In recent years, however, more and more people are claiming that the demands on data processing have changed and see Hadoop as an outdated concept.

A few years ago, the primary goal was to efficiently handle ever-increasing data volumes, but today iterative real-time analyses on dynamic data sets are required. Data management systems must not be self-contained, but must remain manipulable and monitorable at all times.
So is Hadoop dead, or still indispensable?

What is Hadoop?

Hadoop is a Linux-based open source Big Data framework for scalable, distributed software. It is originally based on Google’s MapReduce algorithm and enables computationally intensive processes of large data sets by parallelizing them on computer clusters, i.e. a large number of networked computers, using multiple components working together.

Is Hadoop dead? This diagram shows the Hadoop ecosystem — Is Hadoop dead? **Hadoop** **ecosystem**

The Hadoop ecosystem is composed of the Hadoop Common, an interface for all other components. It connects Hadoop to the file system of the computers and contains the libraries.In the Hadoop Distributed File System
( HDFS ) very large amounts of data are stored. This is organized as a server cluster with master and slave nodes. The resources are controlled via the Yet Another Resource Negotiator (YARN) component. This resource manager distributes the individual tasks to the available resources, such as CPU and memory.

What is the MapReduce algorithm?

Google’s MapReduce programming model, even though it is currently being replaced by engines based on Directred-Acyclic-Graph (DAG), is still a core component of the Hadoop framework. So if we want to understand how Hadoop works, we first need to understand what MapReduce is in the first place.

Is Hadoop dead? This diagram shows the principle behind Google's MapReduce algorithm — Is Hadoop dead? **Googles Map Reduce Algorithm principle**

Configurable classes for Map, Reduce and Combination phases are provided via the Hadoop MapReduce framework. Map means that a set of data is transformed into another set of data, where the individual elements of the data are combined into tuples (key/value pairs). In the Reduce phase, the formed tuples are then combined into smaller sets of tuples.

How a Hadoop cluster works

As mentioned earlier, Hadoop distributes storage and processing of large amounts of data in a balanced manner across compute clusters, or interconnected hardware.
These computers are connected to a dedicated server that acts as the master
components. The master node organizes the storage of files and the metadata in the individual slave nodes. Within a cluster, data is stored on multiple computers called nodes. The files are partitioned into data blocks and distributed redundantly among the nodes.

Is Hadoop dead? This diagram shows the components of a Hadoop cluster — Is Hadoop dead? **Components of a Hadoop Cluster**

The NameNode and Resource Manager run on the master node. These collect data in the Hadoop Distributed File System (HDFS) and store data with parallel computations by applying MapReduce.

The client nodes are responsible for loading the data into the cluster’s
Architecture. The slave node is one responsible for collecting the data
Client nodes.

How does communication within a cluster work?

The internal communication, i.e. the process of job execution, is organized via so-called JobTrackers and TaskTrackers.
The client submits a MapReduce job to the JobTracker on the master to process a particular file.The JobTracker then determines the DataNodes that store the blocks for that file by querying the NameNode. The NameNode manages the HDFS file system metadata, so it keeps track of all the files that are divided into blocks. The DataNodes store and retrieve these blocks. Then tasks are assigned to different TaskTrackers based on the information received from the NameNode . In the process, the status of each task NameNode and DataNode is monitored.
A secondary NameNode communicates with the NameNode at a periodic interval to take the snapshot of the HDFS metadata. In other words, a backup. This information can then be used in the event of a NameNode failure.

In principle, both single-node clusters and multi-node clusters can be implemented with Hadoop. In the case of a single node, the cluster is implemented on one machine only. All processes then run on a Java virtual machine instance.
In the case of multi-nodes, the master slave architecture already discussed is then implemented over several computers.

Is Hadoop dead?

So is Hadoop dead? Apache Hadoop has clearly lost its status as the sole Big Data solution. Many technologies have already been added that can solve smaller tasks better than the big one solution Hadoop.Today, this small-scale nature enables Big data management solutions that can be optimally tailored to specific use cases. However, Hadoop Hadoop is not dead either. The system still has its strengths and will continue to be the first choice for special use cases in the foreseeable future.

So how is Hadoop evolving?

With the Hadoop Ozone project, an alternative to the Hadoop Distributed File System (HDFS) has now been developed.
It is still to be deployed on a cluster, but corresponds to an object store for Big Data applications. This is much more scalable than than standard file systems and is intended to optimize the handling of small files, a previous Hadoop weakness. Object stores are typically used as a data storage method in the cloud. Through Ozone, they can now be managed locally.
This object store can be accessed by established Big Data solutions such as Hive or Spark without modification.If you want to know more about the hadoop compatible frameworks read our articles on Hive and Spark.

Ozone is built on a block storage layer called Hadoop Distributed Data Store (HDDS) and is designed to scale to billions of objects. The blocks are organized internally using unique namespaces in many independent volumes.
However, one disadvantage of these local object stores is that they are not yet implemented in the core, but must be separated from the traditional file systems by containerized environments such as Kubernetes and YARN. So there are always two truths.

TensorFlow or Theano?

February 20, 2021 / RainerGewalt / 1 Comment

TensorFlow or Theano – TensorFlow, along with PyTorch, is currently the best known and most widely used machine learning framework. However, the choice of tool should never depend on one’s own preferences, but should be adapted to the data to be examined. Especially in the Big data area, this can prevent a decisive loss of performance. It is therefore also worthwhile to look off the beaten track and to look at other frameworks and libraries in addition to the top dogs.
Theano is one such open source Python library. In the following article, we will introduce both tools and explain the differences.

What is Tensorflow?

The open source framework TensorFlow is the direct successor of Google’s first deep learning tool DistBelief and primarily also forms the basis for neural networks in the environment of language and image processing tasks. With TensorFlow, own models can be developed and processed, but also pre-trained models can be accessed. TF runs on a variety of platforms and is implemented in Python and C++.

TensorFlow vs Theano - This figure shows the hierarchy of the TensorFlow framework. — Hierarchy of TensorFlow toolkits

TF offers low-level APIs for CPU, GPU or TPU. In this way, the hardware resources can be optimally adapted to the process through dynamic allocations.
In addition to the low level APIs, there are also various high level APIs, such as Keras, one of the best known and most frequently used. If you want to know more about Keras, check out our article on the topic.

Framework Architecture

Mainly, the TensorFlow framework can be divided into the components needed for training, where the models are prepared for field use, and for the final deployment, for example on mobile and IoT devices with TensorFlowLite. To simplify the training, TensorFlow offers the developer some useful services besides the already mentioned dynamic allocation. For example, a premade estimator offers a high-level representation of a complete model.Via the TensorFlow Hub, a kind of repository, even trained machine learning models can be other language bindings can be accessed.

TensorFlow vs Theano - This figure shows the structure of the TensorFlow framework. — TensorFlow or Theano – Structure of the TensorFlow Framework

The TensorBoard and StoredModels services act as connecting elements between training and deployment. TensorBoard is the visualization toolkit of TensorFlow with which the experiment results can be visualized. So here it is more of a monitoring solution for the human interface. With the StoredModels both deployment services and training services can share the models. This service thus forms a kind of intermediary, but contains a complete TensorFlow program, including all weights and calculations.

TensorFlow – Data Structure

Neural networks are represented by directed cycle-free graphs. These graphs can be represented and computed beyond the computer limits of training. A graph basically consists of nodes connected by edges. The extent to which the nodes are interconnected also usually determines the learning procedure and thus the structure of an artificial neural network.
The inputs and outputs of the individual calculation steps represent multidimensional data arrays, so-called tensors.

This figure shows the basic tensor structure — Tensor Principle

The mathematical term tensor corresponds to a generalization of vectors and matrices. It is thus an elementary data structure for data representation and processing. In TensorFlow the implementation is done as multidimensional arrays . A vector thus corresponds to a one-dimensional tensor.
Additional dimensions can be added to a tensor up to infinity. Common tensor types are 3-dimensional tensors for time series, images are usually 4-dimensional, and videos are 5-dimensional tensors.

pytorch training 2 — Tensors and neural networks

TensorFlow methods manipulate tensors for linear algebra operations. These processes can be executed with high performance by moving the tensor objects to the graphics card memory or tensor optimized TPUs.

TensorFlow – Training

The training itself then proceeds in such a way that training data are iteratively fed into the computers and at the same time the weights within the graph are varied. The output is then approximated to a target output value. To this end, separate test data can be used to periodically verify that the training is effective for arbitrary or different input data.

The figure shows the sequence of the training of a neural network — Training procedure

Theano – Old but Gold

Theano is an open source Python library for machine learning and neural network programming, and compiler for mathematical expression computation. It was released back in 2007 by the Montreal Institute for Learning Algorithms (MILA) at the University of Montreal.
It is particularly suitable for the definition, optimization and evaluation of mathematical expressions involving multidimensional arrays. For this purpose, Theano accesses the NumPy program library for dealing with matrices, large multidimensional arrays and vectors. First, read our article on NumPy. Here we introduce you to this elementary Python library and explain its basic data management.

Mathematical expressions are programmed and symbolized in Theano using a NumPy-like syntax.
The calculation instructions are done in C++ or CUDA code, thus very close to the machine and accordingly very efficient on CPUs or graphics processing units (GPUs).
Theano can also be used, like TensorFlow as a backend for the framework Keras. Keras thus forms an intersection for both technologies.

Graph Structure

Unlike TensorFlow, Theano focuses on supporting symbolic matrix expressions rather than tensors as a basic data type. Although all kinds of Python objects are supported, basic tensor functionality can be used with Theano, but these operations are not as optimized as with TensorFlow.

Theano performs the symbolic mathematical calculations are executed as graphs. These graphs are composed of interconnected Apply, Variable and Op nodes.

TensorFlow vs Theano - Overview structure of a Theano graph — TensorFlow or Theano – Overview structure of a Theano graph

The Op node represents a particular computation on a particular type of input that produces a particular type of output. It thus corresponds to the definition of a computation.

The centrally located Apply node represents the application of an Op to some variables, that is, the application of computations to the current data, and is used to represent a computation graph. Each op is responsible for knowing how to build an Apply node from a list of inputs and thus determines the determines the function and transformation.
An Apply node additionally consists of the input or output fields. The inputs represent the arguments of the function, and the outputs represent the return values of the function.

The Apply nodes then refer to their input and output variables, the main data structure, in the graph via their input and output fields, respectively.
These Variable Nodes are defined by various fields. The variable type, the owner, which can be None or an Apply node of which the variable is an output, the index and the variable name.

TensorFlow or Theano?

All in all, both technologies have their advantages and disadvantages. But both have their raison d’être. Here, too, the data set provides the tools.

In the table below, we have listed all the important points of difference in detail.

TensorFlow vs Theano - This table compares both tools in detail. — TensorFlow or Theano – Comparision

Especially when it comes to tensor processing, as in image processing and sound recognition, TensorFlow with its optimized operations should be the first choice. Another tensor-based alternative to the Google solution is PyTorch from Facebook. In this article we compared these two tools.
Despite its age, Theano is a high-performance and modern alternative for the calculation of matrix expressions.

PyTorch vs TensorFlow – Facebook vs Google – Duel of the Giants

February 6, 2021 / RainerGewalt / 0 Comments

In recent years, the field of data science has been able to access increasingly powerful analysis methods thanks to increasingly high-performance hardware. Google’s Tensorflow has been the benchmark for editing machine learning and modeling deep learning methods. It still has the most freedom today. But a wide range of options often creates a high barrier to entry.

PyTorch vs TensorFlow – With the 2 years younger, also Python-based, open source package PyTorch, Facebook now wants to knock Tensorflow off its throne. It has been steadily gaining popularity for years due to its simplicity and features.
In this article, we will clarify what is in the package and whether it can really compete with Tensorflow.

What is PyTorch?

Pytorch is one of the most popular open source Python packages for scientific computing and neural network development/training.
It was developed by Facebook in 2016 and is based on the Torch library written in Lua. A NumPy-like tensor library that provides rich GPU support to enable accelerated neural network learning. PyTorch is also often referred to as the library of the same name. More about this in the section “Libraries”.
Tensors form the elementary data structures for PyTorch, similar to Tensorflow.

PyTorch vs TensorFlow – Tensors form the basis for both!

The mathematical term tensor corresponds to a generalization of vectors and matrices. It is thus an elementary data structure for data representation and processing. In PyTorch the implementation is done as multidimensional arrays . A vector thus corresponds to a one-dimensional tensor.

PyTorch vs TensorFlow - the figure schematically shows the principle behind tensors. — PyTorch vs TensorFlow – Tensor Principle

More dimensions can be added to a tensor up to infinity. Common types of tensors are 3 dimensional tensors for time series, images are usually 4 dimensional and videos are five dimensional tensors.

PyTorch vs TensorFlow -The figure shows the role of tensors in the training of neural networks in PyTorch. — PyTorch vs TensorFlow – Tensors and neural networks

PyTorch methods manipulate tensors for linear algebra operations. These processes can run at high performance by moving the tensor objects into the graphics card memory.

PyTorch Libraries

Pytorch offers the possibility to include specific libraries. This way the program can be kept lean and only make references to needed code.
The PyTorch library itself is an optimized tensor library for deep learning on both GPUs and CPUs.
By including another library, PyTorch can also compute on TPUs.

Depending on the data type, different libraries can be loaded, which provide optimized methods and pre-modeled prototypes for analysis. Torchaudio offers besides the usual audio transformation methods also data sets for training. With torchtext large language packages can be accessed and with torchvision images can be analyzed.

PyTorch vs TensorFlow -The figure shows all PyTorch Libraries. — PyTorch Libraries

With TorchElastic, training jobs can be managed and elatically distributed, for example, to shared capacities.

PyTorch features

Through accelerated tensor analysis via allocation to GPUs, PyTorch achieves high flexibility and high speed in Deep Learning algorithms. Beyond this, PyTorch offers through its Python base unlimited compatibilities to powerful Python libraries, such as NumPy and SciPy and to the Cython programming language. Here we have collected the most important Python open source data management and analysis libraries.

Reverse-mode auto differentiation allows developers to modify network behavior at will, without delay or overhead. This allows for essential acceleration of research iterations.
The 8-bit quantization model ensures efficient deployment on servers and edge devices, and PyTorch Mobile can be used to develop for Android and iOS environments.
Other features include named tensor, artificial neural network pruning, and parallel training of models with remote procedure call.

PyTorch can access TorchServe, an open source server from Facebook, and is fully compatible with cloud provider Amazon Web Services (AWS). If you don’t know what AWS is, read our article on the subject.

PyTorch offers a hybrid frontend as an additional feature. This offers the possibility to choose between two modes. The Eager and the Graph mode. The eager mode primarily offers usability and flexibility, while the graph mode offers better speed, optimization and functionality in a C++ runtime environment. PyTorch also allows conversion with the Hybrid frontend. This allows models to be developed in eagermode and then transferred to graph mode for production.

PyTorch has unlimited access to ONNX (Open Neural Network Exchange) compatible platforms. ONNX is an open source project jointly developed by Microsoft, Amazon, and Facebook, among others, that enables the exchange of AI models between different tools.

PyTorch vs Tensorflow

Duel of the Giants

Just like the Facebook solution, Tensorflow works with the tensor data type. PyTorch scores with its simplicity and effective memory usage. Tensorflow, on the other hand, is much more scalable and thus better suited for production models. An essential difference was originally that with PyTorch the graph structure is defined during execution, while with Tensorflow it is first defined and then executed. Here, however, Tensorflow has now followed with its own eager mode. However, this is not yet fully developed at this stage.

PyTorch vs TensorFlow - The figure shows the main differences between Google's Tensorflow and Facebook's PyTorch — Tensorflow vs PyTorch

PyTorch vs Tensorflow – Who is ahead now?

It remains an exciting head-to-head race. Despite its recent development history, PyTorch has already made up a lot of ground and is interesting in an entrepreneurial context precisely because of its user-friendliness. As is often the case, however, it is not a question of which solution will come out on top, but rather of the principle that competition stimulates business. In the end, competitive pressure leads to great new innovations and exciting new tools.

Apache Hive Architecture – Data Warehouse System for free

February 5, 2021 / RainerGewalt / 2 Comments

Apache Hive Architecture – On the way to Industry 4.0, companies are trying to record all business processes as far as possible in order to subsequently optimize them through analysis.
Data warehouse systems provide central data management. Thus, only one data truth exists. In addition to persistence, these information systems take care of sorting, preprocessing, translation and data analysis.
If you want to know more about what a data warehouse system is, check out our article on the subject.

What is Apache Hive

Hive is a data warehousing software project and part of Apache, an open source and free web server software. Learn more about Apache here.
It is built on the Big Data framework Apache Hadoop and was released in 2010. Since then it has been continuously improved and extended by an industrious community.

The query language used by Hive, called HiveQL, is SQL based and allows querying, aggregation and analysis of unstructured data. Hive does not work with the schema-on-write (SoW) approach like relational databases, but uses the so-called schema-on-read (SoR) approach.

What are the biggest advantages of Hive?

Data from relational databases is automatically converted into MapReduce or Tez or Spark jobs. Hadoopclusters are based on MapReduce, a Google programming model for concurrent computation on computer clusters, and powerful stream-based data analysis pipelines can be created with Apache Spark. This ensures full compatibility with the Apache ecosystem, which can be modularly tailored to the needs of an application.

The figure shows the main Apache Hive features — Apache Hive Features

Another advantage of Hive is that the tables are similar to the tables in a relational database. Data is queried using HiveQL. A declarative SQL-like language.
HiveQL allows multiple users to query data simultaneously. Hive supports a variety of data formats and provides a lightweight but powerful translation feature.
For data analysis, custom MapReduce processes can be written and run on clusters in parallel for high performance.

Apache Hive Architecture

Basically, the architecture of Hive can be divided into three core areas. Hive communicates with other applications via the client area. The integration is then executed via the service area. In the last layer, Hive stores the metadata, for example, or computes the data via Hadoop.

The figure shows the basic three-part core architecture of Apache Hive. — Apache Hive Architecture

Hive Clients

Apache Hive can be accessed via different clients. In addition to Open Database Connectivity (ODBC), an SQL-based application programming interface (API) created by Microsoft, there is Java Database Connectivity (JDBC), an SQL-based API developed by Sun Microsystems to allow Java applications to use SQL for database access. Hive also provides a high-performance Apache Thrift connection.

Hive Services

The core and central control of the Hive Services is the so-called driver. This
receives HiveQL commands and is responsible for their execution against the Hadoop system. It typically consists of a compiler that translates HiveQL requests into abstract syntax and executable tasks, an optimizer that aggregates, splits, and optimizes for better performance and scalability, and an executor that interacts with Hadoop’s job tracker and passes tasks to the system for execution.

Apache Hive also provides the ability to submit these tasks directly to the driver. Using the Command Line and User Interface (CLI + UI), it is possible to directly influence the process.

Metadata about persistent relational entities, i.e. databases, tables, columns and partitions are managed by the metastore.

Hive Storage and Computer

The metadata is stored here in a persistence. The results of the query and the data loaded into the tables are stored on HDFS in the Hadoop cluster.

Supervised vs Unsupervised vs Reinforcement Learning – The fundamental differences

January 24, 2021 / RainerGewalt / 1 Comment

Supervised vs Unsupervised vs Reinforcement Learning – The three main categories of machine learning. Why these boundaries have been drawn and what they look like will be discussed in this article. The knowledge about this is an elementary part to understand machine learning correctly and to be able to apply it to data in a meaningful way.

This figure contrasts Supervised vs Unsupervised vs Reinforcement Learning. — Supervised vs Unsupervised vs Reinforcement Learning – Overview

Supervised vs Unsupervised vs Reinforcement Learning – Machine Learning Categories

Machine learning is a branch of artificial intelligence. While AI deals with the functioning of artificial intelligence and compares it with the functioning of the human brain, machine learning is a collection of mathematical methods of pattern recognition. If you want to know more about the differences between Machine Learning, AI and Deep Learning, read our article on the subject. IT systems should be given the ability to automatically learn from experience and improve. Algorithms play a central role here. These can be classified into different learning categories.

In the following figures the three main categories of machine learning methods are shown.

This figure shows Supervised vs Unsupervised vs Reinforcement Learning in the machine learning context. — Supervised vs Unsupervised vs Reinforcement Learning – Machine Learning Context

In the meantime, there are many more categories, some of which are hybrids of the individual main categories. One example is semi-supervised learning. This is certainly also a major machine learning topic, but has been left out for the time being for the sake of simplicity.

What is supervised learning?

In supervised learning, the machine learning algorithm iteratively learns the dependencies between data points. The output to be learned is specified in advance and the learning process is supervised by matching the predictions. How the The optimized algorithm is to apply the learned patterns to unknown data to make predictions.

Supervised vs Unsupervised vs Reinforcement Learning - This figure shows the basic principle of supervised learning. — Supervised vs Unsupervised vs Reinforcement Learning – Supervised Learning

Supervised learning methods can be applied to regression, i.e., prediction, or trend prediction, as well as classification problems.

What is supervised classification?

In classification, abstract classes are formed in order to delimit and order data in a meaningful way. For this purpose, objects are obtained on the basis of certain similar characteristics and structured among each other.

Decision trees can be used as prediction models to create a hierarchical structure, or the feature values can be assigned as class labels and in the form of a vector.

In the following figure the most important supervised classification algorithms are listed.

Supervised vs Unsupervised vs Reinforcement Learning - This figure shows the main algorithms of supervised learning. — Supervised vs Unsupervised vs Reinforcement Learning – Main Algorithms of Supervised Learning.

What is supervised regression?

On the other hand, supervised regression algorithms can be used to make predictions and infer causal relationships between independent and dependent variables.
For example, linear regression can be used to fit the data to a straight line or, conversely, to fit a line to the data object.
We have discussed the exact process of linear regression here in this article.

What is unsupervised learning?

In unsupervised learning, patterns are determined in data without initial patterns and relationships being known.
Especially in complex tasks, these methods can be useful to find solutions that would hardly be solvable by hand. An example is autonomous driving, or large biochemical systems with many interactions.
One key to success is a huge data set. The more data available, the more accurate models can be created.

Supervised vs Unsupervised vs Reinforcement Learning - This figure shows the basic principle of unsupervised learning. — Supervised vs Unsupervised vs Reinforcement Learning – Unsupervised Learning

In unsupervised machine learning methods, two basic principles, which also classify the algorithms used, can be distinguished. The clustering and the dimensional reduction.

What is unsupervised clustering?

The main goal of unsupervised clustering is to create collections of data elements that are similar to each other, but dissimilar to elements in other clusters. The figure below shows some of the main clustering algorithms.

Supervised vs Unsupervised vs Reinforcement Learning - This figure shows the main algorithms of unsupervised learning. — Supervised vs Unsupervised vs Reinforcement Learning – Main algorithms of unsupervised learning.

The clustering algorithms differ primarily in the cluster creation process, but also in the definition of such clusters. Thus, the relationships between clusters can also be used and hierarchical relationships can be explored.

What is unsupervised dimensional reduction?

With a high number of features, high dimensional relations can be translated low dimensional with these transformation methods. The goal is to keep the loss of information as small as possible.
The reduction methods can be divided into two main categories: Methods from linear algebra and from manifold learning.

Manifold learning is an approach to nonlinear dimensionality reduction. Algorithms for this task are based on the idea that they can learn the dimensionality of the data without a given classification and project it in a low-dimensional way.
For example, from the field of linear algebra, matrix factorization methods can be used for dimensionality reduction.

What is reinforcement learning?

In reinforcement learning, a program, a so-called agent, should independently develop a strategy to perform actions in an environment. For this purpose, positive or negative reinforcements are conveyed, which describe the interaction interactions of the agent with the environment. In other words, immediate feedback on an executed task. The program should maximize rewards or minimize punishments. The environment is a kind of simulation scenario that the agent has to explore.
The following figure describes the interactions of all components of a reinforcement learning process.

Supervised vs Unsupervised vs Reinforcement Learning - This figure shows the main principle of reinforcement learning. — Supervised vs Unsupervised vs Reinforcement Learning – Main principle of reinforcement learning.

There are two basic types of reinforcement learning.
Namely, whether the environment is model-based or not.
In model-based RL, the agent uses predictions of the environment response during learning or action.
If no model is available, the data is generated by trial and error.

Things you need to know when you start using Apache Spark

January 23, 2021 / RainerGewalt / 1 Comment

Apache Spark Streaming – Every company produces several million pieces of data every day. Properly analyzed, this information can be used to derive valuable business strategies and increase productivity.
Until now, this data was consumed and stored in a persistent. Even today, this is an important step in order to be able to perform analyses on historical data at a later date. Often, however, analysis results are desired in real time. Be it only reference values that have been exceeded.

So-called data streams, i.e. data that is continuously generated from thousands of data sources, can already be consumed before they end up in a persistence, without the flow rate being significantly reduced. It is even possible to train neural networks using such a stream.

In this article, we’ll tell you why you shouldn’t miss out on Apache Spark and Apache Spark Streaming if you’re planning to integrate stream processing in your organization.

What is Apache Spark?

Apache Spark has become one of the most important and performant unified data analytics on the market today. The framework provides a total solution of data processing and AI integration. This allows companies to easily develop performant data pipelines and train AI methods using massive data streams.

Apache Spark combines several partially interdependent components. So can be deployed in a modular fashion to a certain extent.
Spark can run in its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos or on Kubernetes.
The data here can come from streaming sources, such as Kafka, as well as static data sources. So far, the programming languages Java, Scala, Python and R are supported. These are currently the most commonly used languages across all scientific disciplines for implementing data analysis methods.

What does a Spark cluster look like?

Spark applications run as independent sets of processes on a cluster.The coordinator of a Spark program on a cluster is the so-called SparkContext object. This controls the individual Spark applications as they run as independent processes.
The Coordinator then connects to the Central Element, a Cluster Manager, which then allocates resources to the individual applications.
The figure below shows an example of a typical Spark cluster with all its components.

The figure shows an example of a typical Spark cluster with all its components. — Overview Apache Spark Cluster

The actual calculations and data storage then take place on the nodes. These processes, also called executors, then execute tasks and hold the data in memory or disk space. The cache can then be accessed by another node.

Apache sparks underlying technology – The key to high Performance

Spark Core is the underlying unified computing engine on which all Spark functions are built. It enables parallel processing even for large datasets and thus ensures very high-performance processes.
The following figure shows how the Apache Spark Core APIs are composed.

The figure shows how the Apache Spark Core APIs are composed. — Apache Spark Core APIs

The core API consists of low level APIs, where object manipulation via Resilient Distributed Datasets (RDDs) takes place and structured APIs, where all data types are manipulated and batch or streaming jobs take place.

How do the individual Apache Spark APIs work?

In order to properly understand the API structure, its components must be placed in a historical context.

The figure shows the development history of the Apache Spark APIs. — Development history of the Apache Spark APIs

What is the RDD API?

The RDD (Resilient Distributed Dataset) API has been implemented since the first Spark release and is based on the Scala collections API.
RDDs are a set of Java or Scala objects that represent data and thus are the building blocks of Spark. They excel in being compile-time type-safe and inert.

All higher level APIs can be decomposed into RDDs. Various transformations can be performed in parallel using this API. Each of them defines an operation to be executed, which is invoked by calling an action method and creates a new RDD. This then represents the transformed data.

What is the Dataframe API?

The Dataframe API introduces a higher level abstraction. Spark dataframes correspond to the Pandas dataframes structure. They are built on top of RDDs and represent two-dimensional data and a schema. It contains an ordered collection of columns and each different column can consist of different data types. Each value is unique by a row and a column index.

When data is transferred between nodes, only the data is transferred. The metadata is managed in a schema registry separate from spark. This has significantly improved the performance and scalability of Spark.
The API is suitable for creating a relational query plan. Thus, manipulation of data can now be done using a query language.

What is the Dataset API?

When working with dataframes, compile-time type safety is lost. This is a strength of the RDD API. The Dataset-API was created to combine the advantages of both APIs. It is thus the second most important Spark API next to the RDD API.

The basis of this API are integrated encoders, which are responsible for the conversion between JVM objects and the internal Spark SQL representation.

What components does Apache Spark consist of?

Spark is modularly extensible through the use of components. Spark includes libraries for various tasks ranging from SQL to streaming and machine learning. All components are based on the Spark Core, the foundation for parallel and distributed processing of large data sets. How this API looks in detail and what makes it so performant, we will explain later.
The following figure lists the individual Apache Spark components.

In the figure, the ecosystem of Apache Spark is shown with all the major components. — Apache Spark Ecosystem

Apache Spark Spark SQL

With this component RDDs are converted into the so-called data frames, i.e. provided with metadata information.
The whole thing is done by a catalyst optimizer, which executes an execution plan in the form of a tree.

Apache Spark GraphX

This framework can be used to perform high-performance calculations on graphs. These operations can run in parallel.

Apache Spark MLlib/SparkML

With the MLlib component, machine learning pipelines can be constructed very easily. For this purpose, ready-made models and common machine learning algorithms (classification, regression, clustering …) can be used. Thus, data identification, feature extraction and transformation are combined in a unified framework.

Apache Spark Streaming

Apache Spark Streaming enables and controls the processing of data streams. However, Apache Spark Streaming can also process data from static data sources.
In the case of datastreaming, input stream goes from a streaming data source, such as Kafka, Flume or HDFS, into Apache Spark Streaming.
There, it is broken into batches and fed into the Spark engine for parallel processing. The final results can then be output to HDFS databases and dashboards.
The following figure illustrates the principle of Apache Spark Streaming.

All components can consume directly from the stream via Apache Spark Streaming. This component takes a crucial role here. It coordinates the requests via sliding window operations and regulates the data flow. Since all components are based on the Spark Core API, absolute compatibility is guaranteed. Especially in the Big Data area, this can deliver a decisive performance bonus.