Category: programming (Page 1 of 3)

ERP vs MES vs PLM vs ALM – What role will they play in industry 4.0?

March 14, 2021 / RainerGewalt / 3 Comments

ERP vs MES vs PLM vs ALM – These terms are being mentioned more and more often in connection with Industry 4.0. But what is behind these systems and what are the differences?

this scheme gives an Example of a business process pyramid — Example of a business process pyramid

To stay competitive in today’s world, you need to increase the efficiency of your business processes. It is important that you optimally plan, control and manage your operational resources (capital, personnel…).

Your goal should be to create high quality and continuity with high productivity and low lead time.
Many of your business processes create ever larger amounts of data and increase in complexity. You need to reduce this complexity and increase your flexibility.
Many software solutions are available to your company for the optimal use of resources.

What is an ERP?

Basically, an ERP system is an IT-supported system of software solutions that communicate with each other. Your data is stored centrally and should represent your company in its entirety through quickly available information.

This scheme gives xou an overview about the ERP systems — ERP vs MES vs PLM vs ALM – Overview ERP Systems

The information of your business processes is optimized and documented.
The trend is towards web-based applications.

This means that you access the system interface via your browser and that you can also access it beyond the boundaries of your company. Another advantage is that you don’t have to install any services, making you hardware-independent.

What are ERP Subsystems?

You can use ERP systems in all areas of your business. They provide you with complete solutions for all necessary subsystems.

This scheme shows the ERP fetures — ERP vs MES vs PLM vs ALM – ERP features

Complex systems are divided into so-called application modules, which you can combine with each other as you wish. These fulfill various tasks for the provision and further processing of information. In this way, you can put together your ERP system according to your requirements and adapt it to the size of your company.

What is Advantages Cloud ERP?

ERPs can also be purchased as a complete Software-as-a-Service (SaaS) solution.

This scheme shows the ERP Cloud Advantages — ERP vs MES vs PLM vs ALM – ERP Cloud Advantages

These are comletely industry and hardware independent. You, as a user, can access a sophisticated ERP software package online and thus from anywhere. This gives you absolute spatial flexibility. However, Cloud ERP solutions are still quite new and not yet fully mature. So you should weigh up well in advance whether you want to use a cloud application.

What is an MES?

The MES system is an operational process-related part of a multi-layer MES System. It is responsible for real-time production management and control. You can use MES data to optimize manufacturing processes and detect errors during the production process.

The MES system is assigned to the ERP system. This system accesses your MES data to plan production. It then feeds this information back to your production control system for implementation.

ERP vs MES vs PLM vs ALM – Relationship between company level

The interaction of the individual components is moving closer together in Industry 4.0.

What does the MES include?

MES is usually a multi-layer overall system. It processes your production data into Key Performance Indicators (KPI) and enforces the fulfillment of an existing production plan.

this diagram clearly shows all components of a MES System — ERP vs MES vs PLM vs ALM – MES System features

It processes your production data into Key Performance Indicators (KPI) and enforces the fulfillment of an existing production plan.

What is an PLM?

In addition to MES and ERP, the Product Life Cycle Management (PLM) system plays an elementary role in the digitization of your company.

In order for your company to remain internationally competitive in today’s world, you need to optimize your business models in order to be able to act preventively.

As a manufacturing company, you need to be able to analyze large amounts of data quickly. This way you can recognize deviations from the plan early on and make the right decisions.

Many software solutions help you in all business areas and even exchange data with each other. In this way, you can create information chains within a company and act more quickly.

PLM System is a management approach for the seamless integration of all information that accumulates during the life cycle of a product.
The core components of PLM are the data and information related to the product lifecycle.

this scheme shows your production life cycle process — ERP vs MES vs PLM vs ALM – Production life cycle process

A large amount of product-related and time-dependent data is generated along the product life cycle. The PLM enterprise concept is based on coordinated methods, processes and organizational structures and usually makes use of IT systems. PLM tools link design, implementation and production and provide feedback from manufacturing.

this scheme shows PLM main application areas — ERP vs MES vs PLM vs ALM – PLM main areas of application

The goal of a PLM system is the central management of information and corresponding user groups. One advantage here is that you can control the process of editing and distribution throughout the company.

Application Lifecycle Management (ALM) vs PLM System

More and more products and systems now contain a software component. However, since hardware and software are historically different, you must also differentiate between the management systems.

This schema shows the major differences between ALM and PLM — ALM vs PLM

With PLM you are looking at a physical product, with ALM you are looking at a software product. Basically, however, there are similarities between the two systems. Both also track a product over its entire lifecycle. However, since both product types are increasingly merging today, you can also link both systems on an IT basis at the overall product level.

ERP vs MES vs PLM vs ALM – What does the future hold?

When people talk about Industry 4.0, they are referring to a new level of technological progress. The basis of this innovation is the Internet of Things (IoT). The software solutions of various company levels are networked to form cyber-physical systems and exchange information with each other in real time. In this way, production planning can take place in management and be implemented directly in production. As production becomes more complex in the future, mastering complexity and complex technologies will come with the necessary know-how.

The software solutions presented here are systems optimized for business areas. Each software system is therefore an expert in its own field. This ensures a decisive modularity for a company’s overall solution. On the other hand, this modularity always leads to increased complexity. In the future, it will become increasingly important to create reciprocal data pipelines, so-called data streams, between the individual systems, which currently still operate very autonomously.

ERP vs MES vs PLM vs ALM - This schema shows their roe in industry 4.0 — ERP vs MES vs PLM vs ALM – And their role in Industry 4.0

A decision made at the management level should be implemented in production and at the same time remain controllable at all levels. Optimally, the system should be able to make its own analyses. AI algorithms can help here to find sensible decisions despite increasing complexity. This allows you to optimize your individual production steps and shorten life cycles.

This schema shows the role of a MES System in Industry 4.0 — ERP vs MES vs PLM vs ALM – Industry 4.0 and MES System

The MES, for example, plays an important role here due to its proximity to production. This allows you to make important decisions quickly and implement production plans.In your company of the future, software solutions from various divisions are networked with each other. So you can form information chains and the MES is part of this network.

NumPy vs Pandas – Which is used When?

March 13, 2021 / RainerGewalt / 3 Comments

NumPy vs Pandas – Since in our time in every science and economic branch ever larger amounts of data accumulate, which must be analyzed and managed performantly, the learning of a programming language has become interdisciplinary indispensable.

For many, Python is the first programming language in the classical sense, due to its beginner friendliness and mathematical focus. Python offers the possibility of accessing ready-made, optimized computational tools through the modular implementation of powerful mathematical libraries.

NumPy vs Pandas - The schme shows popular python libraries and their place in the Python ecosystem — **NumPy vs Pandas** – Their place in the Python ecosystem

However, this offer can also quickly become overwhelming. Which library, which framework is suitable for my purposes? Will I save myself work with this tool, or will I reach its limits? Here you can learn more about SciPy and why you should definitely prefer it over MATLAB and here we compared the two Python visualization methods matplotlib and seaborn. These Python libraries are absolutely compatible with each other and together they make a very interesting data science tool. NumPy and Pandas are perhaps two of the best-known python libraries. But what are the differences between them? We will get to the bottom of this question in this article.

What actually is NumPy?

NumPy stands for “Numerical Python” and is an open source Python library for array-based calculations. It was first released in 1995 as Numeric, making it the first implementation of a Python matrix package, and rereleased as NumPy in 2006. This library is intended to allow easy handling of vectors, matrices, or large multidimensional arrays in general.

The scheme shows NumPys major applications — NumPy vs Pandas – Numpys Major Applications

For performance purposes, it is written in C, a deep, machine-oriented programming language. NumPy is compatible with a wide variety of Python libraries, some of which are also based on NumPy, adding further useful functions to its power, such as: Minimization, Regression, Fourier Transform

Python and Science

As mentioned earlier, Python is the programming language most intensively used in the application domain of scientific research across all disciplines for data processing and analysis. What is very interesting here is that the solution approaches are similar across disciplines at the data level. Thus, an exchange of ideas has become indispensable and leads more and more to a fusion of the sciences.

This is only mentioned in passing, but should also emphasize the importance of this programming language and its libraries, which are so often open source and further developed by a community.

NumPy vs Pandas - The schema shows Scientific Computing with NumPy over science disciplines — NumPy vs Pandas – Scientific Computing with NumPy

NumPy was developed specifically for scientific calculations and forms the basis for many specific frameworks and libraries.

The elementary NumPy data structure

The core functionality of NumPy is based on the “ndarray” data structure.

The schema shows NumPys fundamental data structure — NumPy vs Pandas – NumPys fundamental data structure

Such an array can only hold elements of the same data type and always consists of a pointer to a contiguous memory area together with the metadata describing the data stored in it. This allows processes to access them very efficiently and manipulate them as desired.

The schema shows how NumPys fundamental data structure could be manipulate — NumPy vs Pandas – NumPys data structure is manipulable

Thus, the shape can be changed via so-called reshaping, smaller subarrays can be created within a given larger array, arrays can be split, or merged.

What is Pandas?

Pandas is an open source library for data analysis and manipulation in Python. Already released in 2008 by Wes McKinney and written in Python, Cython and in C. Pandas are used in almost all areas and find worldwide appeal in all industries.

The schema shows Pandas major applications — NumPy vs Pandas – Pandas Major Applications

The name Pandas is derived from Panel Data.
Its strength lies in the processing and analysis of tabular data and time series.

The schema shows Pandas major features — NumPy vs Pandas – Pandas Features

Especially in the pre-processing of data, pandas offers a lot of operations. In addition to high-performance filter functions, very large data volumes with over 500 thousand rows can be transformed, manipulated, aggregated and cleaned.

Pandas fundamental data structures

As a basis for the individual functions and tools that Pandas provides, the library defines its own data objects. These objects can be one, two, or even three-dimensional.

The one-dimensional series object can take up different data types in contrast to NumPys ndarrays and corresponds to a data structure with two arrays. One array as index and one array holding the actual data.

The two-dimensional DataFrame object contains an ordered collection of columns. Here, each column can consist of different data types and each value is unique by a row index and a column index.
The eponymous Panel object is then a three-dimensional dataset consisting of dataframes. These objects can be divided into major axes, which are the index rows of each DataFrame, and minor axes, which are the columns of each of the DataFrames.

NumPy vs Pandas – Conclusion

Both libraries have their similarities, which are due to the fact that Pandas is based on NumPy, but is it an either or question? No, clearly not. Pandas is based on NumPy, but adds so many individual features to its functionality that there is a clear justification for their parallel existence. They simply serve different purposes and should be used for both.

One of the main differences between the two open source libraries is the data structure used. Pandas allows analysis and manipulation on a tabular form while NumPy works mainly with numerical data in arrays whose objects can have up to n dimensions. These data forms are easily convertible among themselves via an interface.

Pandas is more performant especially with very large data sets (500K rows and more). This makes data preprocessing and reading from external data sources easier to perform with Pandas and can then be transferred as a NumPy array into complex machine learning or deep learning algorithms. If you want to know more about machine learning methods and their fields of application, take a look at this article from us.

Is Hadoop dead? Should I invest time to learn the Hadoop ecosystem?

February 28, 2021 / RainerGewalt / 5 Comments

Is Hadoop dead – In the IT sector in particular, technologies and software architectures do not have a long shelf life. As new technical insights are gained, the requirements and use cases for the systems also change. As young as the term “big data” is, it is also undergoing constant change. The increased acceptance of open source projects in the business community has led to increased diversification and thus to many mutually beneficial competitive situations.
Apache Hadoop has been considered the one all-purpose solution for over a decade. A Big data ecosystem in which Hadoop plays together with many other extensions. In recent years, however, more and more people are claiming that the demands on data processing have changed and see Hadoop as an outdated concept.

A few years ago, the primary goal was to efficiently handle ever-increasing data volumes, but today iterative real-time analyses on dynamic data sets are required. Data management systems must not be self-contained, but must remain manipulable and monitorable at all times.
So is Hadoop dead, or still indispensable?

What is Hadoop?

Hadoop is a Linux-based open source Big Data framework for scalable, distributed software. It is originally based on Google’s MapReduce algorithm and enables computationally intensive processes of large data sets by parallelizing them on computer clusters, i.e. a large number of networked computers, using multiple components working together.

Is Hadoop dead? This diagram shows the Hadoop ecosystem — Is Hadoop dead? **Hadoop** **ecosystem**

The Hadoop ecosystem is composed of the Hadoop Common, an interface for all other components. It connects Hadoop to the file system of the computers and contains the libraries.In the Hadoop Distributed File System
( HDFS ) very large amounts of data are stored. This is organized as a server cluster with master and slave nodes. The resources are controlled via the Yet Another Resource Negotiator (YARN) component. This resource manager distributes the individual tasks to the available resources, such as CPU and memory.

What is the MapReduce algorithm?

Google’s MapReduce programming model, even though it is currently being replaced by engines based on Directred-Acyclic-Graph (DAG), is still a core component of the Hadoop framework. So if we want to understand how Hadoop works, we first need to understand what MapReduce is in the first place.

Is Hadoop dead? This diagram shows the principle behind Google's MapReduce algorithm — Is Hadoop dead? **Googles Map Reduce Algorithm principle**

Configurable classes for Map, Reduce and Combination phases are provided via the Hadoop MapReduce framework. Map means that a set of data is transformed into another set of data, where the individual elements of the data are combined into tuples (key/value pairs). In the Reduce phase, the formed tuples are then combined into smaller sets of tuples.

How a Hadoop cluster works

As mentioned earlier, Hadoop distributes storage and processing of large amounts of data in a balanced manner across compute clusters, or interconnected hardware.
These computers are connected to a dedicated server that acts as the master
components. The master node organizes the storage of files and the metadata in the individual slave nodes. Within a cluster, data is stored on multiple computers called nodes. The files are partitioned into data blocks and distributed redundantly among the nodes.

Is Hadoop dead? This diagram shows the components of a Hadoop cluster — Is Hadoop dead? **Components of a Hadoop Cluster**

The NameNode and Resource Manager run on the master node. These collect data in the Hadoop Distributed File System (HDFS) and store data with parallel computations by applying MapReduce.

The client nodes are responsible for loading the data into the cluster’s
Architecture. The slave node is one responsible for collecting the data
Client nodes.

How does communication within a cluster work?

The internal communication, i.e. the process of job execution, is organized via so-called JobTrackers and TaskTrackers.
The client submits a MapReduce job to the JobTracker on the master to process a particular file.The JobTracker then determines the DataNodes that store the blocks for that file by querying the NameNode. The NameNode manages the HDFS file system metadata, so it keeps track of all the files that are divided into blocks. The DataNodes store and retrieve these blocks. Then tasks are assigned to different TaskTrackers based on the information received from the NameNode . In the process, the status of each task NameNode and DataNode is monitored.
A secondary NameNode communicates with the NameNode at a periodic interval to take the snapshot of the HDFS metadata. In other words, a backup. This information can then be used in the event of a NameNode failure.

In principle, both single-node clusters and multi-node clusters can be implemented with Hadoop. In the case of a single node, the cluster is implemented on one machine only. All processes then run on a Java virtual machine instance.
In the case of multi-nodes, the master slave architecture already discussed is then implemented over several computers.

Is Hadoop dead?

So is Hadoop dead? Apache Hadoop has clearly lost its status as the sole Big Data solution. Many technologies have already been added that can solve smaller tasks better than the big one solution Hadoop.Today, this small-scale nature enables Big data management solutions that can be optimally tailored to specific use cases. However, Hadoop Hadoop is not dead either. The system still has its strengths and will continue to be the first choice for special use cases in the foreseeable future.

So how is Hadoop evolving?

With the Hadoop Ozone project, an alternative to the Hadoop Distributed File System (HDFS) has now been developed.
It is still to be deployed on a cluster, but corresponds to an object store for Big Data applications. This is much more scalable than than standard file systems and is intended to optimize the handling of small files, a previous Hadoop weakness. Object stores are typically used as a data storage method in the cloud. Through Ozone, they can now be managed locally.
This object store can be accessed by established Big Data solutions such as Hive or Spark without modification.If you want to know more about the hadoop compatible frameworks read our articles on Hive and Spark.

Ozone is built on a block storage layer called Hadoop Distributed Data Store (HDDS) and is designed to scale to billions of objects. The blocks are organized internally using unique namespaces in many independent volumes.
However, one disadvantage of these local object stores is that they are not yet implemented in the core, but must be separated from the traditional file systems by containerized environments such as Kubernetes and YARN. So there are always two truths.

Apache Hive Architecture – Data Warehouse System for free

February 5, 2021 / RainerGewalt / 2 Comments

Apache Hive Architecture – On the way to Industry 4.0, companies are trying to record all business processes as far as possible in order to subsequently optimize them through analysis.
Data warehouse systems provide central data management. Thus, only one data truth exists. In addition to persistence, these information systems take care of sorting, preprocessing, translation and data analysis.
If you want to know more about what a data warehouse system is, check out our article on the subject.

What is Apache Hive

Hive is a data warehousing software project and part of Apache, an open source and free web server software. Learn more about Apache here.
It is built on the Big Data framework Apache Hadoop and was released in 2010. Since then it has been continuously improved and extended by an industrious community.

The query language used by Hive, called HiveQL, is SQL based and allows querying, aggregation and analysis of unstructured data. Hive does not work with the schema-on-write (SoW) approach like relational databases, but uses the so-called schema-on-read (SoR) approach.

What are the biggest advantages of Hive?

Data from relational databases is automatically converted into MapReduce or Tez or Spark jobs. Hadoopclusters are based on MapReduce, a Google programming model for concurrent computation on computer clusters, and powerful stream-based data analysis pipelines can be created with Apache Spark. This ensures full compatibility with the Apache ecosystem, which can be modularly tailored to the needs of an application.

The figure shows the main Apache Hive features — Apache Hive Features

Another advantage of Hive is that the tables are similar to the tables in a relational database. Data is queried using HiveQL. A declarative SQL-like language.
HiveQL allows multiple users to query data simultaneously. Hive supports a variety of data formats and provides a lightweight but powerful translation feature.
For data analysis, custom MapReduce processes can be written and run on clusters in parallel for high performance.

Apache Hive Architecture

Basically, the architecture of Hive can be divided into three core areas. Hive communicates with other applications via the client area. The integration is then executed via the service area. In the last layer, Hive stores the metadata, for example, or computes the data via Hadoop.

The figure shows the basic three-part core architecture of Apache Hive. — Apache Hive Architecture

Hive Clients

Apache Hive can be accessed via different clients. In addition to Open Database Connectivity (ODBC), an SQL-based application programming interface (API) created by Microsoft, there is Java Database Connectivity (JDBC), an SQL-based API developed by Sun Microsystems to allow Java applications to use SQL for database access. Hive also provides a high-performance Apache Thrift connection.

Hive Services

The core and central control of the Hive Services is the so-called driver. This
receives HiveQL commands and is responsible for their execution against the Hadoop system. It typically consists of a compiler that translates HiveQL requests into abstract syntax and executable tasks, an optimizer that aggregates, splits, and optimizes for better performance and scalability, and an executor that interacts with Hadoop’s job tracker and passes tasks to the system for execution.

Apache Hive also provides the ability to submit these tasks directly to the driver. Using the Command Line and User Interface (CLI + UI), it is possible to directly influence the process.

Metadata about persistent relational entities, i.e. databases, tables, columns and partitions are managed by the metastore.

Hive Storage and Computer

The metadata is stored here in a persistence. The results of the query and the data loaded into the tables are stored on HDFS in the Hadoop cluster.

Things you need to know when you start using Apache Spark

January 23, 2021 / RainerGewalt / 1 Comment

Apache Spark Streaming – Every company produces several million pieces of data every day. Properly analyzed, this information can be used to derive valuable business strategies and increase productivity.
Until now, this data was consumed and stored in a persistent. Even today, this is an important step in order to be able to perform analyses on historical data at a later date. Often, however, analysis results are desired in real time. Be it only reference values that have been exceeded.

So-called data streams, i.e. data that is continuously generated from thousands of data sources, can already be consumed before they end up in a persistence, without the flow rate being significantly reduced. It is even possible to train neural networks using such a stream.

In this article, we’ll tell you why you shouldn’t miss out on Apache Spark and Apache Spark Streaming if you’re planning to integrate stream processing in your organization.

What is Apache Spark?

Apache Spark has become one of the most important and performant unified data analytics on the market today. The framework provides a total solution of data processing and AI integration. This allows companies to easily develop performant data pipelines and train AI methods using massive data streams.

Apache Spark combines several partially interdependent components. So can be deployed in a modular fashion to a certain extent.
Spark can run in its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos or on Kubernetes.
The data here can come from streaming sources, such as Kafka, as well as static data sources. So far, the programming languages Java, Scala, Python and R are supported. These are currently the most commonly used languages across all scientific disciplines for implementing data analysis methods.

What does a Spark cluster look like?

Spark applications run as independent sets of processes on a cluster.The coordinator of a Spark program on a cluster is the so-called SparkContext object. This controls the individual Spark applications as they run as independent processes.
The Coordinator then connects to the Central Element, a Cluster Manager, which then allocates resources to the individual applications.
The figure below shows an example of a typical Spark cluster with all its components.

The figure shows an example of a typical Spark cluster with all its components. — Overview Apache Spark Cluster

The actual calculations and data storage then take place on the nodes. These processes, also called executors, then execute tasks and hold the data in memory or disk space. The cache can then be accessed by another node.

Apache sparks underlying technology – The key to high Performance

Spark Core is the underlying unified computing engine on which all Spark functions are built. It enables parallel processing even for large datasets and thus ensures very high-performance processes.
The following figure shows how the Apache Spark Core APIs are composed.

The figure shows how the Apache Spark Core APIs are composed. — Apache Spark Core APIs

The core API consists of low level APIs, where object manipulation via Resilient Distributed Datasets (RDDs) takes place and structured APIs, where all data types are manipulated and batch or streaming jobs take place.

How do the individual Apache Spark APIs work?

In order to properly understand the API structure, its components must be placed in a historical context.

The figure shows the development history of the Apache Spark APIs. — Development history of the Apache Spark APIs

What is the RDD API?

The RDD (Resilient Distributed Dataset) API has been implemented since the first Spark release and is based on the Scala collections API.
RDDs are a set of Java or Scala objects that represent data and thus are the building blocks of Spark. They excel in being compile-time type-safe and inert.

All higher level APIs can be decomposed into RDDs. Various transformations can be performed in parallel using this API. Each of them defines an operation to be executed, which is invoked by calling an action method and creates a new RDD. This then represents the transformed data.

What is the Dataframe API?

The Dataframe API introduces a higher level abstraction. Spark dataframes correspond to the Pandas dataframes structure. They are built on top of RDDs and represent two-dimensional data and a schema. It contains an ordered collection of columns and each different column can consist of different data types. Each value is unique by a row and a column index.

When data is transferred between nodes, only the data is transferred. The metadata is managed in a schema registry separate from spark. This has significantly improved the performance and scalability of Spark.
The API is suitable for creating a relational query plan. Thus, manipulation of data can now be done using a query language.

What is the Dataset API?

When working with dataframes, compile-time type safety is lost. This is a strength of the RDD API. The Dataset-API was created to combine the advantages of both APIs. It is thus the second most important Spark API next to the RDD API.

The basis of this API are integrated encoders, which are responsible for the conversion between JVM objects and the internal Spark SQL representation.

What components does Apache Spark consist of?

Spark is modularly extensible through the use of components. Spark includes libraries for various tasks ranging from SQL to streaming and machine learning. All components are based on the Spark Core, the foundation for parallel and distributed processing of large data sets. How this API looks in detail and what makes it so performant, we will explain later.
The following figure lists the individual Apache Spark components.

In the figure, the ecosystem of Apache Spark is shown with all the major components. — Apache Spark Ecosystem

Apache Spark Spark SQL

With this component RDDs are converted into the so-called data frames, i.e. provided with metadata information.
The whole thing is done by a catalyst optimizer, which executes an execution plan in the form of a tree.

Apache Spark GraphX

This framework can be used to perform high-performance calculations on graphs. These operations can run in parallel.

Apache Spark MLlib/SparkML

With the MLlib component, machine learning pipelines can be constructed very easily. For this purpose, ready-made models and common machine learning algorithms (classification, regression, clustering …) can be used. Thus, data identification, feature extraction and transformation are combined in a unified framework.

Apache Spark Streaming

Apache Spark Streaming enables and controls the processing of data streams. However, Apache Spark Streaming can also process data from static data sources.
In the case of datastreaming, input stream goes from a streaming data source, such as Kafka, Flume or HDFS, into Apache Spark Streaming.
There, it is broken into batches and fed into the Spark engine for parallel processing. The final results can then be output to HDFS databases and dashboards.
The following figure illustrates the principle of Apache Spark Streaming.

All components can consume directly from the stream via Apache Spark Streaming. This component takes a crucial role here. It coordinates the requests via sliding window operations and regulates the data flow. Since all components are based on the Spark Core API, absolute compatibility is guaranteed. Especially in the Big Data area, this can deliver a decisive performance bonus.

Messaging Patterns – It is not enough to decide to use a message

January 17, 2021 / RainerGewalt / 1 Comment

Messaging Patterns- What are they? What are their strengths and why should they only be used with caution? We clarify these questions in this article.

What are Design Patterns?

Technology-independent designs can provide proven pattern solutions in software development, ensuring standardized and robust architecture.
If you’ve never heard of software design patterns, check out this article from us on the subject first.

Design patterns allow a developer to draw on the experience of others. They offer proven solutions for recurring tasks. A one-to-one implementation is not advisable. The patterns should rather be used as a guide.

What is a message?

A basic design pattern is the message. Actually a term that is used by everyone as a matter of course, but what is behind it?

Data is packaged in messages and then transmitted from the sender to the receiver via a message channel. The following figure shows such a messaging system.

Messaging Patterns - This scheme shows the basic concept of a message — Messaging Patterns – Basic Concept of a Message

The communication is asynchronous, which means that both applications are decoupled from each other and therefore do not have to run simultaneously. The sender must build and send the message, while the receiver must read and unpack it.

What are Messaging Patterns?

However, this form of message transmission is only one way of transferring information. The following figure shows the basic concepts of messaging design patterns.

This diagram shows all the basic components of the messaging design patterns — Basic Components of the Messaging Patterns

What is Message Construction?

It is not enough to decide to use a message. A message can be constructed according to different architectural patterns, depending on the functions to be performed.

The following figure shows some of these patterns.

Messaging Design Patterns - This diagram shows the different Patterns of message construction. — Messaging Design Patterns – Message Construction Patterns

Message Construction – When do I use it?

Massaging can be used not only to send data between a sender and receiver, but also to call a procedure or request a response in another application.

With the right message architecture a certain flexibility can be guaranteed. This makes the message much more robust against possible future changes.

What is Message Routing?

A message router connects the message channels in a messaging system. We will come back to this topic later. This router corresponds to a filter, which regulates the message forwarding, but does not change the message. A message is only forwarded to another channel if all predefined conditions are met.

The following figure lists some specific message router types.

Messaging Design Patterns - This diagram shows the different patterns of message routing. — Messaging Patterns – Message Routing Patterns

When do I use message routing and how?

For example, messages can be forwarded to dynamically defined recipients, or message parts can be processed or combined in a differentiated manner.

What are Messaging Channels?

In a messaging system, the exchange of information does not just happen unregulated. The sender transfers the message to a so-called messaging channel and the receiver requests a specific message channel.
In this way, the sender and receiver are decoupled. However, the sender can determine which application receives the data without knowing about it by selecting the specific messaging channel.

However, the right choice of message channel depends on your architecture. Which channel should be addressed and when?

The following figure lists some such channel types.

Messaging Design Patterns - This diagram shows the different patterns of message channels. — Messaging Patterns – Message Channel Patterns

What are the basic differences between the channel types ?

Basically, the channel types can be divided into two main types.

A distinction can be made between a point-to-point channel, i.e. one sender and exactly one receiver, and a publish-subscribe channel, one sender and several receivers.

What is a Messaging Endpoint?

In order for a sender or receiver application to connect to the messaging channel, an intermediary must be used. This client is called a messaging endpoint.

The following figure shows the principle of communication via messaging endpoints.

Messaging Design Patterns - Dieses Diagramm zeigt the Basic principle of a message endpoint — Basic principle of a Message Endpoint

On the receiver side, the end point accepts the data to be sent, builds a message from it and sends it via a specific message channel. On the receiver side, this message is also received via an end point and extracted again. An application can access several end points here. However, an endpoint can only implement one alternative.

The following figure lists some endpoint types.

Messaging Design Patterns - This diagram shows the different patterns of message endpoints. — Messaging Design Patterns – Message Endpoint Patterns

When do I choose which endpoint?

Receiving messages in particular can become difficult and lead to server overload. Therefore, control and possible throttling of the processing of client requests is crucial. A proven means is, for example, the formation of processing queues or a dynamic adjustment of consumers, depending on the volume of requests.

What is Message Transformation?

If the data format has to be changed when data is exchanged between two applications, a so-called message transformation ensures that the message channel is formally decoupled.

This translation process can be understood as two systems running in parallel. The actual message data is separated from the metadata.

The following figure shows some message transformation types.

Messaging Design Patterns - This diagram shows the different patterns of message transformation. — Messaging Design Patterns – Message Transformation Patterns

How do I monitor my messaging system and keep it running?

A flexible messaging architecture unfortunately leads to a certain degree of complexity on the other side. Especially when it comes to integrating many message producers and consumers decoupled from each other in a messaging system, with partly asynchronous messaging, monitoring during operation can become difficult.

For this purpose, system management patterns have been developed to provide the right monitoring tools. The main goal is to prevent bottlenecks and hardware overloads in order to guarantee the smooth flow of messages.

The following figure shows some test and monitoring patterns.

Messaging Design Patterns - This diagram shows the different patterns of message monitoring. — Messaging Design Patterns – Message Monitoring Patterns

What are the basic systems?

With a typical system management solution, for example, the data flow can be controlled by checking the number of data sent and received, or the processing time.

This is contrasted with the actual checking of the message information contained.

t-SNE – Great Machine Learning Algorithm for Visualization of High-Dimensional Datasets

January 14, 2021 / RainerGewalt / 0 Comments

The machine learning algorithm t-Distributed Stochastic Neighborhood Embedding, also abbreviated as t-SNE, can be used to visualize high-dimensional datasets. Each high-dimensional information of a
data point is reduced to a low-dimensional representation. However, the information about existing neighborhoods should be preserved.

So this technique is another tool you can use to create meaningful groups in unordered data collections based on the unifying data properties. If you don’t know what cluster algorithms are, check out this article. Here we present 5 machine learning methods that you should know.
As shown in the following figure, the data should be represented grouped in 2-dimensional space.

The figure shows the data clusters generated by t-Distributed Stochastic Neighborhood Embedding (T-SNE) in 2-dimensional space. — Data clusters generated by t-Distributed Stochastic Neighborhood Embedding (T-SNE)

But how does the algorithm work and what are its strengths? In order to understand its function, we need to look at the origin of the technology.

What is the Stochastic Neighbor Embedding (SNE) Algorithm?

The basis of the t-Distributed Stochastic Neighborhood Embedding algorithm is originally the Stochastic Neighbor Embedding (SNE) algorithm. This converts high-dimensional Euclidean distances into similarity probabilities between individual data points.
The probability with which an object occurs next to a potential neighbor must be calculated.
The dissimilarities between two high-dimensional data points can be explained with a distance matrix, corresponding to the squared Euclidean distance.
A conditional probability is calculated for the low-dimensional correspondence.
This determines the similarity of the two data points on the low-dimensional map.

In order to achieve the closest possible correspondence between the two distributions pij and
qij, a Kullback-Leibler divergence (KL) over all neighbors of each data point is computed as a cost function C. Large costs are incurred for distant data points.

t-Distributed Stochastic Neighborhood Embedding: minimized cost function: sum of the Kullback-Leibler divergences between the original and the induced distribution over the neighbors of an object. — Minimized Cost function: sum of the Kullback-Leibler divergences between the original and the induced distribution over the neighbors of an object.

A gradient descent method is used to optimize the cost function. However, this optimization method converges very slowly. In addition, a so-called crowding problem arises.

If a high dimensional data set is linearly approximated in a small scale, then it cannot be reduced to a lower dimension with a local scaling algo-
rithm to a lower dimension.

What makes the t-Distributed Stochastic Neighborhood Embedding (t-SNE) Algorithmt work?

The t-Distributed Stochastic Neighbor
Embedding (t-SNE) algorithm starts here. On the one hand, a simplified symmetric cost function is used.

The figure shows the simplified symmetric cost function used in t-Distributed Stochastic Neighborhood Embedding. — t-SNE: simplified symmetric cost function

Here, only one KL is minimized over a common probability distribution of all
high, and low dimensional data is minimized.

On the other hand, the similarity of the low-dimensional data points is computed with a Student’s t-distribution and a degree of freedom of one. This can be optimized quickly and is stable to the crowding problem.
stable against the crowding problem.

Software Design Patterns – A COMPLETE GUIDE

January 13, 2021 / RainerGewalt / 1 Comment

Software Design Patterns – This article is intended to explain the concept of design patterns in a simplified way and to give you an overview of the individual major groups.

Software architecture can be compared to the architecture of a house. So needs the application development in the planning also consists of the design and the construction of a meaningful, stable structure.

During implementation, it is really only about problem definition and solution with the tools given to you. Many of the steps are repetitive and follow routine patterns. The experience of the user or architect plays a major role here.
What do I apply when and how?

What are Software Design Patterns?

For many processes, there are already very optimized, proven templates that can be reused. Through these so-called design patterns, it is therefore possible to indirectly access the experience of others. The concept goes back to the architect Christopher Alexander and was subsequently used by computer scientists as a basis for conceptual design in software architecture.

These Patterns are categorized on the basis their characteristics in so-called Design Pattern Catalogs and logically grouped around a certain clarity to create. These characteristics can be for example pattern similarities among themselves, the applicability, or the consequences. Many literature deal with this classification topic. The categorizations shown in the following figure may therefore differ depending on the point of view.

This diagram shows the 4 Important software design patterns. — 4 Important Software Design Patterns.

Creational Patterns

The Creational Design Patterns deal with object and class creation. How can object creations be inherited from other objects and to what extent can classes be instantiated by subclasses? How are these instantiations created and linked?

Patterns should create object creation mechanisms with which object creations are controlled and thus the object is created purposefully on the respective situation. Flexibility and reusability are the intended goals here.
Thereby the construction is separated from the concrete implementation.
In the following scheme some patterns, which are to be assigned to the creational patterns, are represented.

Software Design Patterns - This scheme shows some Creational Patterns examples — Software Design Patterns **– Creational Patterns examples**

Structural Design Patterns

How do I create large, cohesive, yet efficient structures? How do I properly optimize the interaction of my entities? Structural Design Patterns should help with these questions and standardize the composition of objects and classes. So the focus here is on establishing individual relationships.
The following figure shows some of the patterns assigned here.

Software Design Patterns - This scheme shows some Structural Patterns examples — Software Design Patterns **-Structural Patterns examples**

It is often a matter of optimizing and saving inheritance processes. For example, objects can be enclosed in a tree structure, which then all use the same interface, or general properties can be moved to a single object, which is then shared by all other objects. Pipelines can be built and process chains can be formed.

Behavioral Patterns

In addition to the efficient assignment and allocation of entities, communication must also be optimized. At this level, the different transfers among them also describe a structural flow of control. These behavioral patterns can be very complex and difficult to grasp, but are determined by how the individual objects are connected to each other.

So how are responsibilities distributed? Behavioral patterns are intended to help increase the flexibility of the software in terms of its behavior in carrying out this communication.
In the following diagram some patterns are represented, which are to be assigned to the Behavioral Patterns.

Software Design Patterns - This scheme shows some Behavioral Patterns examples — Software Design Patterns – **Behavioral Patterns examples**

For example, inheritance between classes is used to distribute behavior between classes. This inheritance is a sequence of different algorithms that retrieve operations in predefined order and are defined, instantiated, and implemented.
Also, behaviors of objects can be encapsulated instead of distributing them across classes. Another behavioral pattern approach is an observer pattern where the dependencies between objects are observed.

Concurrency Patterns

Like also computations at the same time, thus parallel can be executed, so also models can be created parallel.
So whole program instances can be encapsulated as processes and run isolated, or a program can be divided into several threads, which all access the same memory area, but can also work in parallel.
Where which pattern can be used depends on all the workload conditions present and must be carefully coordinated to effectively avoid overload peaks. The following diagram shows some examples of concurrency patterns.

Conclusion

Since not every problem solution has to be developed by oneself, strategically applied design patterns can save time and resources. They can ensure that programs run effectively. A certain standardization is created. This is especially important for cross-team development. A software product is thereby uniformly and coherently conceived and implemented.

Nevertheless, these templates are often criticized. Why is that?
A decisive factor is that design patterns must not be seen as an all-purpose solution. The individual templates must be understood by the developer in order to use them efficiently. Does the template fit my problem 100 percent, or am I creating extra work again?

Design patterns allow you to access the experience of others, but require your own experience in working with these solutions.

If you are interested in more architectural thinking. Here we have put together another interesting software design the Domain Driven Design.

AI vs Machine Learning vs Deep Learning – It’s almost harder to understand all the acronyms around AI than the technology itself.

January 5, 2021 / RainerGewalt / 5 Comments

It’s almost harder to understand all the acronyms around Artificial Intelligence (AI) than the technology itself.
AI vs Machine Learning vs Deep Learning – These terms are often carelessly mixed together. But what are actually the differences? In this article, we will introduce you to all Three fields, because even though there is overlap, they differ.
It should be important for you to know these differences, as each discipline describes different stages of a data analysis pipeline.

AI vs Machine Learning vs Deep Learning

In the following figure, we have schematically shown you the individual fields in their context. As you can see, the individual disciplines surround each other and form an onion-like layered model.

The figure clearly shows that there are relationships between individual disciplines. AI is to be understood as a generic term and thus includes the other fields. The deeper you go in the model, the more specific the tasks become. In the following, we will follow this representation and work our way from the outside to the inside.

Artificial intelligence

All disciplines are encompassed by the term AI. It is a science that explores ways to build intelligent programs and machines that can perceive, reason, act, and solve problems creatively. To this end, it attempts to model how the human brain works.
The following figure shows that AI can basically be divided into two categories.

Classification is about measuring the performance of AI based on how well it is able to replicate the human-like brain. In the Based on Functionality category, AI is classified based on how well it matches the human way of thinking. In the second category, it is evaluated based on human intelligence. Within these categories, there are still some subgroups that correspond to an index.

AI vs Machine Learning

So what is the first subcategory Machine Learning and how does it differ from AI?
While AI deals with the functioning of artificial intelligence and compares them with the functioning of the human brain, machine learning is a collection of mathematical methods of pattern recognition. It is about how a system is given the ability to automatically learn and improve from experience. Various algorithms (e.g., neural networks) are used for this purpose. In the following scheme, the broad machine learning field is presented in a categorized way.

In machine learning, algorithms are used to build statistical models based on training data. Roughly, these algorithms can be divided into three main learning techniques. While in supervised learning the result is predetermined by a cleanly labeled data set, unsupervised learning is completely self-organized. Here the patterns are to be explored independently.
In reinforcement learning, utility functions are to be independently approximated based on rewards received.

Machine Learning vs Deep Learning/ Deep Neural Learning

Deep learning is a subfield of machine learning similar to machine learning in Ai. Here, multilayer neural networks are used to analyze various factors in large amounts of data. These networks are similar to the human neural system. If you want to know more about this structure, read our article on perceptrons, the smallest unit of a neural network.
Optimization of neural weights, unlike machine learning, can be done using powerful GPUs. Pure machine learning is best used on structured data sets, while for unstructured data you should opt for deep learning. In the following graphic, we have summarized the main factors that make up deep learning. For the network types autoencoder and CNN we provide more detailed articles.

Representation of all basic deep learning components
AI vs Machine Learning vs Deep Learning — Definition Deep Learning

4 Index Data Structures a Data Engineer Must Know

January 2, 2021 / RainerGewalt / 1 Comment

In this article we will explain what index data structures are and introduce you to some popular structures.

In today’s world, ever-increasing amounts of data are being processed. The data can be used to derive business strategies in a commercial context, but also to gain valuable information about all scientific disciplines. The data obtained must be saved, ideally as raw data, and stored for future analysis.

At the time of creation, it is not yet possible to estimate what information might be valuable at some point. So any reduction in data ultimately represents a loss. Huge amounts of data accumulate every second, and managing them is an immense task for today’s hardware and software. Mathematical tricks have to be used to optimize search mechanisms and storage functions.

Index data structures allow you to access searched data in a large data collection immensely faster. Instead of executing a search query sequentially, a so-called index data structure is used to search for a specific data record in this data set based on a search criterion.

What are Index Data Structures in Databases?

You have probably heard about indexing in connection with databases. Here, too, an index structure is formed, independent of the data structure, which accelerates the search for certain fields. This structure consists of references, which define an order relation to the table columns. Based on these pointers, the database management system can then find the data using a search algorithm.

schematic representation of index data structures in databases — Index Data Structures in databases

However, indexing is a very complex scientific field. Queries are constantly being made more efficient and optimized. Thus, the approaches are diverse and very mathematical. This article will give you an overview of popular index data structures and help you to optimize your data pipelines.

Index data structure types

There are many different indexing methods. They are all based on different mathematical assumptions. You should understand these assumptions and choose a suitable system according to your data properties.
In the following scheme you can see some structure types you have to distinguish between, depending on the data you want to index.

index structures 1 — index data structure types

The most important distinction, however, is whether you want to index one-dimensional or multidimensional data relationships. This means that you have to differentiate whether there is a common feature or several related but independent features.
In the following figure, we have classified the individual index structures according to their dimension coverage.

we have classified the individual index structures according to their dimension coverage. — individual index structures according to their dimension coverage

Which index structure you ultimately choose depends on many factors and should be weighed up well in advance, especially with large data sets.

Popular index data structures you should know

In the following, we will introduce you to some of the most popular indexing methods in detail. Because here, too, the key to success lies in understanding your tools and using them correctly at the right moment.

What is Hashing?

If you want to search for a value in an unsorted array, a linear search method is not optimal and too time consuming.
With the so called hashing method a hash value is used for unique object identification. This is calculated by a hash function from the key and determines the storage location in an array of indices, the so-called hash table. This means that you use this function to generate a unique storage location in the table using a key.
In the following figure the hash function flow is shown again.

schematic representation of the hash function sequence in detail — hash function sequence

Important basic assumptions are, however, that the function always returns a number for an object, two identical objects always have the same number and two unequal objects do not always have different numbers.

What is a Binary tree?

A so-called binary tree is a data structure in which each element, also called node, has a maximum of two successors. The addresses of the subordinate nodes are kept track of by pointers. It is often used when data is to be stored in RAM.

What is a B-tree?

The B-tree is often used in databases and file systems, i.e. for storage on the hard disk. The tree is sorted and completely balanced. The data is stored sorted by keys. The keys are stored in its internal nodes, but need not be stored in the records at the leaves. CRUD functions run in amortized logarithmic time.

The B-tree is classified into different types according to its properties.
In the B+ tree, only copies of the keys are stored in the internal nodes. The keys are stored with the data in the leaves. To speed up sequential access, these also contain pointers to the next leaf node and are thus concatenated.
In the following scheme you see a basic B+ tree structure.

Basic representation of a b+ tree and its components — Basic b+ tree structure

The B* tree is an index structure where non-root nodes must be at least 2/3 filled. This is achieved by a modified split strategy.
In addition to indexing, partitioning also offers you the possibility of strongly optimizing the data search within a database. In this article we introduce you to this technique.

What is a SkipList?

The SkipList resembles in its structure a linked list consisting of containers, which contain the data with a unique key and a pointer to the following container. In a SkipList, however, the containers have different heights and can contain pointers to containers that do not follow directly. The idea is to speed up the search by additional pointers.

schematic representation of an index structure of the SkipList — Schematic representation of a SkipList

Calculation of the container height

All nodes have pointers on different levels. Keys can be skipped with it. The height of the list elements is calculated either regularly, or unbalanced according to mathematical rules. The search is however dependent on the list emergence or evenly randomly over the list.