Category: Software Architecture Principles

ERP vs MES vs PLM vs ALM – What role will they play in industry 4.0?

March 14, 2021 / RainerGewalt / 3 Comments

ERP vs MES vs PLM vs ALM – These terms are being mentioned more and more often in connection with Industry 4.0. But what is behind these systems and what are the differences?

this scheme gives an Example of a business process pyramid — Example of a business process pyramid

To stay competitive in today’s world, you need to increase the efficiency of your business processes. It is important that you optimally plan, control and manage your operational resources (capital, personnel…).

Your goal should be to create high quality and continuity with high productivity and low lead time.
Many of your business processes create ever larger amounts of data and increase in complexity. You need to reduce this complexity and increase your flexibility.
Many software solutions are available to your company for the optimal use of resources.

What is an ERP?

Basically, an ERP system is an IT-supported system of software solutions that communicate with each other. Your data is stored centrally and should represent your company in its entirety through quickly available information.

This scheme gives xou an overview about the ERP systems — ERP vs MES vs PLM vs ALM – Overview ERP Systems

The information of your business processes is optimized and documented.
The trend is towards web-based applications.

This means that you access the system interface via your browser and that you can also access it beyond the boundaries of your company. Another advantage is that you don’t have to install any services, making you hardware-independent.

What are ERP Subsystems?

You can use ERP systems in all areas of your business. They provide you with complete solutions for all necessary subsystems.

This scheme shows the ERP fetures — ERP vs MES vs PLM vs ALM – ERP features

Complex systems are divided into so-called application modules, which you can combine with each other as you wish. These fulfill various tasks for the provision and further processing of information. In this way, you can put together your ERP system according to your requirements and adapt it to the size of your company.

What is Advantages Cloud ERP?

ERPs can also be purchased as a complete Software-as-a-Service (SaaS) solution.

This scheme shows the ERP Cloud Advantages — ERP vs MES vs PLM vs ALM – ERP Cloud Advantages

These are comletely industry and hardware independent. You, as a user, can access a sophisticated ERP software package online and thus from anywhere. This gives you absolute spatial flexibility. However, Cloud ERP solutions are still quite new and not yet fully mature. So you should weigh up well in advance whether you want to use a cloud application.

What is an MES?

The MES system is an operational process-related part of a multi-layer MES System. It is responsible for real-time production management and control. You can use MES data to optimize manufacturing processes and detect errors during the production process.

The MES system is assigned to the ERP system. This system accesses your MES data to plan production. It then feeds this information back to your production control system for implementation.

ERP vs MES vs PLM vs ALM – Relationship between company level

The interaction of the individual components is moving closer together in Industry 4.0.

What does the MES include?

MES is usually a multi-layer overall system. It processes your production data into Key Performance Indicators (KPI) and enforces the fulfillment of an existing production plan.

this diagram clearly shows all components of a MES System — ERP vs MES vs PLM vs ALM – MES System features

It processes your production data into Key Performance Indicators (KPI) and enforces the fulfillment of an existing production plan.

What is an PLM?

In addition to MES and ERP, the Product Life Cycle Management (PLM) system plays an elementary role in the digitization of your company.

In order for your company to remain internationally competitive in today’s world, you need to optimize your business models in order to be able to act preventively.

As a manufacturing company, you need to be able to analyze large amounts of data quickly. This way you can recognize deviations from the plan early on and make the right decisions.

Many software solutions help you in all business areas and even exchange data with each other. In this way, you can create information chains within a company and act more quickly.

PLM System is a management approach for the seamless integration of all information that accumulates during the life cycle of a product.
The core components of PLM are the data and information related to the product lifecycle.

this scheme shows your production life cycle process — ERP vs MES vs PLM vs ALM – Production life cycle process

A large amount of product-related and time-dependent data is generated along the product life cycle. The PLM enterprise concept is based on coordinated methods, processes and organizational structures and usually makes use of IT systems. PLM tools link design, implementation and production and provide feedback from manufacturing.

this scheme shows PLM main application areas — ERP vs MES vs PLM vs ALM – PLM main areas of application

The goal of a PLM system is the central management of information and corresponding user groups. One advantage here is that you can control the process of editing and distribution throughout the company.

Application Lifecycle Management (ALM) vs PLM System

More and more products and systems now contain a software component. However, since hardware and software are historically different, you must also differentiate between the management systems.

This schema shows the major differences between ALM and PLM — ALM vs PLM

With PLM you are looking at a physical product, with ALM you are looking at a software product. Basically, however, there are similarities between the two systems. Both also track a product over its entire lifecycle. However, since both product types are increasingly merging today, you can also link both systems on an IT basis at the overall product level.

ERP vs MES vs PLM vs ALM – What does the future hold?

When people talk about Industry 4.0, they are referring to a new level of technological progress. The basis of this innovation is the Internet of Things (IoT). The software solutions of various company levels are networked to form cyber-physical systems and exchange information with each other in real time. In this way, production planning can take place in management and be implemented directly in production. As production becomes more complex in the future, mastering complexity and complex technologies will come with the necessary know-how.

The software solutions presented here are systems optimized for business areas. Each software system is therefore an expert in its own field. This ensures a decisive modularity for a company’s overall solution. On the other hand, this modularity always leads to increased complexity. In the future, it will become increasingly important to create reciprocal data pipelines, so-called data streams, between the individual systems, which currently still operate very autonomously.

ERP vs MES vs PLM vs ALM - This schema shows their roe in industry 4.0 — ERP vs MES vs PLM vs ALM – And their role in Industry 4.0

A decision made at the management level should be implemented in production and at the same time remain controllable at all levels. Optimally, the system should be able to make its own analyses. AI algorithms can help here to find sensible decisions despite increasing complexity. This allows you to optimize your individual production steps and shorten life cycles.

This schema shows the role of a MES System in Industry 4.0 — ERP vs MES vs PLM vs ALM – Industry 4.0 and MES System

The MES, for example, plays an important role here due to its proximity to production. This allows you to make important decisions quickly and implement production plans.In your company of the future, software solutions from various divisions are networked with each other. So you can form information chains and the MES is part of this network.

What is data warehousing and does it still make sense?

March 12, 2021 / RainerGewalt / 3 Comments

Data Warehousing – In today’s flood of data, it is becoming increasingly difficult to maintain a clear data management system. More and more data sources are recorded via different software systems.
A unified, centralized system can facilitate analysis and ensure that only one data truth exists in an organization.

What is a Data Warehouse System?

Data warehouse systems are built by integrating data from multiple heterogeneous sources and, in addition to centralization, performs the task of structuring data, supporting analytical reporting and structuring decision-making.
The system can perform data cleansing as well as data integration and data consolidation and does not require transaction processing or recovery.

Data Warehousing - The figure shows all other names that the system has. — Multiple names for Data Warehouse

It is thus a powerful Big Data information system that can centrally handle everything related to data processing.

Data Warehousing Features

Data warehousing offers several features. Such an information system is subject oriented. It does not focus on the current operation, as these data are separated. This means that frequent changes in the operational database are not reflected in the data warehouse. Thus, the focus is on modeling and analysis of data.
The system is Time variant, which means that the collected data are identified with a certain period of time and previous data are not deleted when new data are added.

What does a Data Warehouse structure look like?

The complexity of this system increases exponentially with the complexity of the business. Many distinctive data sources, i.e. business processes, provide commutative and historical data.

This figure shows the main principle behind data warehousing simplified. — Data Warehousing main Principle

Therefore, basic approaches have been defined according to which every data warehouse system should be structured. Single Tier, Two Tier and Three Tier.

2 Tier vs Three Tier Data Warehouse Architecture

In the following we will work out the three tier architecture.
This, the most commonly used, structure is completely decoupled from the data and the user interface by moving the application logic to a middle tier.
In two-tier, the application logic resides either in the user interface on the client or in the database on the server.
Thus, without a middle tier, this system is less scalable and more flexible. Integration of other data sources is more difficult here.

Three Tier Data Warehouse Architecture

The Three Tier Data Warehouse Architecture is the design on the basis of which a data warehouse with three tiers is then built. The figure below shows this structure with common components.

However, the individual components can vary and depend on the project framework. As a rule, however, these changes do not alter the basic structure.

Bottom Tier

The lowest layer is persistence, which is usually located on a server. The data from various data sources is prepared and stored here using an ETL (extract, transform and load) process. Tools and other external resources can be used to feed the data.
This persistence can consist of a relational but also a multidimensional database system.

Load Data into Warehouse

In addition to the different components and architectures, data can also be transmitted to the information system in different ways.

etl elt warehousing 2 — Data Warehousing – ETL vs ELT

As shown in the figure, a basic distinction is made between two elementary processes.

What is ELT?

Extract, Load, and Transform, or ELT for short, is about extracting aggregate information from the source system and loading it into the target method.

The following figure shows such an example system. In this case, the Hadoop framework handles the central data management, while applications and analysis tools access the untransformed data.

warehouse elt 1 — Data Warehousing – Extract, Load, and Transform

What is ETL?

In Extract, Transform and Load, or ETL for short, the data set is first extracted from the sources into a staging area, then transformed or reformatted with business manipulation performed on it, and only then loaded into the target or destination database or data warehouse.

warehouse etl 1 — Data Warehousing – Extract, Transform, and Load

Middle tier

One or more OLAP (Online Analytical Processing) servers reside in the middle data warehouse layer. This technology can be used to create complex budget plans and perform analyses cost-effectively. So in the three tier data warehouse architecture, jobs are generated in the top tier and sent to this middle tier. Here, the data in the bottom tier is then accessed and analyses are performed. The result is then sent to the top tier and thus made available to the user, and/or forwarded to the bottom tier for storage of the analysis results in persistence.

What is an OLAP Server?

Basically, three OLAP server models are distinguished.
In Relational OLAP (ROLAP) the operations on multidimensional data are based on standard relational operations. The Multidimensional OLAP (MOLAP) directly implements the multidimensional data operations. A mixture of relational and multidimensional processing can be handled by Hybrid OLAP (HOLAP).
The choice of the server model always depends on the data composition in the lowest layer.

Top-Tier

The top tier is the top of the three tier data warehouse architecture, the front-end client layer. It contains query and reporting tools, analysis tools, and data mining tools, thus providing the interface to the user. Here he can generate analyses and take a look at the data.

Terminologies

However, some terms that often come up in connection with this system need to be clarified.
When metadata is mentioned, a kind of roadmap to the data warehouse is meant. Here the warehouse objects are defined and it acts as a directory. This means that the decision support system finds the contents via the metadata.

The metadata is stored in the metadata repository. An integral directory that manages both the business metadata, i.e. data ownership information, business definition and change policies, and the operational metadata. Operational metadata refers to the timeliness of the data is it active, archived or cleansed, and data lineage, which is the history of the data. This includes the data used to map the operational environment, source databases and their contents, data extraction, data partitioning, cleansing, transformation rules, data refreshing and cleansing rules, but also the algorithms for summation, dimensional algorithms, data for granularity, aggregation, summation, etc.

The so-called data cube represents data in multiple dimensions and the data mart contains only the data specific to a certain group.

Data Warehouse Types and How they work

All Data Warehouse systems follow the same basic structure, which we explain in this article, but can consist of different components. Accordingly, they are typified.

Host-Based mainframe warehouse

The Host-Based mainframe warehouse resides on a large-volume database.
In addition to this database, metadata is managed in a central metadata repository. Within this metadata, for example, the information for the documentation of data sources or data translation rules are stored.

In general, three phases run in this information system.
Selections and scrubbing methods take place in the unloading phase. That is, the appropriate data types and data sources are determined here and the data is subsequently error corrected.
In the following transform phase the data are translated into a suitable form. Here also already the rules for the access and the storage are specified.
In the final Load phase, the preprocessed data set is moved into tables.

Host-Based LAN data warehouse

With this type, information can be extracted from a variety of sources. Multiple LAN-based warehouses are supported. Data provisioning can be either centralized or from the workgroup environment. The size depends on the platform.

Multi-Stage Data Warehouses

Here, the data is staged several times before being loaded into the data warehouse and finally distributed into department-specific data marts.

Stationary Data Warehouse

In the case of stationary warehouse types, the data from the sources are not changed. The customer thus has direct access to the data.

Distributed Data Warehouses

In the distributed warehouse system, the data is basically distributed. For technological or business reasons, this separation can take place in local and global warehouses. The local levels are then only integrated within the local site but also contain historical data and is therefore absolutely autonomous. This is where most of the operational processing takes place, while the global part processes the data that is relevant to the company as a whole.

Are data warehouses still promising?

Nowadays, data streams are on everyone’s lips and it is precisely with iterative and highly dynamic data sources that large data warehouse systems reach their limits. Often, smaller tools that do not require a complete solution are more efficient. So are data warehouse systems dead after all? No, definitely not. Apart from the basic principles, such as the pursuit of data truth, which can be applied to any other software system. In many cases, however, a data warehouse system is still a high-performance overall solution and can coexist with data stream pipeline systems.

With Apache Hive you can access a free Apache data warehouse software. With this software you can easily implement very performant big data systems. Here we have collected the most important information about Hive.

Apache Hive Architecture – Data Warehouse System for free

February 5, 2021 / RainerGewalt / 2 Comments

Apache Hive Architecture – On the way to Industry 4.0, companies are trying to record all business processes as far as possible in order to subsequently optimize them through analysis.
Data warehouse systems provide central data management. Thus, only one data truth exists. In addition to persistence, these information systems take care of sorting, preprocessing, translation and data analysis.
If you want to know more about what a data warehouse system is, check out our article on the subject.

What is Apache Hive

Hive is a data warehousing software project and part of Apache, an open source and free web server software. Learn more about Apache here.
It is built on the Big Data framework Apache Hadoop and was released in 2010. Since then it has been continuously improved and extended by an industrious community.

The query language used by Hive, called HiveQL, is SQL based and allows querying, aggregation and analysis of unstructured data. Hive does not work with the schema-on-write (SoW) approach like relational databases, but uses the so-called schema-on-read (SoR) approach.

What are the biggest advantages of Hive?

Data from relational databases is automatically converted into MapReduce or Tez or Spark jobs. Hadoopclusters are based on MapReduce, a Google programming model for concurrent computation on computer clusters, and powerful stream-based data analysis pipelines can be created with Apache Spark. This ensures full compatibility with the Apache ecosystem, which can be modularly tailored to the needs of an application.

The figure shows the main Apache Hive features — Apache Hive Features

Another advantage of Hive is that the tables are similar to the tables in a relational database. Data is queried using HiveQL. A declarative SQL-like language.
HiveQL allows multiple users to query data simultaneously. Hive supports a variety of data formats and provides a lightweight but powerful translation feature.
For data analysis, custom MapReduce processes can be written and run on clusters in parallel for high performance.

Apache Hive Architecture

Basically, the architecture of Hive can be divided into three core areas. Hive communicates with other applications via the client area. The integration is then executed via the service area. In the last layer, Hive stores the metadata, for example, or computes the data via Hadoop.

The figure shows the basic three-part core architecture of Apache Hive. — Apache Hive Architecture

Hive Clients

Apache Hive can be accessed via different clients. In addition to Open Database Connectivity (ODBC), an SQL-based application programming interface (API) created by Microsoft, there is Java Database Connectivity (JDBC), an SQL-based API developed by Sun Microsystems to allow Java applications to use SQL for database access. Hive also provides a high-performance Apache Thrift connection.

Hive Services

The core and central control of the Hive Services is the so-called driver. This
receives HiveQL commands and is responsible for their execution against the Hadoop system. It typically consists of a compiler that translates HiveQL requests into abstract syntax and executable tasks, an optimizer that aggregates, splits, and optimizes for better performance and scalability, and an executor that interacts with Hadoop’s job tracker and passes tasks to the system for execution.

Apache Hive also provides the ability to submit these tasks directly to the driver. Using the Command Line and User Interface (CLI + UI), it is possible to directly influence the process.

Metadata about persistent relational entities, i.e. databases, tables, columns and partitions are managed by the metastore.

Hive Storage and Computer

The metadata is stored here in a persistence. The results of the query and the data loaded into the tables are stored on HDFS in the Hadoop cluster.

Messaging Patterns – It is not enough to decide to use a message

January 17, 2021 / RainerGewalt / 1 Comment

Messaging Patterns- What are they? What are their strengths and why should they only be used with caution? We clarify these questions in this article.

What are Design Patterns?

Technology-independent designs can provide proven pattern solutions in software development, ensuring standardized and robust architecture.
If you’ve never heard of software design patterns, check out this article from us on the subject first.

Design patterns allow a developer to draw on the experience of others. They offer proven solutions for recurring tasks. A one-to-one implementation is not advisable. The patterns should rather be used as a guide.

What is a message?

A basic design pattern is the message. Actually a term that is used by everyone as a matter of course, but what is behind it?

Data is packaged in messages and then transmitted from the sender to the receiver via a message channel. The following figure shows such a messaging system.

Messaging Patterns - This scheme shows the basic concept of a message — Messaging Patterns – Basic Concept of a Message

The communication is asynchronous, which means that both applications are decoupled from each other and therefore do not have to run simultaneously. The sender must build and send the message, while the receiver must read and unpack it.

What are Messaging Patterns?

However, this form of message transmission is only one way of transferring information. The following figure shows the basic concepts of messaging design patterns.

This diagram shows all the basic components of the messaging design patterns — Basic Components of the Messaging Patterns

What is Message Construction?

It is not enough to decide to use a message. A message can be constructed according to different architectural patterns, depending on the functions to be performed.

The following figure shows some of these patterns.

Messaging Design Patterns - This diagram shows the different Patterns of message construction. — Messaging Design Patterns – Message Construction Patterns

Message Construction – When do I use it?

Massaging can be used not only to send data between a sender and receiver, but also to call a procedure or request a response in another application.

With the right message architecture a certain flexibility can be guaranteed. This makes the message much more robust against possible future changes.

What is Message Routing?

A message router connects the message channels in a messaging system. We will come back to this topic later. This router corresponds to a filter, which regulates the message forwarding, but does not change the message. A message is only forwarded to another channel if all predefined conditions are met.

The following figure lists some specific message router types.

Messaging Design Patterns - This diagram shows the different patterns of message routing. — Messaging Patterns – Message Routing Patterns

When do I use message routing and how?

For example, messages can be forwarded to dynamically defined recipients, or message parts can be processed or combined in a differentiated manner.

What are Messaging Channels?

In a messaging system, the exchange of information does not just happen unregulated. The sender transfers the message to a so-called messaging channel and the receiver requests a specific message channel.
In this way, the sender and receiver are decoupled. However, the sender can determine which application receives the data without knowing about it by selecting the specific messaging channel.

However, the right choice of message channel depends on your architecture. Which channel should be addressed and when?

The following figure lists some such channel types.

Messaging Design Patterns - This diagram shows the different patterns of message channels. — Messaging Patterns – Message Channel Patterns

What are the basic differences between the channel types ?

Basically, the channel types can be divided into two main types.

A distinction can be made between a point-to-point channel, i.e. one sender and exactly one receiver, and a publish-subscribe channel, one sender and several receivers.

What is a Messaging Endpoint?

In order for a sender or receiver application to connect to the messaging channel, an intermediary must be used. This client is called a messaging endpoint.

The following figure shows the principle of communication via messaging endpoints.

Messaging Design Patterns - Dieses Diagramm zeigt the Basic principle of a message endpoint — Basic principle of a Message Endpoint

On the receiver side, the end point accepts the data to be sent, builds a message from it and sends it via a specific message channel. On the receiver side, this message is also received via an end point and extracted again. An application can access several end points here. However, an endpoint can only implement one alternative.

The following figure lists some endpoint types.

Messaging Design Patterns - This diagram shows the different patterns of message endpoints. — Messaging Design Patterns – Message Endpoint Patterns

When do I choose which endpoint?

Receiving messages in particular can become difficult and lead to server overload. Therefore, control and possible throttling of the processing of client requests is crucial. A proven means is, for example, the formation of processing queues or a dynamic adjustment of consumers, depending on the volume of requests.

What is Message Transformation?

If the data format has to be changed when data is exchanged between two applications, a so-called message transformation ensures that the message channel is formally decoupled.

This translation process can be understood as two systems running in parallel. The actual message data is separated from the metadata.

The following figure shows some message transformation types.

Messaging Design Patterns - This diagram shows the different patterns of message transformation. — Messaging Design Patterns – Message Transformation Patterns

How do I monitor my messaging system and keep it running?

A flexible messaging architecture unfortunately leads to a certain degree of complexity on the other side. Especially when it comes to integrating many message producers and consumers decoupled from each other in a messaging system, with partly asynchronous messaging, monitoring during operation can become difficult.

For this purpose, system management patterns have been developed to provide the right monitoring tools. The main goal is to prevent bottlenecks and hardware overloads in order to guarantee the smooth flow of messages.

The following figure shows some test and monitoring patterns.

Messaging Design Patterns - This diagram shows the different patterns of message monitoring. — Messaging Design Patterns – Message Monitoring Patterns

What are the basic systems?

With a typical system management solution, for example, the data flow can be controlled by checking the number of data sent and received, or the processing time.

This is contrasted with the actual checking of the message information contained.

Software Design Patterns – A COMPLETE GUIDE

January 13, 2021 / RainerGewalt / 1 Comment

Software Design Patterns – This article is intended to explain the concept of design patterns in a simplified way and to give you an overview of the individual major groups.

Software architecture can be compared to the architecture of a house. So needs the application development in the planning also consists of the design and the construction of a meaningful, stable structure.

During implementation, it is really only about problem definition and solution with the tools given to you. Many of the steps are repetitive and follow routine patterns. The experience of the user or architect plays a major role here.
What do I apply when and how?

What are Software Design Patterns?

For many processes, there are already very optimized, proven templates that can be reused. Through these so-called design patterns, it is therefore possible to indirectly access the experience of others. The concept goes back to the architect Christopher Alexander and was subsequently used by computer scientists as a basis for conceptual design in software architecture.

These Patterns are categorized on the basis their characteristics in so-called Design Pattern Catalogs and logically grouped around a certain clarity to create. These characteristics can be for example pattern similarities among themselves, the applicability, or the consequences. Many literature deal with this classification topic. The categorizations shown in the following figure may therefore differ depending on the point of view.

This diagram shows the 4 Important software design patterns. — 4 Important Software Design Patterns.

Creational Patterns

The Creational Design Patterns deal with object and class creation. How can object creations be inherited from other objects and to what extent can classes be instantiated by subclasses? How are these instantiations created and linked?

Patterns should create object creation mechanisms with which object creations are controlled and thus the object is created purposefully on the respective situation. Flexibility and reusability are the intended goals here.
Thereby the construction is separated from the concrete implementation.
In the following scheme some patterns, which are to be assigned to the creational patterns, are represented.

Software Design Patterns - This scheme shows some Creational Patterns examples — Software Design Patterns **– Creational Patterns examples**

Structural Design Patterns

How do I create large, cohesive, yet efficient structures? How do I properly optimize the interaction of my entities? Structural Design Patterns should help with these questions and standardize the composition of objects and classes. So the focus here is on establishing individual relationships.
The following figure shows some of the patterns assigned here.

Software Design Patterns - This scheme shows some Structural Patterns examples — Software Design Patterns **-Structural Patterns examples**

It is often a matter of optimizing and saving inheritance processes. For example, objects can be enclosed in a tree structure, which then all use the same interface, or general properties can be moved to a single object, which is then shared by all other objects. Pipelines can be built and process chains can be formed.

Behavioral Patterns

In addition to the efficient assignment and allocation of entities, communication must also be optimized. At this level, the different transfers among them also describe a structural flow of control. These behavioral patterns can be very complex and difficult to grasp, but are determined by how the individual objects are connected to each other.

So how are responsibilities distributed? Behavioral patterns are intended to help increase the flexibility of the software in terms of its behavior in carrying out this communication.
In the following diagram some patterns are represented, which are to be assigned to the Behavioral Patterns.

Software Design Patterns - This scheme shows some Behavioral Patterns examples — Software Design Patterns – **Behavioral Patterns examples**

For example, inheritance between classes is used to distribute behavior between classes. This inheritance is a sequence of different algorithms that retrieve operations in predefined order and are defined, instantiated, and implemented.
Also, behaviors of objects can be encapsulated instead of distributing them across classes. Another behavioral pattern approach is an observer pattern where the dependencies between objects are observed.

Concurrency Patterns

Like also computations at the same time, thus parallel can be executed, so also models can be created parallel.
So whole program instances can be encapsulated as processes and run isolated, or a program can be divided into several threads, which all access the same memory area, but can also work in parallel.
Where which pattern can be used depends on all the workload conditions present and must be carefully coordinated to effectively avoid overload peaks. The following diagram shows some examples of concurrency patterns.

Conclusion

Since not every problem solution has to be developed by oneself, strategically applied design patterns can save time and resources. They can ensure that programs run effectively. A certain standardization is created. This is especially important for cross-team development. A software product is thereby uniformly and coherently conceived and implemented.

Nevertheless, these templates are often criticized. Why is that?
A decisive factor is that design patterns must not be seen as an all-purpose solution. The individual templates must be understood by the developer in order to use them efficiently. Does the template fit my problem 100 percent, or am I creating extra work again?

Design patterns allow you to access the experience of others, but require your own experience in working with these solutions.

If you are interested in more architectural thinking. Here we have put together another interesting software design the Domain Driven Design.

4 Index Data Structures a Data Engineer Must Know

January 2, 2021 / RainerGewalt / 1 Comment

In this article we will explain what index data structures are and introduce you to some popular structures.

In today’s world, ever-increasing amounts of data are being processed. The data can be used to derive business strategies in a commercial context, but also to gain valuable information about all scientific disciplines. The data obtained must be saved, ideally as raw data, and stored for future analysis.

At the time of creation, it is not yet possible to estimate what information might be valuable at some point. So any reduction in data ultimately represents a loss. Huge amounts of data accumulate every second, and managing them is an immense task for today’s hardware and software. Mathematical tricks have to be used to optimize search mechanisms and storage functions.

Index data structures allow you to access searched data in a large data collection immensely faster. Instead of executing a search query sequentially, a so-called index data structure is used to search for a specific data record in this data set based on a search criterion.

What are Index Data Structures in Databases?

You have probably heard about indexing in connection with databases. Here, too, an index structure is formed, independent of the data structure, which accelerates the search for certain fields. This structure consists of references, which define an order relation to the table columns. Based on these pointers, the database management system can then find the data using a search algorithm.

schematic representation of index data structures in databases — Index Data Structures in databases

However, indexing is a very complex scientific field. Queries are constantly being made more efficient and optimized. Thus, the approaches are diverse and very mathematical. This article will give you an overview of popular index data structures and help you to optimize your data pipelines.

Index data structure types

There are many different indexing methods. They are all based on different mathematical assumptions. You should understand these assumptions and choose a suitable system according to your data properties.
In the following scheme you can see some structure types you have to distinguish between, depending on the data you want to index.

index structures 1 — index data structure types

The most important distinction, however, is whether you want to index one-dimensional or multidimensional data relationships. This means that you have to differentiate whether there is a common feature or several related but independent features.
In the following figure, we have classified the individual index structures according to their dimension coverage.

we have classified the individual index structures according to their dimension coverage. — individual index structures according to their dimension coverage

Which index structure you ultimately choose depends on many factors and should be weighed up well in advance, especially with large data sets.

Popular index data structures you should know

In the following, we will introduce you to some of the most popular indexing methods in detail. Because here, too, the key to success lies in understanding your tools and using them correctly at the right moment.

What is Hashing?

If you want to search for a value in an unsorted array, a linear search method is not optimal and too time consuming.
With the so called hashing method a hash value is used for unique object identification. This is calculated by a hash function from the key and determines the storage location in an array of indices, the so-called hash table. This means that you use this function to generate a unique storage location in the table using a key.
In the following figure the hash function flow is shown again.

schematic representation of the hash function sequence in detail — hash function sequence

Important basic assumptions are, however, that the function always returns a number for an object, two identical objects always have the same number and two unequal objects do not always have different numbers.

What is a Binary tree?

A so-called binary tree is a data structure in which each element, also called node, has a maximum of two successors. The addresses of the subordinate nodes are kept track of by pointers. It is often used when data is to be stored in RAM.

What is a B-tree?

The B-tree is often used in databases and file systems, i.e. for storage on the hard disk. The tree is sorted and completely balanced. The data is stored sorted by keys. The keys are stored in its internal nodes, but need not be stored in the records at the leaves. CRUD functions run in amortized logarithmic time.

The B-tree is classified into different types according to its properties.
In the B+ tree, only copies of the keys are stored in the internal nodes. The keys are stored with the data in the leaves. To speed up sequential access, these also contain pointers to the next leaf node and are thus concatenated.
In the following scheme you see a basic B+ tree structure.

Basic representation of a b+ tree and its components — Basic b+ tree structure

The B* tree is an index structure where non-root nodes must be at least 2/3 filled. This is achieved by a modified split strategy.
In addition to indexing, partitioning also offers you the possibility of strongly optimizing the data search within a database. In this article we introduce you to this technique.

What is a SkipList?

The SkipList resembles in its structure a linked list consisting of containers, which contain the data with a unique key and a pointer to the following container. In a SkipList, however, the containers have different heights and can contain pointers to containers that do not follow directly. The idea is to speed up the search by additional pointers.

schematic representation of an index structure of the SkipList — Schematic representation of a SkipList

Calculation of the container height

All nodes have pointers on different levels. Keys can be skipped with it. The height of the list elements is calculated either regularly, or unbalanced according to mathematical rules. The search is however dependent on the list emergence or evenly randomly over the list.

ricardo gomez angel j5gCOKZdm6I unsplash

Apache Avro – Effective Big Data Serialization Solution for Kafka

November 15, 2020 / RainerGewalt / 0 Comments

In this article we will explain everything you need to know about Apache Avro, an open source big data serialization solution and why you should not do without it.

You can serialize data objects, i.e. put them into a sequential representation, in order to store or send them independent of the programming language. The text structure reflects your data hierarchy. Known serialization formats are for example XML and JSON. If you want to know more about both formats, read our articles on the topics. To read, you have to deserialize the text, i.e. convert it back into an object.

In times of Big Data, every computing process must be optimized. Even small computing delays can lead to long delays with a correspondingly large data throughput, and large data formats can block too many resources. The decisive factors are therefore speed and the smallest possible data formats that are stored. Avro is developed by the Apache community and is optimized for Big Data use. It offers you a fast and space-saving open source solution. If you don’t know what Apache means, look here. Here we have summarized everything you need to know about it and introduce you to some other Apache open source projects you should know about.

Apache Avro – Open Source Big Data Serialization Solution

With Apache Avro, you get not only a remote procedure call framework, but also a data serialization framework. So on the one hand you can call functions in other address spaces and on the other hand you can convert data into a more compact binary or text format. This duality gives you some advantages when you have cross-network data pipelines and is justified by its development history.

Avro was released back in 2011 as a part of Apache Hadoop. Here, Avro was supposed to provide a serialization format for data persistence as well as a data transfer format for communication between Hadoop nodes. To provide functionality in a Hadoop cluster, Avro needed to be able to access other address spaces. Due to its ability to serialize large amounts of data, cost-efficiently, Avro can now be used Hadoop-independently.

You can access Avro via special API’s with many common programming languages (Java, C#, C, C++, Python and Ruby). So you can implement it very flexible.

In the following figure we have summarized some reasons what makes the framework so ingenious. But what really makes Avro so fast?

The schema clearly shows all the features that Apache Avro offers the user and why he should use it — Features Apache Avro

What makes Avro so fast?

The trick is that a schema is used for serialization and deserialization. About that the data hierarchy, i.e. the metadata, is stored separately in a file. The data types and protocols are defined via a JSON format. These are to be assigned unambiguously by ID to the actual values and can be called for the further data processing constantly. This schema is sent along with the data exchange via RPC calls.

Creating a schema registry is especially useful when processing data streams with Apache Kafka.

Apache Avro and Apache Kafka

Here you can save a lot of performance if you store the metadata separately and call it only when you really need it. In the following figure we have shown you this process schematically.

When you let Avro manage your schema registration, it provides you with comprehensive, flexible and automatic schema development. This means that you can add additional fields and delete fields. Even renaming is allowed within certain limits. At the same time, Avro schema is backward and forward compatible. This means that the schema versions of the Reader and Writer can differ. Schema registration management solutions exist, with Google Protocol Buffers and Apache Thrift, among others. However, the JSON data structure makes Avro the most popular choice.

IaaS vs PaaS vs SaaS – The Various Facets of Cloud Computing

November 11, 2020 / RainerGewalt / 1 Comment

IaaS vs PaaS vs SaaS – terms that categorize clouds, but what exactly do they mean? In this article, we contrast all three and explain the differences.

In almost all areas, the cloud is becoming more and more important. Increasingly, the cloud is also becoming interesting for business processes. Everyone is talking about it, but what is it actually?

What is the cloud anyway?

The cloud basically means the use of different servers. This means that your data can be hosted online, i.e. stored, managed and processed.
So you don’t have to provide the appropriate hardware on site, but can rent these resources from a cloud provider. Read our article about the cloud computing provider AWS.
Besides Amazon, other global players such as Google (Google Cloud) and Microsoft (Azure) also offer profitable cloud resources.
But which ones are suitable for me or my company? To meaningfully compare the individual solutions, you need to understand the differences between them.
Basically, you need to distinguish between the three categories already mentioned.

IaaS vs PaaS vs SaaS - Diese Abbildung zeigt die Die 3 Cloud Kategorien — IaaS vs PaaS vs SaaS

IaaS vs PaaS vs SaaS – What are the Differences?

First and foremost, all three terms are used to describe a resource provided by a cloud service provider for a short period of time.
The following figure shows this “as-a-service”, or Flexible consumption model, and the management components..

IaaS vs PaaS vs SaaS - This diagram shows the distribution of tasks between providers and customers in the individual cloud categories depending on the service layer model. — **Red**: managed by others; **Green**: managed by your organization

You can see very clearly here that the cloud provider manages more and more layers, ascending from IaaS to SaaS.

Software as a Service (SaaS)

The abbreviation SaaS refers to cloud-based software. This is hosted online by a company and provided via the Internet. It is easy to use and manage. Additionally, it is highly scalable, meaning it can be used for an entire organization.

Platform as a Service (PaaS)

PaaS is used to describe a cloud-based platform service. This offers developers an online platform for application development. Data is provided, stored and managed online.s

Infrastructure as a Service (IaaS)

IaaS refers to cloud-based infrastructure resources provided via virtualization technologies. These services are designed to help companies build and manage their servers, networks, operating systems and data storage. This is where the highest administrative share lies with the customer. Access to the servers for data management takes place via a dashboard or API.

IaaS vs PaaS vs SaaS – For whom is which category suitable?

So who should choose which service model? The following figure shows that the more tasks are taken over by the provider, the more control is relinquished. This is especially detrimental in organizations where a lot of control is needed.

IaaS vs PaaS vs SaaS - Presentation of the individual services depending on the control and for whom they are suitable. — Services depending on the control

IaaS gives administrators more direct needed, control over operating systems. However, more control always comes with more complicated administration tasks. PaaS therefore offers users a certain compromise between flexibility and ease of use. This model is particularly appealing to developers.
The SaaS model offers the highest level of usability and is accordingly interesting for customers who want to take over no to few administrative tasks.

IaaS vs PaaS vs SaaS – Technology of the future?

Cloud resources can be a valuable alternative to expensive, in-house hardware solutions. Of course, with external administration, a company loses control over its own data. However, the different types of service mean that compromises can be made that are tailored to the company’s own needs.

The advantages are obvious. Individual services can be accessed from virtually anywhere at any time, and high-performance computing can be operated cost-effectively. As network technologies become faster and faster, these solutions are increasingly coming into focus and will certainly become more and more important for companies and private individuals in the coming years.

ksqlDB – Efficient real-time stream transformation of data within Kafka’s data pipelines

November 1, 2020 / RainerGewalt / 0 Comments

ksqlDB vs Kafka streams – Data streams are all the rage right now. A technique to move and process huge amounts of data simultaneously without caching it.

What is Apache Kafka?

With the messagebroker Kafka, the data can be stored resource-efficiently in so-called topics as so-called logs. These topics can then be subscribed to and rewritten by any number of clients, primarily microservices.
The metadata information is stored externally in a schemaregistry and assigned to the data again via an ID when it is read. In this way, each microservice can be developed independently of technology and programming languages. The data structure remains the same.

However, if a microservice wants to access the data streams from two or more topics and these arrive with different frequencies, then the correct allocation of the data is often difficult. The so-called data stream position can be controlled with event streaming databases.

What is ksqlDB?

Especially for Apache Kafka, ksqlDB allows easy transformation of data within Kafka’s data pipelines.

The following figure shows how a software architecture with Apache Kafka and ksqlDB could look like. It is still possible to subscribe to the data streams from the messagebroker, or indirectly via ksqlDB using pulls and pushs. The communication between table and kafka is done directly via the eventstreaming platform Confluent.

The figure shows how a software architecture with Apache Kafka and ksqlDB could look like. — software architecture with Apache Kafka and ksqlDB

It can be used to materialize views asynchronously using interactive SQL queries.
So with this, microservices can enrich the data and transform it in real time.
This enables anomaly detection, real-time monitoring, and real-time data format conversion.

Event Streaming

ksqlDB is an event streaming database. Thus, it is based on continuous streams of structured event data that can be published to multiple applications in real time. The following figure shows such an event stream schematically.

ksqlDB vs Kafka streams- The figure shows such an event stream schematically. — event stream

Each individual record always consists of an event and a unique key for identification.
These event streams can be combined with streaming analytics and is a way to offload work to back-end processing applications. If you want to know more about messaging patterns and how a message is transmitted between sender and receiver, read our article.

Window-based Query Processing

ksqlDB allows continuous stream queries. These are based on window-based aggregation of events.

Windows are polling intervals that are continuously executed over the data streams. These windows can be expanded and moved as needed to handle new incoming data items.
Several window types are shown in the figure below. They differ in their composition to each other.

ksqlDB - Several window types are shown in the figure. They differ in their composition to each other. — window types

The “Tumbling” type repeats a non-overlapping interval, while the “Bouncing” type allows overlaps. In a “Session” the elements are grouped by activity sessions without allowing overlaps. The session is terminated when no elements are received for a certain time.

ksqlDB Features

In addition to continuous queries through window-based aggregation of events, ksqlDB offers many other features that are helpful in dealing with streams. For example, the last value of a column can be tracked when aggregating events from a stream into a table.

Multiple streams can be merged by real-time joins or transformed in real-time. In doing so, the database is Distributed, Fault Tolerant and Scalable.
The Kafka Connect connectors can be executed and controlled directly.
Push and pull queries are applicable to the flows. Thus, subscribers get the constantly updated results of a query, or can retrieve data in request/response flows at a specific time.

Conclusion

With Confluent’s event streaming database ksqlDB, a service is provided that offers an absolutely compatible solution for real-time data stream processing with Kafka. Kafka in particular lends itself as a central element in a microservice-based software architecture. Microservices run as separate processes and consume in parallel from the message broker. Aligning these processes remains a challenge. However, ksqlDB ensures real-time stream processing within the services.

Apache Mahout – A Powerful Open Source Machine Learning Project

October 18, 2020 / RainerGewalt / 0 Comments

Apache Mahout is a powerful machine learning tool that comes with a seamless compatibility to the strong big data management frameworks from the Apache universe. In this article, we will explain the functionalities and show you the possibilities that the Apache environment offers.

What is Machine Learning?

Machine learning algorithms provide lots of tools for analyzing large unknown data sets.
The art of data science is to extract the maximum amount of information depending on the data set by using the right method. Are there patterns in the high-dimensional data relationships, and how can they be represented in a low-dimensional way without much loss of information?

scikitLearn ml — Fields of machine learning

There is often a similar amount of information in the failure as when an algorithm was able to successfully create groupings.
It is important to understand the mathematical approaches behind the tools in order to draw conclusions about why an algorithm did not work.
If you don’t know the basic machine learning categories, it’s best to read our article on the subject first.

Machine Learning and Linear Algebra

Most machine learning methods are based on linear algebra.
This mathematical subfield deals with linear transformations, vector spaces and linear mappings between them.
The knowledge of the regularities is the key to the correct understanding of machine learning algorithms.

What is Apache Mahout

Apache Mahout is an open source machine learning project that builds implementations of scalable machine learning algorithms with a focus on linear algebra. If you’re not sure what Apache is, check out this article. Here we introduce you to the project and its main projects once.

Mahout was already released in 2009 and since then it is constantly extended and kept up-to-date by a very active community.
Originally, it contained scalable algorithms closely related to Apache Hadoop and MapReduce.
However, Mahout has since evolved into a backend independent environment. That is, it operates on non-Hadoop clusters or single nodes.

Features

The math library is based on Scala and provides an R-like Domain Specific Language (DSL). Mahout is usable for Big Data applications and statistical computing. The figure below lists all machine learning algorithms currently offered by Mahout.

The figure below lists all machine learning algorithms currently offered by Apache Mahout. — Implemented mathematical functions and algorithms

The algorithms are scalable and cover both supervised and unsupervised machine learning methods, such as clustering algorithms.

Apache Mahout covers a large part of the usual machine learning tools. This means that data can be analyzed without having to change frameworks. This is a big plus for maintaining compatibility in the application.

Apache Ecosystem

The framework integrates seamlessly into the Apache Ecosystem. This means that an application can access the entire power of the data processing platforms and build very high-performance big data pipelines. The following figure shows the Apache data management ecosystem.

Through connectivity to Apache Flink, stream data analysis pipelines can be built, or with Hive data from relational databases can be automatically converted into MapReduce or Tez or Spark jobs.