Category: datascience (Page 1 of 2)

ERP vs MES vs PLM vs ALM – What role will they play in industry 4.0?

March 14, 2021 / RainerGewalt / 3 Comments

ERP vs MES vs PLM vs ALM – These terms are being mentioned more and more often in connection with Industry 4.0. But what is behind these systems and what are the differences?

this scheme gives an Example of a business process pyramid — Example of a business process pyramid

To stay competitive in today’s world, you need to increase the efficiency of your business processes. It is important that you optimally plan, control and manage your operational resources (capital, personnel…).

Your goal should be to create high quality and continuity with high productivity and low lead time.
Many of your business processes create ever larger amounts of data and increase in complexity. You need to reduce this complexity and increase your flexibility.
Many software solutions are available to your company for the optimal use of resources.

What is an ERP?

Basically, an ERP system is an IT-supported system of software solutions that communicate with each other. Your data is stored centrally and should represent your company in its entirety through quickly available information.

This scheme gives xou an overview about the ERP systems — ERP vs MES vs PLM vs ALM – Overview ERP Systems

The information of your business processes is optimized and documented.
The trend is towards web-based applications.

This means that you access the system interface via your browser and that you can also access it beyond the boundaries of your company. Another advantage is that you don’t have to install any services, making you hardware-independent.

What are ERP Subsystems?

You can use ERP systems in all areas of your business. They provide you with complete solutions for all necessary subsystems.

This scheme shows the ERP fetures — ERP vs MES vs PLM vs ALM – ERP features

Complex systems are divided into so-called application modules, which you can combine with each other as you wish. These fulfill various tasks for the provision and further processing of information. In this way, you can put together your ERP system according to your requirements and adapt it to the size of your company.

What is Advantages Cloud ERP?

ERPs can also be purchased as a complete Software-as-a-Service (SaaS) solution.

This scheme shows the ERP Cloud Advantages — ERP vs MES vs PLM vs ALM – ERP Cloud Advantages

These are comletely industry and hardware independent. You, as a user, can access a sophisticated ERP software package online and thus from anywhere. This gives you absolute spatial flexibility. However, Cloud ERP solutions are still quite new and not yet fully mature. So you should weigh up well in advance whether you want to use a cloud application.

What is an MES?

The MES system is an operational process-related part of a multi-layer MES System. It is responsible for real-time production management and control. You can use MES data to optimize manufacturing processes and detect errors during the production process.

The MES system is assigned to the ERP system. This system accesses your MES data to plan production. It then feeds this information back to your production control system for implementation.

ERP vs MES vs PLM vs ALM – Relationship between company level

The interaction of the individual components is moving closer together in Industry 4.0.

What does the MES include?

MES is usually a multi-layer overall system. It processes your production data into Key Performance Indicators (KPI) and enforces the fulfillment of an existing production plan.

this diagram clearly shows all components of a MES System — ERP vs MES vs PLM vs ALM – MES System features

It processes your production data into Key Performance Indicators (KPI) and enforces the fulfillment of an existing production plan.

What is an PLM?

In addition to MES and ERP, the Product Life Cycle Management (PLM) system plays an elementary role in the digitization of your company.

In order for your company to remain internationally competitive in today’s world, you need to optimize your business models in order to be able to act preventively.

As a manufacturing company, you need to be able to analyze large amounts of data quickly. This way you can recognize deviations from the plan early on and make the right decisions.

Many software solutions help you in all business areas and even exchange data with each other. In this way, you can create information chains within a company and act more quickly.

PLM System is a management approach for the seamless integration of all information that accumulates during the life cycle of a product.
The core components of PLM are the data and information related to the product lifecycle.

this scheme shows your production life cycle process — ERP vs MES vs PLM vs ALM – Production life cycle process

A large amount of product-related and time-dependent data is generated along the product life cycle. The PLM enterprise concept is based on coordinated methods, processes and organizational structures and usually makes use of IT systems. PLM tools link design, implementation and production and provide feedback from manufacturing.

this scheme shows PLM main application areas — ERP vs MES vs PLM vs ALM – PLM main areas of application

The goal of a PLM system is the central management of information and corresponding user groups. One advantage here is that you can control the process of editing and distribution throughout the company.

Application Lifecycle Management (ALM) vs PLM System

More and more products and systems now contain a software component. However, since hardware and software are historically different, you must also differentiate between the management systems.

This schema shows the major differences between ALM and PLM — ALM vs PLM

With PLM you are looking at a physical product, with ALM you are looking at a software product. Basically, however, there are similarities between the two systems. Both also track a product over its entire lifecycle. However, since both product types are increasingly merging today, you can also link both systems on an IT basis at the overall product level.

ERP vs MES vs PLM vs ALM – What does the future hold?

When people talk about Industry 4.0, they are referring to a new level of technological progress. The basis of this innovation is the Internet of Things (IoT). The software solutions of various company levels are networked to form cyber-physical systems and exchange information with each other in real time. In this way, production planning can take place in management and be implemented directly in production. As production becomes more complex in the future, mastering complexity and complex technologies will come with the necessary know-how.

The software solutions presented here are systems optimized for business areas. Each software system is therefore an expert in its own field. This ensures a decisive modularity for a company’s overall solution. On the other hand, this modularity always leads to increased complexity. In the future, it will become increasingly important to create reciprocal data pipelines, so-called data streams, between the individual systems, which currently still operate very autonomously.

ERP vs MES vs PLM vs ALM - This schema shows their roe in industry 4.0 — ERP vs MES vs PLM vs ALM – And their role in Industry 4.0

A decision made at the management level should be implemented in production and at the same time remain controllable at all levels. Optimally, the system should be able to make its own analyses. AI algorithms can help here to find sensible decisions despite increasing complexity. This allows you to optimize your individual production steps and shorten life cycles.

This schema shows the role of a MES System in Industry 4.0 — ERP vs MES vs PLM vs ALM – Industry 4.0 and MES System

The MES, for example, plays an important role here due to its proximity to production. This allows you to make important decisions quickly and implement production plans.In your company of the future, software solutions from various divisions are networked with each other. So you can form information chains and the MES is part of this network.

What is data warehousing and does it still make sense?

March 12, 2021 / RainerGewalt / 3 Comments

Data Warehousing – In today’s flood of data, it is becoming increasingly difficult to maintain a clear data management system. More and more data sources are recorded via different software systems.
A unified, centralized system can facilitate analysis and ensure that only one data truth exists in an organization.

What is a Data Warehouse System?

Data warehouse systems are built by integrating data from multiple heterogeneous sources and, in addition to centralization, performs the task of structuring data, supporting analytical reporting and structuring decision-making.
The system can perform data cleansing as well as data integration and data consolidation and does not require transaction processing or recovery.

Data Warehousing - The figure shows all other names that the system has. — Multiple names for Data Warehouse

It is thus a powerful Big Data information system that can centrally handle everything related to data processing.

Data Warehousing Features

Data warehousing offers several features. Such an information system is subject oriented. It does not focus on the current operation, as these data are separated. This means that frequent changes in the operational database are not reflected in the data warehouse. Thus, the focus is on modeling and analysis of data.
The system is Time variant, which means that the collected data are identified with a certain period of time and previous data are not deleted when new data are added.

What does a Data Warehouse structure look like?

The complexity of this system increases exponentially with the complexity of the business. Many distinctive data sources, i.e. business processes, provide commutative and historical data.

This figure shows the main principle behind data warehousing simplified. — Data Warehousing main Principle

Therefore, basic approaches have been defined according to which every data warehouse system should be structured. Single Tier, Two Tier and Three Tier.

2 Tier vs Three Tier Data Warehouse Architecture

In the following we will work out the three tier architecture.
This, the most commonly used, structure is completely decoupled from the data and the user interface by moving the application logic to a middle tier.
In two-tier, the application logic resides either in the user interface on the client or in the database on the server.
Thus, without a middle tier, this system is less scalable and more flexible. Integration of other data sources is more difficult here.

Three Tier Data Warehouse Architecture

The Three Tier Data Warehouse Architecture is the design on the basis of which a data warehouse with three tiers is then built. The figure below shows this structure with common components.

However, the individual components can vary and depend on the project framework. As a rule, however, these changes do not alter the basic structure.

Bottom Tier

The lowest layer is persistence, which is usually located on a server. The data from various data sources is prepared and stored here using an ETL (extract, transform and load) process. Tools and other external resources can be used to feed the data.
This persistence can consist of a relational but also a multidimensional database system.

Load Data into Warehouse

In addition to the different components and architectures, data can also be transmitted to the information system in different ways.

etl elt warehousing 2 — Data Warehousing – ETL vs ELT

As shown in the figure, a basic distinction is made between two elementary processes.

What is ELT?

Extract, Load, and Transform, or ELT for short, is about extracting aggregate information from the source system and loading it into the target method.

The following figure shows such an example system. In this case, the Hadoop framework handles the central data management, while applications and analysis tools access the untransformed data.

warehouse elt 1 — Data Warehousing – Extract, Load, and Transform

What is ETL?

In Extract, Transform and Load, or ETL for short, the data set is first extracted from the sources into a staging area, then transformed or reformatted with business manipulation performed on it, and only then loaded into the target or destination database or data warehouse.

warehouse etl 1 — Data Warehousing – Extract, Transform, and Load

Middle tier

One or more OLAP (Online Analytical Processing) servers reside in the middle data warehouse layer. This technology can be used to create complex budget plans and perform analyses cost-effectively. So in the three tier data warehouse architecture, jobs are generated in the top tier and sent to this middle tier. Here, the data in the bottom tier is then accessed and analyses are performed. The result is then sent to the top tier and thus made available to the user, and/or forwarded to the bottom tier for storage of the analysis results in persistence.

What is an OLAP Server?

Basically, three OLAP server models are distinguished.
In Relational OLAP (ROLAP) the operations on multidimensional data are based on standard relational operations. The Multidimensional OLAP (MOLAP) directly implements the multidimensional data operations. A mixture of relational and multidimensional processing can be handled by Hybrid OLAP (HOLAP).
The choice of the server model always depends on the data composition in the lowest layer.

Top-Tier

The top tier is the top of the three tier data warehouse architecture, the front-end client layer. It contains query and reporting tools, analysis tools, and data mining tools, thus providing the interface to the user. Here he can generate analyses and take a look at the data.

Terminologies

However, some terms that often come up in connection with this system need to be clarified.
When metadata is mentioned, a kind of roadmap to the data warehouse is meant. Here the warehouse objects are defined and it acts as a directory. This means that the decision support system finds the contents via the metadata.

The metadata is stored in the metadata repository. An integral directory that manages both the business metadata, i.e. data ownership information, business definition and change policies, and the operational metadata. Operational metadata refers to the timeliness of the data is it active, archived or cleansed, and data lineage, which is the history of the data. This includes the data used to map the operational environment, source databases and their contents, data extraction, data partitioning, cleansing, transformation rules, data refreshing and cleansing rules, but also the algorithms for summation, dimensional algorithms, data for granularity, aggregation, summation, etc.

The so-called data cube represents data in multiple dimensions and the data mart contains only the data specific to a certain group.

Data Warehouse Types and How they work

All Data Warehouse systems follow the same basic structure, which we explain in this article, but can consist of different components. Accordingly, they are typified.

Host-Based mainframe warehouse

The Host-Based mainframe warehouse resides on a large-volume database.
In addition to this database, metadata is managed in a central metadata repository. Within this metadata, for example, the information for the documentation of data sources or data translation rules are stored.

In general, three phases run in this information system.
Selections and scrubbing methods take place in the unloading phase. That is, the appropriate data types and data sources are determined here and the data is subsequently error corrected.
In the following transform phase the data are translated into a suitable form. Here also already the rules for the access and the storage are specified.
In the final Load phase, the preprocessed data set is moved into tables.

Host-Based LAN data warehouse

With this type, information can be extracted from a variety of sources. Multiple LAN-based warehouses are supported. Data provisioning can be either centralized or from the workgroup environment. The size depends on the platform.

Multi-Stage Data Warehouses

Here, the data is staged several times before being loaded into the data warehouse and finally distributed into department-specific data marts.

Stationary Data Warehouse

In the case of stationary warehouse types, the data from the sources are not changed. The customer thus has direct access to the data.

Distributed Data Warehouses

In the distributed warehouse system, the data is basically distributed. For technological or business reasons, this separation can take place in local and global warehouses. The local levels are then only integrated within the local site but also contain historical data and is therefore absolutely autonomous. This is where most of the operational processing takes place, while the global part processes the data that is relevant to the company as a whole.

Are data warehouses still promising?

Nowadays, data streams are on everyone’s lips and it is precisely with iterative and highly dynamic data sources that large data warehouse systems reach their limits. Often, smaller tools that do not require a complete solution are more efficient. So are data warehouse systems dead after all? No, definitely not. Apart from the basic principles, such as the pursuit of data truth, which can be applied to any other software system. In many cases, however, a data warehouse system is still a high-performance overall solution and can coexist with data stream pipeline systems.

With Apache Hive you can access a free Apache data warehouse software. With this software you can easily implement very performant big data systems. Here we have collected the most important information about Hive.

Supervised vs Unsupervised vs Reinforcement Learning – The fundamental differences

January 24, 2021 / RainerGewalt / 1 Comment

Supervised vs Unsupervised vs Reinforcement Learning – The three main categories of machine learning. Why these boundaries have been drawn and what they look like will be discussed in this article. The knowledge about this is an elementary part to understand machine learning correctly and to be able to apply it to data in a meaningful way.

This figure contrasts Supervised vs Unsupervised vs Reinforcement Learning. — Supervised vs Unsupervised vs Reinforcement Learning – Overview

Supervised vs Unsupervised vs Reinforcement Learning – Machine Learning Categories

Machine learning is a branch of artificial intelligence. While AI deals with the functioning of artificial intelligence and compares it with the functioning of the human brain, machine learning is a collection of mathematical methods of pattern recognition. If you want to know more about the differences between Machine Learning, AI and Deep Learning, read our article on the subject. IT systems should be given the ability to automatically learn from experience and improve. Algorithms play a central role here. These can be classified into different learning categories.

In the following figures the three main categories of machine learning methods are shown.

This figure shows Supervised vs Unsupervised vs Reinforcement Learning in the machine learning context. — Supervised vs Unsupervised vs Reinforcement Learning – Machine Learning Context

In the meantime, there are many more categories, some of which are hybrids of the individual main categories. One example is semi-supervised learning. This is certainly also a major machine learning topic, but has been left out for the time being for the sake of simplicity.

What is supervised learning?

In supervised learning, the machine learning algorithm iteratively learns the dependencies between data points. The output to be learned is specified in advance and the learning process is supervised by matching the predictions. How the The optimized algorithm is to apply the learned patterns to unknown data to make predictions.

Supervised vs Unsupervised vs Reinforcement Learning - This figure shows the basic principle of supervised learning. — Supervised vs Unsupervised vs Reinforcement Learning – Supervised Learning

Supervised learning methods can be applied to regression, i.e., prediction, or trend prediction, as well as classification problems.

What is supervised classification?

In classification, abstract classes are formed in order to delimit and order data in a meaningful way. For this purpose, objects are obtained on the basis of certain similar characteristics and structured among each other.

Decision trees can be used as prediction models to create a hierarchical structure, or the feature values can be assigned as class labels and in the form of a vector.

In the following figure the most important supervised classification algorithms are listed.

Supervised vs Unsupervised vs Reinforcement Learning - This figure shows the main algorithms of supervised learning. — Supervised vs Unsupervised vs Reinforcement Learning – Main Algorithms of Supervised Learning.

What is supervised regression?

On the other hand, supervised regression algorithms can be used to make predictions and infer causal relationships between independent and dependent variables.
For example, linear regression can be used to fit the data to a straight line or, conversely, to fit a line to the data object.
We have discussed the exact process of linear regression here in this article.

What is unsupervised learning?

In unsupervised learning, patterns are determined in data without initial patterns and relationships being known.
Especially in complex tasks, these methods can be useful to find solutions that would hardly be solvable by hand. An example is autonomous driving, or large biochemical systems with many interactions.
One key to success is a huge data set. The more data available, the more accurate models can be created.

Supervised vs Unsupervised vs Reinforcement Learning - This figure shows the basic principle of unsupervised learning. — Supervised vs Unsupervised vs Reinforcement Learning – Unsupervised Learning

In unsupervised machine learning methods, two basic principles, which also classify the algorithms used, can be distinguished. The clustering and the dimensional reduction.

What is unsupervised clustering?

The main goal of unsupervised clustering is to create collections of data elements that are similar to each other, but dissimilar to elements in other clusters. The figure below shows some of the main clustering algorithms.

Supervised vs Unsupervised vs Reinforcement Learning - This figure shows the main algorithms of unsupervised learning. — Supervised vs Unsupervised vs Reinforcement Learning – Main algorithms of unsupervised learning.

The clustering algorithms differ primarily in the cluster creation process, but also in the definition of such clusters. Thus, the relationships between clusters can also be used and hierarchical relationships can be explored.

What is unsupervised dimensional reduction?

With a high number of features, high dimensional relations can be translated low dimensional with these transformation methods. The goal is to keep the loss of information as small as possible.
The reduction methods can be divided into two main categories: Methods from linear algebra and from manifold learning.

Manifold learning is an approach to nonlinear dimensionality reduction. Algorithms for this task are based on the idea that they can learn the dimensionality of the data without a given classification and project it in a low-dimensional way.
For example, from the field of linear algebra, matrix factorization methods can be used for dimensionality reduction.

What is reinforcement learning?

In reinforcement learning, a program, a so-called agent, should independently develop a strategy to perform actions in an environment. For this purpose, positive or negative reinforcements are conveyed, which describe the interaction interactions of the agent with the environment. In other words, immediate feedback on an executed task. The program should maximize rewards or minimize punishments. The environment is a kind of simulation scenario that the agent has to explore.
The following figure describes the interactions of all components of a reinforcement learning process.

Supervised vs Unsupervised vs Reinforcement Learning - This figure shows the main principle of reinforcement learning. — Supervised vs Unsupervised vs Reinforcement Learning – Main principle of reinforcement learning.

There are two basic types of reinforcement learning.
Namely, whether the environment is model-based or not.
In model-based RL, the agent uses predictions of the environment response during learning or action.
If no model is available, the data is generated by trial and error.

Messaging Patterns – It is not enough to decide to use a message

January 17, 2021 / RainerGewalt / 1 Comment

Messaging Patterns- What are they? What are their strengths and why should they only be used with caution? We clarify these questions in this article.

What are Design Patterns?

Technology-independent designs can provide proven pattern solutions in software development, ensuring standardized and robust architecture.
If you’ve never heard of software design patterns, check out this article from us on the subject first.

Design patterns allow a developer to draw on the experience of others. They offer proven solutions for recurring tasks. A one-to-one implementation is not advisable. The patterns should rather be used as a guide.

What is a message?

A basic design pattern is the message. Actually a term that is used by everyone as a matter of course, but what is behind it?

Data is packaged in messages and then transmitted from the sender to the receiver via a message channel. The following figure shows such a messaging system.

Messaging Patterns - This scheme shows the basic concept of a message — Messaging Patterns – Basic Concept of a Message

The communication is asynchronous, which means that both applications are decoupled from each other and therefore do not have to run simultaneously. The sender must build and send the message, while the receiver must read and unpack it.

What are Messaging Patterns?

However, this form of message transmission is only one way of transferring information. The following figure shows the basic concepts of messaging design patterns.

This diagram shows all the basic components of the messaging design patterns — Basic Components of the Messaging Patterns

What is Message Construction?

It is not enough to decide to use a message. A message can be constructed according to different architectural patterns, depending on the functions to be performed.

The following figure shows some of these patterns.

Messaging Design Patterns - This diagram shows the different Patterns of message construction. — Messaging Design Patterns – Message Construction Patterns

Message Construction – When do I use it?

Massaging can be used not only to send data between a sender and receiver, but also to call a procedure or request a response in another application.

With the right message architecture a certain flexibility can be guaranteed. This makes the message much more robust against possible future changes.

What is Message Routing?

A message router connects the message channels in a messaging system. We will come back to this topic later. This router corresponds to a filter, which regulates the message forwarding, but does not change the message. A message is only forwarded to another channel if all predefined conditions are met.

The following figure lists some specific message router types.

Messaging Design Patterns - This diagram shows the different patterns of message routing. — Messaging Patterns – Message Routing Patterns

When do I use message routing and how?

For example, messages can be forwarded to dynamically defined recipients, or message parts can be processed or combined in a differentiated manner.

What are Messaging Channels?

In a messaging system, the exchange of information does not just happen unregulated. The sender transfers the message to a so-called messaging channel and the receiver requests a specific message channel.
In this way, the sender and receiver are decoupled. However, the sender can determine which application receives the data without knowing about it by selecting the specific messaging channel.

However, the right choice of message channel depends on your architecture. Which channel should be addressed and when?

The following figure lists some such channel types.

Messaging Design Patterns - This diagram shows the different patterns of message channels. — Messaging Patterns – Message Channel Patterns

What are the basic differences between the channel types ?

Basically, the channel types can be divided into two main types.

A distinction can be made between a point-to-point channel, i.e. one sender and exactly one receiver, and a publish-subscribe channel, one sender and several receivers.

What is a Messaging Endpoint?

In order for a sender or receiver application to connect to the messaging channel, an intermediary must be used. This client is called a messaging endpoint.

The following figure shows the principle of communication via messaging endpoints.

Messaging Design Patterns - Dieses Diagramm zeigt the Basic principle of a message endpoint — Basic principle of a Message Endpoint

On the receiver side, the end point accepts the data to be sent, builds a message from it and sends it via a specific message channel. On the receiver side, this message is also received via an end point and extracted again. An application can access several end points here. However, an endpoint can only implement one alternative.

The following figure lists some endpoint types.

Messaging Design Patterns - This diagram shows the different patterns of message endpoints. — Messaging Design Patterns – Message Endpoint Patterns

When do I choose which endpoint?

Receiving messages in particular can become difficult and lead to server overload. Therefore, control and possible throttling of the processing of client requests is crucial. A proven means is, for example, the formation of processing queues or a dynamic adjustment of consumers, depending on the volume of requests.

What is Message Transformation?

If the data format has to be changed when data is exchanged between two applications, a so-called message transformation ensures that the message channel is formally decoupled.

This translation process can be understood as two systems running in parallel. The actual message data is separated from the metadata.

The following figure shows some message transformation types.

Messaging Design Patterns - This diagram shows the different patterns of message transformation. — Messaging Design Patterns – Message Transformation Patterns

How do I monitor my messaging system and keep it running?

A flexible messaging architecture unfortunately leads to a certain degree of complexity on the other side. Especially when it comes to integrating many message producers and consumers decoupled from each other in a messaging system, with partly asynchronous messaging, monitoring during operation can become difficult.

For this purpose, system management patterns have been developed to provide the right monitoring tools. The main goal is to prevent bottlenecks and hardware overloads in order to guarantee the smooth flow of messages.

The following figure shows some test and monitoring patterns.

Messaging Design Patterns - This diagram shows the different patterns of message monitoring. — Messaging Design Patterns – Message Monitoring Patterns

What are the basic systems?

With a typical system management solution, for example, the data flow can be controlled by checking the number of data sent and received, or the processing time.

This is contrasted with the actual checking of the message information contained.

jelleke vanooteghem RnBNnM2Utic unsplash

PCA vs Linear Regression – Therefore you should know the differences

January 16, 2021 / RainerGewalt / 2 Comments

PCA vs Linear Regression – Two statistical methods that run very similarly. However, they differ in one important respect. What the two methods actually are and what this difference is, we explain to you in the following article.

What is a PCA?

Principal Component Analysis (PCA) is a multivariate statistical method for structuring or simplifying a large data set. The main goal here is the discovery of relationships in 2 or 3 dimensional domain.
This method enjoys great popularity in almost all scientific disciplines and is mostly used when variables are highly correlated.

However, PCA is only a reliable method if the data are at least interval scaled and approximately normally distributed.
Although the variables are adjusted to avoid redundant effects, the error and residual variance of the data are not taken into account.

The following figure shows the basic principle of a PCA. High dimensional data relationships should be represented in a low dimensional way, with as little loss of information as possible.

PCA vs Linear Regression - Figure shows the basic principle of a PCA. High dimensional data relationships should be represented in a low dimensional way, with as little loss of information as possible. — PCA vs Linear Regression – Basic principle of a PCA

The key point of PCA is dimensional reduction. It is to extract the most important features of a data set by reducing the total number of measured variables with a large proportion of the variance of all variables.
This reduction is done mathematically using linear combinations.

What are linear combinations?

PCA works in a purely exploratory way, searching the data for a linear pattern that best describes the data set.
These linear combinations can best be thought of as straight lines between variable values.
In the figure below, the linear combinations have been applied to a data set.

PCA vs Linear Regression -In this scheme the linear combinations have been applied to a data set — Linear combinations

How does the algorithm work?

In the principal component analysis procedure, a set of fully uncorrelated principal components are first generated.
These contain the main changes in the data and are also known as latent variables, factors or eigenvectors.
The number of extracted components is given here by the data.

The first principal component is formed by minimizing the sum of squared variances of all variables.
During extraction, the variance component is maximized over all variables.
Then, the remaining variance is gradually resolved by the second component until the total variance of all data is explained by the principal components.

The first factor always points in the direction of the maximum variance in the data.
The second factor must be perpendicular to it and explain the next largest variance

PCA vs Linear Regression – How do they Differ?

We have studied the PCA and how it works in great detail. But what are the differences to linear regression?

In the following illustration the main difference is set up against each other.

PCA vs Linear Regression - The figure shows the main difference between the two methods. The minimization of the error squares to the straight line. — PCA vs Linear Regression – Minimization of the Error Squares to the Straight Line

With PCA, the error squares are minimized perpendicular to the straight line, so it is an orthogonal regression. In linear regression, the error squares are minimized in the y-direction.

Thus, linear regression is more about finding a straight line that best fits the data, depending on the internal data relationships.
Principal component analysis uses an orthogonal transformation to form the principal components, or linear combinations of the variables.

So this difference between the two techniques only becomes apparent when the data are not completely independent, but there is a correlation.

If you want to know more about machine learning methods and how they work, check out our article on the t-SNE algorithm.

t-SNE – Great Machine Learning Algorithm for Visualization of High-Dimensional Datasets

January 14, 2021 / RainerGewalt / 0 Comments

The machine learning algorithm t-Distributed Stochastic Neighborhood Embedding, also abbreviated as t-SNE, can be used to visualize high-dimensional datasets. Each high-dimensional information of a
data point is reduced to a low-dimensional representation. However, the information about existing neighborhoods should be preserved.

So this technique is another tool you can use to create meaningful groups in unordered data collections based on the unifying data properties. If you don’t know what cluster algorithms are, check out this article. Here we present 5 machine learning methods that you should know.
As shown in the following figure, the data should be represented grouped in 2-dimensional space.

The figure shows the data clusters generated by t-Distributed Stochastic Neighborhood Embedding (T-SNE) in 2-dimensional space. — Data clusters generated by t-Distributed Stochastic Neighborhood Embedding (T-SNE)

But how does the algorithm work and what are its strengths? In order to understand its function, we need to look at the origin of the technology.

What is the Stochastic Neighbor Embedding (SNE) Algorithm?

The basis of the t-Distributed Stochastic Neighborhood Embedding algorithm is originally the Stochastic Neighbor Embedding (SNE) algorithm. This converts high-dimensional Euclidean distances into similarity probabilities between individual data points.
The probability with which an object occurs next to a potential neighbor must be calculated.
The dissimilarities between two high-dimensional data points can be explained with a distance matrix, corresponding to the squared Euclidean distance.
A conditional probability is calculated for the low-dimensional correspondence.
This determines the similarity of the two data points on the low-dimensional map.

In order to achieve the closest possible correspondence between the two distributions pij and
qij, a Kullback-Leibler divergence (KL) over all neighbors of each data point is computed as a cost function C. Large costs are incurred for distant data points.

t-Distributed Stochastic Neighborhood Embedding: minimized cost function: sum of the Kullback-Leibler divergences between the original and the induced distribution over the neighbors of an object. — Minimized Cost function: sum of the Kullback-Leibler divergences between the original and the induced distribution over the neighbors of an object.

A gradient descent method is used to optimize the cost function. However, this optimization method converges very slowly. In addition, a so-called crowding problem arises.

If a high dimensional data set is linearly approximated in a small scale, then it cannot be reduced to a lower dimension with a local scaling algo-
rithm to a lower dimension.

What makes the t-Distributed Stochastic Neighborhood Embedding (t-SNE) Algorithmt work?

The t-Distributed Stochastic Neighbor
Embedding (t-SNE) algorithm starts here. On the one hand, a simplified symmetric cost function is used.

The figure shows the simplified symmetric cost function used in t-Distributed Stochastic Neighborhood Embedding. — t-SNE: simplified symmetric cost function

Here, only one KL is minimized over a common probability distribution of all
high, and low dimensional data is minimized.

On the other hand, the similarity of the low-dimensional data points is computed with a Student’s t-distribution and a degree of freedom of one. This can be optimized quickly and is stable to the crowding problem.
stable against the crowding problem.

4 Index Data Structures a Data Engineer Must Know

January 2, 2021 / RainerGewalt / 1 Comment

In this article we will explain what index data structures are and introduce you to some popular structures.

In today’s world, ever-increasing amounts of data are being processed. The data can be used to derive business strategies in a commercial context, but also to gain valuable information about all scientific disciplines. The data obtained must be saved, ideally as raw data, and stored for future analysis.

At the time of creation, it is not yet possible to estimate what information might be valuable at some point. So any reduction in data ultimately represents a loss. Huge amounts of data accumulate every second, and managing them is an immense task for today’s hardware and software. Mathematical tricks have to be used to optimize search mechanisms and storage functions.

Index data structures allow you to access searched data in a large data collection immensely faster. Instead of executing a search query sequentially, a so-called index data structure is used to search for a specific data record in this data set based on a search criterion.

What are Index Data Structures in Databases?

You have probably heard about indexing in connection with databases. Here, too, an index structure is formed, independent of the data structure, which accelerates the search for certain fields. This structure consists of references, which define an order relation to the table columns. Based on these pointers, the database management system can then find the data using a search algorithm.

schematic representation of index data structures in databases — Index Data Structures in databases

However, indexing is a very complex scientific field. Queries are constantly being made more efficient and optimized. Thus, the approaches are diverse and very mathematical. This article will give you an overview of popular index data structures and help you to optimize your data pipelines.

Index data structure types

There are many different indexing methods. They are all based on different mathematical assumptions. You should understand these assumptions and choose a suitable system according to your data properties.
In the following scheme you can see some structure types you have to distinguish between, depending on the data you want to index.

index structures 1 — index data structure types

The most important distinction, however, is whether you want to index one-dimensional or multidimensional data relationships. This means that you have to differentiate whether there is a common feature or several related but independent features.
In the following figure, we have classified the individual index structures according to their dimension coverage.

we have classified the individual index structures according to their dimension coverage. — individual index structures according to their dimension coverage

Which index structure you ultimately choose depends on many factors and should be weighed up well in advance, especially with large data sets.

Popular index data structures you should know

In the following, we will introduce you to some of the most popular indexing methods in detail. Because here, too, the key to success lies in understanding your tools and using them correctly at the right moment.

What is Hashing?

If you want to search for a value in an unsorted array, a linear search method is not optimal and too time consuming.
With the so called hashing method a hash value is used for unique object identification. This is calculated by a hash function from the key and determines the storage location in an array of indices, the so-called hash table. This means that you use this function to generate a unique storage location in the table using a key.
In the following figure the hash function flow is shown again.

schematic representation of the hash function sequence in detail — hash function sequence

Important basic assumptions are, however, that the function always returns a number for an object, two identical objects always have the same number and two unequal objects do not always have different numbers.

What is a Binary tree?

A so-called binary tree is a data structure in which each element, also called node, has a maximum of two successors. The addresses of the subordinate nodes are kept track of by pointers. It is often used when data is to be stored in RAM.

What is a B-tree?

The B-tree is often used in databases and file systems, i.e. for storage on the hard disk. The tree is sorted and completely balanced. The data is stored sorted by keys. The keys are stored in its internal nodes, but need not be stored in the records at the leaves. CRUD functions run in amortized logarithmic time.

The B-tree is classified into different types according to its properties.
In the B+ tree, only copies of the keys are stored in the internal nodes. The keys are stored with the data in the leaves. To speed up sequential access, these also contain pointers to the next leaf node and are thus concatenated.
In the following scheme you see a basic B+ tree structure.

Basic representation of a b+ tree and its components — Basic b+ tree structure

The B* tree is an index structure where non-root nodes must be at least 2/3 filled. This is achieved by a modified split strategy.
In addition to indexing, partitioning also offers you the possibility of strongly optimizing the data search within a database. In this article we introduce you to this technique.

What is a SkipList?

The SkipList resembles in its structure a linked list consisting of containers, which contain the data with a unique key and a pointer to the following container. In a SkipList, however, the containers have different heights and can contain pointers to containers that do not follow directly. The idea is to speed up the search by additional pointers.

schematic representation of an index structure of the SkipList — Schematic representation of a SkipList

Calculation of the container height

All nodes have pointers on different levels. Keys can be skipped with it. The height of the list elements is calculated either regularly, or unbalanced according to mathematical rules. The search is however dependent on the list emergence or evenly randomly over the list.

H2O AI – That’s why it’s so great

December 5, 2020 / RainerGewalt / 0 Comments

There is a lot of Big Data software available now. One of them that you should definitely know about is the H2O AI Machine Learning solution.

With this open-source application you can implement algorithms from the fields of statistics, data mining and machine learning. The H2O AI Engine is based on the distributed file system Hadoop and is therefore more performant than other analysis tools. Your machine learning methods can thus be used as
parallelized methods.

Software Stack

They can program their algorithms in R, Python and Java and thus in the most important mathematical programming languages. H2O provides a REST interface to Python, R, JSON and Excel. Additionally, you can access H2O directly with Hadoop and Apache Spark. This makes integration into your data science workflow much easier. You already get approximate results while running the algorithms. A graphical web browser UI helps you to better analyze the processes and perform targeted optimizations.

How Clients Interacts with H2O AI

You can interact with H2O via clients using various interfaces. It is important for you to know that the data is usually not held in memory. They are localized in a H2O cluster and you only get a pointer to the data when you make a request.

H2O Frame

The basic unit of data storage accessible to you is the H2O Frame. This corresponds to a two-dimensional, resizable and potentially heterogeneous data point. This tabular data structure also contains labeled axes.

H2O Cluster

Your H2O cluster consists of one or more nodes. A node corresponds to a JVM process and this process consists of three layers.

H2O Machine Learning Software Structure — H2O Software Stack

H2O Machine Learning Components

Language Layer

The R evaluation layer is a slave to the REST client front-end and in the Scala layer you can write native programs and algorithms. You can then use these with H2O Machine learning.

Algorithms Layer

This layer is where your algorithms are applied. You can run statistical methods, data import and machine learning here.

Core Layer

In this layer you handle the resource management. You can manage both the memory and the CPU processing capacity.

Microsoft Power Platform – To turn your company into an Industry 4.0 enterprise, you can no longer avoid cloud solutions

November 28, 2020 / RainerGewalt / 0 Comments

In this article, we will show you everything about the cloud-based web tool Microsoft Power Platform and why you shouldn’t do without it.

Cloud-based web development has gained popularity in the web development industry in recent years. The globalization of the workforce and the diversification of the work process have significantly driven the development of cloud-based services.

What is Microsoft Power Platform?

With Power Platform you get an integrated application platform consisting of a group of different Microsoft products with which you can develop complex business solutions. This way you can make your business processes more efficient and productive. The platform can also take care of data storage, entry and processing. Data analysis via complex visualizations and predictions can also be handled by various services.

Microsoft Power Platform Services

Since the Power Platform is a collection of different Microsoft services, we want to give you an overview of the individual parts.

Build your own Apps

With Power Apps you get a user suite for mapping custom apps. All apps are independent of data sources and can be extended by you as you wish via drag & drop. This way you can adapt them to your needs.

Automate your tasks

Power Automate was still called Microsoft Flow until 2019. This web tool lets you automate recurring tasks and simple cross-platform workflows. You can connect to over 100 third-party systems via connectors. This allows you to automate processes outside the Microsoft environment and across applications.

Schematic representation of the Principle Microsoft Power Automate — Principle Microsoft Power Automate

Analyze your business data

With Power BI you get a business intelligence tool. With this you can access different data sources. An advantage to other BI tools is the deep integration with Excel. So you can create user-friendly, data connections and visualize.

Should I choose Microsoft Power Platform?

You need more and more scalable and secure solutions at low cost. Device-independent access to important planning and evaluation software will increasingly become the focus of attention and change global corporate structures. To turn your company into an Industry 4.0 enterprise, you can no longer avoid cloud solutions.

The question is not: if, but when you will choose a corresponding service. The question is not: if, but when you will choose a corresponding service. Microsoft, as one of the largest IT companies, now presents you with a solution. Whether it fits your needs, however, you must ultimately decide for yourself.

Array vs Object – The creation of a JSON structure follows some rules you should know

November 26, 2020 / RainerGewalt / 3 Comments

Array vs Object – JSON is one of the most popular data formats. However, the creation of such an object is done according to some rules. These rules depend on the original data type. In this article we will introduce you to the conversion of some JSON data types (Array vs Object).

What is JSON anyway?

With the JavaScript Object Notation, JSON for short, you can structure data compactly and independently of programming languages. The data format is therefore particularly well suited for exchange between your applications, for general data storage (file extension “.json”) and for configuration files. The data is also readable for you and coded in the standardized text format. The application notes of the data format are defined by the standards – RFC 8259 and the JSON syntax by the standards ECMA-404. Due to its easy integration with JavaScript, you can use it well for transferring data in web applications.

You can best compare the JSON data structure to XML and YAML, only it’s simpler and more compact.

What are the basic rules?

This code snippet shows a simple json object structure — Simple JSON Object

The JSON text structure is based on the JavaScript Object Syntax. Hierarchical data structures are thus possible. It contains only properties and no methods. The basis is formed by name-value pairs and ordered list of values. Basically, they are formatted with curly braces and as strings. This is especially advantageous if you want to transfer the data over the network. If you want to access the data you have to convert the text structure into a native JavaScript object.

Data Formats – JSON Array vs Object

Basically, you can have different data types included in JSON.

Value:

Your JSON value can take one of the following allowed types.

Object:

A JSON object represents the basic form of a JSON text. With this you can accept any data type that is suitable for inclusion in JSON.

Array:

JSON Array vs Object – It is possible to include an array. Arrays can contain objects, strings, numbers, arrays and boolean. You can include arrays as shown schematically below, enclosed with two square brackets.

In this way, you can further and further nest the individual data types with each other and thus easily create any number of hierarchy levels. For example, object attributes can consist of arrays, or arrays can contain multiple objects.