AutoEncoder – In data science, we often encounter multidimensional data relationships. Understanding and representing these is often not straightforward. But how do you effectively reduce the dimension without reducing the information content?
Unsupervised dimension reduction
One possibility is offered by unsupervised machine learning algorithms, which aim to code high-dimensional data as effectively as possible in a low-dimensional way. If you don’t know the difference between unsupervised, supervised and reinforcement learning, check out this article we wrote on the topic.
What is an AutoEncoder?
The AutoEncoder is an artificial neural network that is used to unsupervised reduce the data dimensions. The network usually consists of three or more layers. The gradient calculation is usually done with a backpropagation algorithm. The network thus corresponds to a feedforward network that is fully interconnected layer by layer.
Types
AutoEncoder types are many. The following table lists the most common variations.
However, the basic structure of all variations is the same for all types.
Basic Structure
Each AutoEncoder is characterized by an encoding and a decoding side, which are connected by a bottleneck, a much smaller hidden layer.
The following figure shows the basic network structure.
During encoding, the dimension of the input information is reduced. The average value of the information is passed on and the information is compressed in such a way. In the decoding part, the compressed information is to be used to reconstruct the original data. For this purpose, the weights are then adjusted via backpropagation. In the output layer, each neuron then has the same meaning as the corresponding neuron in the input layer.
Autoencoder vs Restricted Boltzmann Machine (RBM)
Restricted Boltzmann Machines are also based on a similar idea. These are undirected graphical models useful for dimensionality reduction, classification, regression, collaborative filtering, and feature learning. However, these take a stochastic approach. Thus, stochastic units with a particular distribution are used instead of the deterministic distribution.
RBMs are designed to find the connections between visible and hidden random variables. How does the training work? The hidden biases generate the activations during forward traversal and the visible layer biases generate learning of the reconstruction during backward traversal.
Pretraining
Since the random initialization of weights in neural networks at the beginning of training is not always optimal, it makes sense to pre-train. The task of training is to minimize an error or a reconstruction in order to find the most efficient compact representation for input data.
The method was developed by Geoffrey Hinton and is primarily for training complex autoencoders. Here, the neighboring layers are treated as a Restricted Boltzmann Machine. Thus, a good approximation is achieved and fine-tuning is done with a backpropagation.
A perceptron is a simple binary classification algorithm modeled after the biological neuron and is thus a very simple learning machine. The output function here is determined by the weighting of the inputs and by the thresholds. Perceptrons are used for machine learning as well as for artificial intelligence (AI) applications. If you don’t know the difference between AI, neural networks and machine learning you should read our article on the subject.
What does the learning process look like?
A set of input signals are decomposed into a binary output decision, i.e. zeros or ones. By training with certain input patterns, similar patterns can thus be found in a data set to be analyzed. The following figure shows this learning process schematically.
If a set threshold is exceeded or not reached by weighting all inputs, the state of the neuron output changes. If one now trains a perceptron with given data patterns, the weighting of the inputs changes. The perceptron thus has the ability to learn and solve complex problems by adjusting the weights.
However, a basic requirement to obtain valid results is that the data must be linearly separable.
What are Multilayer Perceptrons (MLP)
A multilayer perceptron corresponds to what is known as a neural network. Perceptrons thus form the neuronal base, which are interconnected in different layers.
The figure below shows a simple three-layer MLP. Each line here represents a different output.
However, neurons of the same layer have no connections to each other. For each signal, the perceptron uses different weights and the output of a neuron is the input vector of a neuron of the next layer. The diversity of classification possibilities increases with the number of layers.
Recurrent Neural Networks vs Feed-Forward Networks
Basically, neural networks are distinguished according to the recurrent and the feed-forward principle.
Recurrent Neural Networks
In the recurrent neural network the neurons are connected to neurons of the same or a preceding layer. Here, a basic distinction is made between three types of feedback. With the direct feedback the own output of a neuron is used as further input. In indirect feedback, on the other hand, the output of a neuron is connected to a neuron of the preceding layers. In the last feedback principle, lateral feedback, the output of a neuron is connected to another neuron of the same layer.
Feed-Forward Networks
In feed-forward networks, on the other hand, the outputs are connected only to the inputs of a subsequent layer. These can be fully connected, then the neurons of a layer are connected to all neurons of the directly following layer. Or short-cuts are formed. Some neurons are then not connected to all neurons of the next layer.
In almost no scientific discipline you can get around the programming language Python nowadays. With it, powerful algorithms can be applied to large amounts of data in a performant way. Open source libraries and frameworks enable the simple implementation of mathematical methods and data transports.
What is scikit-learn?
One of the most popular Python libraries is scikit-learn. It can be used to implement both supervised and unsupervised machine learning algorithms. scikit-learn primarily offers ready-made solutions for data mining, preprocessing and data analysis. The library is based on the SciPy Toolkit (SciKit) and makes extensive use of NumPy for high performance linear algebra and array operations. If you don’t know what NumPy is, check out our article on the popular Python library. The library was first released in 2007 and since then it is constantly extended and optimized by a very active community. The library was written primarily in Python and is based on Cython only for some high-level operations. This makes the library easy to integrate into Python applications.
scikit-learn Features
Easily implement many machine learning algorithms with scikit-learn. Both supervised and unsupervised machine learning are supported. If you don’t know what the difference is between the two machine learning categories, check out this article from us on the topic. The figure below lists all the algorithms provided by the library.
scikit-learn thus offers rich capabilities to recognize patterns and data relationships in a dataset. Thus, high dimensions can be reduced to visualize the relationships without sacrificing much information. Features can be extracted and data clustering algorithms can be easily created.
Dependencies
scikit-learn is powerful and versatile. However, the library does not exist completely solitary. Besides the obvious dependency on Python, the library requires the import of other libraries for special operations.
NumPy allows easy handling of vectors, matrices or generally large multidimensional arrays. SciPy complements these functions with useful features like minimization, regression or the Fourier transform. With joblib Python functions can be built as lightweighted pipeline jobs and with threadpoolctl methods can be coordinated as threads to save resources.
== Open source Python library – a collection of mathematical algorithms and convenience functions
– is mainly used by scientists, analysts and engineers for scientific computing, visualization and related activities
– Initial Realease: 2006; Stable Release: 2020 – depends on the NumPy module → basic data structure used by SciPy is a N-dimensional array provided by NumPy
Benefits
Features
– SciPy library provides many user-friendly and efficient numerical routines:
SciPy ecosystem
– scienitific computing in Python builds upon a small core of open-source software for mathematics, science and engineering
More relevant Packages
– the SciPy ecosystem includes, based on the core properties, other specialized tools
The product and further information can be found here:
Apache Mahout is a powerful machine learning tool that comes with a seamless compatibility to the strong big data management frameworks from the Apache universe. In this article, we will explain the functionalities and show you the possibilities that the Apache environment offers.
What is Machine Learning?
Machine learning algorithms provide lots of tools for analyzing large unknown data sets. The art of data science is to extract the maximum amount of information depending on the data set by using the right method. Are there patterns in the high-dimensional data relationships, and how can they be represented in a low-dimensional way without much loss of information?
There is often a similar amount of information in the failure as when an algorithm was able to successfully create groupings. It is important to understand the mathematical approaches behind the tools in order to draw conclusions about why an algorithm did not work. If you don’t know the basic machine learning categories, it’s best to read our article on the subject first.
Machine Learning and Linear Algebra
Most machine learning methods are based on linear algebra. This mathematical subfield deals with linear transformations, vector spaces and linear mappings between them. The knowledge of the regularities is the key to the correct understanding of machine learning algorithms.
What is Apache Mahout
Apache Mahout is an open source machine learning project that builds implementations of scalable machine learning algorithms with a focus on linear algebra. If you’re not sure what Apache is, check out this article. Here we introduce you to the project and its main projects once.
Mahout was already released in 2009 and since then it is constantly extended and kept up-to-date by a very active community. Originally, it contained scalable algorithms closely related to Apache Hadoop and MapReduce. However, Mahout has since evolved into a backend independent environment. That is, it operates on non-Hadoop clusters or single nodes.
Features
The math library is based on Scala and provides an R-like Domain Specific Language (DSL). Mahout is usable for Big Data applications and statistical computing. The figure below lists all machine learning algorithms currently offered by Mahout.
The algorithms are scalable and cover both supervised and unsupervised machine learning methods, such as clustering algorithms.
Apache Mahout covers a large part of the usual machine learning tools. This means that data can be analyzed without having to change frameworks. This is a big plus for maintaining compatibility in the application.
Apache Ecosystem
The framework integrates seamlessly into the Apache Ecosystem. This means that an application can access the entire power of the data processing platforms and build very high-performance big data pipelines. The following figure shows the Apache data management ecosystem.
Through connectivity to Apache Flink, stream data analysis pipelines can be built, or with Hive data from relational databases can be automatically converted into MapReduce or Tez or Spark jobs.
What role does xml play in Industry 4.0? – XML is one of the most popular and widely used data formats. Its widespread use is also its most important advantage. XML is interpretable by both humans and machines and is therefore widely used to import and export application data. XML stands for Extensible Markup Language and is a markup language for representing hierarchically structured data in text file format. It was already published in 1998 and is primarily a meta language.
That means that on its basis application-specific languages are defined by structural and content restrictions. For example RSS, MathML, GraphML, but also the Scalable Vector Graphics (SVG). All web browsers are able to visualize XML documents using the built-in XML parser.
What is XML Document Structure
An XML document can always be described as the interaction of its main components. In addition to the data itself, these are the layout, i.e. the description of the relationships between individual containers, and the structure.
An XML structure can be interpreted as a tree. Thus, each XML document has a root element and texts or attributes as sub-elements.
An XML document can have an optional header in addition to the actual data. XML declarations, i.e. references to an external document type definition (DTD), or internal DTD, or document type declarations can be placed here. Examples for these declarations are the XML version or the encoding.
Classification of the XML format
The XML format can be further classified. Which class comes into question when is determined by the use case. Mainly we decide between document-centered and data-centered. The document-centric XML format is based on a text document and is difficult to process by machine due to its weak structure. In data-centric, the schema describes entities of a data model and their relationships. This format is optimized for efficient processing by machines. The Semistructured format represents a hybrid of both.
Processing
The XML format allows both sequential and optional accesses. This can be done either by a “push”, where the program flow is controlled by the parser, or by a “pull”, where the flow is implemented in the code that calls the parser. Management of the tree structure can be hierarchical as well as nested.
XML-Schema vs. Database-Schema
Besides XML, JSON is also a very popular markup language. In this article we have recorded the most important information about this format.
Another large field of computer languages, i.e. formal languages developed for interaction between humans and computers, is occupied by database languages. They describe the structure of a database. Here, too, the data is organized as a plan.
If you want to know more about database language, read our article on SQL and NoSQL. Here we explain the most important differences.
But how does this schema differ from an XML schema?
XML contains nested elements with an unlimited nesting depth. To transfer this nesting to a database schema, the nested elements must be decomposed and linked by foreign key relationships. In XML format, the elements within an element can be repeated as often as desired. Elements of a given type do not always have to contain the same child elements. However, the order of elements is an integral part of the document structure. In a database schema, each column is always present only once and contains simple values. Therefore, if multiple elements are to be stored, another table must then be created. The order in which the values are stored is not important, unlike in XML.
– Open source stream processor framework developed by the Apache Software Foundation (2016) – Data streams with high data volume can be processed and analyzed with low delay and high speed
Core functions
– diverse, specialized APIs: → DataStream API (Stream Processing) → ProcessFunctions (control of states and time; event states can be saved and timers can be added for future calculations) → Table API → SQL API → provides a rich set of connectors to various storage systems such as Kafka, Kinesis, Kubernetes, YARN, HDFS, Elasticsearch, and JDBC database systems → REST API
Stream Processing
== Data is processed continuously with a short delay → without intermediate storage of the data in separate databases – several data streams can be processed in parallel – Each stream can be used to derive own follow-up actions and analyses
Architecture
Data can be processed as unbounded or bounded streams:
Unbounded stream
have a start but no defined end
must be continuously processed
Bounded stream
have a defined start and end
can be processed by ingesting all data before performing any computations(== batch processing)
– Flink automatically identifies the required resources based on the application’s configured parallelism and requests them from the resource manager.
–In case of a failure, Flink replaces the failed container by requesting new resources.
– Stateful Flink applications are optimized for local state access
PyTorch BigGraph – The graph is a data structure that can be used to clearly represent relationships between data objects as nodes and edges. These structures can contain billions of nodes and edges in an industrial context.
So how can the multidimensional data relationships be accessed in a meaningful way? Graph embedding offers one possibility for dimension reduction. This is a sequence of different algorithms with the goal of reducing the graph’s property relations to vector spaces. These embedding methods usually run unsupervised. If there is a large property similarity, two points should also be close to each other in the vector space.
The reduced feature information can then be further processed with additional machine learning algorithms.
What is PyTorch BigGraph?
Facebook offers PyTorch BigGraph, an open source library that can be used to create very performant graph embeddings for extremely large graphs.
It is a distributed system that can unsupervised learn graph embeddings for graphs with billions of nodes and trillions of edges. It was launched in 2019 and is written entirely in Python. This ensures absolute compatibility with common Python data processing libraries, such as NumPy, Pandas, and scikit-learn. All calculations are performed on the CPU, which should play a decisive role in the hardware selection. A lot of memory is mandatory. It should also be noted that PBG can process very performant large graphs, but is not optimized for small graphs, i.e. structures with less than 100.000 nodes.
Facebook extends the ecosystem of its popular Python scientific computing package PyTorch with a very performant Big Graph solution. If you want to know more about PyTorch, you should read this article from us. Here we will show you the most important features and compare it with the industry’s top performer Google Tensorflow.
Fundamental building blocks
PGB provides some basic building blocks to handle the complexity of the graph. The graph partitioning splits the graph into equal parts and can be processed in parallel. PGB also supports multithreading computations. A process is divided into several threads, which run independently, but can access the same memory. In addition to the distribution of tasks, PyTorch BigGraph can also be used intelligently by distributed execution of hardware resources.
PyTorch BigGraph- How does the training work?
The PGB graph processing algorithms can process the graph in parallel using the fundamental building blocks already described. This allows the training mechanisms to run in a distributed manner and thus with high performance.
Once the nodes and edges are partitioned, the training can be performed for one bucket at a time.
The training runs unsupervised on an input graph by reading its edge list. A feature vector is then output for each entity. Here, neighboring entities in the vector space are placed close to each other, while unconnected entities are pushed apart. Thus, the dimensions are iteratively reduced. It is also possible to configure and optimize this calculation using parameters learned during training.
PGB and Machine Learning
The graph structure is a very information-rich and so far unfortunately too much neglected data structure. With tools like PGB the large structure is more and more equalized by high parallelism.
A very interesting concept is the use of PGB in machine learning large graph structures. Here, the graph structures could be used for semantic queries with nodes, edges and properties to represent and store data and could replace a labeled data structure. Through the connections between the nodes certain relations can be derived. By PGB the graph can be processed enormously parallelized. This would allow individual machines to train a model in parallel with different buckets, using a lock server.
In times of Big Data, the graph has become a popular data structure due to its flexible and clear relationship-based structure. Even entire database systems are now designed according to the graph principle. For more on this, read our article on NoSQL databases. Libraries, like PyGraph, allow you to perform fast queries and optimized graph manipulations. With its full Python implementation, it offers you a user-friendly and powerful tool.
What is a graph?
In a graph, objects are represented according to their relationships with each other. The objects are called vertices and the relations are called edges of the graph. An edge always connects exactly two nodes. Graphs are often used to represent traffic networks, entity-relationship diagrams, syntax trees for programming languages, finite automata and proof or decision trees.
PyGraph supports different graph types
Basically, graphs must be differentiated between directed and undirected. If a graph is directed, the edges may only be used in one direction. These edges are also called directed edges. If it is undirected, there are no directional constraints. So each edge is connected to an undirected pair of vertices. In the following figure we have contrasted both categories.
You can use PyGraph regardless of these properties, because both types are supported.
PyGraph supports several algorithms
PyGraph supports the use of many well-known graph operations. For example, searching or traversing a graph, where all nodes of a graph must be visited, can be done in different ways. In the Depth-First Search (DFS) search algorithm, for example, the successors of a successor of the current node are visited first and only then the neighbors of the current node.
The depth of the search can also be limited accordingly. Breadth-First Search (BFS), on the other hand, first visits its own neighboring nodes and only then the successors of the neighboring nodes.
In addition to the algorithm-based search of a graph, other operations can be performed with PyGraph, such as the calculation of minimum spanning trees. This tree describes the best possible path to traverse all available nodes in a weighted graph. In the following figure we have shown you all currently supported algorithms.
Modbus Overview – In this article we introduce you to the industrial communication protocol, its function and its individual characteristics.
What is Modbus and how does it work?
Modbus is due to its simple usability in many automation areas a standard to couple intelligent machines in a client/server architecture, also called master/slave. Each bus participant is assigned a unique address. Zero is always the broadcaster. Usually the master initiates a message and an addressed slave responds. This means that the message is always sent from one point to all participants.
Data can be exchanged either via a serial interface and thus in single bits (RTU, ASCII) one after the other, or via Ethernet using data frames (TCP). Depending on the data format the bus types are also distinguished
Modbus Overview – RTU-Modbus
The Remote Terminal Unit (RTU) is best described as a remote control system. Transmission is in binary form and is therefore very fast. However, to be able to read the data, it must be translated back again.
The length of the transmission pause depends on the transmission speed. The data field contains information which registers the slave is to read. The slave then inserts the read data here and sends it to the master. In the master then an error check takes place via cyclic redundancy check, either via a CRC or by the calculation of a checksum byte.
Modbus Overview – ASCII-Modbus
Instead of a binary sequence, an ASCII code, i.e. a 7-bit character encoding, can also be transmitted. This can be read immediately, but has a lower data throughput in direct comparison to RTU.
Error checking is done by longitudinal parity checking via a line replaceable unit. The error case is usually triggered at a frame transmission pause of more than one second. However, this period is configurable.
Modbus Overview – Modbus/TCP
Data transmission can also take place via transmission Control Protocol/ Internet Protocol (TCP/IP) packets. Here, identification takes place via IP addresses.
The transmission security is ensured by a digital certificate authentication of server and client, a so-called Transport Layer Security (TLS).
Modbus Overview – Client/Server Model
What does the client/server model look like?
The following figure shows the individual steps of both bus participants.
A client sends a request to the network to initiate a transaction. This request is then received on the server side. The so-called indication. The server then processes the request and creates a response, and returns it to the network. The client side then receives the response. This step is called confirmation.
General MODBUS frame
MODBUS protocol defines a Protocol Data Unit (PDU, full message from an OSI shift)
Mapping of protocol on specific buses or networks can introduce additional fields on the Application Data Unit (ADU, combined command/data block)
MODBUS on TCP/IP Application Data Unit
Dedicated header is used to identify the Data Unit (MBAP header) -> contains fields for several information codes
MODBUS vs OPC UA
OPC UA may become one of the most important unified data protocols. For years, umbrella organizations have been driving the project worldwide. Originally developed for the injection molding and rubber processing industries, OPC UA is gradually being extended to other industries. Thanks to its standardized tree structure, OPC can represent data very flexibly as hierarchical objects. This means that a lot of relationship and structure information is also transmitted.
How can OPC UA devices now be interconnected via Modbus? The first problem is hardware based. OPC UA is usually transmitted via Ethernet. Modbus mostly via RS485. Here a first conversion is necessary. The second issue deals with how to represent the registers and coils of the Modbus device in OPC UA.
OPC UA Native Representation
The first option is to remap the entire Modbus data space to objects. Each register and coil are thus represented as a variable attribute and are given descriptive names. Subsequently, metadata can be added. For example the data type, a maximum value or time of origin.
Modbus Native Representation
Instead of building an OPC UA address space, individual objects can be represented with Modbus registers and coils as attributes. Each register and coil can be mapped by the current register and coil number and no metadata is added. A client can then access the new data space using UA Read and UA Write requests.
Modbus Data Transport
How can Modbus use OPC UA for data transport? The actual Modbus message packet is sent over the network, embedded in the OPC UA transport. OPC UA wraps the standard message and adds a standard encoding and security layer.
However, this requires that the OPC UA Server recognizes that the content of a message is not a standard read or write of OPC UA attributes, but is a Modbus message. An OPC UA Server must therefore be attached to the rontend of the device. This can be done in different ways
1.
The OPC UA Server function acts as a mechanism to establish a secure and reliable connection between a Modbus client and a Modbus server. A string attribute is exposed which then contains the entire Modbus message. For a Modbus RTU client device, an OPC UA client device can be used to write the send attribute to the target device. The receive attribute is read back. However, nothing really changes for the Modbus devices except that the messages are passed to OPC UA instead of being put on a line.
2.
Another possibility is to create a data provider/manager for processing the Modbus message. A message is processed through the low-level transport and the application services look at the attribute. The application service manager would notice that the namespace index points to a dedicated application process.
Where else does Modbus make sense?
Modbus is still the best solution for certain applications. It is a relatively inexpensive way to build a reliable information flow. It is a “slow response network” which is especially advantageous for temperature recording. For complex data sets, however, there are far better solutions with EtherNet/IP, PROFINET IO and EtherCAT. OPC UA, for example, can be transferred here without major transformations and detours.
Increasingly large volumes of data can now be processed faster and faster thanks to ever more efficient hardware performance. Large information networks for monitoring and analyzing almost all business processes are becoming more and more standard. uniform, fast and resource-saving data protocols are the key to Industry 4.0. If you want to know more about Industry 4.0, take a look at our article here.