Category: programming (Page 2 of 3)

H2O AI – That’s why it’s so great

December 5, 2020 / RainerGewalt / 0 Comments

There is a lot of Big Data software available now. One of them that you should definitely know about is the H2O AI Machine Learning solution.

With this open-source application you can implement algorithms from the fields of statistics, data mining and machine learning. The H2O AI Engine is based on the distributed file system Hadoop and is therefore more performant than other analysis tools. Your machine learning methods can thus be used as
parallelized methods.

Software Stack

They can program their algorithms in R, Python and Java and thus in the most important mathematical programming languages. H2O provides a REST interface to Python, R, JSON and Excel. Additionally, you can access H2O directly with Hadoop and Apache Spark. This makes integration into your data science workflow much easier. You already get approximate results while running the algorithms. A graphical web browser UI helps you to better analyze the processes and perform targeted optimizations.

How Clients Interacts with H2O AI

You can interact with H2O via clients using various interfaces. It is important for you to know that the data is usually not held in memory. They are localized in a H2O cluster and you only get a pointer to the data when you make a request.

H2O Frame

The basic unit of data storage accessible to you is the H2O Frame. This corresponds to a two-dimensional, resizable and potentially heterogeneous data point. This tabular data structure also contains labeled axes.

H2O Cluster

Your H2O cluster consists of one or more nodes. A node corresponds to a JVM process and this process consists of three layers.

H2O Machine Learning Software Structure — H2O Software Stack

H2O Machine Learning Components

Language Layer

The R evaluation layer is a slave to the REST client front-end and in the Scala layer you can write native programs and algorithms. You can then use these with H2O Machine learning.

Algorithms Layer

This layer is where your algorithms are applied. You can run statistical methods, data import and machine learning here.

Core Layer

In this layer you handle the resource management. You can manage both the memory and the CPU processing capacity.

Array vs Object – The creation of a JSON structure follows some rules you should know

November 26, 2020 / RainerGewalt / 3 Comments

Array vs Object – JSON is one of the most popular data formats. However, the creation of such an object is done according to some rules. These rules depend on the original data type. In this article we will introduce you to the conversion of some JSON data types (Array vs Object).

What is JSON anyway?

With the JavaScript Object Notation, JSON for short, you can structure data compactly and independently of programming languages. The data format is therefore particularly well suited for exchange between your applications, for general data storage (file extension “.json”) and for configuration files. The data is also readable for you and coded in the standardized text format. The application notes of the data format are defined by the standards – RFC 8259 and the JSON syntax by the standards ECMA-404. Due to its easy integration with JavaScript, you can use it well for transferring data in web applications.

You can best compare the JSON data structure to XML and YAML, only it’s simpler and more compact.

What are the basic rules?

This code snippet shows a simple json object structure — Simple JSON Object

The JSON text structure is based on the JavaScript Object Syntax. Hierarchical data structures are thus possible. It contains only properties and no methods. The basis is formed by name-value pairs and ordered list of values. Basically, they are formatted with curly braces and as strings. This is especially advantageous if you want to transfer the data over the network. If you want to access the data you have to convert the text structure into a native JavaScript object.

Data Formats – JSON Array vs Object

Basically, you can have different data types included in JSON.

Value:

Your JSON value can take one of the following allowed types.

Object:

A JSON object represents the basic form of a JSON text. With this you can accept any data type that is suitable for inclusion in JSON.

Array:

JSON Array vs Object – It is possible to include an array. Arrays can contain objects, strings, numbers, arrays and boolean. You can include arrays as shown schematically below, enclosed with two square brackets.

In this way, you can further and further nest the individual data types with each other and thus easily create any number of hierarchy levels. For example, object attributes can consist of arrays, or arrays can contain multiple objects.

5 Clustering Algorithms Data Scientists need to know – The key is always to understand the basic approach of any algorithm you want to use

November 22, 2020 / RainerGewalt / 1 Comment

As a data scientist, you have several basic tools at your disposal, which you can also apply in combination to a data set. Here we present some clustering algorithms that you should definitely know and use

In times of Big Data, not only the sheer number of data increases, but also the relationships between them. More and more complex dependencies are formed. This makes it all the more difficult to recognize these similar properties and to assign the data to so-called clusters in a way that can be evaluated.

You have certainly heard of these algorithms and maybe used one or the other, but do you really know what clustering algorithms are?

What are clustering algorithms?

So let’s first clarify what these algorithms are in the first place. The goal is clear: You want to identify similar properties between individual data points in a data set and group them in a meaningful way. These properties are often high-dimensional.

With the help of cluster analysis, you want to reduce this high-dimensional information to a low-dimensional dependency. So, for example, a representation in 2D space. Clustering is an unsupervised machine learning technique and in the end you classify the data points by using algorithms.

The approach to clustering differs from technique to technique. All have their advantages and disadvantages, so it makes sense to try several on one set of data, or apply them in combination. Below we will introduce you to some popular clustering methods and explain their grouping approach.

This picture shows schematically popular Clustering Machine Learning Algorithms you should know as a data scientist — Clustering Machine Learning Algorithms – Popular clustering algorithms

Mean-Shift Clustering

The first algorithm we want to introduce you to is Mean-Shift Clustering. With this you can find dense areas of data points according to the concept of kernel density estimation (KDE). The basis of the clustering is a circular sliding window, which moves towards higher density at each iteration. Within the window, the centers of each class are determined, called centroids.

The movement is now created by moving the center to the average of the points within the window. The density within the sliding window is thus proportional to the number of points within it. This motion continues until there is no direction in which the motion can take more points within the kernel.

Clustering Machine Learning Algorithms - Schematic and simplified representation of the Mean-Shift principle. — Clustering Machine Learning Algorithms – **Mean-Shift Clustering Priciple**

Hierarchical Cluster Analysis (HCA)

With HCA, clusters are formed based on empirical similarity measures of the data points. This means that the two most similar objects are assigned one after the other until all objects are in one cluster. This results in a tree-like structure. In contrast to the K-means algorithm, which we will discuss later, similarities between the clusters play a role. These are represented by a cluster distance. With K-means, only all objects within a collection are similar to each other, while they are dissimilar to objects in other clusters.

You can create an HCA in different ways. There are two elementary procedures, the top-down and the bottom-up. If you want to know more about Hierarchical Cluster Analysis, read this article.

Schematic and simplified representation of the HCA clustering principle. — Clustering Machine Learning Algorithms – **HCA Principle**

Expectation-Maximization (EM) Clustering using Gaussian Mixture Models (GMM)

GMM basically assumes that the data points are Gaussian and not circular. The clusters are described by their mean and standard deviation. Each Gaussian distribution is randomly assigned to a single cluster and found using the Expectation-Maximization (EM) optimization algorithm. The probability of belonging to a cluster is then calculated for each data point. Thus, the closer a point is to the Gaussian center, the more likely it is then to belong to that cluster. Based on these probabilities, a new set of parameters for the Gaussian distributions is iteratively calculated. That is, the probabilities within a cluster are maximized.

K-Means clustering algorithms

The k-Means algorithm described by MacQueen, 1967 goes back to the methods described by Lloyd, 1957 and Forgy, 1965. You can use the algorithm besides cluster analysis also for vector quantization. Here, a data set is partitioned into k groups with equal variance.

The number of clusters must be specified in advance. Each disjoint cluster is described by the average of all contained samples. The so-called cluster centroid.

Each centroid is updated to represent the average of its constituent instances. This is done until the assignment of instances to the clusters does not
changes any more. If you want to learn more about the K-means algorithm, check this out.

Schematic and simplified representation.of the kmeans clustering algorithm — K-Means Principle

Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

DBSCAN is a density-based cluster analysis with noise. From an arbitrary starting data point, neighborhood points are specified at a distance epsilon. Clustering then begins from a certain neighborhood data point count.

The current data point becomes the first point of the new cluster, or referred to as noise. In both cases, however, it is considered to be examined. The neighboring data points are then added to the cluster. Once all neighbors have been added, a new, unexamined point is called and processed. A new cluster is thus formed.

Schematic and simplified representation of the DBSCAN Clustering principle. — Clustering Machine Learning Algorithms – How **DBSCAN** works

The field of cluster algorithms is wide and everyone’s approach is different. You should be aware that there is no one solution. You have to consider each algorithm as another tool. Not every technique works equally well in every situation.

The key here is to always understand the basic approach of each algorithm you want to use. Build a small portfolio and get to know these techniques well. Once you master them, you should then add new ones. Knowing your own tools is crucial to avoid try and error and to gain control over your data. Remember: no result is a result. Your added value here is that even if an algorithm doesn’t work well on your data set, it will give you information about the data properties.

ricardo gomez angel j5gCOKZdm6I unsplash

Apache Avro – Effective Big Data Serialization Solution for Kafka

November 15, 2020 / RainerGewalt / 0 Comments

In this article we will explain everything you need to know about Apache Avro, an open source big data serialization solution and why you should not do without it.

You can serialize data objects, i.e. put them into a sequential representation, in order to store or send them independent of the programming language. The text structure reflects your data hierarchy. Known serialization formats are for example XML and JSON. If you want to know more about both formats, read our articles on the topics. To read, you have to deserialize the text, i.e. convert it back into an object.

In times of Big Data, every computing process must be optimized. Even small computing delays can lead to long delays with a correspondingly large data throughput, and large data formats can block too many resources. The decisive factors are therefore speed and the smallest possible data formats that are stored. Avro is developed by the Apache community and is optimized for Big Data use. It offers you a fast and space-saving open source solution. If you don’t know what Apache means, look here. Here we have summarized everything you need to know about it and introduce you to some other Apache open source projects you should know about.

Apache Avro – Open Source Big Data Serialization Solution

With Apache Avro, you get not only a remote procedure call framework, but also a data serialization framework. So on the one hand you can call functions in other address spaces and on the other hand you can convert data into a more compact binary or text format. This duality gives you some advantages when you have cross-network data pipelines and is justified by its development history.

Avro was released back in 2011 as a part of Apache Hadoop. Here, Avro was supposed to provide a serialization format for data persistence as well as a data transfer format for communication between Hadoop nodes. To provide functionality in a Hadoop cluster, Avro needed to be able to access other address spaces. Due to its ability to serialize large amounts of data, cost-efficiently, Avro can now be used Hadoop-independently.

You can access Avro via special API’s with many common programming languages (Java, C#, C, C++, Python and Ruby). So you can implement it very flexible.

In the following figure we have summarized some reasons what makes the framework so ingenious. But what really makes Avro so fast?

The schema clearly shows all the features that Apache Avro offers the user and why he should use it — Features Apache Avro

What makes Avro so fast?

The trick is that a schema is used for serialization and deserialization. About that the data hierarchy, i.e. the metadata, is stored separately in a file. The data types and protocols are defined via a JSON format. These are to be assigned unambiguously by ID to the actual values and can be called for the further data processing constantly. This schema is sent along with the data exchange via RPC calls.

Creating a schema registry is especially useful when processing data streams with Apache Kafka.

Apache Avro and Apache Kafka

Here you can save a lot of performance if you store the metadata separately and call it only when you really need it. In the following figure we have shown you this process schematically.

When you let Avro manage your schema registration, it provides you with comprehensive, flexible and automatic schema development. This means that you can add additional fields and delete fields. Even renaming is allowed within certain limits. At the same time, Avro schema is backward and forward compatible. This means that the schema versions of the Reader and Writer can differ. Schema registration management solutions exist, with Google Protocol Buffers and Apache Thrift, among others. However, the JSON data structure makes Avro the most popular choice.

What does HCA stand for?

November 13, 2020 / RainerGewalt / 1 Comment

What does HCA stand for? What is the difference between Agglomerative and Divisive? When do I use the algorithm and what are its strengths? In this article we will clarify all these questions.

If you don’t know what clustering means, check out this article. Here we also explain four other clustering methods that you as a data scientist must know.

What is an HCA?

Hierarchical Cluster Analysis, or HCA, is a technique for optimal and compact connection of objects based on empirical similarity measures. The two most similar objects are assigned one after another until all objects are finally in one cluster. This then results in a tree-like structure.

What does HCA mean - This figure shows the basic principle of an applied HCA to raw data. — What does HCA stand for? Basic principle of an applied HCA to raw data.

So how does a hierarchical cluster procedure work?

Agglomerative vs Divisive Calculation

The basic clustering can be done in two opposite ways, Agglomerative and Divisive calculation.

Agglomerative clustering:

Agglomerative Nesting, abbreviated AGNES, is also known as the bottom-up method. This method first creates a cluster between two objects with high similarity, and then adds more clusters until all the data has been enclosed.

The divisive cluster calculation follows an opposite concept.

Divisive hierarchical clustering:

Divise Analysis, also known as DIANA, is a top-down method. All objects are directly framed into a cluster and then reduced in size.

In the following figure, the agglomerative process is compared with the divisive process.

What does HCA stand for? The figure compares the agglomerative and divisive calculation. — What does HCA stand for? Agglomerative vs Divisive Calculation

Thus, the goal is to represent the common properties in low dimension in multidimensional raw data. A strength of this machine learning method is the inclusion of cluster relationships. With K-means, only all objects within a collection are similar to each other, while they are dissimilar to objects in other clusters. If you want to know more about this other popular clustering method, read this article.

How to calculate the cluster distances?

As mentioned earlier, not only are similarities between data points in a cluster weighted, but also similarities between groups. These similarities are represented by distances between the clusters. These distances can be determined in different ways. The distance between the centroids of two clusters can be calculated. A single linkage is the shortest distance between two clusters, a complete linkage is the largest distance between two clusters and an average linkage is the average distance between two clusters.

The figure below contrasts each cluster distance calculation method.

The figure contrasts each cluster distance calculation method. A single linkage is the shortest distance between two clusters, a complete linkage is the largest distance between two clusters and an average linkage is the average distance between two clusters — Cluster distance calculation methods

In addition to the planar representation, the HCA can also be represented in a dendrogram.

HCA represented in a Dendrogram

Since an HCA describes a tree structure, it can be well represented in a dendrogram. Here the connections between the individual data elements and the connections between the clusters become well visible. This diagram can help to choose the optimal number of clusters in the data depending on where you intersect the tree.

In the following figure, for example, such a dendrogram is shown in dependence on Agglomerative and Divisive Calculation.

The figure shows a HCA represented as a dendrogram in dependence to Agglomerative and Divisive Calculation. — HCA presented as dendrogram in dependence to Agglomerative and Divisive Calculation.

IaaS vs PaaS vs SaaS – The Various Facets of Cloud Computing

November 11, 2020 / RainerGewalt / 1 Comment

IaaS vs PaaS vs SaaS – terms that categorize clouds, but what exactly do they mean? In this article, we contrast all three and explain the differences.

In almost all areas, the cloud is becoming more and more important. Increasingly, the cloud is also becoming interesting for business processes. Everyone is talking about it, but what is it actually?

What is the cloud anyway?

The cloud basically means the use of different servers. This means that your data can be hosted online, i.e. stored, managed and processed.
So you don’t have to provide the appropriate hardware on site, but can rent these resources from a cloud provider. Read our article about the cloud computing provider AWS.
Besides Amazon, other global players such as Google (Google Cloud) and Microsoft (Azure) also offer profitable cloud resources.
But which ones are suitable for me or my company? To meaningfully compare the individual solutions, you need to understand the differences between them.
Basically, you need to distinguish between the three categories already mentioned.

IaaS vs PaaS vs SaaS - Diese Abbildung zeigt die Die 3 Cloud Kategorien — IaaS vs PaaS vs SaaS

IaaS vs PaaS vs SaaS – What are the Differences?

First and foremost, all three terms are used to describe a resource provided by a cloud service provider for a short period of time.
The following figure shows this “as-a-service”, or Flexible consumption model, and the management components..

IaaS vs PaaS vs SaaS - This diagram shows the distribution of tasks between providers and customers in the individual cloud categories depending on the service layer model. — **Red**: managed by others; **Green**: managed by your organization

You can see very clearly here that the cloud provider manages more and more layers, ascending from IaaS to SaaS.

Software as a Service (SaaS)

The abbreviation SaaS refers to cloud-based software. This is hosted online by a company and provided via the Internet. It is easy to use and manage. Additionally, it is highly scalable, meaning it can be used for an entire organization.

Platform as a Service (PaaS)

PaaS is used to describe a cloud-based platform service. This offers developers an online platform for application development. Data is provided, stored and managed online.s

Infrastructure as a Service (IaaS)

IaaS refers to cloud-based infrastructure resources provided via virtualization technologies. These services are designed to help companies build and manage their servers, networks, operating systems and data storage. This is where the highest administrative share lies with the customer. Access to the servers for data management takes place via a dashboard or API.

IaaS vs PaaS vs SaaS – For whom is which category suitable?

So who should choose which service model? The following figure shows that the more tasks are taken over by the provider, the more control is relinquished. This is especially detrimental in organizations where a lot of control is needed.

IaaS vs PaaS vs SaaS - Presentation of the individual services depending on the control and for whom they are suitable. — Services depending on the control

IaaS gives administrators more direct needed, control over operating systems. However, more control always comes with more complicated administration tasks. PaaS therefore offers users a certain compromise between flexibility and ease of use. This model is particularly appealing to developers.
The SaaS model offers the highest level of usability and is accordingly interesting for customers who want to take over no to few administrative tasks.

IaaS vs PaaS vs SaaS – Technology of the future?

Cloud resources can be a valuable alternative to expensive, in-house hardware solutions. Of course, with external administration, a company loses control over its own data. However, the different types of service mean that compromises can be made that are tailored to the company’s own needs.

The advantages are obvious. Individual services can be accessed from virtually anywhere at any time, and high-performance computing can be operated cost-effectively. As network technologies become faster and faster, these solutions are increasingly coming into focus and will certainly become more and more important for companies and private individuals in the coming years.

Matplotlib vs Seaborn – Who owns the Python visualization throne?

November 7, 2020 / RainerGewalt / 0 Comments

Matplotlib vs Seaborn – Matplotlib is often the first choice when it comes to creating mathematical plots with Python. But is it always the best choice? With Seaborn there is a potent competitor.

Matplotlib was developed by John D. Hunter back in 2003 and has become indispensable. Due to the increasing importance of the Python programming language in almost all scientific areas, the importance of fully compatible visualization methods is also growing.

Due to its open source concept, Matplotlib can be used absolutely free of charge and is a basic component of many popular Python distribution platforms, such as Anaconda.

The library offers a MATLAB-like interface and can be used in combination with NumPy, Pandas and Scipy, just like MATLAB.

SciPy is a collection of mathematical algorithms and convenience functions and is mainly used by scientists, analysts and engineers for scientific computing, visualization and related activities.
NumPy allows easy handling of vectors, matrices, or large multidimensional arrays in general.
NumPy’s operators and functions are optimized for multidimensional array operations and evaluate particularly efficiently.

Pandas is also an open source Python library that can be used to perform data analysis and manipulation efficiently. Its strength lies in the processing and evaluation of tabular data and time series.

These components, which are absolutely compatible with each other, offer in their entirety an absolutely free, but fully comprehensive alternative to the commercial analysis software MATLAB.

This figure shows some Python libraries, which together form an open source MATLAB alternative. — Matplotlib vs Seaborn – Together the Python libraries form a MATLAB replacement

Python Matplotlib – What are the features?

The library offers a wide range of visualization functions. Some of them are listed in the figure below.

Matplotlib vs Seaborn - This figure shows Matplotlib features sorted by their use cases. — Matplotlib vs Seaborn – Matplotlib Features

Matplotlib is designed to effectively visualize the results of mathematical calculations. Visualization is an efficient and important data analysis tool.
The library is able to generate all the usual diagrams and figures by default. It is even possible to create animations that can be used to better understand the flow of certain algorithms.

Event Handling

Matplotlib offers an important feature with event handling. Behind the name is a UI-neutral event model. This allows the library to connect to events without knowing which UI Matplotlib will eventually plug into.

This allows me to develop a very flexible and portable code.
However, the events can then be used to transfer things like the data coordinate.

PyLab vs Pyplot

PyLab is a collection of functions that is installed together with Matplotlib and make the library work like MATLAB.
The module brings a set of NumPy functions and classes into the namespace. This makes them accessible without having to import them.
However, this often led to conflicts between individual Matplotlib functions.
For this reason, the use of PyLab is now no longer recommended.
Pyplot is a module in Matplotlib and provides the state-machine interface to the underlying plotting library.

The conflicts are prevented because an import is done with Pyplot and a separate NumPy import.

Python Matplotlib – Third party packages

If the standard library features are not enough, you can extend Matplotlib with additional external packages. In the following figure some of the possible extensions are listed and grouped by application.

Python Matplotlib - This figure shows Matplotlib Third Party Packages sorted by their use cases. — Matplotlib vs Seaborn – Matplotlib Third Party Packages

These external packages must be installed individually and extend the functionality of the plotting library, or build on existing features.
They sometimes offer more complex graphics or higher performance data analysis methods. Most of these packages are open source and are constantly updated by very active communities.

Matplotlib also has weaknesses

Matplotlib is not perfect despite the wide feature set. For example, only poor default options for the size and colors of plots are offered. Matplotlib is often considered to be a low-level technology compared to today’s requirements. Thus, very specialized code is needed to generate appealing plots.

What is Seaborn?

Seaborn is a Python visualization library, but based on Matplotlib. This library provides a high-level interface for visualization of statistical data and not only has its own graphics library, but internally uses Matplotlib’s functionalities and data structures.
It thus offers a variety of additional features besides the śtandard Matplotlib functions.

This scheme shows the main features of Seaborn — Matplotlib vs Seaborn – Main features of Seaborn

Among other things, Seaborn provides built-in themes for designing matplotlib graphs and a dataset-oriented API for determining the relationship between variables. It can visualize both univariate and bivariate data and plot statistical time series. Estimation and plotting of linear regression models run automatically and Seaborn, unlike Matplotlib, offers optimization when processing NumPy and Pandas data structures.

So what should you choose?

Especially when it comes to deep statistics, Seaborn clearly has the edge. Matplotlib, however, is often the leaner solution due to its simplicity. So both have their strengths and weaknesses. Which tool you ultimately choose depends on the situation. You can’t do much wrong. With one solution, however, you have more contextual options. But now that you know the differences between the two, this decision will be easier for you.

k-Means: One of the simplest Clustering Algorithms

November 4, 2020 / RainerGewalt / 1 Comment

One of the most popular unsupervised clustering methods is the k-means algorithm. It is considered one of the easiest and most cost-effective clustering algorithms to create. It is therefore well suited to identify an overview of possible patterns in data.

What is the principle behind the k-means algorithm?
in this article we will explain what is behind this algorithm and how it really works, because, the better you know your data science tools, the better you will be able to analyze your data.

What is k-Means?

The k-means algorithm described by MacQueen, 1967 goes back to the methods described by.
Lloyd, 1957 and Forgy, 1965 described methods. The unsupervised machine learning algorithm is used for vector quantification or cluster analysis. If you don’t know what the differences are between supervised, unsupervised and reinforcement methods, read this article on the main machine learning categories.

The following figure shows the basic principle of the k-Means clustering algorithm.

The figure shows the basic principle of the k-Means clustering algorithm. — Basic principle of the k-Means clustering

The main goal of unsupervised clustering is to create collections of data elements that are similar to each other, but dissimilar to elements in other clusters.

What is the principle behind the k-means algorithm?

Here, a data set is partitioned into k groups with equal variance. The number of clusters searched for must be specified in advance. Each disjoint cluster is described by the average of all contained samples.
The so-called cluster centroid.
The following figure shows the cluster center of gravity principle.

The figure shows the k-Means cluster center of gravity principle. — cluster center of gravity principle

Each centroid is updated to represent the average of its constituent instances. This is done until the assignment of instances to clusters does not change.

Applied algorithm process

But how exactly does the algorithm work?
First, initial centroids are set. The distances between data instances and centroids are measured and data instances are added as members of the nearest centroid. The centroids are recalculated. If necessary, final centroids are re-measured, re-clustered or re-calculated

The figure shows the process of k-Means Centroid creation — Centroid creation

ksqlDB – Efficient real-time stream transformation of data within Kafka’s data pipelines

November 1, 2020 / RainerGewalt / 0 Comments

ksqlDB vs Kafka streams – Data streams are all the rage right now. A technique to move and process huge amounts of data simultaneously without caching it.

What is Apache Kafka?

With the messagebroker Kafka, the data can be stored resource-efficiently in so-called topics as so-called logs. These topics can then be subscribed to and rewritten by any number of clients, primarily microservices.
The metadata information is stored externally in a schemaregistry and assigned to the data again via an ID when it is read. In this way, each microservice can be developed independently of technology and programming languages. The data structure remains the same.

However, if a microservice wants to access the data streams from two or more topics and these arrive with different frequencies, then the correct allocation of the data is often difficult. The so-called data stream position can be controlled with event streaming databases.

What is ksqlDB?

Especially for Apache Kafka, ksqlDB allows easy transformation of data within Kafka’s data pipelines.

The following figure shows how a software architecture with Apache Kafka and ksqlDB could look like. It is still possible to subscribe to the data streams from the messagebroker, or indirectly via ksqlDB using pulls and pushs. The communication between table and kafka is done directly via the eventstreaming platform Confluent.

The figure shows how a software architecture with Apache Kafka and ksqlDB could look like. — software architecture with Apache Kafka and ksqlDB

It can be used to materialize views asynchronously using interactive SQL queries.
So with this, microservices can enrich the data and transform it in real time.
This enables anomaly detection, real-time monitoring, and real-time data format conversion.

Event Streaming

ksqlDB is an event streaming database. Thus, it is based on continuous streams of structured event data that can be published to multiple applications in real time. The following figure shows such an event stream schematically.

ksqlDB vs Kafka streams- The figure shows such an event stream schematically. — event stream

Each individual record always consists of an event and a unique key for identification.
These event streams can be combined with streaming analytics and is a way to offload work to back-end processing applications. If you want to know more about messaging patterns and how a message is transmitted between sender and receiver, read our article.

Window-based Query Processing

ksqlDB allows continuous stream queries. These are based on window-based aggregation of events.

Windows are polling intervals that are continuously executed over the data streams. These windows can be expanded and moved as needed to handle new incoming data items.
Several window types are shown in the figure below. They differ in their composition to each other.

ksqlDB - Several window types are shown in the figure. They differ in their composition to each other. — window types

The “Tumbling” type repeats a non-overlapping interval, while the “Bouncing” type allows overlaps. In a “Session” the elements are grouped by activity sessions without allowing overlaps. The session is terminated when no elements are received for a certain time.

ksqlDB Features

In addition to continuous queries through window-based aggregation of events, ksqlDB offers many other features that are helpful in dealing with streams. For example, the last value of a column can be tracked when aggregating events from a stream into a table.

Multiple streams can be merged by real-time joins or transformed in real-time. In doing so, the database is Distributed, Fault Tolerant and Scalable.
The Kafka Connect connectors can be executed and controlled directly.
Push and pull queries are applicable to the flows. Thus, subscribers get the constantly updated results of a query, or can retrieve data in request/response flows at a specific time.

Conclusion

With Confluent’s event streaming database ksqlDB, a service is provided that offers an absolutely compatible solution for real-time data stream processing with Kafka. Kafka in particular lends itself as a central element in a microservice-based software architecture. Microservices run as separate processes and consume in parallel from the message broker. Aligning these processes remains a challenge. However, ksqlDB ensures real-time stream processing within the services.

That is why Liquid State Machines (LSM) are great

October 31, 2020 / RainerGewalt / 0 Comments

– Recently developed computational model

– does not require information to be stored in some stable state of the system

→ the inherent dynamics of the system are used by a memory less readout function to compute the output

→ can be used for complex Tasks (pattern classification, function approximation, object tracking, …)

– LSMs take the temporal aspect of the input into account

Concept

The figure shows a typical structure of a liquid State Machine. — Liquid State Machine

Reservoir/ Liquid

– large accumulation of recurrent interacting nodes
→ is stimulated by the input layer
– Liquid itself is not trained, but randomly constructed with the help of heuristics
– Loops cause a short-term memory effect
– preferably a Spiking Neural Network (SNNs)
→ are closer to biological neural networks than the multilayer Perceptron
→ can be any type of network that has sufficient internal dynamics

Running State

→ will be extracted by the readout function

– depend on the input streams they’ve been presented

Readout Function

– converts the high-dimensional state into the output

– since the readout function is separated from the liquid, several readout functions can be used with the same liquid

→ so different tasks can be performed with the same input