Tag: Continuous transformations

Data Mining vs Big Data Analytics – You need the right tools and you need to know how to use them!

November 16, 2020 / RainerGewalt / 0 Comments

Data Mining vs Big Data Analytics – Both data disciplines, but what makes them different? In this article, we introduce you to both fields and explain the key differences.

Data Science is an interdisciplinary scientific field, as it has become more and more in focus in the last decades. Many companies see this as the key to an Industry 4.0 company. The hope is that valuable information can be found in the company’s own data, which can be used to massively increase its own profitability. Terms such as big data, data mining, data analytics and machine learning are being thrown into the ring. Many people do not realize that these terms describe other disciplines. If you want to build a house, you need the right tools and you have to know how to use them.

Map of Data Disciplines

First of all, you should think of the individual disciplines as being layered into each other like an onion. So there is overlap between all the fields and when you talk about a discipline, you are also talking about lower layers.

data mining vs analytics - This diagram shows the relationships between the individual data disciplines — Map of data disciplines

Since data analytics is located above data mining in the layer model, it is already clear that mining must be a sub discipline of analytics. Therefore, we will first describe the comprehensive discipline.

Data Mining vs Big Data Analytics – What is Analytics?

Big data analytics, as a sub field of data analysis, describes the use of data analysis tools and without special data processing. in data analytics, you use queries and data aggregation methods, but also data mining techniques and tools. The goal of this discipline is to represent various dependencies between input variables.

The goal of this discipline is to represent various dependencies between input variables. The following figure shows the individual overlaps in the use of the tools of the different disciplines.

scheme about overlaps in the use of the tools of the different data disciplines — Overlaps of the different data disciplines

Data Mining vs Big Data Analytics – What is Data Mining?

Data mining is a subset of data analytics. At its core, it is about identifying and discovering a large data set through correlations. Especially if you know little about the available data this field should be used.

But what does a typical data mining process look like and what are typical data mining tasks?

Data Mining Process

You can divide a typical data mining process into several sequential steps. In the preprocessing stage, your data is first cleaned. This involves integrating sources and removing inconsistencies. Then you can convert the data into the right format. After that, the actual analysis step, the data mining, takes place.Finally, your results have to be evaluated. Expert knowledge is required here to control the patterns found and the fulfillment of your own objectives.

The term data mining covers a variety of techniques and algorithms to analyze a data set. In the following we will show you some typical methods.

Data Mining Tasks

Besides identifying unusual data sets with outlier detection, you can also group your objects based on similarities using cluster analysis. In this article we have already summarized some popular clustering algorithms that you should know as a data scientist. While association analysis only identifies the relationships and dependencies in the data, regression analysis provides you with the relationships between dependent and independent variables. Through classification, you assign elements that were not previously assigned to classes to existing classes. You can also summarize the data to reduce the data set to a more compact description without significant loss of information.

Data Mining vs Big Data Analytics – Conclusion

Although the two disciplines are related, they are two different disciplines. Data mining is more about identifying key data relationships, patterns or trends in the data, while data analytics is more about deriving a data-driven model. On this path, data mining is an important step in making the data more usable. In the end, it’s not a versus, but both disciplines are part of an analytics pipeline.
In this article, we will go further into the differences between the various data sciences and clarify the difference between data analysis and data science.

ksqlDB – Efficient real-time stream transformation of data within Kafka’s data pipelines

November 1, 2020 / RainerGewalt / 0 Comments

ksqlDB vs Kafka streams – Data streams are all the rage right now. A technique to move and process huge amounts of data simultaneously without caching it.

What is Apache Kafka?

With the messagebroker Kafka, the data can be stored resource-efficiently in so-called topics as so-called logs. These topics can then be subscribed to and rewritten by any number of clients, primarily microservices.
The metadata information is stored externally in a schemaregistry and assigned to the data again via an ID when it is read. In this way, each microservice can be developed independently of technology and programming languages. The data structure remains the same.

However, if a microservice wants to access the data streams from two or more topics and these arrive with different frequencies, then the correct allocation of the data is often difficult. The so-called data stream position can be controlled with event streaming databases.

What is ksqlDB?

Especially for Apache Kafka, ksqlDB allows easy transformation of data within Kafka’s data pipelines.

The following figure shows how a software architecture with Apache Kafka and ksqlDB could look like. It is still possible to subscribe to the data streams from the messagebroker, or indirectly via ksqlDB using pulls and pushs. The communication between table and kafka is done directly via the eventstreaming platform Confluent.

The figure shows how a software architecture with Apache Kafka and ksqlDB could look like. — software architecture with Apache Kafka and ksqlDB

It can be used to materialize views asynchronously using interactive SQL queries.
So with this, microservices can enrich the data and transform it in real time.
This enables anomaly detection, real-time monitoring, and real-time data format conversion.

Event Streaming

ksqlDB is an event streaming database. Thus, it is based on continuous streams of structured event data that can be published to multiple applications in real time. The following figure shows such an event stream schematically.

ksqlDB vs Kafka streams- The figure shows such an event stream schematically. — event stream

Each individual record always consists of an event and a unique key for identification.
These event streams can be combined with streaming analytics and is a way to offload work to back-end processing applications. If you want to know more about messaging patterns and how a message is transmitted between sender and receiver, read our article.

Window-based Query Processing

ksqlDB allows continuous stream queries. These are based on window-based aggregation of events.

Windows are polling intervals that are continuously executed over the data streams. These windows can be expanded and moved as needed to handle new incoming data items.
Several window types are shown in the figure below. They differ in their composition to each other.

ksqlDB - Several window types are shown in the figure. They differ in their composition to each other. — window types

The “Tumbling” type repeats a non-overlapping interval, while the “Bouncing” type allows overlaps. In a “Session” the elements are grouped by activity sessions without allowing overlaps. The session is terminated when no elements are received for a certain time.

ksqlDB Features

In addition to continuous queries through window-based aggregation of events, ksqlDB offers many other features that are helpful in dealing with streams. For example, the last value of a column can be tracked when aggregating events from a stream into a table.

Multiple streams can be merged by real-time joins or transformed in real-time. In doing so, the database is Distributed, Fault Tolerant and Scalable.
The Kafka Connect connectors can be executed and controlled directly.
Push and pull queries are applicable to the flows. Thus, subscribers get the constantly updated results of a query, or can retrieve data in request/response flows at a specific time.

Conclusion

With Confluent’s event streaming database ksqlDB, a service is provided that offers an absolutely compatible solution for real-time data stream processing with Kafka. Kafka in particular lends itself as a central element in a microservice-based software architecture. Microservices run as separate processes and consume in parallel from the message broker. Aligning these processes remains a challenge. However, ksqlDB ensures real-time stream processing within the services.