EXPERT KNOWLEDGE AT A GLANCE

Category: api

Apache Hive Architecture – Data Warehouse System for free

Apache Hive Architecture – On the way to Industry 4.0, companies are trying to record all business processes as far as possible in order to subsequently optimize them through analysis.
Data warehouse systems provide central data management. Thus, only one data truth exists. In addition to persistence, these information systems take care of sorting, preprocessing, translation and data analysis.
If you want to know more about what a data warehouse system is, check out our article on the subject.

What is Apache Hive

Hive is a data warehousing software project and part of Apache, an open source and free web server software. Learn more about Apache here.
It is built on the Big Data framework Apache Hadoop and was released in 2010. Since then it has been continuously improved and extended by an industrious community.

hive
Apache Hive Architecture – Built on top of Hadoop

The query language used by Hive, called HiveQL, is SQL based and allows querying, aggregation and analysis of unstructured data. Hive does not work with the schema-on-write (SoW) approach like relational databases, but uses the so-called schema-on-read (SoR) approach.

What are the biggest advantages of Hive?

Data from relational databases is automatically converted into MapReduce or Tez or Spark jobs. Hadoopclusters are based on MapReduce, a Google programming model for concurrent computation on computer clusters, and powerful stream-based data analysis pipelines can be created with Apache Spark. This ensures full compatibility with the Apache ecosystem, which can be modularly tailored to the needs of an application.

The figure shows the main Apache Hive features
Apache Hive Features

Another advantage of Hive is that the tables are similar to the tables in a relational database. Data is queried using HiveQL. A declarative SQL-like language.
HiveQL allows multiple users to query data simultaneously. Hive supports a variety of data formats and provides a lightweight but powerful translation feature.
For data analysis, custom MapReduce processes can be written and run on clusters in parallel for high performance.

Apache Hive Architecture

Basically, the architecture of Hive can be divided into three core areas. Hive communicates with other applications via the client area. The integration is then executed via the service area. In the last layer, Hive stores the metadata, for example, or computes the data via Hadoop.

The figure shows the basic three-part core architecture of Apache Hive.
Apache Hive Architecture

Hive Clients

Apache Hive can be accessed via different clients. In addition to Open Database Connectivity (ODBC), an SQL-based application programming interface (API) created by Microsoft, there is Java Database Connectivity (JDBC), an SQL-based API developed by Sun Microsystems to allow Java applications to use SQL for database access. Hive also provides a high-performance Apache Thrift connection.

Hive Services

The core and central control of the Hive Services is the so-called driver. This
receives HiveQL commands and is responsible for their execution against the Hadoop system. It typically consists of a compiler that translates HiveQL requests into abstract syntax and executable tasks, an optimizer that aggregates, splits, and optimizes for better performance and scalability, and an executor that interacts with Hadoop’s job tracker and passes tasks to the system for execution.

Apache Hive also provides the ability to submit these tasks directly to the driver. Using the Command Line and User Interface (CLI + UI), it is possible to directly influence the process.

Metadata about persistent relational entities, i.e. databases, tables, columns and partitions are managed by the metastore.

Hive Storage and Computer

The metadata is stored here in a persistence. The results of the query and the data loaded into the tables are stored on HDFS in the Hadoop cluster.

Things you need to know when you start using Apache Spark

Apache Spark Streaming – Every company produces several million pieces of data every day. Properly analyzed, this information can be used to derive valuable business strategies and increase productivity.
Until now, this data was consumed and stored in a persistent. Even today, this is an important step in order to be able to perform analyses on historical data at a later date. Often, however, analysis results are desired in real time. Be it only reference values that have been exceeded.


So-called data streams, i.e. data that is continuously generated from thousands of data sources, can already be consumed before they end up in a persistence, without the flow rate being significantly reduced. It is even possible to train neural networks using such a stream.


In this article, we’ll tell you why you shouldn’t miss out on Apache Spark and Apache Spark Streaming if you’re planning to integrate stream processing in your organization.

What is Apache Spark?


Apache Spark has become one of the most important and performant unified data analytics on the market today. The framework provides a total solution of data processing and AI integration. This allows companies to easily develop performant data pipelines and train AI methods using massive data streams.


Apache Spark combines several partially interdependent components. So can be deployed in a modular fashion to a certain extent.
Spark can run in its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos or on Kubernetes.
The data here can come from streaming sources, such as Kafka, as well as static data sources. So far, the programming languages Java, Scala, Python and R are supported. These are currently the most commonly used languages across all scientific disciplines for implementing data analysis methods.

What does a Spark cluster look like?

The coordinator of a Spark program on a cluster is the so-called SparkContext object. This controls the individual Spark applications as they run as independent processes.
The Coordinator then connects to the Central Element, a Cluster Manager, which then allocates resources to the individual applications.
The figure below shows an example of a typical Spark cluster with all its components.

The figure  shows an example of a typical Spark cluster with all its components.
Overview Apache Spark Cluster

The actual calculations and data storage then take place on the nodes. These processes, also called executors, then execute tasks and hold the data in memory or disk space. The cache can then be accessed by another node.

Apache sparks underlying technology – The key to high Performance

Spark Core is the underlying unified computing engine on which all Spark functions are built. It enables parallel processing even for large datasets and thus ensures very high-performance processes.
The following figure shows how the Apache Spark Core APIs are composed.

The  figure shows how the Apache Spark Core APIs are composed.
Apache Spark Core APIs

The core API consists of low level APIs, where object manipulation via Resilient Distributed Datasets (RDDs) takes place and structured APIs, where all data types are manipulated and batch or streaming jobs take place.

How do the individual Apache Spark APIs work?

In order to properly understand the API structure, its components must be placed in a historical context.

The figure shows the development history of the Apache Spark APIs.
Development history of the Apache Spark APIs

What is the RDD API?

The RDD (Resilient Distributed Dataset) API has been implemented since the first Spark release and is based on the Scala collections API.
RDDs are a set of Java or Scala objects that represent data and thus are the building blocks of Spark. They excel in being compile-time type-safe and inert.

All higher level APIs can be decomposed into RDDs. Various transformations can be performed in parallel using this API. Each of them defines an operation to be executed, which is invoked by calling an action method and creates a new RDD. This then represents the transformed data.

What is the Dataframe API?

The Dataframe API introduces a higher level abstraction. Spark dataframes correspond to the Pandas dataframes structure. They are built on top of RDDs and represent two-dimensional data and a schema. It contains an ordered collection of columns and each different column can consist of different data types. Each value is unique by a row and a column index.


When data is transferred between nodes, only the data is transferred. The metadata is managed in a schema registry separate from spark. This has significantly improved the performance and scalability of Spark.
The API is suitable for creating a relational query plan. Thus, manipulation of data can now be done using a query language.

What is the Dataset API?

When working with dataframes, compile-time type safety is lost. This is a strength of the RDD API. The Dataset-API was created to combine the advantages of both APIs. It is thus the second most important Spark API next to the RDD API.


The basis of this API are integrated encoders, which are responsible for the conversion between JVM objects and the internal Spark SQL representation.

What components does Apache Spark consist of?

Spark is modularly extensible through the use of components. Spark includes libraries for various tasks ranging from SQL to streaming and machine learning. All components are based on the Spark Core, the foundation for parallel and distributed processing of large data sets. How this API looks in detail and what makes it so performant, we will explain later.
The following figure lists the individual Apache Spark components.

In the figure, the ecosystem of Apache Spark is shown with all the major components.
Apache Spark Ecosystem

Apache Spark Spark SQL

With this component RDDs are converted into the so-called data frames, i.e. provided with metadata information.
The whole thing is done by a catalyst optimizer, which executes an execution plan in the form of a tree.

Apache Spark GraphX

This framework can be used to perform high-performance calculations on graphs. These operations can run in parallel.

Apache Spark MLlib/SparkML

With the MLlib component, machine learning pipelines can be constructed very easily. For this purpose, ready-made models and common machine learning algorithms (classification, regression, clustering …) can be used. Thus, data identification, feature extraction and transformation are combined in a unified framework.

Apache Spark Streaming

Apache Spark Streaming enables and controls the processing of data streams. However, Apache Spark Streaming can also process data from static data sources.
In the case of datastreaming, input stream goes from a streaming data source, such as Kafka, Flume or HDFS, into Apache Spark Streaming.
There, it is broken into batches and fed into the Spark engine for parallel processing. The final results can then be output to HDFS databases and dashboards.
The following figure illustrates the principle of Apache Spark Streaming.

The figure illustrates the principle of Apache Spark Streaming.
Principle of Apache Spark Streaming

All components can consume directly from the stream via Apache Spark Streaming. This component takes a crucial role here. It coordinates the requests via sliding window operations and regulates the data flow. Since all components are based on the Spark Core API, absolute compatibility is guaranteed. Especially in the Big Data area, this can deliver a decisive performance bonus.

Apache Avro – Effective Big Data Serialization Solution for Kafka

In this article we will explain everything you need to know about Apache Avro, an open source big data serialization solution and why you should not do without it.


You can serialize data objects, i.e. put them into a sequential representation, in order to store or send them independent of the programming language. The text structure reflects your data hierarchy. Known serialization formats are for example XML and JSON. If you want to know more about both formats, read our articles on the topics. To read, you have to deserialize the text, i.e. convert it back into an object.

In times of Big Data, every computing process must be optimized. Even small computing delays can lead to long delays with a correspondingly large data throughput, and large data formats can block too many resources. The decisive factors are therefore speed and the smallest possible data formats that are stored. Avro is developed by the Apache community and is optimized for Big Data use. It offers you a fast and space-saving open source solution. If you don’t know what Apache means, look here. Here we have summarized everything you need to know about it and introduce you to some other Apache open source projects you should know about.

Apache Avro – Open Source Big Data Serialization Solution

With Apache Avro, you get not only a remote procedure call framework, but also a data serialization framework. So on the one hand you can call functions in other address spaces and on the other hand you can convert data into a more compact binary or text format. This duality gives you some advantages when you have cross-network data pipelines and is justified by its development history.

Avro was released back in 2011 as a part of Apache Hadoop. Here, Avro was supposed to provide a serialization format for data persistence as well as a data transfer format for communication between Hadoop nodes. To provide functionality in a Hadoop cluster, Avro needed to be able to access other address spaces. Due to its ability to serialize large amounts of data, cost-efficiently, Avro can now be used Hadoop-independently. 

You can access Avro via special API’s with many common programming languages (Java, C#, C, C++, Python and Ruby). So you can implement it very flexible.

In the following figure we have summarized some reasons what makes the framework so ingenious. But what really makes Avro so fast?

The schema clearly shows all the features that Apache Avro offers the user and why he should use it
Features Apache Avro

What makes Avro so fast?

The trick is that a schema is used for serialization and deserialization. About that the data hierarchy, i.e. the metadata, is stored separately in a file. The data types and protocols are defined via a JSON format. These are to be assigned unambiguously by ID to the actual values and can be called for the further data processing constantly. This schema is sent along with the data exchange via RPC calls.

Creating a schema registry is especially useful when processing data streams with Apache Kafka.

Apache Avro and Apache Kafka

Here you can save a lot of performance if you store the metadata separately and call it only when you really need it. In the following figure we have shown you this process schematically.

avro kafka

When you let Avro manage your schema registration, it provides you with comprehensive, flexible and automatic schema development. This means that you can add additional fields and delete fields. Even renaming is allowed within certain limits. At the same time, Avro schema is backward and forward compatible. This means that the schema versions of the Reader and Writer can differ. Schema registration management solutions exist, with Google Protocol Buffers and Apache Thrift, among others. However, the JSON data structure makes Avro the most popular choice.

IaaS vs PaaS vs SaaS – What are the differences?

IaaS vs PaaS vs SaaS – terms that categorize clouds, but what exactly do they mean? In this article, we contrast all three and explain the differences.

In almost all areas, the cloud is becoming more and more important. Increasingly, the cloud is also becoming interesting for business processes. Everyone is talking about it, but what is it actually?

What is the cloud anyway?

The cloud basically means the use of different servers. This means that your data can be hosted online, i.e. stored, managed and processed.
So you don’t have to provide the appropriate hardware on site, but can rent these resources from a cloud provider. Read our article about the cloud computing provider AWS.
Besides Amazon, other global players such as Google (Google Cloud) and Microsoft (Azure) also offer profitable cloud resources.
But which ones are suitable for me or my company? To meaningfully compare the individual solutions, you need to understand the differences between them.
Basically, you need to distinguish between the three categories already mentioned.

IaaS vs PaaS vs SaaS - Diese Abbildung zeigt die Die 3 Cloud Kategorien
IaaS vs PaaS vs SaaS

IaaS vs PaaS vs SaaS – What are the Differences?

First and foremost, all three terms are used to describe a resource provided by a cloud service provider for a short period of time.
The following figure shows this “as-a-service”, or Flexible consumption model, and the management components..

IaaS vs PaaS vs SaaS - This diagram shows the distribution of tasks between providers and customers in the individual cloud categories depending on the service layer model.
Red: managed by others; Green: managed by your organization

You can see very clearly here that the cloud provider manages more and more layers, ascending from IaaS to SaaS.

Software as a Service (SaaS)

The abbreviation SaaS refers to cloud-based software. This is hosted online by a company and provided via the Internet. It is easy to use and manage. Additionally, it is highly scalable, meaning it can be used for an entire organization.

Platform as a Service (PaaS)

PaaS is used to describe a cloud-based platform service. This offers developers an online platform for application development. Data is provided, stored and managed online.s

Infrastructure as a Service (IaaS)

IaaS refers to cloud-based infrastructure resources provided via virtualization technologies. These services are designed to help companies build and manage their servers, networks, operating systems and data storage. This is where the highest administrative share lies with the customer. Access to the servers for data management takes place via a dashboard or API.

IaaS vs PaaS vs SaaS – For whom is which category suitable?

So who should choose which service model? The following figure shows that the more tasks are taken over by the provider, the more control is relinquished. This is especially detrimental in organizations where a lot of control is needed.

IaaS vs PaaS vs SaaS - Presentation of the individual services depending on the control and for whom they are suitable.
Services depending on the control

IaaS gives administrators more direct needed, control over operating systems. However, more control always comes with more complicated administration tasks. PaaS therefore offers users a certain compromise between flexibility and ease of use. This model is particularly appealing to developers.
The SaaS model offers the highest level of usability and is accordingly interesting for customers who want to take over no to few administrative tasks.

IaaS vs PaaS vs SaaS – Technology of the future?

Cloud resources can be a valuable alternative to expensive, in-house hardware solutions. Of course, with external administration, a company loses control over its own data. However, the different types of service mean that compromises can be made that are tailored to the company’s own needs.

The advantages are obvious. Individual services can be accessed from virtually anywhere at any time, and high-performance computing can be operated cost-effectively. As network technologies become faster and faster, these solutions are increasingly coming into focus and will certainly become more and more important for companies and private individuals in the coming years.

Apache Flink

Overview

– Open source stream processor framework developed by the Apache Software Foundation (2016)
– Data streams with high data volume can be processed and analyzed with low delay and high speed

flink analytics
Flink provides various tools for efficient real-time processing of continuous data streams and batch data

Core functions

– diverse, specialized APIs:
→ DataStream API (Stream Processing)
→ ProcessFunctions (control of states and time; event states can be saved and timers can be added for future calculations)
→ Table API
→ SQL API
→ provides a rich set of connectors to various storage systems such as Kafka, Kinesis, Kubernetes, YARN, HDFS, Elasticsearch, and JDBC database systems
→ REST API

Stream Processing

pexels pixabay 2438
How to handle this flood of data?

== Data is processed continuously with a short delay
→ without intermediate storage of the data in separate databases
– several data streams can be processed in parallel
– Each stream can be used to derive own follow-up actions and analyses

Architecture

Data can be processed as unbounded or bounded streams:

  • Unbounded stream

    • have a start but no defined end

    • must be continuously processed

  • Bounded stream

    • have a defined start and end

    • can be processed by ingesting all data before performing any computations(== batch processing)

– Flink automatically identifies the required resources based on the application’s configured parallelism and requests them from the resource manager.

In case of a failure, Flink replaces the failed container by requesting new resources.

– Stateful Flink applications are optimized for local state access