Tag: hadoop

Is Hadoop dead? Should I invest time to learn the Hadoop ecosystem?

February 28, 2021 / RainerGewalt / 5 Comments

Is Hadoop dead – In the IT sector in particular, technologies and software architectures do not have a long shelf life. As new technical insights are gained, the requirements and use cases for the systems also change. As young as the term “big data” is, it is also undergoing constant change. The increased acceptance of open source projects in the business community has led to increased diversification and thus to many mutually beneficial competitive situations.
Apache Hadoop has been considered the one all-purpose solution for over a decade. A Big data ecosystem in which Hadoop plays together with many other extensions. In recent years, however, more and more people are claiming that the demands on data processing have changed and see Hadoop as an outdated concept.

A few years ago, the primary goal was to efficiently handle ever-increasing data volumes, but today iterative real-time analyses on dynamic data sets are required. Data management systems must not be self-contained, but must remain manipulable and monitorable at all times.
So is Hadoop dead, or still indispensable?

What is Hadoop?

Hadoop is a Linux-based open source Big Data framework for scalable, distributed software. It is originally based on Google’s MapReduce algorithm and enables computationally intensive processes of large data sets by parallelizing them on computer clusters, i.e. a large number of networked computers, using multiple components working together.

Is Hadoop dead? This diagram shows the Hadoop ecosystem — Is Hadoop dead? **Hadoop** **ecosystem**

The Hadoop ecosystem is composed of the Hadoop Common, an interface for all other components. It connects Hadoop to the file system of the computers and contains the libraries.In the Hadoop Distributed File System
( HDFS ) very large amounts of data are stored. This is organized as a server cluster with master and slave nodes. The resources are controlled via the Yet Another Resource Negotiator (YARN) component. This resource manager distributes the individual tasks to the available resources, such as CPU and memory.

What is the MapReduce algorithm?

Google’s MapReduce programming model, even though it is currently being replaced by engines based on Directred-Acyclic-Graph (DAG), is still a core component of the Hadoop framework. So if we want to understand how Hadoop works, we first need to understand what MapReduce is in the first place.

Is Hadoop dead? This diagram shows the principle behind Google's MapReduce algorithm — Is Hadoop dead? **Googles Map Reduce Algorithm principle**

Configurable classes for Map, Reduce and Combination phases are provided via the Hadoop MapReduce framework. Map means that a set of data is transformed into another set of data, where the individual elements of the data are combined into tuples (key/value pairs). In the Reduce phase, the formed tuples are then combined into smaller sets of tuples.

How a Hadoop cluster works

As mentioned earlier, Hadoop distributes storage and processing of large amounts of data in a balanced manner across compute clusters, or interconnected hardware.
These computers are connected to a dedicated server that acts as the master
components. The master node organizes the storage of files and the metadata in the individual slave nodes. Within a cluster, data is stored on multiple computers called nodes. The files are partitioned into data blocks and distributed redundantly among the nodes.

Is Hadoop dead? This diagram shows the components of a Hadoop cluster — Is Hadoop dead? **Components of a Hadoop Cluster**

The NameNode and Resource Manager run on the master node. These collect data in the Hadoop Distributed File System (HDFS) and store data with parallel computations by applying MapReduce.

The client nodes are responsible for loading the data into the cluster’s
Architecture. The slave node is one responsible for collecting the data
Client nodes.

How does communication within a cluster work?

The internal communication, i.e. the process of job execution, is organized via so-called JobTrackers and TaskTrackers.
The client submits a MapReduce job to the JobTracker on the master to process a particular file.The JobTracker then determines the DataNodes that store the blocks for that file by querying the NameNode. The NameNode manages the HDFS file system metadata, so it keeps track of all the files that are divided into blocks. The DataNodes store and retrieve these blocks. Then tasks are assigned to different TaskTrackers based on the information received from the NameNode . In the process, the status of each task NameNode and DataNode is monitored.
A secondary NameNode communicates with the NameNode at a periodic interval to take the snapshot of the HDFS metadata. In other words, a backup. This information can then be used in the event of a NameNode failure.

In principle, both single-node clusters and multi-node clusters can be implemented with Hadoop. In the case of a single node, the cluster is implemented on one machine only. All processes then run on a Java virtual machine instance.
In the case of multi-nodes, the master slave architecture already discussed is then implemented over several computers.

Is Hadoop dead?

So is Hadoop dead? Apache Hadoop has clearly lost its status as the sole Big Data solution. Many technologies have already been added that can solve smaller tasks better than the big one solution Hadoop.Today, this small-scale nature enables Big data management solutions that can be optimally tailored to specific use cases. However, Hadoop Hadoop is not dead either. The system still has its strengths and will continue to be the first choice for special use cases in the foreseeable future.

So how is Hadoop evolving?

With the Hadoop Ozone project, an alternative to the Hadoop Distributed File System (HDFS) has now been developed.
It is still to be deployed on a cluster, but corresponds to an object store for Big Data applications. This is much more scalable than than standard file systems and is intended to optimize the handling of small files, a previous Hadoop weakness. Object stores are typically used as a data storage method in the cloud. Through Ozone, they can now be managed locally.
This object store can be accessed by established Big Data solutions such as Hive or Spark without modification.If you want to know more about the hadoop compatible frameworks read our articles on Hive and Spark.

Ozone is built on a block storage layer called Hadoop Distributed Data Store (HDDS) and is designed to scale to billions of objects. The blocks are organized internally using unique namespaces in many independent volumes.
However, one disadvantage of these local object stores is that they are not yet implemented in the core, but must be separated from the traditional file systems by containerized environments such as Kubernetes and YARN. So there are always two truths.

Apache Hive Architecture – Data Warehouse System for free

February 5, 2021 / RainerGewalt / 2 Comments

Apache Hive Architecture – On the way to Industry 4.0, companies are trying to record all business processes as far as possible in order to subsequently optimize them through analysis.
Data warehouse systems provide central data management. Thus, only one data truth exists. In addition to persistence, these information systems take care of sorting, preprocessing, translation and data analysis.
If you want to know more about what a data warehouse system is, check out our article on the subject.

What is Apache Hive

Hive is a data warehousing software project and part of Apache, an open source and free web server software. Learn more about Apache here.
It is built on the Big Data framework Apache Hadoop and was released in 2010. Since then it has been continuously improved and extended by an industrious community.

The query language used by Hive, called HiveQL, is SQL based and allows querying, aggregation and analysis of unstructured data. Hive does not work with the schema-on-write (SoW) approach like relational databases, but uses the so-called schema-on-read (SoR) approach.

What are the biggest advantages of Hive?

Data from relational databases is automatically converted into MapReduce or Tez or Spark jobs. Hadoopclusters are based on MapReduce, a Google programming model for concurrent computation on computer clusters, and powerful stream-based data analysis pipelines can be created with Apache Spark. This ensures full compatibility with the Apache ecosystem, which can be modularly tailored to the needs of an application.

The figure shows the main Apache Hive features — Apache Hive Features

Another advantage of Hive is that the tables are similar to the tables in a relational database. Data is queried using HiveQL. A declarative SQL-like language.
HiveQL allows multiple users to query data simultaneously. Hive supports a variety of data formats and provides a lightweight but powerful translation feature.
For data analysis, custom MapReduce processes can be written and run on clusters in parallel for high performance.

Apache Hive Architecture

Basically, the architecture of Hive can be divided into three core areas. Hive communicates with other applications via the client area. The integration is then executed via the service area. In the last layer, Hive stores the metadata, for example, or computes the data via Hadoop.

The figure shows the basic three-part core architecture of Apache Hive. — Apache Hive Architecture

Hive Clients

Apache Hive can be accessed via different clients. In addition to Open Database Connectivity (ODBC), an SQL-based application programming interface (API) created by Microsoft, there is Java Database Connectivity (JDBC), an SQL-based API developed by Sun Microsystems to allow Java applications to use SQL for database access. Hive also provides a high-performance Apache Thrift connection.

Hive Services

The core and central control of the Hive Services is the so-called driver. This
receives HiveQL commands and is responsible for their execution against the Hadoop system. It typically consists of a compiler that translates HiveQL requests into abstract syntax and executable tasks, an optimizer that aggregates, splits, and optimizes for better performance and scalability, and an executor that interacts with Hadoop’s job tracker and passes tasks to the system for execution.

Apache Hive also provides the ability to submit these tasks directly to the driver. Using the Command Line and User Interface (CLI + UI), it is possible to directly influence the process.

Metadata about persistent relational entities, i.e. databases, tables, columns and partitions are managed by the metastore.

Hive Storage and Computer

The metadata is stored here in a persistence. The results of the query and the data loaded into the tables are stored on HDFS in the Hadoop cluster.

ricardo gomez angel j5gCOKZdm6I unsplash

Apache Avro – Effective Big Data Serialization Solution for Kafka

November 15, 2020 / RainerGewalt / 0 Comments

In this article we will explain everything you need to know about Apache Avro, an open source big data serialization solution and why you should not do without it.

You can serialize data objects, i.e. put them into a sequential representation, in order to store or send them independent of the programming language. The text structure reflects your data hierarchy. Known serialization formats are for example XML and JSON. If you want to know more about both formats, read our articles on the topics. To read, you have to deserialize the text, i.e. convert it back into an object.

In times of Big Data, every computing process must be optimized. Even small computing delays can lead to long delays with a correspondingly large data throughput, and large data formats can block too many resources. The decisive factors are therefore speed and the smallest possible data formats that are stored. Avro is developed by the Apache community and is optimized for Big Data use. It offers you a fast and space-saving open source solution. If you don’t know what Apache means, look here. Here we have summarized everything you need to know about it and introduce you to some other Apache open source projects you should know about.

Apache Avro – Open Source Big Data Serialization Solution

With Apache Avro, you get not only a remote procedure call framework, but also a data serialization framework. So on the one hand you can call functions in other address spaces and on the other hand you can convert data into a more compact binary or text format. This duality gives you some advantages when you have cross-network data pipelines and is justified by its development history.

Avro was released back in 2011 as a part of Apache Hadoop. Here, Avro was supposed to provide a serialization format for data persistence as well as a data transfer format for communication between Hadoop nodes. To provide functionality in a Hadoop cluster, Avro needed to be able to access other address spaces. Due to its ability to serialize large amounts of data, cost-efficiently, Avro can now be used Hadoop-independently.

You can access Avro via special API’s with many common programming languages (Java, C#, C, C++, Python and Ruby). So you can implement it very flexible.

In the following figure we have summarized some reasons what makes the framework so ingenious. But what really makes Avro so fast?

The schema clearly shows all the features that Apache Avro offers the user and why he should use it — Features Apache Avro

What makes Avro so fast?

The trick is that a schema is used for serialization and deserialization. About that the data hierarchy, i.e. the metadata, is stored separately in a file. The data types and protocols are defined via a JSON format. These are to be assigned unambiguously by ID to the actual values and can be called for the further data processing constantly. This schema is sent along with the data exchange via RPC calls.

Creating a schema registry is especially useful when processing data streams with Apache Kafka.

Apache Avro and Apache Kafka

Here you can save a lot of performance if you store the metadata separately and call it only when you really need it. In the following figure we have shown you this process schematically.

When you let Avro manage your schema registration, it provides you with comprehensive, flexible and automatic schema development. This means that you can add additional fields and delete fields. Even renaming is allowed within certain limits. At the same time, Avro schema is backward and forward compatible. This means that the schema versions of the Reader and Writer can differ. Schema registration management solutions exist, with Google Protocol Buffers and Apache Thrift, among others. However, the JSON data structure makes Avro the most popular choice.