EXPERT KNOWLEDGE AT A GLANCE

Category: Databases

Apache Hive Architecture – Data Warehouse System for free

Apache Hive Architecture – On the way to Industry 4.0, companies are trying to record all business processes as far as possible in order to subsequently optimize them through analysis.
Data warehouse systems provide central data management. Thus, only one data truth exists. In addition to persistence, these information systems take care of sorting, preprocessing, translation and data analysis.
If you want to know more about what a data warehouse system is, check out our article on the subject.

What is Apache Hive

Hive is a data warehousing software project and part of Apache, an open source and free web server software. Learn more about Apache here.
It is built on the Big Data framework Apache Hadoop and was released in 2010. Since then it has been continuously improved and extended by an industrious community.

hive
Apache Hive Architecture – Built on top of Hadoop

The query language used by Hive, called HiveQL, is SQL based and allows querying, aggregation and analysis of unstructured data. Hive does not work with the schema-on-write (SoW) approach like relational databases, but uses the so-called schema-on-read (SoR) approach.

What are the biggest advantages of Hive?

Data from relational databases is automatically converted into MapReduce or Tez or Spark jobs. Hadoopclusters are based on MapReduce, a Google programming model for concurrent computation on computer clusters, and powerful stream-based data analysis pipelines can be created with Apache Spark. This ensures full compatibility with the Apache ecosystem, which can be modularly tailored to the needs of an application.

The figure shows the main Apache Hive features
Apache Hive Features

Another advantage of Hive is that the tables are similar to the tables in a relational database. Data is queried using HiveQL. A declarative SQL-like language.
HiveQL allows multiple users to query data simultaneously. Hive supports a variety of data formats and provides a lightweight but powerful translation feature.
For data analysis, custom MapReduce processes can be written and run on clusters in parallel for high performance.

Apache Hive Architecture

Basically, the architecture of Hive can be divided into three core areas. Hive communicates with other applications via the client area. The integration is then executed via the service area. In the last layer, Hive stores the metadata, for example, or computes the data via Hadoop.

The figure shows the basic three-part core architecture of Apache Hive.
Apache Hive Architecture

Hive Clients

Apache Hive can be accessed via different clients. In addition to Open Database Connectivity (ODBC), an SQL-based application programming interface (API) created by Microsoft, there is Java Database Connectivity (JDBC), an SQL-based API developed by Sun Microsystems to allow Java applications to use SQL for database access. Hive also provides a high-performance Apache Thrift connection.

Hive Services

The core and central control of the Hive Services is the so-called driver. This
receives HiveQL commands and is responsible for their execution against the Hadoop system. It typically consists of a compiler that translates HiveQL requests into abstract syntax and executable tasks, an optimizer that aggregates, splits, and optimizes for better performance and scalability, and an executor that interacts with Hadoop’s job tracker and passes tasks to the system for execution.

Apache Hive also provides the ability to submit these tasks directly to the driver. Using the Command Line and User Interface (CLI + UI), it is possible to directly influence the process.

Metadata about persistent relational entities, i.e. databases, tables, columns and partitions are managed by the metastore.

Hive Storage and Computer

The metadata is stored here in a persistence. The results of the query and the data loaded into the tables are stored on HDFS in the Hadoop cluster.

4 Index Data Structures a Data Engineer Must Know

In this article we will explain what index data structures are and introduce you to some popular structures.

In today’s world, ever-increasing amounts of data are being processed. The data can be used to derive business strategies in a commercial context, but also to gain valuable information about all scientific disciplines. The data obtained must be saved, ideally as raw data, and stored for future analysis.

At the time of creation, it is not yet possible to estimate what information might be valuable at some point. So any reduction in data ultimately represents a loss. Huge amounts of data accumulate every second, and managing them is an immense task for today’s hardware and software. Mathematical tricks have to be used to optimize search mechanisms and storage functions.

Index data structures allow you to access searched data in a large data collection immensely faster. Instead of executing a search query sequentially, a so-called index data structure is used to search for a specific data record in this data set based on a search criterion.

What are Index Data Structures in Databases?

You have probably heard about indexing in connection with databases. Here, too, an index structure is formed, independent of the data structure, which accelerates the search for certain fields. This structure consists of references, which define an order relation to the table columns. Based on these pointers, the database management system can then find the data using a search algorithm.

schematic representation of index data structures in databases
Index Data Structures in databases

However, indexing is a very complex scientific field. Queries are constantly being made more efficient and optimized. Thus, the approaches are diverse and very mathematical. This article will give you an overview of popular index data structures and help you to optimize your data pipelines.

Index data structure types

There are many different indexing methods. They are all based on different mathematical assumptions. You should understand these assumptions and choose a suitable system according to your data properties.
In the following scheme you can see some structure types you have to distinguish between, depending on the data you want to index.

index structures 1
index data structure types

The most important distinction, however, is whether you want to index one-dimensional or multidimensional data relationships. This means that you have to differentiate whether there is a common feature or several related but independent features.
In the following figure, we have classified the individual index structures according to their dimension coverage.

we have classified the individual index structures according to their dimension coverage.
individual index structures according to their dimension coverage

Which index structure you ultimately choose depends on many factors and should be weighed up well in advance, especially with large data sets.

Popular index data structures you should know

In the following, we will introduce you to some of the most popular indexing methods in detail. Because here, too, the key to success lies in understanding your tools and using them correctly at the right moment.

What is Hashing?

If you want to search for a value in an unsorted array, a linear search method is not optimal and too time consuming.
With the so called hashing method a hash value is used for unique object identification. This is calculated by a hash function from the key and determines the storage location in an array of indices, the so-called hash table. This means that you use this function to generate a unique storage location in the table using a key.
In the following figure the hash function flow is shown again.

schematic representation of the hash function sequence in detail
hash function sequence

Important basic assumptions are, however, that the function always returns a number for an object, two identical objects always have the same number and two unequal objects do not always have different numbers.

What is a Binary tree?

A so-called binary tree is a data structure in which each element, also called node, has a maximum of two successors. The addresses of the subordinate nodes are kept track of by pointers. It is often used when data is to be stored in RAM.

What is a B-tree?

The B-tree is often used in databases and file systems, i.e. for storage on the hard disk. The tree is sorted and completely balanced. The data is stored sorted by keys. The keys are stored in its internal nodes, but need not be stored in the records at the leaves. CRUD functions run in amortized logarithmic time.


The B-tree is classified into different types according to its properties.
In the B+ tree, only copies of the keys are stored in the internal nodes. The keys are stored with the data in the leaves. To speed up sequential access, these also contain pointers to the next leaf node and are thus concatenated.
In the following scheme you see a basic B+ tree structure.

Basic representation of a b+ tree and its components
Basic b+ tree structure

The B* tree is an index structure where non-root nodes must be at least 2/3 filled. This is achieved by a modified split strategy.
In addition to indexing, partitioning also offers you the possibility of strongly optimizing the data search within a database. In this article we introduce you to this technique.

What is a SkipList?

The SkipList resembles in its structure a linked list consisting of containers, which contain the data with a unique key and a pointer to the following container. In a SkipList, however, the containers have different heights and can contain pointers to containers that do not follow directly. The idea is to speed up the search by additional pointers.

schematic representation of an index structure of the SkipList
Schematic representation of a SkipList

Calculation of the container height

All nodes have pointers on different levels. Keys can be skipped with it. The height of the list elements is calculated either regularly, or unbalanced according to mathematical rules. The search is however dependent on the list emergence or evenly randomly over the list.

When to use NoSQL vs SQL and why?

When to use NoSQL vs SQL – In this article we explain the important differences.
With the right choice of storage medium, you can build elementary more performant architectures in times of Big Data. Streaming platforms can now process huge streams of data in real time. But this technology is not a panacea. The database, for example, still occupies an important place in today’s data handling.
Often, however, it is crucial that you choose the right system for your data and in relation to the overall infrastructure.

when to use NoSQL vs SQL – Spoiled for choice

Database vendors abound. Here is just a small selection of popular databases.

popular examples nosql sql
Popular SQL and NoSQL Databases

But before you get into the differences between the databases, you should basically know the differences between the systems.

SQL is relational

Structured Query Language (SQL) databases consist of a fixed defined schema structure. All schemas contain tables with columns. Each table row (tuple) represents a data set (record). In addition, each row consists of a set of attributes (characteristics).

You can use the query language to manipulate and retrieve tables. You can also control the relationships between these structured data formats. Each table in a database can be linked to each other.
These relationships can take many forms. Table cells can have single relationships, or relationships with many cells.

This schema clearly shows all SQL table cells elationships
SQL table cells relationships

NoSQL is not relational

Not only SQL (NoSQL) databases allow you to store and retrieve unstructured data using a dynamic schema. For example, your data is stored in the form of n collections, each containing m documents. Other forms are key-value stores, or graph databases. Thus, there is no special query language here

when to use NoSQL vs SQL – Both in direct comparison.


NoSQL databases exist since 1998 and is relatively young compared to SQL. SQL was already developed in the 70s. Besides the actual structure, databases of both categories differ in that they are scalable in different ways. In contrast to
NoSQL databases, SQL databases can only be scaled vertically.
Furthermore, it is important for you to know that you cannot write to and read from an SQL database in parallel. In NoSQL databases, you can read what data is available at that moment.

when to use NoSQL vs SQL - This picture shows schematically and clearly the differences between NoSQL and SQL databases
SQL vs NoSQL

When to use NoSQL vs SQL

Which one suits me?


As you might have guessed, the answer here is: it depends! The differences are there and can have an important impact on the performance of your services. So the choice always depends on the application purpose. Especially for BigData use cases you should choose a NoSQL database, because here you don’t have to wait for the transaction to complete. Where you need high flexibility, due to frequently changing data structures, or real-time processing, you should also go for NoSQL DBs. However, if you want acid guarantees, you will have to go for an SQL solution. It is important for you to understand that both systems coexist, complement each other and do not replace each other.

If you want to know how to partition a database, check out this article.

Array vs Object – The creation of a JSON structure follows some rules you should know

Array vs Object – JSON is one of the most popular data formats. However, the creation of such an object is done according to some rules. These rules depend on the original data type. In this article we will introduce you to the conversion of some JSON data types (Array vs Object).

What is JSON anyway?

With the JavaScript Object Notation, JSON for short, you can structure data compactly and independently of programming languages. The data format is therefore particularly well suited for exchange between your applications, for general data storage (file extension “.json”) and for configuration files. The data is also readable for you and coded in the standardized text format. The application notes of the data format are defined by the standards – RFC 8259 and the JSON syntax by the standards ECMA-404. Due to its easy integration with JavaScript, you can use it well for transferring data in web applications.

You can best compare the JSON data structure to XML and YAML, only it’s simpler and more compact.

What are the basic rules?

This code snippet shows a simple json object structure
Simple JSON Object

The JSON text structure is based on the JavaScript Object Syntax. Hierarchical data structures are thus possible. It contains only properties and no methods. The basis is formed by name-value pairs and ordered list of values. Basically, they are formatted with curly braces and as strings. This is especially advantageous if you want to transfer the data over the network. If you want to access the data you have to convert the text structure into a native JavaScript object.

Data Formats – JSON Array vs Object

Basically, you can have different data types included in JSON.

Value:

Your JSON value can take one of the following allowed types.

Schematic representation of the data types that a JSON value can assume
JSON value data types

Object:

A JSON object represents the basic form of a JSON text. With this you can accept any data type that is suitable for inclusion in JSON.

JSON Array vs Object - Schematic representation of the creation of a JSON object
Creation of a JSON object

Array:

JSON Array vs Object – It is possible to include an array. Arrays can contain objects, strings, numbers, arrays and boolean. You can include arrays as shown schematically below, enclosed with two square brackets.

JSON Array vs Object - Schematic representation of the creation of a JSON array
Creation of a JSON array

In this way, you can further and further nest the individual data types with each other and thus easily create any number of hierarchy levels. For example, object attributes can consist of arrays, or arrays can contain multiple objects.

What is XML format?

XML format is one of the most popular and widely used data formats. Its widespread use is also its most important advantage. XML is interpretable by both humans and machines and is therefore widely used to import and export application data. XML stands for Extensible Markup Language and is a markup language for representing hierarchically structured data in text file format. It was already published in 1998 and is primarily a meta language.

That means that on its basis application-specific languages are defined by structural and content restrictions. For example RSS, MathML, GraphML, but also the Scalable Vector Graphics (SVG). All web browsers are able to visualize XML documents using the built-in XML parser.

XML Document Structure

An XML document can always be described as the interaction of its main components. In addition to the data itself, these are the layout, i.e. the description of the relationships between individual containers, and the structure.

This figure shows how an XML format document structure is determined.
XMl Document Components

An XML structure can be interpreted as a tree. Thus, each XML document has a root element and texts or attributes as sub-elements.

An XML document can have an optional header in addition to the actual data. XML declarations, i.e. references to an external document type definition (DTD), or internal DTD, or document type declarations can be placed here. Examples for these declarations are the XML version or the encoding.

Classification of the XML format

The XML format can be further classified. Which class comes into question when is determined by the use case. Mainly we decide between document-centered and data-centered. The document-centric XML format is based on a text document and is difficult to process by machine due to its weak structure. In data-centric, the schema describes entities of a data model and their relationships. This format is optimized for efficient processing by machines. The Semistructured format represents a hybrid of both.

Processing

The XML format allows both sequential and optional accesses. This can be done either by a “push”, where the program flow is controlled by the parser, or by a “pull”, where the flow is implemented in the code that calls the parser.
Management of the tree structure can be hierarchical as well as nested.

XML-Schema vs. Database-Schema

Besides XML, JSON is also a very popular markup language. In this article we have recorded the most important information about this format.

Another large field of computer languages, i.e. formal languages developed for interaction between humans and computers, is occupied by database languages. They describe the structure of a database. Here, too, the data is organized as a plan.

If you want to know more about database language, read our article on SQL and NoSQL. Here we explain the most important differences.


But how does this schema differ from an XML schema?

This figure shows the differences between an XML schema and a database schema.
Differences between an XML schema and a database schema.

XML contains nested elements with an unlimited nesting depth. To transfer this nesting to a database schema, the nested elements must be decomposed and linked by foreign key relationships.
In XML format, the elements within an element can be repeated as often as desired. Elements of a given type do not always have to contain the same child elements. However, the order of elements is an integral part of the document structure.
In a database schema, each column is always present only once and contains simple values. Therefore, if multiple elements are to be stored, another table must then be created. The order in which the values are stored is not important, unlike in XML.

Data Warehouse Types and How they work

Data warehouse systems offer a way to create data truth in a company. In such an information system, data is not only stored and sorted, but also cleansed and analyzed.
If you haven’t heard of these information systems, check out our article on the subject. Here we explain the features of such a system and how to provide it with data.
All systems follow the same basic structure, which we explain in this article, but can consist of different components. Accordingly, they are typified. This article is about this classification. In the following, we will introduce you to the functionalities of the most popular data warehouse systems.

Host-Based mainframe warehouse

The Host-Based mainframe warehouse resides on a large-volume database.
In addition to this database, metadata is managed in a central metadata repository. Within this metadata, for example, the information for the documentation of data sources or data translation rules are stored.

Data Warehouse types -  The figure shows the main Host-Based mainframe warehouse Principle
Data Warehouse Types – Host-Based mainframe warehouse Principle


In general, three phases run in this information system.
Selections and scrubbing methods take place in the unloading phase. That is, the appropriate data types and data sources are determined here and the data is subsequently error corrected.
In the following transform phase the data are translated into a suitable form. Here also already the rules for the access and the storage are specified.
In the final Load phase, the preprocessed data set is moved into tables.

Host-Based LAN data warehouse

• extract information from a variety of sources and support multiple LAN based warehouses
• data delivery can be handled either centrally or from the workgroup environment
• size depends on the platform

Host Based LAN data warehouses

Multi-Stage Data Warehouses

• staging of the data multiple times before the loading operation into the data warehouse

→finally to departmentalized data marts

Multi Stage Data Warehouses
Multi-Stage Data Warehouses

Stationary Data Warehouse

• data is not changed from the sources
• customer is given direct access to the data

Stationary Data Warehouse
Stationary Data Warehouse

Distributed Data Warehouses

• two types of distributed data warehouses and their modifications for the local enterprise
warehouses which are distributed throughout the enterprise and a global warehouses
• Activity appears at the local level
• Bulk of the operational processing
• Local site is autonomous
• Local warehouses also include historical data and are integrated only within the local site

Distributed Data Warehouses 1
Distributed Data Warehouses

Virtual Data Warehouses

Created in the following steps

• Installing a set of data approach, data dictionary, and process management facilities
• Training end-clients
• Monitoring how DW facilities will be used

• Based upon actual usage, physically Data Warehouse is created to provide the high-frequency results

Need to define four kinds of data

• A data dictionary including the definitions of the various databases
• A description of the relationship between the data components
• The description of the method user will interface with the system
• The algorithms and business rules that describe what to do and how to do it

What is Data Warehousing?

Here you can find out everything about the Three-Tier Architecture

What is a Three Tier Data Warehouse Architecture?

Three Tier Data Warehouse Architecture – In this article we will introduce you to the most common data warehouse architecture.

Nowadays, business processes are increasingly supported by digital assistance systems and recorded for further analysis and optimization.
This generates a lot of structured, unstructured and semi-structured data from many different sources.

What is a Data Warehouse System?

In order to create a unified view of the data for improved BI and thus enable comprehensive evaluations, all information from the diverse data sets must be centralized.

This integration is the first basic function of a data warehouse system. However, this information system also assumes the task of data separation. In this way, data that is used for operational business, i.e. that is regularly queried, can be separated from data that is only used for analyzing business processes in controlling.
You can read more about the different types of data warehouses here.

Data centralization ensures that there is only one version of the truth for a company to use for decision making and forecasting.

What does the Typical Data Warehousing Architecture look like?

The complexity of this system increases exponentially with the complexity of the business. Many distinctive data sources, i.e. business processes, provide commutative and historical data. Therefore, basic approaches have been defined according to which every data warehouse system should be structured. Single Tier, Two Tier and Three Tier.

2 Tier vs Three Tier Data Warehouse Architecture

In the following we will work out the three tier architecture.
This, the most commonly used, structure is completely decoupled from the data and the user interface by moving the application logic to a middle tier.
In two-tier, the application logic resides either in the user interface on the client or in the database on the server.
Thus, without a middle tier, this system is less scalable and more flexible. Integration of other data sources is more difficult here.

Three Tier Data Warehouse Architecture

The Three Tier Data Warehouse Architecture is the design on the basis of which a data warehouse with three tiers is then built. The figure below shows this structure with common components.

In this schema, the typical three tier data warehouse architecture is presented in a clear and simplified way
Three Tier Data Warehouse Architecture

However, the individual components can vary and depend on the project framework. As a rule, however, these changes do not alter the basic structure.

Bottom Tier

The lowest layer is persistence, which is usually located on a server. The data from various data sources is prepared and stored here using an ETL (extract, transform and load) process. Tools and other external resources can be used to feed the data.
This persistence can consist of a relational but also a multidimensional database system.

Middle tier

One or more OLAP (Online Analytical Processing) servers reside in the middle data warehouse layer. This technology can be used to create complex budget plans and perform analyses cost-effectively. So in the three tier data warehouse architecture, jobs are generated in the top tier and sent to this middle tier. Here, the data in the bottom tier is then accessed and analyses are performed. The result is then sent to the top tier and thus made available to the user, and/or forwarded to the bottom tier for storage of the analysis results in persistence.

What is an OLAP Server?

Basically, three OLAP server models are distinguished.
In Relational OLAP (ROLAP) the operations on multidimensional data are based on standard relational operations. The Multidimensional OLAP (MOLAP) directly implements the multidimensional data operations. A mixture of relational and multidimensional processing can be handled by Hybrid OLAP (HOLAP).
The choice of the server model always depends on the data composition in the lowest layer.

Top-Tier

The top tier is the top of the three tier data warehouse architecture, the front-end client layer. It contains query and reporting tools, analysis tools, and data mining tools, thus providing the interface to the user. Here he can generate analyses and take a look at the data.