Tag: indexing

NumPy vs Pandas – Which is used When?

March 13, 2021 / RainerGewalt / 3 Comments

NumPy vs Pandas – Since in our time in every science and economic branch ever larger amounts of data accumulate, which must be analyzed and managed performantly, the learning of a programming language has become interdisciplinary indispensable.

For many, Python is the first programming language in the classical sense, due to its beginner friendliness and mathematical focus. Python offers the possibility of accessing ready-made, optimized computational tools through the modular implementation of powerful mathematical libraries.

NumPy vs Pandas - The schme shows popular python libraries and their place in the Python ecosystem — **NumPy vs Pandas** – Their place in the Python ecosystem

However, this offer can also quickly become overwhelming. Which library, which framework is suitable for my purposes? Will I save myself work with this tool, or will I reach its limits? Here you can learn more about SciPy and why you should definitely prefer it over MATLAB and here we compared the two Python visualization methods matplotlib and seaborn. These Python libraries are absolutely compatible with each other and together they make a very interesting data science tool. NumPy and Pandas are perhaps two of the best-known python libraries. But what are the differences between them? We will get to the bottom of this question in this article.

What actually is NumPy?

NumPy stands for “Numerical Python” and is an open source Python library for array-based calculations. It was first released in 1995 as Numeric, making it the first implementation of a Python matrix package, and rereleased as NumPy in 2006. This library is intended to allow easy handling of vectors, matrices, or large multidimensional arrays in general.

The scheme shows NumPys major applications — NumPy vs Pandas – Numpys Major Applications

For performance purposes, it is written in C, a deep, machine-oriented programming language. NumPy is compatible with a wide variety of Python libraries, some of which are also based on NumPy, adding further useful functions to its power, such as: Minimization, Regression, Fourier Transform

Python and Science

As mentioned earlier, Python is the programming language most intensively used in the application domain of scientific research across all disciplines for data processing and analysis. What is very interesting here is that the solution approaches are similar across disciplines at the data level. Thus, an exchange of ideas has become indispensable and leads more and more to a fusion of the sciences.

This is only mentioned in passing, but should also emphasize the importance of this programming language and its libraries, which are so often open source and further developed by a community.

NumPy vs Pandas - The schema shows Scientific Computing with NumPy over science disciplines — NumPy vs Pandas – Scientific Computing with NumPy

NumPy was developed specifically for scientific calculations and forms the basis for many specific frameworks and libraries.

The elementary NumPy data structure

The core functionality of NumPy is based on the “ndarray” data structure.

The schema shows NumPys fundamental data structure — NumPy vs Pandas – NumPys fundamental data structure

Such an array can only hold elements of the same data type and always consists of a pointer to a contiguous memory area together with the metadata describing the data stored in it. This allows processes to access them very efficiently and manipulate them as desired.

The schema shows how NumPys fundamental data structure could be manipulate — NumPy vs Pandas – NumPys data structure is manipulable

Thus, the shape can be changed via so-called reshaping, smaller subarrays can be created within a given larger array, arrays can be split, or merged.

What is Pandas?

Pandas is an open source library for data analysis and manipulation in Python. Already released in 2008 by Wes McKinney and written in Python, Cython and in C. Pandas are used in almost all areas and find worldwide appeal in all industries.

The schema shows Pandas major applications — NumPy vs Pandas – Pandas Major Applications

The name Pandas is derived from Panel Data.
Its strength lies in the processing and analysis of tabular data and time series.

The schema shows Pandas major features — NumPy vs Pandas – Pandas Features

Especially in the pre-processing of data, pandas offers a lot of operations. In addition to high-performance filter functions, very large data volumes with over 500 thousand rows can be transformed, manipulated, aggregated and cleaned.

Pandas fundamental data structures

As a basis for the individual functions and tools that Pandas provides, the library defines its own data objects. These objects can be one, two, or even three-dimensional.

The one-dimensional series object can take up different data types in contrast to NumPys ndarrays and corresponds to a data structure with two arrays. One array as index and one array holding the actual data.

The two-dimensional DataFrame object contains an ordered collection of columns. Here, each column can consist of different data types and each value is unique by a row index and a column index.
The eponymous Panel object is then a three-dimensional dataset consisting of dataframes. These objects can be divided into major axes, which are the index rows of each DataFrame, and minor axes, which are the columns of each of the DataFrames.

NumPy vs Pandas – Conclusion

Both libraries have their similarities, which are due to the fact that Pandas is based on NumPy, but is it an either or question? No, clearly not. Pandas is based on NumPy, but adds so many individual features to its functionality that there is a clear justification for their parallel existence. They simply serve different purposes and should be used for both.

One of the main differences between the two open source libraries is the data structure used. Pandas allows analysis and manipulation on a tabular form while NumPy works mainly with numerical data in arrays whose objects can have up to n dimensions. These data forms are easily convertible among themselves via an interface.

Pandas is more performant especially with very large data sets (500K rows and more). This makes data preprocessing and reading from external data sources easier to perform with Pandas and can then be transferred as a NumPy array into complex machine learning or deep learning algorithms. If you want to know more about machine learning methods and their fields of application, take a look at this article from us.

4 Index Data Structures a Data Engineer Must Know

January 2, 2021 / RainerGewalt / 1 Comment

In this article we will explain what index data structures are and introduce you to some popular structures.

In today’s world, ever-increasing amounts of data are being processed. The data can be used to derive business strategies in a commercial context, but also to gain valuable information about all scientific disciplines. The data obtained must be saved, ideally as raw data, and stored for future analysis.

At the time of creation, it is not yet possible to estimate what information might be valuable at some point. So any reduction in data ultimately represents a loss. Huge amounts of data accumulate every second, and managing them is an immense task for today’s hardware and software. Mathematical tricks have to be used to optimize search mechanisms and storage functions.

Index data structures allow you to access searched data in a large data collection immensely faster. Instead of executing a search query sequentially, a so-called index data structure is used to search for a specific data record in this data set based on a search criterion.

What are Index Data Structures in Databases?

You have probably heard about indexing in connection with databases. Here, too, an index structure is formed, independent of the data structure, which accelerates the search for certain fields. This structure consists of references, which define an order relation to the table columns. Based on these pointers, the database management system can then find the data using a search algorithm.

schematic representation of index data structures in databases — Index Data Structures in databases

However, indexing is a very complex scientific field. Queries are constantly being made more efficient and optimized. Thus, the approaches are diverse and very mathematical. This article will give you an overview of popular index data structures and help you to optimize your data pipelines.

Index data structure types

There are many different indexing methods. They are all based on different mathematical assumptions. You should understand these assumptions and choose a suitable system according to your data properties.
In the following scheme you can see some structure types you have to distinguish between, depending on the data you want to index.

index structures 1 — index data structure types

The most important distinction, however, is whether you want to index one-dimensional or multidimensional data relationships. This means that you have to differentiate whether there is a common feature or several related but independent features.
In the following figure, we have classified the individual index structures according to their dimension coverage.

we have classified the individual index structures according to their dimension coverage. — individual index structures according to their dimension coverage

Which index structure you ultimately choose depends on many factors and should be weighed up well in advance, especially with large data sets.

Popular index data structures you should know

In the following, we will introduce you to some of the most popular indexing methods in detail. Because here, too, the key to success lies in understanding your tools and using them correctly at the right moment.

What is Hashing?

If you want to search for a value in an unsorted array, a linear search method is not optimal and too time consuming.
With the so called hashing method a hash value is used for unique object identification. This is calculated by a hash function from the key and determines the storage location in an array of indices, the so-called hash table. This means that you use this function to generate a unique storage location in the table using a key.
In the following figure the hash function flow is shown again.

schematic representation of the hash function sequence in detail — hash function sequence

Important basic assumptions are, however, that the function always returns a number for an object, two identical objects always have the same number and two unequal objects do not always have different numbers.

What is a Binary tree?

A so-called binary tree is a data structure in which each element, also called node, has a maximum of two successors. The addresses of the subordinate nodes are kept track of by pointers. It is often used when data is to be stored in RAM.

What is a B-tree?

The B-tree is often used in databases and file systems, i.e. for storage on the hard disk. The tree is sorted and completely balanced. The data is stored sorted by keys. The keys are stored in its internal nodes, but need not be stored in the records at the leaves. CRUD functions run in amortized logarithmic time.

The B-tree is classified into different types according to its properties.
In the B+ tree, only copies of the keys are stored in the internal nodes. The keys are stored with the data in the leaves. To speed up sequential access, these also contain pointers to the next leaf node and are thus concatenated.
In the following scheme you see a basic B+ tree structure.

Basic representation of a b+ tree and its components — Basic b+ tree structure

The B* tree is an index structure where non-root nodes must be at least 2/3 filled. This is achieved by a modified split strategy.
In addition to indexing, partitioning also offers you the possibility of strongly optimizing the data search within a database. In this article we introduce you to this technique.

What is a SkipList?

The SkipList resembles in its structure a linked list consisting of containers, which contain the data with a unique key and a pointer to the following container. In a SkipList, however, the containers have different heights and can contain pointers to containers that do not follow directly. The idea is to speed up the search by additional pointers.

schematic representation of an index structure of the SkipList — Schematic representation of a SkipList

Calculation of the container height

All nodes have pointers on different levels. Keys can be skipped with it. The height of the list elements is calculated either regularly, or unbalanced according to mathematical rules. The search is however dependent on the list emergence or evenly randomly over the list.