Tag: 2020 (Page 1 of 2)

What is data warehousing and does it still make sense?

March 12, 2021 / RainerGewalt / 3 Comments

Data Warehousing – In today’s flood of data, it is becoming increasingly difficult to maintain a clear data management system. More and more data sources are recorded via different software systems.
A unified, centralized system can facilitate analysis and ensure that only one data truth exists in an organization.

What is a Data Warehouse System?

Data warehouse systems are built by integrating data from multiple heterogeneous sources and, in addition to centralization, performs the task of structuring data, supporting analytical reporting and structuring decision-making.
The system can perform data cleansing as well as data integration and data consolidation and does not require transaction processing or recovery.

Data Warehousing - The figure shows all other names that the system has. — Multiple names for Data Warehouse

It is thus a powerful Big Data information system that can centrally handle everything related to data processing.

Data Warehousing Features

Data warehousing offers several features. Such an information system is subject oriented. It does not focus on the current operation, as these data are separated. This means that frequent changes in the operational database are not reflected in the data warehouse. Thus, the focus is on modeling and analysis of data.
The system is Time variant, which means that the collected data are identified with a certain period of time and previous data are not deleted when new data are added.

What does a Data Warehouse structure look like?

The complexity of this system increases exponentially with the complexity of the business. Many distinctive data sources, i.e. business processes, provide commutative and historical data.

This figure shows the main principle behind data warehousing simplified. — Data Warehousing main Principle

Therefore, basic approaches have been defined according to which every data warehouse system should be structured. Single Tier, Two Tier and Three Tier.

2 Tier vs Three Tier Data Warehouse Architecture

In the following we will work out the three tier architecture.
This, the most commonly used, structure is completely decoupled from the data and the user interface by moving the application logic to a middle tier.
In two-tier, the application logic resides either in the user interface on the client or in the database on the server.
Thus, without a middle tier, this system is less scalable and more flexible. Integration of other data sources is more difficult here.

Three Tier Data Warehouse Architecture

The Three Tier Data Warehouse Architecture is the design on the basis of which a data warehouse with three tiers is then built. The figure below shows this structure with common components.

However, the individual components can vary and depend on the project framework. As a rule, however, these changes do not alter the basic structure.

Bottom Tier

The lowest layer is persistence, which is usually located on a server. The data from various data sources is prepared and stored here using an ETL (extract, transform and load) process. Tools and other external resources can be used to feed the data.
This persistence can consist of a relational but also a multidimensional database system.

Load Data into Warehouse

In addition to the different components and architectures, data can also be transmitted to the information system in different ways.

etl elt warehousing 2 — Data Warehousing – ETL vs ELT

As shown in the figure, a basic distinction is made between two elementary processes.

What is ELT?

Extract, Load, and Transform, or ELT for short, is about extracting aggregate information from the source system and loading it into the target method.

The following figure shows such an example system. In this case, the Hadoop framework handles the central data management, while applications and analysis tools access the untransformed data.

warehouse elt 1 — Data Warehousing – Extract, Load, and Transform

What is ETL?

In Extract, Transform and Load, or ETL for short, the data set is first extracted from the sources into a staging area, then transformed or reformatted with business manipulation performed on it, and only then loaded into the target or destination database or data warehouse.

warehouse etl 1 — Data Warehousing – Extract, Transform, and Load

Middle tier

One or more OLAP (Online Analytical Processing) servers reside in the middle data warehouse layer. This technology can be used to create complex budget plans and perform analyses cost-effectively. So in the three tier data warehouse architecture, jobs are generated in the top tier and sent to this middle tier. Here, the data in the bottom tier is then accessed and analyses are performed. The result is then sent to the top tier and thus made available to the user, and/or forwarded to the bottom tier for storage of the analysis results in persistence.

What is an OLAP Server?

Basically, three OLAP server models are distinguished.
In Relational OLAP (ROLAP) the operations on multidimensional data are based on standard relational operations. The Multidimensional OLAP (MOLAP) directly implements the multidimensional data operations. A mixture of relational and multidimensional processing can be handled by Hybrid OLAP (HOLAP).
The choice of the server model always depends on the data composition in the lowest layer.

Top-Tier

The top tier is the top of the three tier data warehouse architecture, the front-end client layer. It contains query and reporting tools, analysis tools, and data mining tools, thus providing the interface to the user. Here he can generate analyses and take a look at the data.

Terminologies

However, some terms that often come up in connection with this system need to be clarified.
When metadata is mentioned, a kind of roadmap to the data warehouse is meant. Here the warehouse objects are defined and it acts as a directory. This means that the decision support system finds the contents via the metadata.

The metadata is stored in the metadata repository. An integral directory that manages both the business metadata, i.e. data ownership information, business definition and change policies, and the operational metadata. Operational metadata refers to the timeliness of the data is it active, archived or cleansed, and data lineage, which is the history of the data. This includes the data used to map the operational environment, source databases and their contents, data extraction, data partitioning, cleansing, transformation rules, data refreshing and cleansing rules, but also the algorithms for summation, dimensional algorithms, data for granularity, aggregation, summation, etc.

The so-called data cube represents data in multiple dimensions and the data mart contains only the data specific to a certain group.

Data Warehouse Types and How they work

All Data Warehouse systems follow the same basic structure, which we explain in this article, but can consist of different components. Accordingly, they are typified.

Host-Based mainframe warehouse

The Host-Based mainframe warehouse resides on a large-volume database.
In addition to this database, metadata is managed in a central metadata repository. Within this metadata, for example, the information for the documentation of data sources or data translation rules are stored.

In general, three phases run in this information system.
Selections and scrubbing methods take place in the unloading phase. That is, the appropriate data types and data sources are determined here and the data is subsequently error corrected.
In the following transform phase the data are translated into a suitable form. Here also already the rules for the access and the storage are specified.
In the final Load phase, the preprocessed data set is moved into tables.

Host-Based LAN data warehouse

With this type, information can be extracted from a variety of sources. Multiple LAN-based warehouses are supported. Data provisioning can be either centralized or from the workgroup environment. The size depends on the platform.

Multi-Stage Data Warehouses

Here, the data is staged several times before being loaded into the data warehouse and finally distributed into department-specific data marts.

Stationary Data Warehouse

In the case of stationary warehouse types, the data from the sources are not changed. The customer thus has direct access to the data.

Distributed Data Warehouses

In the distributed warehouse system, the data is basically distributed. For technological or business reasons, this separation can take place in local and global warehouses. The local levels are then only integrated within the local site but also contain historical data and is therefore absolutely autonomous. This is where most of the operational processing takes place, while the global part processes the data that is relevant to the company as a whole.

Are data warehouses still promising?

Nowadays, data streams are on everyone’s lips and it is precisely with iterative and highly dynamic data sources that large data warehouse systems reach their limits. Often, smaller tools that do not require a complete solution are more efficient. So are data warehouse systems dead after all? No, definitely not. Apart from the basic principles, such as the pursuit of data truth, which can be applied to any other software system. In many cases, however, a data warehouse system is still a high-performance overall solution and can coexist with data stream pipeline systems.

With Apache Hive you can access a free Apache data warehouse software. With this software you can easily implement very performant big data systems. Here we have collected the most important information about Hive.

AI vs Machine Learning vs Deep Learning – It’s almost harder to understand all the acronyms around AI than the technology itself.

January 5, 2021 / RainerGewalt / 5 Comments

It’s almost harder to understand all the acronyms around Artificial Intelligence (AI) than the technology itself.
AI vs Machine Learning vs Deep Learning – These terms are often carelessly mixed together. But what are actually the differences? In this article, we will introduce you to all Three fields, because even though there is overlap, they differ.
It should be important for you to know these differences, as each discipline describes different stages of a data analysis pipeline.

AI vs Machine Learning vs Deep Learning

In the following figure, we have schematically shown you the individual fields in their context. As you can see, the individual disciplines surround each other and form an onion-like layered model.

The figure clearly shows that there are relationships between individual disciplines. AI is to be understood as a generic term and thus includes the other fields. The deeper you go in the model, the more specific the tasks become. In the following, we will follow this representation and work our way from the outside to the inside.

Artificial intelligence

All disciplines are encompassed by the term AI. It is a science that explores ways to build intelligent programs and machines that can perceive, reason, act, and solve problems creatively. To this end, it attempts to model how the human brain works.
The following figure shows that AI can basically be divided into two categories.

Classification is about measuring the performance of AI based on how well it is able to replicate the human-like brain. In the Based on Functionality category, AI is classified based on how well it matches the human way of thinking. In the second category, it is evaluated based on human intelligence. Within these categories, there are still some subgroups that correspond to an index.

AI vs Machine Learning

So what is the first subcategory Machine Learning and how does it differ from AI?
While AI deals with the functioning of artificial intelligence and compares them with the functioning of the human brain, machine learning is a collection of mathematical methods of pattern recognition. It is about how a system is given the ability to automatically learn and improve from experience. Various algorithms (e.g., neural networks) are used for this purpose. In the following scheme, the broad machine learning field is presented in a categorized way.

In machine learning, algorithms are used to build statistical models based on training data. Roughly, these algorithms can be divided into three main learning techniques. While in supervised learning the result is predetermined by a cleanly labeled data set, unsupervised learning is completely self-organized. Here the patterns are to be explored independently.
In reinforcement learning, utility functions are to be independently approximated based on rewards received.

Machine Learning vs Deep Learning/ Deep Neural Learning

Deep learning is a subfield of machine learning similar to machine learning in Ai. Here, multilayer neural networks are used to analyze various factors in large amounts of data. These networks are similar to the human neural system. If you want to know more about this structure, read our article on perceptrons, the smallest unit of a neural network.
Optimization of neural weights, unlike machine learning, can be done using powerful GPUs. Pure machine learning is best used on structured data sets, while for unstructured data you should opt for deep learning. In the following graphic, we have summarized the main factors that make up deep learning. For the network types autoencoder and CNN we provide more detailed articles.

Representation of all basic deep learning components
AI vs Machine Learning vs Deep Learning — Definition Deep Learning

H2O AI – That’s why it’s so great

December 5, 2020 / RainerGewalt / 0 Comments

There is a lot of Big Data software available now. One of them that you should definitely know about is the H2O AI Machine Learning solution.

With this open-source application you can implement algorithms from the fields of statistics, data mining and machine learning. The H2O AI Engine is based on the distributed file system Hadoop and is therefore more performant than other analysis tools. Your machine learning methods can thus be used as
parallelized methods.

Software Stack

They can program their algorithms in R, Python and Java and thus in the most important mathematical programming languages. H2O provides a REST interface to Python, R, JSON and Excel. Additionally, you can access H2O directly with Hadoop and Apache Spark. This makes integration into your data science workflow much easier. You already get approximate results while running the algorithms. A graphical web browser UI helps you to better analyze the processes and perform targeted optimizations.

How Clients Interacts with H2O AI

You can interact with H2O via clients using various interfaces. It is important for you to know that the data is usually not held in memory. They are localized in a H2O cluster and you only get a pointer to the data when you make a request.

H2O Frame

The basic unit of data storage accessible to you is the H2O Frame. This corresponds to a two-dimensional, resizable and potentially heterogeneous data point. This tabular data structure also contains labeled axes.

H2O Cluster

Your H2O cluster consists of one or more nodes. A node corresponds to a JVM process and this process consists of three layers.

H2O Machine Learning Software Structure — H2O Software Stack

H2O Machine Learning Components

Language Layer

The R evaluation layer is a slave to the REST client front-end and in the Scala layer you can write native programs and algorithms. You can then use these with H2O Machine learning.

Algorithms Layer

This layer is where your algorithms are applied. You can run statistical methods, data import and machine learning here.

Core Layer

In this layer you handle the resource management. You can manage both the memory and the CPU processing capacity.

Microsoft Power Platform – To turn your company into an Industry 4.0 enterprise, you can no longer avoid cloud solutions

November 28, 2020 / RainerGewalt / 0 Comments

In this article, we will show you everything about the cloud-based web tool Microsoft Power Platform and why you shouldn’t do without it.

Cloud-based web development has gained popularity in the web development industry in recent years. The globalization of the workforce and the diversification of the work process have significantly driven the development of cloud-based services.

What is Microsoft Power Platform?

With Power Platform you get an integrated application platform consisting of a group of different Microsoft products with which you can develop complex business solutions. This way you can make your business processes more efficient and productive. The platform can also take care of data storage, entry and processing. Data analysis via complex visualizations and predictions can also be handled by various services.

Microsoft Power Platform Services

Since the Power Platform is a collection of different Microsoft services, we want to give you an overview of the individual parts.

Build your own Apps

With Power Apps you get a user suite for mapping custom apps. All apps are independent of data sources and can be extended by you as you wish via drag & drop. This way you can adapt them to your needs.

Automate your tasks

Power Automate was still called Microsoft Flow until 2019. This web tool lets you automate recurring tasks and simple cross-platform workflows. You can connect to over 100 third-party systems via connectors. This allows you to automate processes outside the Microsoft environment and across applications.

Schematic representation of the Principle Microsoft Power Automate — Principle Microsoft Power Automate

Analyze your business data

With Power BI you get a business intelligence tool. With this you can access different data sources. An advantage to other BI tools is the deep integration with Excel. So you can create user-friendly, data connections and visualize.

Should I choose Microsoft Power Platform?

You need more and more scalable and secure solutions at low cost. Device-independent access to important planning and evaluation software will increasingly become the focus of attention and change global corporate structures. To turn your company into an Industry 4.0 enterprise, you can no longer avoid cloud solutions.

The question is not: if, but when you will choose a corresponding service. The question is not: if, but when you will choose a corresponding service. Microsoft, as one of the largest IT companies, now presents you with a solution. Whether it fits your needs, however, you must ultimately decide for yourself.

Array vs Object – The creation of a JSON structure follows some rules you should know

November 26, 2020 / RainerGewalt / 3 Comments

Array vs Object – JSON is one of the most popular data formats. However, the creation of such an object is done according to some rules. These rules depend on the original data type. In this article we will introduce you to the conversion of some JSON data types (Array vs Object).

What is JSON anyway?

With the JavaScript Object Notation, JSON for short, you can structure data compactly and independently of programming languages. The data format is therefore particularly well suited for exchange between your applications, for general data storage (file extension “.json”) and for configuration files. The data is also readable for you and coded in the standardized text format. The application notes of the data format are defined by the standards – RFC 8259 and the JSON syntax by the standards ECMA-404. Due to its easy integration with JavaScript, you can use it well for transferring data in web applications.

You can best compare the JSON data structure to XML and YAML, only it’s simpler and more compact.

What are the basic rules?

This code snippet shows a simple json object structure — Simple JSON Object

The JSON text structure is based on the JavaScript Object Syntax. Hierarchical data structures are thus possible. It contains only properties and no methods. The basis is formed by name-value pairs and ordered list of values. Basically, they are formatted with curly braces and as strings. This is especially advantageous if you want to transfer the data over the network. If you want to access the data you have to convert the text structure into a native JavaScript object.

Data Formats – JSON Array vs Object

Basically, you can have different data types included in JSON.

Value:

Your JSON value can take one of the following allowed types.

Object:

A JSON object represents the basic form of a JSON text. With this you can accept any data type that is suitable for inclusion in JSON.

Array:

JSON Array vs Object – It is possible to include an array. Arrays can contain objects, strings, numbers, arrays and boolean. You can include arrays as shown schematically below, enclosed with two square brackets.

In this way, you can further and further nest the individual data types with each other and thus easily create any number of hierarchy levels. For example, object attributes can consist of arrays, or arrays can contain multiple objects.

Data Mining vs Big Data Analytics – You need the right tools and you need to know how to use them!

November 16, 2020 / RainerGewalt / 0 Comments

Data Mining vs Big Data Analytics – Both data disciplines, but what makes them different? In this article, we introduce you to both fields and explain the key differences.

Data Science is an interdisciplinary scientific field, as it has become more and more in focus in the last decades. Many companies see this as the key to an Industry 4.0 company. The hope is that valuable information can be found in the company’s own data, which can be used to massively increase its own profitability. Terms such as big data, data mining, data analytics and machine learning are being thrown into the ring. Many people do not realize that these terms describe other disciplines. If you want to build a house, you need the right tools and you have to know how to use them.

Map of Data Disciplines

First of all, you should think of the individual disciplines as being layered into each other like an onion. So there is overlap between all the fields and when you talk about a discipline, you are also talking about lower layers.

data mining vs analytics - This diagram shows the relationships between the individual data disciplines — Map of data disciplines

Since data analytics is located above data mining in the layer model, it is already clear that mining must be a sub discipline of analytics. Therefore, we will first describe the comprehensive discipline.

Data Mining vs Big Data Analytics – What is Analytics?

Big data analytics, as a sub field of data analysis, describes the use of data analysis tools and without special data processing. in data analytics, you use queries and data aggregation methods, but also data mining techniques and tools. The goal of this discipline is to represent various dependencies between input variables.

The goal of this discipline is to represent various dependencies between input variables. The following figure shows the individual overlaps in the use of the tools of the different disciplines.

scheme about overlaps in the use of the tools of the different data disciplines — Overlaps of the different data disciplines

Data Mining vs Big Data Analytics – What is Data Mining?

Data mining is a subset of data analytics. At its core, it is about identifying and discovering a large data set through correlations. Especially if you know little about the available data this field should be used.

But what does a typical data mining process look like and what are typical data mining tasks?

Data Mining Process

You can divide a typical data mining process into several sequential steps. In the preprocessing stage, your data is first cleaned. This involves integrating sources and removing inconsistencies. Then you can convert the data into the right format. After that, the actual analysis step, the data mining, takes place.Finally, your results have to be evaluated. Expert knowledge is required here to control the patterns found and the fulfillment of your own objectives.

The term data mining covers a variety of techniques and algorithms to analyze a data set. In the following we will show you some typical methods.

Data Mining Tasks

Besides identifying unusual data sets with outlier detection, you can also group your objects based on similarities using cluster analysis. In this article we have already summarized some popular clustering algorithms that you should know as a data scientist. While association analysis only identifies the relationships and dependencies in the data, regression analysis provides you with the relationships between dependent and independent variables. Through classification, you assign elements that were not previously assigned to classes to existing classes. You can also summarize the data to reduce the data set to a more compact description without significant loss of information.

Data Mining vs Big Data Analytics – Conclusion

Although the two disciplines are related, they are two different disciplines. Data mining is more about identifying key data relationships, patterns or trends in the data, while data analytics is more about deriving a data-driven model. On this path, data mining is an important step in making the data more usable. In the end, it’s not a versus, but both disciplines are part of an analytics pipeline.
In this article, we will go further into the differences between the various data sciences and clarify the difference between data analysis and data science.

ricardo gomez angel j5gCOKZdm6I unsplash

Apache Avro – Effective Big Data Serialization Solution for Kafka

November 15, 2020 / RainerGewalt / 0 Comments

In this article we will explain everything you need to know about Apache Avro, an open source big data serialization solution and why you should not do without it.

You can serialize data objects, i.e. put them into a sequential representation, in order to store or send them independent of the programming language. The text structure reflects your data hierarchy. Known serialization formats are for example XML and JSON. If you want to know more about both formats, read our articles on the topics. To read, you have to deserialize the text, i.e. convert it back into an object.

In times of Big Data, every computing process must be optimized. Even small computing delays can lead to long delays with a correspondingly large data throughput, and large data formats can block too many resources. The decisive factors are therefore speed and the smallest possible data formats that are stored. Avro is developed by the Apache community and is optimized for Big Data use. It offers you a fast and space-saving open source solution. If you don’t know what Apache means, look here. Here we have summarized everything you need to know about it and introduce you to some other Apache open source projects you should know about.

Apache Avro – Open Source Big Data Serialization Solution

With Apache Avro, you get not only a remote procedure call framework, but also a data serialization framework. So on the one hand you can call functions in other address spaces and on the other hand you can convert data into a more compact binary or text format. This duality gives you some advantages when you have cross-network data pipelines and is justified by its development history.

Avro was released back in 2011 as a part of Apache Hadoop. Here, Avro was supposed to provide a serialization format for data persistence as well as a data transfer format for communication between Hadoop nodes. To provide functionality in a Hadoop cluster, Avro needed to be able to access other address spaces. Due to its ability to serialize large amounts of data, cost-efficiently, Avro can now be used Hadoop-independently.

You can access Avro via special API’s with many common programming languages (Java, C#, C, C++, Python and Ruby). So you can implement it very flexible.

In the following figure we have summarized some reasons what makes the framework so ingenious. But what really makes Avro so fast?

The schema clearly shows all the features that Apache Avro offers the user and why he should use it — Features Apache Avro

What makes Avro so fast?

The trick is that a schema is used for serialization and deserialization. About that the data hierarchy, i.e. the metadata, is stored separately in a file. The data types and protocols are defined via a JSON format. These are to be assigned unambiguously by ID to the actual values and can be called for the further data processing constantly. This schema is sent along with the data exchange via RPC calls.

Creating a schema registry is especially useful when processing data streams with Apache Kafka.

Apache Avro and Apache Kafka

Here you can save a lot of performance if you store the metadata separately and call it only when you really need it. In the following figure we have shown you this process schematically.

When you let Avro manage your schema registration, it provides you with comprehensive, flexible and automatic schema development. This means that you can add additional fields and delete fields. Even renaming is allowed within certain limits. At the same time, Avro schema is backward and forward compatible. This means that the schema versions of the Reader and Writer can differ. Schema registration management solutions exist, with Google Protocol Buffers and Apache Thrift, among others. However, the JSON data structure makes Avro the most popular choice.

IaaS vs PaaS vs SaaS – The Various Facets of Cloud Computing

November 11, 2020 / RainerGewalt / 1 Comment

IaaS vs PaaS vs SaaS – terms that categorize clouds, but what exactly do they mean? In this article, we contrast all three and explain the differences.

In almost all areas, the cloud is becoming more and more important. Increasingly, the cloud is also becoming interesting for business processes. Everyone is talking about it, but what is it actually?

What is the cloud anyway?

The cloud basically means the use of different servers. This means that your data can be hosted online, i.e. stored, managed and processed.
So you don’t have to provide the appropriate hardware on site, but can rent these resources from a cloud provider. Read our article about the cloud computing provider AWS.
Besides Amazon, other global players such as Google (Google Cloud) and Microsoft (Azure) also offer profitable cloud resources.
But which ones are suitable for me or my company? To meaningfully compare the individual solutions, you need to understand the differences between them.
Basically, you need to distinguish between the three categories already mentioned.

IaaS vs PaaS vs SaaS - Diese Abbildung zeigt die Die 3 Cloud Kategorien — IaaS vs PaaS vs SaaS

IaaS vs PaaS vs SaaS – What are the Differences?

First and foremost, all three terms are used to describe a resource provided by a cloud service provider for a short period of time.
The following figure shows this “as-a-service”, or Flexible consumption model, and the management components..

IaaS vs PaaS vs SaaS - This diagram shows the distribution of tasks between providers and customers in the individual cloud categories depending on the service layer model. — **Red**: managed by others; **Green**: managed by your organization

You can see very clearly here that the cloud provider manages more and more layers, ascending from IaaS to SaaS.

Software as a Service (SaaS)

The abbreviation SaaS refers to cloud-based software. This is hosted online by a company and provided via the Internet. It is easy to use and manage. Additionally, it is highly scalable, meaning it can be used for an entire organization.

Platform as a Service (PaaS)

PaaS is used to describe a cloud-based platform service. This offers developers an online platform for application development. Data is provided, stored and managed online.s

Infrastructure as a Service (IaaS)

IaaS refers to cloud-based infrastructure resources provided via virtualization technologies. These services are designed to help companies build and manage their servers, networks, operating systems and data storage. This is where the highest administrative share lies with the customer. Access to the servers for data management takes place via a dashboard or API.

IaaS vs PaaS vs SaaS – For whom is which category suitable?

So who should choose which service model? The following figure shows that the more tasks are taken over by the provider, the more control is relinquished. This is especially detrimental in organizations where a lot of control is needed.

IaaS vs PaaS vs SaaS - Presentation of the individual services depending on the control and for whom they are suitable. — Services depending on the control

IaaS gives administrators more direct needed, control over operating systems. However, more control always comes with more complicated administration tasks. PaaS therefore offers users a certain compromise between flexibility and ease of use. This model is particularly appealing to developers.
The SaaS model offers the highest level of usability and is accordingly interesting for customers who want to take over no to few administrative tasks.

IaaS vs PaaS vs SaaS – Technology of the future?

Cloud resources can be a valuable alternative to expensive, in-house hardware solutions. Of course, with external administration, a company loses control over its own data. However, the different types of service mean that compromises can be made that are tailored to the company’s own needs.

The advantages are obvious. Individual services can be accessed from virtually anywhere at any time, and high-performance computing can be operated cost-effectively. As network technologies become faster and faster, these solutions are increasingly coming into focus and will certainly become more and more important for companies and private individuals in the coming years.

Matplotlib vs Seaborn – Who owns the Python visualization throne?

November 7, 2020 / RainerGewalt / 0 Comments

Matplotlib vs Seaborn – Matplotlib is often the first choice when it comes to creating mathematical plots with Python. But is it always the best choice? With Seaborn there is a potent competitor.

Matplotlib was developed by John D. Hunter back in 2003 and has become indispensable. Due to the increasing importance of the Python programming language in almost all scientific areas, the importance of fully compatible visualization methods is also growing.

Due to its open source concept, Matplotlib can be used absolutely free of charge and is a basic component of many popular Python distribution platforms, such as Anaconda.

The library offers a MATLAB-like interface and can be used in combination with NumPy, Pandas and Scipy, just like MATLAB.

SciPy is a collection of mathematical algorithms and convenience functions and is mainly used by scientists, analysts and engineers for scientific computing, visualization and related activities.
NumPy allows easy handling of vectors, matrices, or large multidimensional arrays in general.
NumPy’s operators and functions are optimized for multidimensional array operations and evaluate particularly efficiently.

Pandas is also an open source Python library that can be used to perform data analysis and manipulation efficiently. Its strength lies in the processing and evaluation of tabular data and time series.

These components, which are absolutely compatible with each other, offer in their entirety an absolutely free, but fully comprehensive alternative to the commercial analysis software MATLAB.

This figure shows some Python libraries, which together form an open source MATLAB alternative. — Matplotlib vs Seaborn – Together the Python libraries form a MATLAB replacement

Python Matplotlib – What are the features?

The library offers a wide range of visualization functions. Some of them are listed in the figure below.

Matplotlib vs Seaborn - This figure shows Matplotlib features sorted by their use cases. — Matplotlib vs Seaborn – Matplotlib Features

Matplotlib is designed to effectively visualize the results of mathematical calculations. Visualization is an efficient and important data analysis tool.
The library is able to generate all the usual diagrams and figures by default. It is even possible to create animations that can be used to better understand the flow of certain algorithms.

Event Handling

Matplotlib offers an important feature with event handling. Behind the name is a UI-neutral event model. This allows the library to connect to events without knowing which UI Matplotlib will eventually plug into.

This allows me to develop a very flexible and portable code.
However, the events can then be used to transfer things like the data coordinate.

PyLab vs Pyplot

PyLab is a collection of functions that is installed together with Matplotlib and make the library work like MATLAB.
The module brings a set of NumPy functions and classes into the namespace. This makes them accessible without having to import them.
However, this often led to conflicts between individual Matplotlib functions.
For this reason, the use of PyLab is now no longer recommended.
Pyplot is a module in Matplotlib and provides the state-machine interface to the underlying plotting library.

The conflicts are prevented because an import is done with Pyplot and a separate NumPy import.

Python Matplotlib – Third party packages

If the standard library features are not enough, you can extend Matplotlib with additional external packages. In the following figure some of the possible extensions are listed and grouped by application.

Python Matplotlib - This figure shows Matplotlib Third Party Packages sorted by their use cases. — Matplotlib vs Seaborn – Matplotlib Third Party Packages

These external packages must be installed individually and extend the functionality of the plotting library, or build on existing features.
They sometimes offer more complex graphics or higher performance data analysis methods. Most of these packages are open source and are constantly updated by very active communities.

Matplotlib also has weaknesses

Matplotlib is not perfect despite the wide feature set. For example, only poor default options for the size and colors of plots are offered. Matplotlib is often considered to be a low-level technology compared to today’s requirements. Thus, very specialized code is needed to generate appealing plots.

What is Seaborn?

Seaborn is a Python visualization library, but based on Matplotlib. This library provides a high-level interface for visualization of statistical data and not only has its own graphics library, but internally uses Matplotlib’s functionalities and data structures.
It thus offers a variety of additional features besides the śtandard Matplotlib functions.

This scheme shows the main features of Seaborn — Matplotlib vs Seaborn – Main features of Seaborn

Among other things, Seaborn provides built-in themes for designing matplotlib graphs and a dataset-oriented API for determining the relationship between variables. It can visualize both univariate and bivariate data and plot statistical time series. Estimation and plotting of linear regression models run automatically and Seaborn, unlike Matplotlib, offers optimization when processing NumPy and Pandas data structures.

So what should you choose?

Especially when it comes to deep statistics, Seaborn clearly has the edge. Matplotlib, however, is often the leaner solution due to its simplicity. So both have their strengths and weaknesses. Which tool you ultimately choose depends on the situation. You can’t do much wrong. With one solution, however, you have more contextual options. But now that you know the differences between the two, this decision will be easier for you.

Data Science vs Data Analysis – How to decide which one is right for you?

November 5, 2020 / RainerGewalt / 0 Comments

Data Science vs Data Analysis – What distinguishes both professions from each other? How do your tasks differ? In this article, we will discuss all of these questions.

By now, almost every company, across industries and sizes, has recognized the potential in their own data. Every company wants to access this treasure and gain valuable information in order to develop profitable business strategies.
The economy is crying out for experts who can manage and analyze the enormous volumes of data. A trend that is not expected to end in the next 10 years, but rather to grow steadily.
So if you decide to enter the industry today and start studying, or if you want to teach yourself, you should first be clear about the differences between these often confusingly named professions. Often, HR professionals don’t even know these differences and look for the wrong profiles.

What are the similarities?

Let’s start with the similarities and the main reasons why both disciplines are often confused with each other.

Both professions deal with large amounts of data from which knowledge is to be extracted for a specific purpose.
New insights are to be generated and actions are to be identified.

Map of data disciplines

In order to properly understand the relationships between the data sciences, we need to look at the following figure. The individual disciplines and their relationships to each other are shown here.

Data Science vs Data Analysis - This scheme shows the map of all data disciplines — Data Science vs Data Analysis – Map of data disciplines

The diagram corresponds to an onion-like layering. It is important to understand that all the disciplines listed here are different. Not only are there intersections, but when you talk about a higher level discipline, it includes other, lower level disciplines.

As you can see, both data science and data analysis are ranked very high. So to understand these two disciplines you need to know the other fields as well.

What is Data Science?

When you talk about data science, you are also talking about all other data disciplines.
A data scientist is an all-rounder and can apply all interdisciplinary tools and methods. He or she can handle structured and unstructured data and perform data preprocessing in addition to analysis.

What is Data Analysis?

Data analysis is more about using the right data analysis tools. Specialized data processing is not required at this level, but a data analyst must be able to fully master and understand the tools in order to gain new insights from the data.

What is Data Analytics?

Data analytics is primarily about the use of queries and data aggregation methods. The primary question here is: How can different dependencies between input variables be represented?
Furthermore, this discipline makes use of data mining techniques and tools.

Data mining

Data mining uses the predictive power of machine learning by applying various machine learning algorithms to big data to identify new trends in the data.

If you want to know even more about how data mining differs from data analytics, check out this article we wrote on the subject.

Data Science vs Data Analysis – So what are the differences?

So we have found that all data disciplines are similar in many ways and one discipline can imply other disciplines. In order to be able to define the differences precisely, the methods used must be compared with each other. Are programming skills required, or is the business intelligence part higher?

In the following figure, the assignment to both professions is shown once.

Data Science vs Data Analysis - This diagram shows the cornerstones of the two data disciplines. Mathematics, statistics and business intelligence — Venn Diagram
by Hugh Conway in 2010

Both disciplines lie at the intersection of mathematics, statistics, and development. While data science is characterized by the fact that it consists of all three cornerstones, data analysis lacks the connection to computer science. And that is the biggest difference between the two fields.

Data Science vs Data Analysis – Comparison

Data Science is a branch of Big Data, with the objective of extracting and interpreting information from a huge amount of data. To do this, a data scientist must design and implement mathematical algorithms and predictive models based on statistics, machine learning, and other methods.
Data Analysis is the specific application of Data Science. It specifically involves searching raw data sources to find trends and metrics. However, this involves working with larger data sets than in the area of Business Intelligence.

In the following diagram, these differences and the overlaps between the two professions are compared once again.

datascientist vs dataanalyst — Data Science vs Data Analysis – Comparison

So what you ultimately decide to do depends on your programming interests. Do you want to develop the analyses yourself, or do you prefer to use specific analysis tools to get more value out of large data sets?