Month: February 2021

TensorFlow vs Theano – Which One Should You Choose

TensorFlow vs Theano – TensorFlow, along with PyTorch, is currently the best known and most widely used machine learning framework. However, the choice of tool should never depend on one’s own preferences, but should be adapted to the data to be examined. Especially in the Big data area, this can prevent a decisive loss of performance. It is therefore also worthwhile to look off the beaten track and to look at other frameworks and libraries in addition to the top dogs.
Theano is one such open source Python library. In the following article, we will introduce both tools and explain the differences.

What is Tensorflow?

The open source framework TensorFlow is the direct successor of Google’s first deep learning tool DistBelief and primarily also forms the basis for neural networks in the environment of language and image processing tasks. With TensorFlow, own models can be developed and processed, but also pre-trained models can be accessed. TF runs on a variety of platforms and is implemented in Python and C++.

TensorFlow vs Theano - This figure shows the hierarchy of the TensorFlow framework.
Hierarchy of TensorFlow toolkits

TF offers low-level APIs for CPU, GPU or TPU. In this way, the hardware resources can be optimally adapted to the process through dynamic allocations.
In addition to the low level APIs, there are also various high level APIs, such as Keras, one of the best known and most frequently used. If you want to know more about Keras, check out our article on the topic.

Framework Architecture

Mainly, the TensorFlow framework can be divided into the components needed for training, where the models are prepared for field use, and for the final deployment, for example on mobile and IoT devices with TensorFlowLite. To simplify the training, TensorFlow offers the developer some useful services besides the already mentioned dynamic allocation. For example, a premade estimator offers a high-level representation of a complete model.Via the TensorFlow Hub, a kind of repository, even trained machine learning models can be other language bindings can be accessed.

TensorFlow vs Theano - This figure shows the structure of the TensorFlow framework.
TensorFlow vs Theano – Structure of the TensorFlow Framework

The TensorBoard and StoredModels services act as connecting elements between training and deployment. TensorBoard is the visualization toolkit of TensorFlow with which the experiment results can be visualized. So here it is more of a monitoring solution for the human interface. With the StoredModels both deployment services and training services can share the models. This service thus forms a kind of intermediary, but contains a complete TensorFlow program, including all weights and calculations.

TensorFlow – Data Structure

Neural networks are represented by directed cycle-free graphs. These graphs can be represented and computed beyond the computer limits of training. A graph basically consists of nodes connected by edges. The extent to which the nodes are interconnected also usually determines the learning procedure and thus the structure of an artificial neural network.
The inputs and outputs of the individual calculation steps represent multidimensional data arrays, so-called tensors.

This figure shows the basic tensor structure
Tensor Principle

The mathematical term tensor corresponds to a generalization of vectors and matrices. It is thus an elementary data structure for data representation and processing. In TensorFlow the implementation is done as multidimensional arrays . A vector thus corresponds to a one-dimensional tensor.
Additional dimensions can be added to a tensor up to infinity. Common tensor types are 3-dimensional tensors for time series, images are usually 4-dimensional, and videos are 5-dimensional tensors.

pytorch training 2
Tensors and neural networks

TensorFlow methods manipulate tensors for linear algebra operations. These processes can be executed with high performance by moving the tensor objects to the graphics card memory or tensor optimized TPUs.

TensorFlow – Training

The training itself then proceeds in such a way that training data are iteratively fed into the computers and at the same time the weights within the graph are varied. The output is then approximated to a target output value. To this end, separate test data can be used to periodically verify that the training is effective for arbitrary or different input data.

 The figure shows the sequence of the training of a neural network
Training procedure

Theano – Old but Gold

Theano is an open source Python library for machine learning and neural network programming, and compiler for mathematical expression computation. It was released back in 2007 by the Montreal Institute for Learning Algorithms (MILA) at the University of Montreal.
It is particularly suitable for the definition, optimization and evaluation of mathematical expressions involving multidimensional arrays. For this purpose, Theano accesses the NumPy program library for dealing with matrices, large multidimensional arrays and vectors. First, read our article on NumPy. Here we introduce you to this elementary Python library and explain its basic data management.

Mathematical expressions are programmed and symbolized in Theano using a NumPy-like syntax.
The calculation instructions are done in C++ or CUDA code, thus very close to the machine and accordingly very efficient on CPUs or graphics processing units (GPUs).
Theano can also be used, like TensorFlow as a backend for the framework Keras. Keras thus forms an intersection for both technologies.

Graph Structure

Unlike TensorFlow, Theano focuses on supporting symbolic matrix expressions rather than tensors as a basic data type. Although all kinds of Python objects are supported, basic tensor functionality can be used with Theano, but these operations are not as optimized as with TensorFlow.

Theano performs the symbolic mathematical calculations are executed as graphs. These graphs are composed of interconnected Apply, Variable and Op nodes.

TensorFlow vs Theano - Overview structure of a Theano graph
TensorFlow vs Theano – Overview structure of a Theano graph

The Op node represents a particular computation on a particular type of input that produces a particular type of output. It thus corresponds to the definition of a computation.

The centrally located Apply node represents the application of an Op to some variables, that is, the application of computations to the current data, and is used to represent a computation graph. Each op is responsible for knowing how to build an Apply node from a list of inputs and thus determines the determines the function and transformation.
An Apply node additionally consists of the input or output fields. The inputs represent the arguments of the function, and the outputs represent the return values of the function.

The Apply nodes then refer to their input and output variables, the main data structure, in the graph via their input and output fields, respectively.
These Variable Nodes are defined by various fields. The variable type, the owner, which can be None or an Apply node of which the variable is an output, the index and the variable name.

TensorFlow vs Theano

All in all, both technologies have their advantages and disadvantages. But both have their raison d’être. Here, too, the data set provides the tools.

In the table below, we have listed all the important points of difference in detail.

TensorFlow vs Theano - This table compares both tools in detail.
TensorFlow vs Theano – Comparision

Especially when it comes to tensor processing, as in image processing and sound recognition, TensorFlow with its optimized operations should be the first choice. Another tensor-based alternative to the Google solution is PyTorch from Facebook. In this article we compared these two tools.
Despite its age, Theano is a high-performance and modern alternative for the calculation of matrix expressions.

PyTorch vs TensorFlow – Facebook vs Google – Understanding the Most Popular Deep Learning Frameworks

In recent years, the field of data science has been able to access increasingly powerful analysis methods thanks to increasingly high-performance hardware. Google’s Tensorflow has been the benchmark for editing machine learning and modeling deep learning methods. It still has the most freedom today. But a wide range of options often creates a high barrier to entry.

PyTorch vs TensorFlow – With the 2 years younger, also Python-based, open source package PyTorch, Facebook now wants to knock Tensorflow off its throne. It has been steadily gaining popularity for years due to its simplicity and features.
In this article, we will clarify what is in the package and whether it can really compete with Tensorflow.

What is PyTorch?

Pytorch is one of the most popular open source Python packages for scientific computing and neural network development/training.
It was developed by Facebook in 2016 and is based on the Torch library written in Lua. A NumPy-like tensor library that provides rich GPU support to enable accelerated neural network learning. PyTorch is also often referred to as the library of the same name. More about this in the section “Libraries”.
Tensors form the elementary data structures for PyTorch, similar to Tensorflow.

PyTorch vs TensorFlow – Tensors form the basis for both!

The mathematical term tensor corresponds to a generalization of vectors and matrices. It is thus an elementary data structure for data representation and processing. In PyTorch the implementation is done as multidimensional arrays . A vector thus corresponds to a one-dimensional tensor.

the figure schematically shows the principle behind tensors.
PyTorch vs TensorFlow – Tensor Principle

More dimensions can be added to a tensor up to infinity. Common types of tensors are 3 dimensional tensors for time series, images are usually 4 dimensional and videos are five dimensional tensors.

The figure shows the role of tensors in the training of neural networks in PyTorch.
PyTorch vs TensorFlow – Tensors and neural networks

PyTorch methods manipulate tensors for linear algebra operations. These processes can run at high performance by moving the tensor objects into the graphics card memory.

PyTorch Libraries

Pytorch offers the possibility to include specific libraries. This way the program can be kept lean and only make references to needed code.
The PyTorch library itself is an optimized tensor library for deep learning on both GPUs and CPUs.
By including another library, PyTorch can also compute on TPUs.

Depending on the data type, different libraries can be loaded, which provide optimized methods and pre-modeled prototypes for analysis. Torchaudio offers besides the usual audio transformation methods also data sets for training. With torchtext large language packages can be accessed and with torchvision images can be analyzed.

The figure shows all PyTorch Libraries.
PyTorch Libraries

With TorchElastic, training jobs can be managed and elatically distributed, for example, to shared capacities.

PyTorch features

Through accelerated tensor analysis via allocation to GPUs, PyTorch achieves high flexibility and high speed in Deep Learning algorithms. Beyond this, PyTorch offers through its Python base unlimited compatibilities to powerful Python libraries, such as NumPy and SciPy and to the Cython programming language. Here we have collected the most important Python open source data management and analysis libraries.

Reverse-mode auto differentiation allows developers to modify network behavior at will, without delay or overhead. This allows for essential acceleration of research iterations.
The 8-bit quantization model ensures efficient deployment on servers and edge devices, and PyTorch Mobile can be used to develop for Android and iOS environments.
Other features include named tensor, artificial neural network pruning, and parallel training of models with remote procedure call.

PyTorch can access TorchServe, an open source server from Facebook, and is fully compatible with cloud provider Amazon Web Services (AWS). If you don’t know what AWS is, read our article on the subject.

PyTorch offers a hybrid frontend as an additional feature. This offers the possibility to choose between two modes. The Eager and the Graph mode. The eager mode primarily offers usability and flexibility, while the graph mode offers better speed, optimization and functionality in a C++ runtime environment. PyTorch also allows conversion with the Hybrid frontend. This allows models to be developed in eagermode and then transferred to graph mode for production.

PyTorch has unlimited access to ONNX (Open Neural Network Exchange) compatible platforms. ONNX is an open source project jointly developed by Microsoft, Amazon, and Facebook, among others, that enables the exchange of AI models between different tools.

PyTorch vs Tensorflow

Duell of the Giants

Just like the Facebook solution, Tensorflow works with the tensor data type. PyTorch scores with its simplicity and effective memory usage. Tensorflow, on the other hand, is much more scalable and thus better suited for production models. An essential difference was originally that with PyTorch the graph structure is defined during execution, while with Tensorflow it is first defined and then executed. Here, however, Tensorflow has now followed with its own eager mode. However, this is not yet fully developed at this stage.

PyTorch vs TensorFlow - The figure shows the main differences between Google's Tensorflow and Facebook's PyTorch
Tensorflow vs PyTorch

PyTorch vs Tensorflow – Who is ahead now?

It remains an exciting head-to-head race. Despite its recent development history, PyTorch has already made up a lot of ground and is interesting in an entrepreneurial context precisely because of its user-friendliness. As is often the case, however, it is not a question of which solution will come out on top, but rather of the principle that competition stimulates business. In the end, competitive pressure leads to great new innovations and exciting new tools.

Apache Hive Architecture – Data Warehouse System for free

Apache Hive Architecture – On the way to Industry 4.0, companies are trying to record all business processes as far as possible in order to subsequently optimize them through analysis.
Data warehouse systems provide central data management. Thus, only one data truth exists. In addition to persistence, these information systems take care of sorting, preprocessing, translation and data analysis.
If you want to know more about what a data warehouse system is, check out our article on the subject.

What is Apache Hive

Hive is a data warehousing software project and part of Apache, an open source and free web server software. Learn more about Apache here.
It is built on the Big Data framework Apache Hadoop and was released in 2010. Since then it has been continuously improved and extended by an industrious community.

Apache Hive Architecture – Built on top of Hadoop

The query language used by Hive, called HiveQL, is SQL based and allows querying, aggregation and analysis of unstructured data. Hive does not work with the schema-on-write (SoW) approach like relational databases, but uses the so-called schema-on-read (SoR) approach.

What are the biggest advantages of Hive?

Data from relational databases is automatically converted into MapReduce or Tez or Spark jobs. Hadoopclusters are based on MapReduce, a Google programming model for concurrent computation on computer clusters, and powerful stream-based data analysis pipelines can be created with Apache Spark. This ensures full compatibility with the Apache ecosystem, which can be modularly tailored to the needs of an application.

The figure shows the main Apache Hive features
Apache Hive Features

Another advantage of Hive is that the tables are similar to the tables in a relational database. Data is queried using HiveQL. A declarative SQL-like language.
HiveQL allows multiple users to query data simultaneously. Hive supports a variety of data formats and provides a lightweight but powerful translation feature.
For data analysis, custom MapReduce processes can be written and run on clusters in parallel for high performance.

Apache Hive Architecture

Basically, the architecture of Hive can be divided into three core areas. Hive communicates with other applications via the client area. The integration is then executed via the service area. In the last layer, Hive stores the metadata, for example, or computes the data via Hadoop.

The figure shows the basic three-part core architecture of Apache Hive.
Apache Hive Architecture

Hive Clients

Apache Hive can be accessed via different clients. In addition to Open Database Connectivity (ODBC), an SQL-based application programming interface (API) created by Microsoft, there is Java Database Connectivity (JDBC), an SQL-based API developed by Sun Microsystems to allow Java applications to use SQL for database access. Hive also provides a high-performance Apache Thrift connection.

Hive Services

The core and central control of the Hive Services is the so-called driver. This
receives HiveQL commands and is responsible for their execution against the Hadoop system. It typically consists of a compiler that translates HiveQL requests into abstract syntax and executable tasks, an optimizer that aggregates, splits, and optimizes for better performance and scalability, and an executor that interacts with Hadoop’s job tracker and passes tasks to the system for execution.

Apache Hive also provides the ability to submit these tasks directly to the driver. Using the Command Line and User Interface (CLI + UI), it is possible to directly influence the process.

Metadata about persistent relational entities, i.e. databases, tables, columns and partitions are managed by the metastore.

Hive Storage and Computer

The metadata is stored here in a persistence. The results of the query and the data loaded into the tables are stored on HDFS in the Hadoop cluster.