EXPERT KNOWLEDGE AT A GLANCE

Category: Machine learning (Page 3 of 3)

SciPy turns Python into an ingenious free MATLAB alternative

Python vs MATLAB

== Open source Python library
– a collection of mathematical algorithms and convenience functions

– is mainly used by scientists, analysts and engineers for scientific computing, visualization and related activities

– Initial Realease: 2006; Stable Release: 2020
– depends on the NumPy module
→ basic data structure used by SciPy is a N-dimensional array provided by NumPy

Benefits

scipy benefits

Features

– SciPy library provides many user-friendly and efficient numerical routines:

scipy subpackages

Available sub-packages

SciPy ecosystem

– scienitific computing in Python builds upon a small core of open-source software for mathematics, science and engineering

scipy ecosystem
SciPy Core Software

More relevant Packages

– the SciPy ecosystem includes, based on the core properties, other specialized tools

scipy eco sidepackages

The product and further information can be found here:

https://www.scipy.org/

Apache Mahout – A Powerful Open Source Machine Learning Project

Apache Mahout is a powerful machine learning tool that comes with a seamless compatibility to the strong big data management frameworks from the Apache universe. In this article, we will explain the functionalities and show you the possibilities that the Apache environment offers.

What is Machine Learning?

Machine learning algorithms provide lots of tools for analyzing large unknown data sets.
The art of data science is to extract the maximum amount of information depending on the data set by using the right method. Are there patterns in the high-dimensional data relationships, and how can they be represented in a low-dimensional way without much loss of information?

scikitLearn ml
Fields of machine learning


There is often a similar amount of information in the failure as when an algorithm was able to successfully create groupings.
It is important to understand the mathematical approaches behind the tools in order to draw conclusions about why an algorithm did not work.
If you don’t know the basic machine learning categories, it’s best to read our article on the subject first.

Machine Learning and Linear Algebra

Most machine learning methods are based on linear algebra.
This mathematical subfield deals with linear transformations, vector spaces and linear mappings between them.
The knowledge of the regularities is the key to the correct understanding of machine learning algorithms.

What is Apache Mahout

Apache Mahout is an open source machine learning project that builds implementations of scalable machine learning algorithms with a focus on linear algebra. If you’re not sure what Apache is, check out this article. Here we introduce you to the project and its main projects once.


Mahout was already released in 2009 and since then it is constantly extended and kept up-to-date by a very active community.
Originally, it contained scalable algorithms closely related to Apache Hadoop and MapReduce.
However, Mahout has since evolved into a backend independent environment. That is, it operates on non-Hadoop clusters or single nodes.

Features

The math library is based on Scala and provides an R-like Domain Specific Language (DSL). Mahout is usable for Big Data applications and statistical computing. The figure below lists all machine learning algorithms currently offered by Mahout.

The figure below lists all machine learning algorithms currently offered by Apache Mahout.
Implemented mathematical functions and algorithms

The algorithms are scalable and cover both supervised and unsupervised machine learning methods, such as clustering algorithms.

Apache Mahout covers a large part of the usual machine learning tools. This means that data can be analyzed without having to change frameworks. This is a big plus for maintaining compatibility in the application.

Apache Ecosystem

The framework integrates seamlessly into the Apache Ecosystem. This means that an application can access the entire power of the data processing platforms and build very high-performance big data pipelines. The following figure shows the Apache data management ecosystem.

Apache Mahout ecosystem
Apache Mahout ecosystem

Through connectivity to Apache Flink, stream data analysis pipelines can be built, or with Hive data from relational databases can be automatically converted into MapReduce or Tez or Spark jobs.

PyTorch BigGraph (PBG) – Facebook’s open source library for process embedding on large graphs for free

PyTorch BigGraph – The graph is a data structure that can be used to clearly represent relationships between data objects as nodes and edges.
These structures can contain billions of nodes and edges in an industrial context.

pygraph graph
Typical Graph structure

So how can the multidimensional data relationships be accessed in a meaningful way?
Graph embedding offers one possibility for dimension reduction.
This is a sequence of different algorithms with the goal of reducing the graph’s property relations to vector spaces. These embedding methods usually run unsupervised.
If there is a large property similarity, two points should also be close to each other in the vector space.

The reduced feature information can then be further processed with additional machine learning algorithms.

What is PyTorch BigGraph?

Facebook offers PyTorch BigGraph, an open source library that can be used to create very performant graph embeddings for extremely large graphs.

The figure shows the main principle of PyTorch BigGraph graph embedding.
PyTorch BigGraph Principle

It is a distributed system that can unsupervised learn graph embeddings for graphs with billions of nodes and trillions of edges. It was launched in 2019 and is written entirely in Python. This ensures absolute compatibility with common Python data processing libraries, such as NumPy, Pandas, and scikit-learn.
All calculations are performed on the CPU, which should play a decisive role in the hardware selection. A lot of memory is mandatory. It should also be noted that PBG can process very performant large graphs, but is not optimized for small graphs, i.e. structures with less than 100.000 nodes.

Facebook extends the ecosystem of its popular Python scientific computing package PyTorch with a very performant Big Graph solution. If you want to know more about PyTorch, you should read this article from us. Here we will show you the most important features and compare it with the industry’s top performer Google Tensorflow.

Fundamental building blocks

PGB provides some basic building blocks to handle the complexity of the graph. The graph partitioning splits the graph into equal parts and can be processed in parallel. PGB also supports multithreading computations. A process is divided into several threads, which run independently, but can access the same memory. In addition to the distribution of tasks, PyTorch BigGraph can also be used intelligently by distributed execution of hardware resources.

PyTorch BigGraph- How does the training work?

The PGB graph processing algorithms can process the graph in parallel using the fundamental building blocks already described. This allows the training mechanisms to run in a distributed manner and thus with high performance.

Once the nodes and edges are partitioned, the training can be performed for one bucket at a time.

The figure shows schematically the parallel training of PyTorch BigGraph which is enabled by graph partitioning.
PyTorch BigGraph – Parallel Training through Graph Partitioning

The training runs unsupervised on an input graph by reading its edge list.
A feature vector is then output for each entity. Here, neighboring entities in the vector space are placed close to each other, while unconnected entities are pushed apart. Thus, the dimensions are iteratively reduced. It is also possible to configure and optimize this calculation using parameters learned during training.

PGB and Machine Learning

The graph structure is a very information-rich and so far unfortunately too much neglected data structure. With tools like PGB the large structure is more and more equalized by high parallelism.

A very interesting concept is the use of PGB in machine learning large graph structures. Here, the graph structures could be used for semantic queries with nodes, edges and properties to represent and store data and could replace a labeled data structure. Through the connections between the nodes certain relations can be derived. By PGB the graph can be processed enormously parallelized. This would allow individual machines to train a model in parallel with different buckets, using a lock server.

PyGraph – A Great Open Source Graph Manipulation Library in Python

In times of Big Data, the graph has become a popular data structure due to its flexible and clear relationship-based structure. Even entire database systems are now designed according to the graph principle. For more on this, read our article on NoSQL databases. Libraries, like PyGraph, allow you to perform fast queries and optimized graph manipulations. With its full Python implementation, it offers you a user-friendly and powerful tool.

What is a graph?

In a graph, objects are represented according to their relationships with each other. The objects are called vertices and the relations are called edges of the graph. An edge always connects exactly two nodes.
Graphs are often used to represent traffic networks, entity-relationship diagrams, syntax trees for programming languages, finite automata and proof or decision trees.

PyGraph - Schematic representation of a graph structure and its components
Schematic representation of a graph structure and its components

PyGraph supports different graph types

Basically, graphs must be differentiated between directed and undirected.
If a graph is directed, the edges may only be used in one direction. These edges are also called directed edges. If it is undirected, there are no directional constraints. So each edge is connected to an undirected pair of vertices. In the following figure we have contrasted both categories.

Schematic comparison of undirected and directed graphs
Comparison of undirected and directed graphs

You can use PyGraph regardless of these properties, because both types are supported.

PyGraph supports several algorithms

PyGraph supports the use of many well-known graph operations. For example, searching or traversing a graph, where all nodes of a graph must be visited, can be done in different ways. In the Depth-First Search (DFS) search algorithm, for example, the successors of a successor of the current node are visited first and only then the neighbors of the current node.

The depth of the search can also be limited accordingly. Breadth-First Search (BFS), on the other hand, first visits its own neighboring nodes and only then the successors of the neighboring nodes.


In addition to the algorithm-based search of a graph, other operations can be performed with PyGraph, such as the calculation of minimum spanning trees. This tree describes the best possible path to traverse all available nodes in a weighted graph. In the following figure we have shown you all currently supported algorithms.

Representation of all algorithms currently supported by PyGraph
All algorithms currently supported
Newer posts »