EXPERT KNOWLEDGE AT A GLANCE

Month: September 2020

Kafka alternative – RabbitMQ

Overview

== Open Source Message Broker Software (a service that handles the distribution of messages)
– released in 2013
– implements Advanced Message Queuing Protocol (AMQP)
– meanwhile also STOMP and MQTT

Advanced Message Queuing Protocol (AMQP)

== binary network protocol
– independent of any programming language (sender and receiver do not have to understand the same programming language)

AMQP

Principle

– There is a queue between the producer and the consumer of a message. The messages are temporarily stored in this queue
– Messages == can be instructions to other programs or actual text messages
– The producer of the message does not have to take care of sending the message himself and does not have to wait until the recipient has received it
→ asynchronous procedure

Stations in the message transmission

Producer: creates news
Exchange: part of RabbitMQ, forwards messages
Queue: part of RabbitMQ, stores messages
Consumer: processes the message

Message transmission process

– Producer publishes a message; gives the message a routing key (==address)
→ passes it to the Exchange
→ distributes the messages to different queues using the routing key

Binding

– There is a so-called binding between Exchange and Queue
– this connects each individual queue to the Exchange
– defines, according to which criteria a message should be forwarded

Direct Exchange

== direct connection between transmitter and receiver
– one queue + one consumer

Topic Exchange

– addressing multiple queues

Fanout Exchange

== Broadcast (distribution of a message to all available queues, without sorting)

Header Exchange

– corresponds to the Topic Exchange
– but browsing is done via header attributes

Redux is Awesome

Overview

== Open source JavaScript library
– 2015 released
– state container in a JavaScript web application
– use together with React, and Angular
– all status information is stored centrally and is therefore accessible for all components of the web application

Patterns

– Command Query Responsibility Segregation (CQRS)
→ Separation of the change at the state, purely via commands

– Event Sourcing
→ give sequences of commands to my state and can replay them at any time due to the base state

Core Components

redux

Store

– contains all status information (data object)
→ not changeable, only readable

Action

– plain objects
– are called by the web components and evaluated by reducers
– Instead of mutating the state directly → specify the mutations with Actions

Reducer

– Special function
– Change the global state of the web application
– Adds a new status object to the global state based on the type of an action
– In large apps, you can split the root reducer into smaller reducers independently operating on the different parts of the state tree

What is meant by Domain Driven Design?

Overview

== Approach to complex software robust (the property of software to function reliably even under unfavorable conditions) flexible (the software’s ability to be easily adapted to changing requirements) and transparent modeling

– one of the basic theories for microservices architectures
– The focus of the software design is on technicality + technical logic.
– Design of complex domain-oriented contexts is based on a model of the application domain (== domain model)
– not worthwhile with CRUD ( Create Read Update Delete ) systems

Components of a domain

ddd


Modules: technical components of the domain
Entities: Objects with variable or ambiguous properties defined by their unique identity
Professional events: Special objects register domain-relevant events and make them visible to other domain parts
Service objects: business relevant functionalities that are important for several objects in the domain
Value objects: Objects that are uniquely defined by their properties and typically remain unchangeable
Associations: Relationships between objects of the model.
Aggregates: Unit of objects and their relations
Factories: For complex scenarios, different production patterns (mostly factory or builder patterns) can be used
Repositories: clean separation of domain and data layer for system abstraction

Techniques

methoden ddd
Techniques and approaches for implementing the domain model

Microservices

== Architecture pattern of information technology, in which complex application software is composed of independent processes that communicate with each other using language-independent programming interfaces
– Services are largely decoupled and perform small tasks

Microservices Core Features

Independent deployability (development teams work within their own deployment pipeline (Continuous Integration/Continuous Development)
Independent technology stacks (technology decision (programming language, frameworks, database, operating system…) is up to the respective development team)
Decentralized data management (each service manages its own data necessary for the functional scope)
Loose coupling (microservices are executed separately in their own processes and are coupled together via the network)
Bounded Context (functional scope of an application is cut into functional delimitable contexts (Bounded Context)

Apache Flink

Overview

– Open source stream processor framework developed by the Apache Software Foundation (2016)
– Data streams with high data volume can be processed and analyzed with low delay and high speed

flink analytics
Flink provides various tools for efficient real-time processing of continuous data streams and batch data

Core functions

– diverse, specialized APIs:
→ DataStream API (Stream Processing)
→ ProcessFunctions (control of states and time; event states can be saved and timers can be added for future calculations)
→ Table API
→ SQL API
→ provides a rich set of connectors to various storage systems such as Kafka, Kinesis, Kubernetes, YARN, HDFS, Elasticsearch, and JDBC database systems
→ REST API

Stream Processing

pexels pixabay 2438
How to handle this flood of data?

== Data is processed continuously with a short delay
→ without intermediate storage of the data in separate databases
– several data streams can be processed in parallel
– Each stream can be used to derive own follow-up actions and analyses

Architecture

Data can be processed as unbounded or bounded streams:

  • Unbounded stream

    • have a start but no defined end

    • must be continuously processed

  • Bounded stream

    • have a defined start and end

    • can be processed by ingesting all data before performing any computations(== batch processing)

– Flink automatically identifies the required resources based on the application’s configured parallelism and requests them from the resource manager.

In case of a failure, Flink replaces the failed container by requesting new resources.

– Stateful Flink applications are optimized for local state access





PyTorch BigGraph (PBG) – Facebook’s open source library for process embedding on large graphs for free

PyTorch BigGraph – The graph is a data structure that can be used to clearly represent relationships between data objects as nodes and edges.
These structures can contain billions of nodes and edges in an industrial context.

pygraph graph
Typical Graph structure

So how can the multidimensional data relationships be accessed in a meaningful way?
Graph embedding offers one possibility for dimension reduction.
This is a sequence of different algorithms with the goal of reducing the graph’s property relations to vector spaces. These embedding methods usually run unsupervised.
If there is a large property similarity, two points should also be close to each other in the vector space.

The reduced feature information can then be further processed with additional machine learning algorithms.

What is PyTorch BigGraph?

Facebook offers PyTorch BigGraph, an open source library that can be used to create very performant graph embeddings for extremely large graphs.

The figure shows the main principle of PyTorch BigGraph graph embedding.
PyTorch BigGraph Principle

It is a distributed system that can unsupervised learn graph embeddings for graphs with billions of nodes and trillions of edges. It was launched in 2019 and is written entirely in Python. This ensures absolute compatibility with common Python data processing libraries, such as NumPy, Pandas, and scikit-learn.
All calculations are performed on the CPU, which should play a decisive role in the hardware selection. A lot of memory is mandatory. It should also be noted that PBG can process very performant large graphs, but is not optimized for small graphs, i.e. structures with less than 100.000 nodes.

Facebook extends the ecosystem of its popular Python scientific computing package PyTorch with a very performant Big Graph solution. If you want to know more about PyTorch, you should read this article from us. Here we will show you the most important features and compare it with the industry’s top performer Google Tensorflow.

Fundamental building blocks

PGB provides some basic building blocks to handle the complexity of the graph. The graph partitioning splits the graph into equal parts and can be processed in parallel. PGB also supports multithreading computations. A process is divided into several threads, which run independently, but can access the same memory. In addition to the distribution of tasks, PyTorch BigGraph can also be used intelligently by distributed execution of hardware resources.

PyTorch BigGraph- How does the training work?

The PGB graph processing algorithms can process the graph in parallel using the fundamental building blocks already described. This allows the training mechanisms to run in a distributed manner and thus with high performance.

Once the nodes and edges are partitioned, the training can be performed for one bucket at a time.

The figure shows schematically the parallel training of PyTorch BigGraph which is enabled by graph partitioning.
PyTorch BigGraph – Parallel Training through Graph Partitioning

The training runs unsupervised on an input graph by reading its edge list.
A feature vector is then output for each entity. Here, neighboring entities in the vector space are placed close to each other, while unconnected entities are pushed apart. Thus, the dimensions are iteratively reduced. It is also possible to configure and optimize this calculation using parameters learned during training.

PGB and Machine Learning

The graph structure is a very information-rich and so far unfortunately too much neglected data structure. With tools like PGB the large structure is more and more equalized by high parallelism.

A very interesting concept is the use of PGB in machine learning large graph structures. Here, the graph structures could be used for semantic queries with nodes, edges and properties to represent and store data and could replace a labeled data structure. Through the connections between the nodes certain relations can be derived. By PGB the graph can be processed enormously parallelized. This would allow individual machines to train a model in parallel with different buckets, using a lock server.

PyGraph – A Great Open Source Graph Manipulation Library in Python

In times of Big Data, the graph has become a popular data structure due to its flexible and clear relationship-based structure. Even entire database systems are now designed according to the graph principle. For more on this, read our article on NoSQL databases. Libraries, like PyGraph, allow you to perform fast queries and optimized graph manipulations. With its full Python implementation, it offers you a user-friendly and powerful tool.

What is a graph?

In a graph, objects are represented according to their relationships with each other. The objects are called vertices and the relations are called edges of the graph. An edge always connects exactly two nodes.
Graphs are often used to represent traffic networks, entity-relationship diagrams, syntax trees for programming languages, finite automata and proof or decision trees.

PyGraph - Schematic representation of a graph structure and its components
Schematic representation of a graph structure and its components

PyGraph supports different graph types

Basically, graphs must be differentiated between directed and undirected.
If a graph is directed, the edges may only be used in one direction. These edges are also called directed edges. If it is undirected, there are no directional constraints. So each edge is connected to an undirected pair of vertices. In the following figure we have contrasted both categories.

Schematic comparison of undirected and directed graphs
Comparison of undirected and directed graphs

You can use PyGraph regardless of these properties, because both types are supported.

PyGraph supports several algorithms

PyGraph supports the use of many well-known graph operations. For example, searching or traversing a graph, where all nodes of a graph must be visited, can be done in different ways. In the Depth-First Search (DFS) search algorithm, for example, the successors of a successor of the current node are visited first and only then the neighbors of the current node.

The depth of the search can also be limited accordingly. Breadth-First Search (BFS), on the other hand, first visits its own neighboring nodes and only then the successors of the neighboring nodes.


In addition to the algorithm-based search of a graph, other operations can be performed with PyGraph, such as the calculation of minimum spanning trees. This tree describes the best possible path to traverse all available nodes in a weighted graph. In the following figure we have shown you all currently supported algorithms.

Representation of all algorithms currently supported by PyGraph
All algorithms currently supported

Keras vs TensorFlow

Overview

== Open Source Python Deep Learning Library
– 2015 published
– code is hosted on GitHub
– originally a uniform interface for various backend libraries (TensorFlow, Microsoft Cognitive Toolkit, Theano, R, PlaidM)
– it focuses on being user-friendly, modular, and extensible, and Fast and easy prototyping of neural networks,
– Part of the Tensorflow Core API, but was also continued independently
– since version 2.4 Keras refers directly to the implementation of Tensorflow 2
– contains numerous implementations of commonly used neural-network building blocks (layers, activation functions, objectives, optimizers, tools to make working with image and text data)

keras histo

Features

– supports standard, convolutional and recurrent neural networks
– supports common supply layers (dropout, batch normalization, pooling)
– supports multi-input and multi-output training
– Modular design allows the creation of new models by combining cost functions, activation functions or initialization schemes
– enables in-depth learning models on iOS and Android, on the web, on the Java Virtual Machine with the DL4J model import from SkyMind , on clusters of graphics processing units (GPU) and tensor processing units (TPU), Google Cloud with TensorFlow Serving, Rasberry Pi

Keras vs Tensorflow

kerasvstensorflow