== Open Source Message Broker Software (a service that handles the distribution of messages) – released in 2013 – implements Advanced Message Queuing Protocol (AMQP) – meanwhile also STOMP and MQTT
Advanced Message Queuing Protocol (AMQP)
== binary network protocol – independent of any programming language (sender and receiver do not have to understand the same programming language)
– There is a queue between the producer and the consumer of a message. The messages are temporarily stored in this queue – Messages == can be instructions to other programs or actual text messages – The producer of the message does not have to take care of sending the message himself and does not have to wait until the recipient has received it → asynchronous procedure
Stations in the message transmission
– Producer: creates news – Exchange: part of RabbitMQ, forwards messages – Queue: part of RabbitMQ, stores messages – Consumer: processes the message
Message transmission process
– Producer publishes a message; gives the message a routing key (==address) → passes it to the Exchange → distributes the messages to different queues using the routing key
– There is a so-called binding between Exchange and Queue – this connects each individual queue to the Exchange – defines, according to which criteria a message should be forwarded
== direct connection between transmitter and receiver – one queue + one consumer
– addressing multiple queues
== Broadcast (distribution of a message to all available queues, without sorting)
– corresponds to the Topic Exchange – but browsing is done via header attributes
== Splitting the data into several parts, which are stored separately
– Method to deal with very large amounts of data
– Partitions are objects, tables, intelligent heap (ORC), metadata in index (BRIN)
Types of partitioning
Individual data records are divided and stored in a distributed manner
logically related data sets are stored separately
Range (range of timestamps)
List (e.g. text strings by list)
Hash (randomly puts data in partitions)
Avoid sorting when partitions referenced in order
SELECT ORDER BY TIMESTAMP -> scan the partitions in time order
Allow expressions as partition bounds, but they don’t sort, so cancels out the optimization
for a table
– is a percentage between 10 and 100.
– 100 (complete packing) is the default
– When a smaller fillfactor is specified, INSERT operations pack table pages only to the indicated percentage; the remaining space on each page is reserved for updating rows on that page
– For a table whose entries are never updated, complete packing is the best choice, but in heavily updated tables smaller fillfactors are appropriate.
for an index
– is a percentage that determines how full the index method will try to pack index pages.
– For B-trees, leaf pages are filled to this percentage during initial index build, and also when extending the index at the right (largest key values). If pages subsequently become completely full, they will be split, leading to gradual degradation in the index’s efficiency.
– but for heavily updated tables a smaller fillfactor is better to minimize the need for page splits.
– keeps your tables and indexes bloat-free
– reclaims storage occupied by dead tuples
– In normal PostgreSQL operation, tuples that are deleted or obsoleted by an update are not physically removed from their table; they remain present until a VACUUM is done.
Avoids updating the index when doing an update
Rebuilds an index using the data stored in the index’s table, replacing the old copy of the index.
– Command Query Responsibility Segregation (CQRS) → Separation of the change at the state, purely via commands
– Event Sourcing → give sequences of commands to my state and can replay them at any time due to the base state
– contains all status information (data object) → not changeable, only readable
– plain objects – are called by the web components and evaluated by reducers – Instead of mutating the state directly → specify the mutations with Actions
– Special function – Change the global state of the web application – Adds a new status object to the global state based on the type of an action – In large apps, you can split the root reducer into smaller reducers independently operating on the different parts of the state tree
== Approach to complex software robust (the property of software to function reliably even under unfavorable conditions) flexible (the software’s ability to be easily adapted to changing requirements) and transparent modeling
– one of the basic theories for microservices architectures – The focus of the software design is on technicality + technical logic. – Design of complex domain-oriented contexts is based on a model of the application domain (== domain model) – not worthwhile with CRUD ( Create Read Update Delete ) systems
Components of a domain
– Modules: technical components of the domain – Entities: Objects with variable or ambiguous properties defined by their unique identity – Professional events: Special objects register domain-relevant events and make them visible to other domain parts – Service objects: business relevant functionalities that are important for several objects in the domain – Value objects: Objects that are uniquely defined by their properties and typically remain unchangeable – Associations: Relationships between objects of the model. – Aggregates: Unit of objects and their relations – Factories: For complex scenarios, different production patterns (mostly factory or builder patterns) can be used – Repositories: clean separation of domain and data layer for system abstraction
== Architecture pattern of information technology, in which complex application software is composed of independent processes that communicate with each other using language-independent programming interfaces – Services are largely decoupled and perform small tasks
Microservices Core Features
– Independent deployability (development teams work within their own deployment pipeline (Continuous Integration/Continuous Development) – Independent technology stacks (technology decision (programming language, frameworks, database, operating system…) is up to the respective development team) – Decentralized data management (each service manages its own data necessary for the functional scope) – Loose coupling (microservices are executed separately in their own processes and are coupled together via the network) – Bounded Context (functional scope of an application is cut into functional delimitable contexts (Bounded Context)
– Open source stream processor framework developed by the Apache Software Foundation (2016) – Data streams with high data volume can be processed and analyzed with low delay and high speed
– diverse, specialized APIs: → DataStream API (Stream Processing) → ProcessFunctions (control of states and time; event states can be saved and timers can be added for future calculations) → Table API → SQL API → provides a rich set of connectors to various storage systems such as Kafka, Kinesis, Kubernetes, YARN, HDFS, Elasticsearch, and JDBC database systems → REST API
== Data is processed continuously with a short delay → without intermediate storage of the data in separate databases – several data streams can be processed in parallel – Each stream can be used to derive own follow-up actions and analyses
Data can be processed as unbounded or bounded streams:
have a start but no defined end
must be continuously processed
have a defined start and end
can be processed by ingesting all data before performing any computations(== batch processing)
– Flink automatically identifies the required resources based on the application’s configured parallelism and requests them from the resource manager.
–In case of a failure, Flink replaces the failed container by requesting new resources.
– Stateful Flink applications are optimized for local state access
PyTorch BigGraph – The graph is a data structure that can be used to clearly represent relationships between data objects as nodes and edges. These structures can contain billions of nodes and edges in an industrial context.
So how can the multidimensional data relationships be accessed in a meaningful way? Graph embedding offers one possibility for dimension reduction. This is a sequence of different algorithms with the goal of reducing the graph’s property relations to vector spaces. These embedding methods usually run unsupervised. If there is a large property similarity, two points should also be close to each other in the vector space.
The reduced feature information can then be further processed with additional machine learning algorithms.
What is PyTorch BigGraph?
Facebook offers PyTorch BigGraph, an open source library that can be used to create very performant graph embeddings for extremely large graphs.
It is a distributed system that can unsupervised learn graph embeddings for graphs with billions of nodes and trillions of edges. It was launched in 2019 and is written entirely in Python. This ensures absolute compatibility with common Python data processing libraries, such as NumPy, Pandas, and scikit-learn. All calculations are performed on the CPU, which should play a decisive role in the hardware selection. A lot of memory is mandatory. It should also be noted that PBG can process very performant large graphs, but is not optimized for small graphs, i.e. structures with less than 100.000 nodes.
Facebook extends the ecosystem of its popular Python scientific computing package PyTorch with a very performant Big Graph solution. If you want to know more about PyTorch, you should read this article from us. Here we will show you the most important features and compare it with the industry’s top performer Google Tensorflow.
Fundamental building blocks
PGB provides some basic building blocks to handle the complexity of the graph. The graph partitioning splits the graph into equal parts and can be processed in parallel. PGB also supports multithreading computations. A process is divided into several threads, which run independently, but can access the same memory. In addition to the distribution of tasks, PyTorch BigGraph can also be used intelligently by distributed execution of hardware resources.
PyTorch BigGraph- How does the training work?
The PGB graph processing algorithms can process the graph in parallel using the fundamental building blocks already described. This allows the training mechanisms to run in a distributed manner and thus with high performance.
Once the nodes and edges are partitioned, the training can be performed for one bucket at a time.
The training runs unsupervised on an input graph by reading its edge list. A feature vector is then output for each entity. Here, neighboring entities in the vector space are placed close to each other, while unconnected entities are pushed apart. Thus, the dimensions are iteratively reduced. It is also possible to configure and optimize this calculation using parameters learned during training.
PGB and Machine Learning
The graph structure is a very information-rich and so far unfortunately too much neglected data structure. With tools like PGB the large structure is more and more equalized by high parallelism.
A very interesting concept is the use of PGB in machine learning large graph structures. Here, the graph structures could be used for semantic queries with nodes, edges and properties to represent and store data and could replace a labeled data structure. Through the connections between the nodes certain relations can be derived. By PGB the graph can be processed enormously parallelized. This would allow individual machines to train a model in parallel with different buckets, using a lock server.
In times of Big Data, the graph has become a popular data structure due to its flexible and clear relationship-based structure. Even entire database systems are now designed according to the graph principle. For more on this, read our article on NoSQL databases. Libraries, like PyGraph, allow you to perform fast queries and optimized graph manipulations. With its full Python implementation, it offers you a user-friendly and powerful tool.
What is a graph?
In a graph, objects are represented according to their relationships with each other. The objects are called vertices and the relations are called edges of the graph. An edge always connects exactly two nodes. Graphs are often used to represent traffic networks, entity-relationship diagrams, syntax trees for programming languages, finite automata and proof or decision trees.
PyGraph supports different graph types
Basically, graphs must be differentiated between directed and undirected. If a graph is directed, the edges may only be used in one direction. These edges are also called directed edges. If it is undirected, there are no directional constraints. So each edge is connected to an undirected pair of vertices. In the following figure we have contrasted both categories.
You can use PyGraph regardless of these properties, because both types are supported.
PyGraph supports several algorithms
PyGraph supports the use of many well-known graph operations. For example, searching or traversing a graph, where all nodes of a graph must be visited, can be done in different ways. In the Depth-First Search (DFS) search algorithm, for example, the successors of a successor of the current node are visited first and only then the neighbors of the current node.
The depth of the search can also be limited accordingly. Breadth-First Search (BFS), on the other hand, first visits its own neighboring nodes and only then the successors of the neighboring nodes.
In addition to the algorithm-based search of a graph, other operations can be performed with PyGraph, such as the calculation of minimum spanning trees. This tree describes the best possible path to traverse all available nodes in a weighted graph. In the following figure we have shown you all currently supported algorithms.
== Open Source Python Deep Learning Library – 2015 published – code is hosted on GitHub – originally a uniform interface for various backend libraries (TensorFlow, Microsoft Cognitive Toolkit, Theano, R, PlaidM) – it focuses on being user-friendly, modular, and extensible, and Fast and easy prototyping of neural networks, – Part of the Tensorflow Core API, but was also continued independently – since version 2.4 Keras refers directly to the implementation of Tensorflow 2 – contains numerous implementations of commonly used neural-network building blocks (layers, activation functions, objectives, optimizers, tools to make working with image and text data)
– supports standard, convolutional and recurrent neural networks – supports common supply layers (dropout, batch normalization, pooling) – supports multi-input and multi-output training – Modular design allows the creation of new models by combining cost functions, activation functions or initialization schemes – enables in-depth learning models on iOS and Android, on the web, on the Java Virtual Machine with the DL4J model import from SkyMind , on clusters of graphics processing units (GPU) and tensor processing units (TPU), Google Cloud with TensorFlow Serving, Rasberry Pi