– LSMs take the temporal aspect of the input into account
– large accumulation of recurrent interacting nodes → is stimulated by the input layer – Liquid itself is not trained, but randomly constructed with the help of heuristics – Loops cause a short-term memory effect – preferably a Spiking Neural Network (SNNs) → are closer to biological neural networks than the multilayer Perceptron → can be any type of network that has sufficient internal dynamics
→ will be extracted by the readout function
– depend on the input streams they’ve been presented
– converts the high-dimensional state into the output
– since the readout function is separated from the liquid, several readout functions can be used with the same liquid
→ so different tasks can be performed with the same input
AutoEncoder – In data science, we often encounter multidimensional data relationships. Understanding and representing these is often not straightforward. But how do you effectively reduce the dimension without reducing the information content?
Unsupervised dimension reduction
One possibility is offered by unsupervised machine learning algorithms, which aim to code high-dimensional data as effectively as possible in a low-dimensional way. If you don’t know the difference between unsupervised, supervised and reinforcement learning, check out this article we wrote on the topic.
What is an AutoEncoder?
The AutoEncoder is an artificial neural network that is used to unsupervised reduce the data dimensions. The network usually consists of three or more layers. The gradient calculation is usually done with a backpropagation algorithm. The network thus corresponds to a feedforward network that is fully interconnected layer by layer.
AutoEncoder types are many. The following table lists the most common variations.
However, the basic structure of all variations is the same for all types.
Each AutoEncoder is characterized by an encoding and a decoding side, which are connected by a bottleneck, a much smaller hidden layer.
The following figure shows the basic network structure.
During encoding, the dimension of the input information is reduced. The average value of the information is passed on and the information is compressed in such a way. In the decoding part, the compressed information is to be used to reconstruct the original data. For this purpose, the weights are then adjusted via backpropagation. In the output layer, each neuron then has the same meaning as the corresponding neuron in the input layer.
Autoencoder vs Restricted Boltzmann Machine (RBM)
Restricted Boltzmann Machines are also based on a similar idea. These are undirected graphical models useful for dimensionality reduction, classification, regression, collaborative filtering, and feature learning. However, these take a stochastic approach. Thus, stochastic units with a particular distribution are used instead of the deterministic distribution.
RBMs are designed to find the connections between visible and hidden random variables. How does the training work? The hidden biases generate the activations during forward traversal and the visible layer biases generate learning of the reconstruction during backward traversal.
Since the random initialization of weights in neural networks at the beginning of training is not always optimal, it makes sense to pre-train. The task of training is to minimize an error or a reconstruction in order to find the most efficient compact representation for input data.
The method was developed by Geoffrey Hinton and is primarily for training complex autoencoders. Here, the neighboring layers are treated as a Restricted Boltzmann Machine. Thus, a good approximation is achieved and fine-tuning is done with a backpropagation.
A perceptron is a simple binary classification algorithm modeled after the biological neuron and is thus a very simple learning machine. The output function here is determined by the weighting of the inputs and by the thresholds. Perceptrons are used for machine learning as well as for artificial intelligence (AI) applications. If you don’t know the difference between AI, neural networks and machine learning you should read our article on the subject.
What does the learning process look like?
A set of input signals are decomposed into a binary output decision, i.e. zeros or ones. By training with certain input patterns, similar patterns can thus be found in a data set to be analyzed. The following figure shows this learning process schematically.
If a set threshold is exceeded or not reached by weighting all inputs, the state of the neuron output changes. If one now trains a perceptron with given data patterns, the weighting of the inputs changes. The perceptron thus has the ability to learn and solve complex problems by adjusting the weights.
However, a basic requirement to obtain valid results is that the data must be linearly separable.
What are Multilayer Perceptrons (MLP)
A multilayer perceptron corresponds to what is known as a neural network. Perceptrons thus form the neuronal base, which are interconnected in different layers.
The figure below shows a simple three-layer MLP. Each line here represents a different output.
However, neurons of the same layer have no connections to each other. For each signal, the perceptron uses different weights and the output of a neuron is the input vector of a neuron of the next layer. The diversity of classification possibilities increases with the number of layers.
Recurrent Neural Networks vs Feed-Forward Networks
Basically, neural networks are distinguished according to the recurrent and the feed-forward principle.
Recurrent Neural Networks
In the recurrent neural network the neurons are connected to neurons of the same or a preceding layer. Here, a basic distinction is made between three types of feedback. With the direct feedback the own output of a neuron is used as further input. In indirect feedback, on the other hand, the output of a neuron is connected to a neuron of the preceding layers. In the last feedback principle, lateral feedback, the output of a neuron is connected to another neuron of the same layer.
In feed-forward networks, on the other hand, the outputs are connected only to the inputs of a subsequent layer. These can be fully connected, then the neurons of a layer are connected to all neurons of the directly following layer. Or short-cuts are formed. Some neurons are then not connected to all neurons of the next layer.
In almost no scientific discipline you can get around the programming language Python nowadays. With it, powerful algorithms can be applied to large amounts of data in a performant way. Open source libraries and frameworks enable the simple implementation of mathematical methods and data transports.
What is scikit-learn?
One of the most popular Python libraries is scikit-learn. It can be used to implement both supervised and unsupervised machine learning algorithms. scikit-learn primarily offers ready-made solutions for data mining, preprocessing and data analysis. The library is based on the SciPy Toolkit (SciKit) and makes extensive use of NumPy for high performance linear algebra and array operations. If you don’t know what NumPy is, check out our article on the popular Python library. The library was first released in 2007 and since then it is constantly extended and optimized by a very active community. The library was written primarily in Python and is based on Cython only for some high-level operations. This makes the library easy to integrate into Python applications.
Easily implement many machine learning algorithms with scikit-learn. Both supervised and unsupervised machine learning are supported. If you don’t know what the difference is between the two machine learning categories, check out this article from us on the topic. The figure below lists all the algorithms provided by the library.
scikit-learn thus offers rich capabilities to recognize patterns and data relationships in a dataset. Thus, high dimensions can be reduced to visualize the relationships without sacrificing much information. Features can be extracted and data clustering algorithms can be easily created.
scikit-learn is powerful and versatile. However, the library does not exist completely solitary. Besides the obvious dependency on Python, the library requires the import of other libraries for special operations.
NumPy allows easy handling of vectors, matrices or generally large multidimensional arrays. SciPy complements these functions with useful features like minimization, regression or the Fourier transform. With joblib Python functions can be built as lightweighted pipeline jobs and with threadpoolctl methods can be coordinated as threads to save resources.
== “Numeric Python” – Open Source Python Library for array-based calculations – First realeased in 1995 as Numeric (first implementation of a Python matrix package); 2006 as NumPy – allows easy handling of vectors, matrices or generally large multidimensional arrays – NumPys operators and functions are optimized for multidimensional array operations and evaluate particularly efficiently – written in C – compatible to various Python libraries (Matplotlib, Pandas, SciPy) – SciPy extends the power of NumPy with other useful features, such as: minimization, regression, Fourier transform…
Python and Science
– The programming language Python is used very intensively in the application area of scientific research – NumPy was designed for scientific calculations
The ndarray data structure
– Core functionality of NumPy is based on the data structure “ndarray – Components: a pointer to a contiguous storage area together with metadata describing the data stored in it – All elements of an array must be of the same data type
– shape == Defines the dimensions in each index value (“axis”) of the array and the number of axes – strides == describe for each axis, how many bytes you have to jump in linear memory, if an index belonging to this axis is increased by 1 – reshaping == Altering the shape of a provided array – slicing == Setting up smaller subarrays within a given larger array – splitting + joining== Splitting one array into many and combining multiple arrays into one single array – indexing == Setting the value of individual array elements
The product and further information can be found here:
== Python visualization library based on Matplotlib (Python’s core 2D plotting library) – provides a high-level interface for the visualization of statistical data – does not have its own graphics library, but uses the functionalities and data structures of Matplotlib internally
– bad default options for size and color of plots – Low level technology compared to today’s requirements, requiring very specialized code to generate appealing plots – no development for Pandas Dataframes
– Built-in themes for styling Matplotlib graphics – Dataset-oriented API for determining the relationship between variables – Visualization of univariate and bivariate data – Automatic estimation and display of linear regression models – Plotting of statistical time series data – works well with NumPy and Pandas data structures – It comes with integrated themes for styling matplotlib graphics
The product and further information can be found here:
Apache Mahout is a powerful machine learning tool that comes with a seamless compatibility to the strong big data management frameworks from the Apache universe. In this article, we will explain the functionalities and show you the possibilities that the Apache environment offers.
What is Machine Learning?
Machine learning algorithms provide lots of tools for analyzing large unknown data sets. The art of data science is to extract the maximum amount of information depending on the data set by using the right method. Are there patterns in the high-dimensional data relationships, and how can they be represented in a low-dimensional way without much loss of information?
There is often a similar amount of information in the failure as when an algorithm was able to successfully create groupings. It is important to understand the mathematical approaches behind the tools in order to draw conclusions about why an algorithm did not work. If you don’t know the basic machine learning categories, it’s best to read our article on the subject first.
Machine Learning and Linear Algebra
Most machine learning methods are based on linear algebra. This mathematical subfield deals with linear transformations, vector spaces and linear mappings between them. The knowledge of the regularities is the key to the correct understanding of machine learning algorithms.
What is Apache Mahout
Apache Mahout is an open source machine learning project that builds implementations of scalable machine learning algorithms with a focus on linear algebra. If you’re not sure what Apache is, check out this article. Here we introduce you to the project and its main projects once.
Mahout was already released in 2009 and since then it is constantly extended and kept up-to-date by a very active community. Originally, it contained scalable algorithms closely related to Apache Hadoop and MapReduce. However, Mahout has since evolved into a backend independent environment. That is, it operates on non-Hadoop clusters or single nodes.
The math library is based on Scala and provides an R-like Domain Specific Language (DSL). Mahout is usable for Big Data applications and statistical computing. The figure below lists all machine learning algorithms currently offered by Mahout.
The algorithms are scalable and cover both supervised and unsupervised machine learning methods, such as clustering algorithms.
Apache Mahout covers a large part of the usual machine learning tools. This means that data can be analyzed without having to change frameworks. This is a big plus for maintaining compatibility in the application.
The framework integrates seamlessly into the Apache Ecosystem. This means that an application can access the entire power of the data processing platforms and build very high-performance big data pipelines. The following figure shows the Apache data management ecosystem.
Through connectivity to Apache Flink, stream data analysis pipelines can be built, or with Hive data from relational databases can be automatically converted into MapReduce or Tez or Spark jobs.
== internationally leading platform for cloud computing – founded 2006 by Amazon – Services go far beyond hosting files. → Services among others: virtual servers, storage solutions, networks, databases, development interfaces – Customers among others: Dropbox, NASA, Netflix
== Access to virtual computing capacity / access to platforms via the Internet
– All services are connected via REST architecture and SOAP protocol → accessible via HTTP/HTTPS
– EC2 (Elastic Compute Cloud) → virtual server (simulated unit of a server farm running separately from others) – Operating systems: Linux distribution or Microsoft Windows Server → fully scalable
== Webspace for file hosting – theoretically any amount of data
– S3 (Simple Storage Service) →Filehosting service, virtual network drives, archiving systems → Access via web interface (HTTP/HTTPS) – Elastic Block Store (EBS) → Memory at block level → can be attached to Amazon EC2 instances
– Snowball → rentable hard disk space → to which large amounts of data can be copied and returned by parcel service
– CloudFront → Content Delivery Network (CDN) → makes content (files, domains) from other AWS services, including SSL encryption, available globally → Reduction of access time
== saves dynamic contents in tables or matrices
– SimpleDB → Storage of non-relational information (structured as objects and properties) → Storage of small and medium-sized data volumes in a high-performance environment
– Relational Database Service (RDS) → virtual database → is based on MySQL, Microsoft SQL Server or Oracle
Elastic Beanstalk – Platform as a Service (PaaS) == Service to deploy and scale web applications and services – Development, Analysis, Debugging, Testing – platforms: Java, .NET, PHP, Node.js, Python, Ruby, Go and Docker – run up: Apache, Nginx, Passenger and IIS
Further services: Simple Workflow Service (SWS), Simple Email Service (SES), Simple Queue Service (SQS), Simple Notification Service (SNS)
The product and further information can be found here:
== Open source data analysis and manipulation Python library – released in 2008 by Wes McKinney – written in Python, Cython, C – Name is derived from “Panel Data – is, next to Numpy, Scipy and Matplotlib, one of the most important data manipulation and analysis tools → All are compatible with each other → Our strength lies in the processing and evaluation of tabular data and time series
– Pandas defines own data objects for data processing → form the basis for functions and tools
– 1-dimensional – Data structure with two arrays (one array as index + one array with data) – can accept different types of data (ints, strings …) – When adding several series, the indices are combined
– 2-dimensional – contains an ordered collection of columns – different columns can consist of different data types – Each value is unique by a row and a column index
– 3-dimensional data sets – consisting of dataframes – Axes: → items – each item corresponds to a DataFrame contained inside. → major axis – index (rows) of each of the DataFrames. → minor axis – columns of each of the DataFrames.
The product and further information can be found here: