Traditionally, neural networks had all their inputs and outputs independent of each other, but in cases, for instance, where it is required to predict the next word of a sentence, information about previous words is essential. Thus, RNN came into existence. Recurrent Neural Network (RNN) is a type of neural network where the output of the preceding step is served as input to the current step. This solved various application issues with the help of a hidden layer. The most important feature of RNN is a hidden state, which retains information about the sequence.
The article is an excerpt from the book IoT and Edge Computing for Architects, Second Edition by Perry Lea. In this article, we will explore recurrent neural networks in the cloud and Edge. This book provides a complete package of executable insights and real-world scenarios that will help the reader to gain comprehensive knowledge of edge computing and become proficient in building efficient enterprise-scale IoT applications.
RNNs, or recurrent neural networks, are a field of machine learning all to themselves and extremely important and relevant to IoT data. The big difference between an RNN and a CNN is that a CNN processes input on fixed size vectors of data. Think of them as two-dimensional images—that is, a known-size input. Instead of taking a fixed-size chunk of image data, an RNN has as input a vector and output another vector. At its heart, the output vector is influenced not by that single input we just fed it, but by the entire history of inputs it was fed. That implies an RNN understands the temporal nature of things or can be said to maintain state. There is information to be inferred from the data, but also from the order the data was sent.
RNNs are of particular value in the IoT space, especially in time-correlated series of data, such as describing a scene in an image, describing the sentiment of a series of text or values, and classifying video streams. Data may be fed to an RNN from an array of sensors that contain a (time:value) tuple. That would be the input data to send to the RNN. In particular, such RNN models can be used in predictive analytics to find faults in factory automation systems, evaluate sensor data for abnormalities, evaluate timestamped data from electric meters, and even to detect patterns in audio data. Signal data from industrial devices is another great example. An RNN could be used to find patterns in an electrical signal or wave. A CNN would struggle with this use case. An RNN will run ahead and predict what the next value in a sequence will be if the value falls out of the predicted range that could indicate a failure or significant event:
If you were to examine a neuron in an RNN, it would look as if it were looping back on itself. Essentially, the RNN is a collection of states going back in time. This is clear if you think of unrolling the RNN at each neuron:
The challenge with RNN systems is that they are more difficult to train over a CNN or other models. Remember, CNN systems use backpropagation to train and reinforce the model. RNN systems don’t have the notion of backpropagation. Anytime we send input into the RNN, it carries a unique timestamp. This leads to the vanishing gradient problem discussed earlier, which reduces the learning rate of the network to be useless. A CNN is also exposed to a vanishing gradient, but the difference with an RNN is that the depth of the RNN can go back many iterations, whereas a CNN traditionally has only a few hidden layers. For example, an RNN resolving a sentence structure like: A quick brown fox jumped over the lazy dog will extend back nine levels. The vanishing gradient problem can be thought of intuitively: if weights in the network are small, then the gradient will shrink exponentially leading the vanishing gradient. If the components of the weights are large, then the gradient will grow exponentially and possibly explode NaN (not a number error). Exploding leads to an obvious crash, but the gradient is usually truncated or capped before that occurs. A vanishing gradient is harder for a computer to deal with.
One method to overcome this effect is to use the ReLU activation function mentioned in the CNN section. This activation function delivers a result of 0 or 1, so it isn’t prone to vanishing gradients. Another option is the concept of long short-term memory (LSTM), which was proposed by researchers Sepp Hochreiter and Juergen Schmidhuber. (Long Short-Term Memory, Neural Computation, 9(8):1735-1780, 1997.) LSTM solved the vanishing gradient issue and allowed an RNN to be trained. Here, the RNN neuron consisted of three or four gates. These gates allow the neurons to hold state information and are controlled by logistic functions with a value between 0 and 1:
- Keep gate K: Controls how much a value will remain in memory
- Write gate W: Controls how much a new value will affect memory
- Read gate R: Controls how much a value in memory is used to create the output activation function
You can see that these gates are somewhat analog in nature. The gates vary how much information will be retained. The LSTM cell will trap errors in the memory of the cell. This is called the error carousel and allows the LSTM cell to backpropagate errors over long time periods. The LSTM cell resembles the following logical structure where the neuron is essentially the same for all outward appearances as a CNN, but internally it maintains state and memory. The LSTM cell of the RNN is illustrated as follows:
An RNN builds up memory in the training process. This is seen in the diagram as the state layer under a hidden layer. An RNN is not searching for the same patterns across an image or bitmap like a CNN; rather, it is searching for a pattern across multiple sequential steps (which could be time). The hidden layer and state layer complement are shown in the diagram:
One can see the amount of computation in training with the LSTM logistical math, as well as how the regular backpropagation is heavier than a CNN. The process of training involves backpropagating gradients through the network all the way to time zero. However, the contribution of a gradient from far in the past (say, time zero) approaches zero and will not contribute to the learning.
A good use case to illustrate an RNN is a signal analysis problem. In an industrial setting, you can collect historical signal data and attempt to infer from it whether a machine was faulty or there were runaway thermals in some component. A sensor device would be attached to a sampling tool, and a Fourier analysis performed on the data. The frequency components could then be inspected to see if a particular aberration were present. In the following graph, we have a simple sine wave that indicates normal behavior, perhaps of a machine using cast rollers and bearings. We also see two aberrations introduced (the anomaly). A fast Fourier transform (FFT) is typically used to find aberrations in a signal based on the harmonics. Here, the defect is a high-frequency spike similar to a Dirac delta or impulse function.
We see the following FFT registers only the carrier frequency and doesn’t see the aberration:
An RNN specifically trained to identify the time-series correlation of a particular tone or audio sequence is a straightforward application. In this case, an RNN could replace an FFT, especially when multiple sequences of frequencies or states are used to classify a system, making it ideal for sound or speech recognition.
Industrial predictive maintenance tools rely on this type of signal analysis to find thermal and vibration-based failures of different machines. This traditional approach has limits, as we see. A machine learning model (especially an RNN) can be used to inspect the incoming stream of data for particular feature (frequency) components and could find point failures as shown in the preceding graph. Raw data, shown in the previous graph, is arguably never as clean as a sine wave. Usually, the data is quite noisy with periods of loss.
Another use case is around sensor fusion in healthcare. Healthcare products like glucose monitors, heart rate monitors, fall indicators, respiratory meters, and infusion pumps will send period or a stream of data. All these sensors are independent of each other, but together comprise a picture of patient health. They also are time-correlated. An RNN can bridge this unstructured data in aggregate and predict patient health, all dependent on the patient’s activity throughout the day. This can be useful for home health monitoring, sports training, rehabilitation, and geriatric care.
[Related article: Image Detection as a Service]
You must be careful with RNNs. While they can make good inferences on time series data and predict oscillations and wave behaviors, they may behave chaotically and are very difficult to train. This article exposed RNN data analysis models as well as how RNN cases satisfy this context through proper training. For a holistic and expert review of implementing cloud and edge computing to build commercial IoT systems, check out the book IoT and Edge Computing for Architects, Second Edition by Perry Lea.