Deep learning is well known for its applicability in image recognition, but another key use of the technology is in speech recognition employed to say Amazon’s Alexa or texting with voice recognition. The advantage of deep learning for speech recognition stems from the flexibility and predicting power of deep neural networks that have recently become more accessible. Pranjal Daga, Machine Learning Scientist at Cisco Innovation Labs, gave a compelling talk at ODSC West 2018 on the specifics of applying deep learning to solve challenging speech recognition problems.
At the most basic level, speech recognition converts sound waves to individual letters and ultimately sentences. Pranjal Daga explained that a key difficulty in accurately transcribing the correct words is the variability in sound created for the same word given accents, or cadence (e.g. hello versus hellllooo). Given an audible sentence, modern speech recognition begins by transforming the sound waves using a Fast Fourier Transformation (shown below) and concatenating frames from adjacent windows to form a spectrogram. The purpose is to reduce the dimensionality of the univariate sound data in a way that enables specific letters to be predicted.
Modeling between specific frames of the spectrogram and the specific letters being predicted is best achieved using recurrent neural networks. Previously, multiple models associated with acoustics, pronunciation, and language were employed in conjunction; instead, recurrent neural networks enable more accurate transcriptions by allowing greater flexibility in predicting words with varying sounds. Pranjal Daga indicated that long short term memory networks (LSTMs) are widely applied and effective for this purpose.
Each frame of the spectrogram (illustrated as “O” below) is then modeled as a character from A to Z and “space” or “blank” contained in “c” below. Each square contained in the layer for each character contains t number of softmax activation values for t frames in the spectrogram. High activation values in the softmax layer indicate a high probability that the spectrogram frame comes from the sound of a given letter. In the diagram below, the fourth spectrogram appears to be associated highly with the letter A.
The sequence of characters predicted by the model may look similar to the sequence below. The classification algorithm then maps the sequence of characters to a word by removing duplicates and blanks.
Pranjal Daga explained that despite the capability of the recurrent neural network to predict characters, a major problem with the approach is that the model tends to make spelling and linguistic errors. The solution is to fuse the acoustic model (described previously) with a language model that is capable of understanding context. Instead of simply translating sounds into characters, the two models are able to infer a given word out of a certain vocabulary based on the sound input.
Another key challenge in speech recognition is the problem of latency; to translate in real-time, the model will need to predict words correctly without the whole sentence. Some of the deep learning models like bi-directional recurrent neural networks benefit highly from using the whole sentence due to the added context. The solution in reducing latency is to include limited context in the model structure by allowing the neural network to have access to a short amount of information after a specific word. Pranjal Daga explained that it is best practice to use final layers that deal with context to allow easy computing/re-computing.
Ultimately, deep learning is still fairly in its infancy, but is quickly approaching a state of the art capability in speech recognition. Pranjal Daga indicated that there is still a lot of room for improvement in model engineering to reduce latency and increase accuracy. In part, many advances made in research have been difficult to realize in production, so improvements in speech recognition need to address production maturity.