Amazon Alexa Scientists’ Speech Recognizer Reduce Speech Recognition Errors By 20 Percent With Semi-Supervised Learning

Artificial Intelligence News

Amazon-Alexa-Scientists’-Speech_Recognizer-Reduce-Speech-Recognition-Errors-By-20-Percent-With-Semi-Supervised-Learning Amazon Alexa Scientists’ Speech Recognizer Reduce Speech Recognition Errors By 20 Percent With Semi-Supervised LearningThe team of researchers at Amazon’s Alexa division has introduced a speech recognizer that classifies data patterns in a semi-supervised fashion. Explained in a paper titled ‘Improving Noise Robustness of Automatic Speech Recognition via Parallel Data and Teacher-Student Learning’ Minhua Wu, an applied scientist in the Alexa Speech group and colleagues claimed that an experimental model trained on 800 hours of annotated data and 7,200 hours of softly unannotated data, with a second speech system fed the same data samples. But with artificially generated noise, the design gained a 20 percent reduction in word error rate compared with the baseline.

In her statement, Wu stated that we hope to improve the noise robustness of the speech recognition system. Wu and colleagues described that automatic speech recognition systems comprise three core components- an acoustic model, a pronunciation model, and a language model. The acoustic model piece takes as input short audio samples or frames, and for every frame outputs thousands of probabilities, wherein each probability points toward the likelihood that any given frame belongs to a low-level phonetic representation called a senone. The acoustic model’s output, in the proposed approach, is fed into the pronunciation model that transforms the senone sequences into possible words and passes those to the language model, which encodes the probabilities of word sequences. In the end, all three Artificial Intelligence systems are work together to find out the most probable word sequence given the audio input.

Wu and colleagues reportedly forced the student model to train strictly on audio data with the highest probabilities, from five to 40, allowed it to devote more resources to differentiating among possible ones, and later minimized errors even on a noise-free test data set. As per the reports, the research is slated to be presented at the International Conference on Acoustics, Speech, and Signal Processing in Brighton this year.