A team of scientists at Google and the University of California has proposed a speech recognition method that promises to reduce word error rate by up to 29 percent. According to the reports, during the experiment with the 800-word, 960-hour language modeling LibriSpeech dataset, where their method showed an 18.6 percent relative improvement in word error rate and in some cases, it managed 29 percent error reduction.
The speech recognition approach that taps a spelling correction model trained on text-only data. The paper author’s described that the aim is to integrate a module on text data into an end-to-end framework, with the objective of correcting errors made by the system. They further noted that they explore utilizing unpaired data to produce audio signals by deploying a text-to-speech system. It is a process akin to back translation in machine translation. Several ASR systems (Automatic Speech Recognition) jointly train three components, including an acoustic model which learns the relationship between audio signals and the linguistic units that form speech; a language model that assigns the probabilities of words series; and a mechanism that acts position the acoustic structures and signify signs. These all three components utilize a single neural network-layered mathematical functions modeled after biological neurons and transcribed audio-text pairs. But, the language model basically suffers degraded performance when it meets words that infrequently occur in the corpus consequently.
The analysts then start out to integrate the spelling correction model, as aforementioned, into the ASR framework and map them to higher-level representations. They utilized text-only data and corresponding synthetic audio signals generated using a text-to-speech system to train a LAS speech recognizer, an end-to-end model, which was first explained by researchers at Google Brain in the years 2017, and subsequently to make text-to-speech dataset pairs. The researchers, to authenticate the model, has trained a language model that produced a data set of text-to-speech to guide the LAS model, and created error hypotheses to train the correction model with the series of 40 million texts from the LibriSpeech dataset. Then, they found that the speech correction model, by correcting entries from the LAS, could produce an expanded output with significantly lower word error rate.