Google’s Gboard Now Uses An On-Device Neural Network For Speech Recognition On Pixel Smartphones

Artificial Intelligence News

Google’s-Gboard_Now-Uses-An-On-Device-Neural-Network-For-Speech-Recognition-On-Pixel-Smartphones Google’s Gboard Now Uses An On-Device Neural Network For Speech Recognition On Pixel SmartphonesSearch engine giant Google has announced that its cross-platform virtual keyboard application, Gboard, now uses an end-to-end recognizer to power American English speech input on Pixel smartphones. As per the company, its new recognizer is always available, even when users are offline. The model works at the character level; so as a user speak, it outputs words character-by-character.

A fellow on Google’s Speech Team, Johan Schalkwyk stated that speech recognition systems previously consisted of a range of independent optimized components like an acoustic model which maps short pieces of audio to phonemes and a language model that expresses the likelihood of given phrases. Though, nearly 2014, a new sequence-to-sequence model came out, a single neural network, which is able to directly map input audio waveform to an output sentence. It laid the foundation for more sophisticated systems with state-of-the-art accuracy but brought an architectural inability to support real-time voice transcription as a major limitation. Now the Gboard’s new model is a Recurrent Neural Network (RNN) that is trained on second-gen TPU (Tensor Processing Units) in Google Cloud, which can switch real-time transcription, due to its process input sequences capability and generate outputs continuously. Moreover, it identifies spoken characters individually, utilizing a feedback loop that feeds signs that predicted by the model, then back into the said model to forecast the next symbols.

The trained RNN-T was quite small, to begin with, merely 450MB, but Schalkwyk and colleagues wanted to minimize it further. It proved to be a challenge, including speech recognition engines compose acoustic, pronunciation, and language models together in decoder graphs which can span multiple gigabytes. However, the Speech Team managed to achieve four times compression, to 80MB, and four times pace at runtime by using quantization and other techniques, which allows the implemented model to run faster than real-time speech on a single processor core.