Google Improves Voice Search With New Acoustic Models

Google announced new improvements to voice search to make it faster and more accurate. The company has built new neural network acoustic model, which it says are a special extension of its recurrent n...
Google Improves Voice Search With New Acoustic Models
Written by Chris Crum
  • Google announced new improvements to voice search to make it faster and more accurate. The company has built new neural network acoustic model, which it says are a special extension of its recurrent neural networks.

    These RNNs are more accurate, particularly in noisy environments, Google says, adding that they are “blazingly fast”. The company explains in a post on the Google Research blog:

    In a traditional speech recognizer, the waveform spoken by a user is split into small consecutive slices or “frames” of 10 milliseconds of audio. Each frame is analyzed for its frequency content, and the resulting feature vector is passed through an acoustic model such as a DNN that outputs a probability distribution over all the phonemes (sounds) in the model. A Hidden Markov Model (HMM) helps to impose some temporal structure on this sequence of probability distributions. This is then combined with other knowledge sources such as a Pronunciation Model that links sequences of sounds to valid words in the target language and a Language Model that expresses how likely given word sequences are in that language. The recognizer then reconciles all this information to determine the sentence the user is speaking. If the user speaks the word “museum” for example – /m j u z i @ m/ in phonetic notation – it may be hard to tell where the /j/ sound ends and where the /u/ starts, but in truth the recognizer doesn’t care where exactly that transition happens: All it cares about is that these sounds were spoken.

    Our improved acoustic models rely on Recurrent Neural Networks (RNN). RNNs have feedback loops in their topology, allowing them to model temporal dependencies: when the user speaks /u/ in the previous example, their articulatory apparatus is coming from a /j/ sound and from an /m/ sound before. Try saying it out loud – “museum” – it flows very naturally in one breath, and RNNs can capture that. The type of RNN used here is a Long Short-Term Memory (LSTM) RNN which, through memory cells and a sophisticated gating mechanism, memorizes information better than other RNNs. Adopting such models already improved the quality of our recognizer significantly.

    The technical explanation continues in the post.

    Google says the new acoustic models are already in use for voice searches in the Google app on both Android and iOS.

    This appears to be the biggest update Google has made to voice search since 2012 when it announced the adoption of deep neural networks in the first place, when it replaced a 30-year old standard (the Gaussian Mixture Model).

    Image via Google

    Get the WebProNews newsletter delivered to your inbox

    Get the free daily newsletter read by decision makers

    Subscribe
    Advertise with Us

    Ready to get started?

    Get our media kit