Speech (or voice) recognition (SR) is the ability of a computer to "understand" and interpret spoken words. With the recent advances in both software and hardware, it is offering an efficient and affordable alternative to traditional input devices. Researchers are also interested in natural language processing techniques as an extension of the speech recognition, providing a more natural and intuitive interface. The accuracy of SR software has reached well over 90%, but don't throw your keyboard away yet. An average of ten mistakenly recognized words on a total of hundred words still makes it far from perfect. This article will give a brief overview of the technology and its practical applications. As usual, we'll start with a bit of theory and continue with the practical examples.
The first attempts to build a machine that can understand human speech were made in the late 1940s at the US Department of Defense, with the obvious goal of interpreting and translating intercepted Russian transmissions. These early experiments typically used top-down approach, trying to perform a literal word-for-word dictionary lookup. However, imagine how much time and computing resources had to be used to record and store a representation of each word in a specific language. Even then, the mapping from symbols to speech is not one-to-one since different underlying symbols can result in very similar speech sounds. As it turned out, human speech recognition operates at much lower, phoneme level. Phonemes are the smallest units of speech that distinguish one utterance from another. But the greatest problem lies in the fact that individual phonemes aren't particularly "well-behaved": individual speech sounds may vary depending on the sounds preceding and following the specific phoneme. In a modern speech recognition system, the digitalized stream of amplitudes of a speech signal captured by a sound board is first converted into the dominant frequency components. Each of these components is mapped to a specific phoneme, so the system can interpret words in a dictionary from the phoneme sequences that produce them. The key process showing the probability of one phoneme combination following another is based on a technique known as a Hidden Markov Model (HMM). The vast majority of commercial speech recognition algorithms are currently based on the HMM, with slight differences in probability calculations, endpoint detection schemes for continuous dictation, etc. The Hidden Markov Model Toolkit (HTK) from Cambridge University is a portable toolkit for building and manipulating Hidden Markov models. If you are interested in more "hands-on" approach, a HTK book provides an in-depth tutorial to building such systems. The final chapter of the tutorial describes the construction of a recognizer for simple voice dialing applications (14 complex steps described on 23 pages). You'll quickly see that creating a new speech recognizer from scratch is an extremely difficult and time consuming process - not to mention that the result is entirely language dependent. OK, so you have built a recognizer for the English language, but what about hundreds of other languages?