Writing a voice-enabled application recently became a fairly straightforward task due to the advances in computer science, linguistics, signal processing and even psychology. One of the key elements of such applications is Text-to-Speech (TTS) or Speech Synthesis engine. The ability to convert text into understandable and intelligible spoken words and sentences is essential for every application that requires spontaneous, human-centered interaction. On the other hand, navigating thousands of rules of pronunciation and inflection requires a lot of processing power: to put it simply, the entire human vocal tract must be modeled and mimicked for a TTS engine to have the quality of a real human speaker.
There are two basic approaches to speech synthesis: formant synthesis that creates totally digitized or synthetic speech from scratch; and concatenation, where actual, prerecorded human voice segments are stored and used to convert text into speech. The first method requires a relatively small CPU and memory footprint, and has the advantage of being adaptable to different languages as pitch and duration of words may be easily varied. The sound quality produced using this approach is generally inferior, and the generated speech sounds a bit robotic. The concatenative approach stores prerecorded fragments of actual human speech in databases, grouping them as needed to produce full words and sentences. The length of individual fragments varies: the smallest units of speech that distinguish one utterance from another is called a phoneme. However, individual speech sounds may vary depending on the sounds preceding and following the specific phoneme. Longer speech units also decrease the density of concatenation points, thus providing better speech quality. Diphones, units that begin in the middle of the stable state of a phoneme and end in the middle of the following one are often chosen as a solution to this problem. Even larger units of speech, like triphones, tetraphones and even whole words are used in the newest generation of TTS engines, requiring larger databases and more efficient storage and retrieval methods.
No matter what method is used, every TTS engine generally contains two basic modules: a Natural Language Processing module that produces a phonetic transcription of the written text, and a Digital Signal Processing module that converts the output of the NLP section to the spoken words. Basically, the process of speech synthesis starts with a step called text normalization, which defines how each word is to be spoken. Remember that words that appear similar in the plain written text don't necessarily have the same pronunciation (like read that can be pronounced red and reed, depending on the context). Some words have to be expanded or even replaced: numbers, abbreviations, dates, times, acronyms, etc. Numbers are especially good candidates for context analyzers: good TTS engine for the US market will "know" that 556-9872 is probably a phone number and won't read it as five hundred fifty six... Once we got the unambiguous set of words, the control is passed to the phoneme converter that tries to look up the specific word in a pronunciation database, or apply letter-to-sound rules. But even the best engines with elaborate databases will have to rely on the help of exception dictionaries, storing words that defy all other rules of pronunciation. And that's not all - the hardest part, prosody generation, still lies ahead. The term prosody refers to the rises and falls of pitch, loudness and syllable length. Generated speech will preserve a lot of its perceived naturalness only if this step is performed correctly: otherwise we are left with the monotone and boring sound that can be very tiring over longer periods.
Advances in both hardware and speech generation algorithms recently resulted in several TTS engines capable of generating speech that is almost indistinguishable from human speech. Lernout & Hauspie RealSpeak is based on concatenation algorithms, and is ideal for industries, such as telephony, which require a high quality voice. L&H RealSpeak joins the company's family of TTS product which include the TTS3000 and L&H TruVoice engines, each having different CPU and memory requirements and feature sets. The new (5.0) version of the L&H Voice Xpress SDK gives software developers the ability to easily voice-enable their applications using the Microsoft's Component Object Model (COM) and the ActiveX technology standard, and Speech Application Programming Interface (SAPI) defined speech recognition/generation functionality. The new transcription feature allows users to record their dictation to a wave file and have the text, alternatives and wave file stored together. Users can select parts of the text and begin playback of the selected text with the recorded voice of the user, and make corrections by speech or pass the files to a third party for correction. The list of currently supported languages includes US and UK English, French, Dutch and Spanish, including vocabularies for different application domains.