Voice recognition; a surge in audio electronics technologies

Date: 08/10/2019

Voice is a natural and easier way to communicate not only with fellow humans but also with other physical things around us including gadgets. It was such an attractive idea to control electronics by voice. Since electronic computing systems have born, efforts were made to control them by speaking to them, but the results were hardly acceptable and satisfying. It was in the last 10 years, there were significant improvements to reliably communicate to computers and electronic systems through voice. Much can be attributed to success of Apple’s Siri. Although touch interface, which is well used in electronics systems is a human way of interface compared to keyboard, but still speaking to systems is lot more natural than touch and keyboard interface. Earlier with the level of computing and communication resources available, it was extremely complex to design speech recognition technologies to achieve acceptable accuracy. After the successful takeoff of Apple' Siri, now there are good number of speech recognition software applications available working as standalone or cloud dependent. The standalone packages consumes significant computing and memory power compared to cloud powered voice recognition systems .

Market: The audio was a big driver for electronics market starting from early electrical recording to today's smart speakers. The voice recognition tech is giving another surge in market growth for audio electronics. Market researcher Yole reports the market for MEMS microphones and ECMs , micro speakers and audio Ics will be worth US$20 billion in 2022. The camera and microphone combined biometric passwords to proliferate all over. Audio electronics was earlier limited to entertainment and communication market segments, but now it is also for controlling and authentication.
The trend now is to design customized silicon chips to run voice for speech recognition algorithms in acceleration mode. Well that is about processing digital audio, there is also lot more happening even before digitizing the audio. The voice recognition is basically an acoustic time, frequency and other such variable pattern search in a database in real time. There are significant growth in available technologies to cost effectively design speech recognition systems at the edge itself.
Let's look at some of the technologies driving speech recognition and challenges faced. The noise is omnipresent in nearly every voice communication. Inside a home, inside a car, on a factory work-floor, everywhere there is noise at varying levels. There are also more number of speakers at a time. We humans can still listen better than machines in noisy situation. All these challenges are opportunities for engineers working in this market. Voice recognition developers are trying to imitate how the human brain and ears work.

1. Using multiple microphones: Each person has two ears, our brain processes the input from both the ears at same time and focuses on particular piece of conversation in the audio filtering out the noise and other conversation in the audio. By using more than one microphone, a new technique called beam forming can be employed to achieve higher hearing accuracy. In case of a delay and sum beam former, the microphones are aligned to compensate for the timing delays introduced by the different paths taken by the sound waves to reach microphones. The signals from multiple microphones are combined to remove noise. And a whole lot of DSP techniques used to filter out unwanted signals.

2. Enhancing the quality of analog front-end: In case of far field voice recognition, the sensitivity of microphone preamplifier should be good enough to amplify the signal so that no signature or pattern is lost. The band width and frequency response of the amplifiers should be in such a way that no low-frequency or high frequency of the voice signal is lost and maintains the same gain across the bandwidth. Most important is, how well amplifier and other front end analog circuits keep-out different type of noises. The preprocessing of audio before the DSP involves adaptive spectral noise reduction, multiple source selection and dynamic range control .

3. There are requirement of circuits which can determine whether the signal is human speech. The circuit need to eliminate microphone signal generated from breathing air, fan air and also music. It is a combination of analog and digital filters to eliminate set of frequencies.

4. Wake word engine: most of the circuits and systems need to be in sleep mode when no action is required. But when an action is required, it need to be woken up from sleep by making it sensible to a particular word or words. In that case, the wake word engine is required where it monitors the microphone output for a particular electrical signature for those word or words. Cloud-based wake word verification is most popular now, but an off-line wake word engine is also worth and more reliable when network is not available. Leading cloud service providers and also some open source programming environments provide easy options to integrate wake word feature in voice recognition systems.

5. DSP and the readily available algorithms: DSP is very central to speech recognition. Digital signal processing of continuous stream of data to identify speech presence and then once the speech is detected, it is to be stored. A set of processing algorithms compare stored speech in the buffer memory with the database. Leading DSP chip and IP vendors provide ready to use algorithms and the code to quickly implement voice recognition. Algorithms are required to identify spikes in the audio channel, continuous repetitive noise, a sneeze from the speaker or a loud noise, so that they don’t pass through the speech detection algorithm.
The trend now is to use customized silicon with artificial intelligence capabilities to process voice in a more natural way. AI voice processing chips also process faster and consume less power. A little bit of analog computing with neural network like computing makes these systems work much like humans.

6. Voice biometric: It is your latest signature or a password. As we know everyone has a unique voice whose amplitude versus time representation and frequency energy representations looks much like hand written signature where no two persons can have same one. For this to be effective, voice recognition need to be done in very detailed manner where there is no scope for cheating. This is another area which is already working in banking and financial interactions and fast expanding to other areas. In terms of market growth voice biometrics is quite a hot area.