What does CTC based ASR pipeline consists of?

1. Feature extraction2. Acoustic Model3. Decoding

Automated Speech Recognition

Q: How does Automated Speech Recognition work?

1. The person speaks, and the ASR solution detects speech.2. The audio file that contains words that the machine understood created afterward.3. As the file contains additional unnecessary data like background noise, it is then cleaned up.4. At the next stage, the system breaks the speech into phonemes and sequences of these phonemes.5. Finally, the software analyzes these sequences, tries to determine the word, and then combines multiple words in sentences.

Q: What are the different types of Automated Speech Recognition systems?

1. Number of speakers2. Nature of the utterance3. Vocabulary size4. Spectral bandwidth

Q: What is Automated Speech Recognition used for?

1. Legal: In legal proceedings, it’s crucial to capture every word and there’s currently a shortage of court reporters. Digital transcription and the ability to scale are key solutions provided by ASR technology.2. Higher education: ASR allows universities to provide captions and transcriptions to students navigating hearing loss or other disabilities in classrooms. It can also serve the needs of students who are non-native speakers, commuters, or who have varying learning needs.3. Health care: Doctors are utilizing ASR to transcribe notes from meetings with patients or document steps during surgeries.4. Media: Media production companies use ASR to provide live captions and media transcription for all the produced, and must according to the FCC and other guidelines.5. Corporate: Companies are utilizing ASR for captioning and transcription to provide more accessible training materials and create inclusive environments for employees with differing needs.

What is Automated Speech Recognition?

Automatic Speech Recognition or ASR, as it’s known in short, is the technology that allows human beings to use their voices to speak with a computer interface in a way that, in its most sophisticated variations, resembles normal human conversation.

The most advanced version of currently developed ASR technologies revolves around what is called Natural Language Processing, or NLP in short. This variant of ASR comes the closest to allowing real conversation between people and machine intelligence and though it still has a long way to go before reaching an apex of development, we’re already seeing some remarkable results in the form of intelligent smartphone interfaces like the Siri program on the iPhone and other systems used in business and advanced technology contexts.

‍

How does Automated Speech Recognition work?

The process of automatic speech recognition looks like this:

The person speaks, and the ASR solution detects speech.
The audio file that contains words that the machine understood created afterward.
As the file contains additional unnecessary data like background noise, it is then cleaned up.
At the next stage, the system breaks the speech into phonemes and sequences of these phonemes.
Finally, the software analyzes these sequences, tries to determine the word, and then combines multiple words in sentences.

This is in layman’s terms. The actual process is actually quite tricky as it consists of a series of subtasks such as speech segmentation, acoustic modeling, and language modeling to form a prediction (of sequences of labels) from noisy, unsegmented input data.

Luckily, the introduction of Connectionist Temporal Classification (CTC) removed the need for pre-segmented data and allowed the network to be trained end-to-end directly for sequence labeling tasks like ASR.

As a result, a CTC based ASR pipeline consists of the following blocks, shown below:

Connectionist Temporal Classification based ASR pipeline consists of

1. Feature extraction

Audio signal preprocessing using normalization, windowing, (log) spectrogram (or mel scale spectrogram, or MFCC).

2. Acoustic Model

A CTC-based network that predicts the probability distributions P_t(c) over vocabulary characters c per each time step t.

3. Decoding

Greedy (argmax): This is the simplest strategy for a decoder. The letter with the highest probability (temporal softmax output layer) is chosen at each time step, without regard to any semantic understanding of what was being communicated. Then, the repeated characters are removed or collapsed, and blank tokens are discarded.
Language model: A language model can be used to add context, and therefore correct mistakes in the acoustic model. A beam search decoder weights the relative probabilities the softmax output against the likelihood of certain words appearing in context and tries to determine what was spoken by combining both what the acoustic model thinks it heard with what is a likely next word.

‍

What are the different types of Automated Speech Recognition systems?

Speech Recognition Systems can be categorized into different groups depending on the constraints imposed on the nature of the input speech.

1. Number of speakers

A system is said to be speaker-independent if it can recognize speech from any and every speaker; such a system has learned the characteristics of a large number of speakers. A large amount of a user’s speech data is necessary for training a speaker-dependent system. Such a system does not recognize other’s speech well. Speaker adaptive systems, on the other hand, are speaker-independent systems to start with but have the capability to adapt to the voice of a new speaker provided a sufficient amount of their speech is provided for training the system. Popular dictation machine is a speaker-adapted system.

2. Nature of the utterance

A user is required to utter words with clear pauses between words in an Isolated Word Recognition system. A Connected Word Recognition system can recognize words, drawn from a small set, spoken without the need for a pause between words. On the other hand, Continuous Speech Recognition systems recognize sentences spoken continuously. A spontaneous speech recognition system can handle speech disfluencies such as “ah,” “um,” false starts, or grammatical errors present in a conversational speech. A Keyword Spotting System keeps looking for a prespecified set of words and detects the presence of any one of them in the input speech.

3. Vocabulary size

An ASR system that can recognize a small number of words (say, 10 digits) is called a small vocabulary system. Medium vocabulary systems can recognize a few hundred words. Large and Very Large ASR systems are trained with several thousand and tens of thousands of words respectively. Examples of application domains of small, medium, and very large vocabulary systems are telephone/credit card number recognition, command, and control, dictation systems respectively.

4. Spectral bandwidth

The bandwidth of telephone/mobile channels is limited to 300-3400Hz and therefore attenuates frequency components outside this passband. Such a speech is called narrowband speech. In contrast, normal speech that does not go through such a channel is call wideband speech; it contains a wider spectrum limited only by the sampling frequency. As a result, the recognition accuracy of ASR systems trained with wideband speech is better. Moreover, an ASR system trained with narrowband speech performs poorly with wideband speech and vice versa.

‍

What is Automated Speech Recognition used for?

ASR is being used in a variety of industries, such as higher education, legal, finance, government, health care, and media where conversations are continuous and often need to be tracked or recorded word for word.

Some examples:

Legal: In legal proceedings, it’s crucial to capture every word and there’s currently a shortage of court reporters. Digital transcription and the ability to scale are key solutions provided by ASR technology.
Higher education: ASR allows universities to provide captions and transcriptions to students navigating hearing loss or other disabilities in classrooms. It can also serve the needs of students who are non-native speakers, commuters, or who have varying learning needs.
Health care: Doctors are utilizing ASR to transcribe notes from meetings with patients or document steps during surgeries.
Media: Media production companies use ASR to provide live captions and media transcription for all the produced, and must according to the FCC and other guidelines.
Corporate: Companies are utilizing ASR for captioning and transcription to provide more accessible training materials and create inclusive environments for employees with differing needs.