Building an ASR system in adverse conditions is a challenging task. The performance of the ASR system is high in clean environments. However, the variabilities such as speaker effect, transmission effect, and the environmental conditions degrade the recognition performance of the system. One way to enhance the robustness of ASR system is to use multiple sources of information about speech. In this work, two sources of additional information on speech are used to build a multimodal ASR system. A throat microphone speech and visual lip reading which is less susceptible to noise acts as alternate sources of information. Mel-frequency cepstral features are extracted from the throat signal and modeled by HMM. Pixel-based transformation methods (DCT and DWT) are used to extract the features from the viseme of the video data and modeled by HMM. Throat and visual features are combined at the feature level. The proposed system has improved recognition accuracy compared to unimodals. The digit database for the English language is used for the study. The experiments are carried out for both unimodal systems and the combined systems. The combined feature of normal and throat microphone gives 86.5% recognition accuracy. Visual speech features with the normal microphone combination produce 84% accuracy. The proposed work (combines normal, throat, and visual features) shows 94% recognition accuracy which is better compared to unimodal and bimodoal ASR systems. © Springer Nature Singapore Pte Ltd. 2019.

Nayeemulla Khan

Computer Science

School of Computer Science and Engineering

Chennai Campus

N. Radha

A. Shahina

Vellore Institute of Technology (VIT) is a private university located in&nbsp;Tamil Nadu, India. Founded in 1984, as Vellore Engineering College, the institution offers 20 undergraduate, 34 postgraduate, four integrated and four research programs. It has campuses in Vellore, Amravati, Bhopal and Chennai.

VIT is one of the top ranked private universities in India according to NIRF, THE and QS Rankings.&nbsp;Govt. of India has recognized&nbsp;VIT, Vellore as an&nbsp;Institution of Eminence. This has allowed VIT to take independent quality initiatives and move up in world ranking.

&nbsp;

&nbsp;

VIT University

Improving Recognition of Speech System Using Multimodal Approach

International Conference on Innovative Computing and Communications Lecture Notes in Networks and Systems

Pattern Recognition Letters

An analysis of the effect of combining standard and alternate sensor signals on recognition of syllabic units for multimodal speech recognition

Objectives: This paper proposes a method to improve the performance of a Visual Speech Recognition (VSR) system by combining the pixel-based and geometry-based features, so as to augment the performance of audio based Automatic Speech Recognition (ASR) systems in adverse conditions. Methods/Statistical Analysis: A video database comprising of 11000 utterances of isolated words, collected from 20 speakers, is used in this study. Pixel based features (DCT and DWT) and geometric features (Active Shape Model or ASM) are fused at two levels, one at the feature level and the other at the decision level. A simple Gaussian mixture HMM word model is built for feature level fusion, while a two stream HMM model is built for decision level fusion. Findings: The VSR system built using the combined features shows a significant improvement in performance when compared to individual VSR systems built using pixel and geometric based features. The accuracy of the individual system is 76% for geometric features, 64% for DCT and 72% for DWT pixel-based features. The performance improves for combined features with an accuracy of 80% for ASM+DCT and 84.7% for DWT+ASM. A weighted decision level fusion result in further improvement, with an accuracy of 84% for ASM+DCT and 92% for ASM+DWT. Application/Improvements: The combined VSR could be preferred over individual pixel/geometric feature based systems to augment the performance of audio based Automatic Speech Recognition (ASR) systems in adverse conditions. Further studies on improving the VSR system, which could be used in lieu of audio-based ASR systems in adverse situations, are being carried out.

Fulltext

Indian Journal of Science and Technology

An Improved Visual Speech Recognition of Isolated Words using Combined Pixel and Geometric Features

This paper presents a person identification system which combines recognition of facial features as well as spoken word using visual features alone. It incorporates a face recognition algorithm to identify the person, followed by spoken word recognition of 'lip-read' password. For face recognition, PCA is used for feature extraction, followed by a KNN based classification on the reduced dimensionality features. Spoken word recognition of passwords is performed using a Visual Lip reading (Visual ASR). The visual features corresponding to the spoken word is extracted using DWT, which are then recognized using a HMM based approaches. Since evidences from face recognition and visual lip reading could be complementary in nature, the scores from the two modalities are combined. Based on the combined evidences, decision making is for person identification is carried out. The performance for face identification is 90% while the accuracy for visual speech recognition is 72%. By combining these evidences, an improved accuracy 98% is achieved. © 2015 IEEE.

2015 International Conference on Computing and Network Communications (CoCoNet)

A person identification system combining recognition of face and lip-read passwords

2014 International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT)

Improving recognition of syallabic units of Hindi languagae using combined features of Throat Microphone and Normal Microphone speech

Interspeech 2016

Feature-Level Decision Fusion for Audio-Visual Word Prominence Detection

Journal	Data powered by TypesetInternational Conference on Innovative Computing and Communications Lecture Notes in Networks and Systems
Publisher	Data powered by TypesetSpringer Singapore
ISSN	2367-3370
Open Access	0