TY - GEN
T1 - A real-time prototype for small-vocabulary audio-visual ASR
AU - Connell, J. H.
AU - Haas, N.
AU - Marcheret, E.
AU - Neti, C.
AU - Potamianos, G.
AU - Velipasalar, S.
N1 - Publisher Copyright:
© 2003 IEEE.
PY - 2003
Y1 - 2003
N2 - We present a prototype for the automatic recognition of audio-visual speech, developed to augment the IBM ViaVoicetrade speech recognition system. Frontal face, full frame video is captured through a USB 2.0 interface by means of an inexpensive PC camera, and processed to obtain appearance-based visual features. Subsequently, these are combined with audio features, synchronously extracted from the acoustic signal, using a simple discriminant feature fusion technique. On the average, the required computations utilize approximately 67% of a Pentiumtrade 4, 1.8 GHz processor, leaving the remaining resources available to hidden Markov model based speech recognition. Real-time performance is there-fore achieved for small-vocabulary tasks, such as connected-digit recognition. In the paper, we discuss the prototype architecture based on the ViaVoice engine, the basic algorithms employed, and their necessary modifications to ensure real-time performance and causality of the visual front end processing. We benchmark the resulting system performance on stored videos against prior research experiments, and we report a close match between the two.
AB - We present a prototype for the automatic recognition of audio-visual speech, developed to augment the IBM ViaVoicetrade speech recognition system. Frontal face, full frame video is captured through a USB 2.0 interface by means of an inexpensive PC camera, and processed to obtain appearance-based visual features. Subsequently, these are combined with audio features, synchronously extracted from the acoustic signal, using a simple discriminant feature fusion technique. On the average, the required computations utilize approximately 67% of a Pentiumtrade 4, 1.8 GHz processor, leaving the remaining resources available to hidden Markov model based speech recognition. Real-time performance is there-fore achieved for small-vocabulary tasks, such as connected-digit recognition. In the paper, we discuss the prototype architecture based on the ViaVoice engine, the basic algorithms employed, and their necessary modifications to ensure real-time performance and causality of the visual front end processing. We benchmark the resulting system performance on stored videos against prior research experiments, and we report a close match between the two.
UR - http://www.scopus.com/inward/record.url?scp=84908605747&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84908605747&partnerID=8YFLogxK
U2 - 10.1109/ICME.2003.1221655
DO - 10.1109/ICME.2003.1221655
M3 - Conference contribution
AN - SCOPUS:84908605747
T3 - Proceedings - IEEE International Conference on Multimedia and Expo
SP - II469-II472
BT - Proceedings - 2003 International Conference on Multimedia and Expo, ICME
PB - IEEE Computer Society
T2 - 2003 International Conference on Multimedia and Expo, ICME 2003
Y2 - 6 July 2003 through 9 July 2003
ER -