An effective conversion of visemes to words for high-performance automatic lip-reading
Fenghour, S., Chen, D., Guo, K., Li, B. and Xiao, P. An effective conversion of visemes to words for high-performance automatic lip-reading. MDPI Sensors.
|Authors||Fenghour, S., Chen, D., Guo, K., Li, B. and Xiao, P.|
As an alternative approach, viseme-based lip-reading systems have demonstrated promising performance results in decoding videos of people uttering entire sentences. However, the overall performance of such systems has been significantly affected by the efficiency of the conversion of visemes to words in the lip-reading process. As shown in the literature, the issue has become a bottleneck of such systems where the systems performance could be dragged down dramatically from a high classification accuracy of visemes (e.g., over 90%) to a comparatively very low classification accuracy of words (e.g., only just over 60%). The underlying cause of this phenomenon is that roughly half of the words in the English language are homopheme words, i.e., a set of visemes can map to multiple words, e.g., "time" and "some". In this paper, aiming to tackle this issue, a deep learning network model with an attention-based Gated Recurrent Unit is proposed for efficient viseme-to-word conversion and compared against three other approaches. The proposed approach features a strong robustness, high efficiency, and short execution time. The approach has been verified with analysis and practical experiments of predicting sentences from the benchmark LRS2 and LRS3 datasets. The main contributions of the paper are: 1) A model is developed that is effective at converting visemes to words, discriminating between homopheme words and robust to incorrectly classified visemes; 2) The model proposed uses few parameters and therefore little overhead and time to train and execute; and 3) An improved performance in predicting spoken sentences from the LRS2 dataset with an attained word accuracy rate of 79.6% - an improvement of 15.0% compared with the state-of-the-art approaches.
|Keywords||lip reading, neural networks, speech recognition, robustness, augmentation, visemes, gated recurrent unit, recurrent neural networks|
|Publication process dates|
|Accepted||20 Nov 2021|
|Deposited||22 Nov 2021|
|Accepted author manuscript|
File Access Level
Accepted author manuscript
4views this month
1downloads this month