Visual Speech Enhancement Without A Real Visual Stream

Published in Winter Conference on Applications of Computer Vision (WACV ’21), 2021

Recommended citation: Hegde, Sindhu B., K. Prajwal, R. Mukhopadhyay, Vinay Namboodiri and C. Jawahar. “Visual Speech Enhancement Without A Real Visual Stream.” Winter Conference on Applications of Computer Vision (WACV ’21) https://arxiv.org/abs/2012.10852

Download paper here

In this work, we re-think the task of speech enhancement in unconstrained real-world environments. Current state-of-the-art methods use only the audio stream and are limited in their performance in a wide range of real-world noises. Recent works using lip movements as additional cues improve the quality of generated speech over audio-only&quot; methods. But, these methods cannot be used for several applications where the visual stream is unreliable or completely absent. We propose a new paradigm for speech enhancement by exploiting recent breakthroughs in speech-driven lip synthesis. Using one such model as a teacher network, we train a robust student network to produce accurate lip movements that mask away the noise, thus acting as avisual noise filter". The intelligibility of the speech enhanced by our pseudo-lip approach is almost close (< 3\% difference) to the case of using real lips. This implies that we can exploit the advantages of using lip movements even in the absence of a real video stream. We rigorously evaluate our model using quantitative metrics as well as qualitative human evaluations. Additional ablation studies and a demo video in the supplementary material containing qualitative comparisons and results clearly illustrate the effectiveness of our approach.

Recommended citation: Hegde, Sindhu B., K. Prajwal, R. Mukhopadhyay, Vinay Namboodiri and C. Jawahar. “Visual Speech Enhancement Without A Real Visual Stream.” Winter Conference on Applications of Computer Vision (WACV ’21)