Abstract

Visual speech recognition (VSR) aims to recognise the content of speech based on the lip movements without relying on the audio stream. Advances in deep learning and the availability of large audio-visual datasets have led to the development of much more accurate and robust VSR models than ever before. However, these advances are usually due to larger training sets rather than the model design. In this work, we demonstrate that designing better models is equally important to using larger training sets. We propose the addition of prediction-based auxiliary tasks to a VSR model and highlight the importance of hyper-parameter optimisation and appropriate data augmentations. We show that such model works for different languages (English, Mandarin, Spanish, French, Portuguese and Italian) and outperforms all previous methods trained on publicly available datasets by a large margin. It even outperforms models that were trained on non-publicly available datasets containing up to to 21 times more data. We show furthermore that using additional training data, even in other languages or with automatically generated transcriptions, results in further improvement.

Visual Speech Recognition for Multiple Languages in the Wild

Pingchuan Ma¹ Stavros Petridis^1,2 Maja Pantic^1,2

¹Imperial College London ²Meta AI

[Paper] [Code] [Model]

Abstract

Demo

Pingchuan Ma1 Stavros Petridis1,2 Maja Pantic1,2

1Imperial College London 2Meta AI

[Paper] [Code] [Model]

Abstract

Demo

Pingchuan Ma¹ Stavros Petridis^1,2 Maja Pantic^1,2

¹Imperial College London ²Meta AI