Abstract

In this work, we propose an efficient and straightforward detection method based on the temporal correlation between audio and video streams. The main idea is that the correlation between audio and video in adversarial examples will be lower than benign examples due to added adversarial noise. We use the synchronisation confidence score as a proxy for audio-visual correlation and based on it we can detect adversarial attacks.

Examples

1. Word Level Attacks (results on LRW)

No.	real example	ε^A=256, ε^V=4	ε^A=256, ε^V=8	ε^A=256, ε^V=16
0
No.	real example	ε^A=512, ε^V=4	ε^A=512, ε^V=8	ε^A=512, ε^V=16
1
No.	real example	ε^A=1024, ε^V=4	ε^A=1024, ε^V=8	ε^A=1024, ε^V=16
2
No.	real example	ε^A=256, ε^V=4	ε^A=256, ε^V=8	ε^A=256, ε^V=16
3
No.	real example	ε^A=512, ε^V=4	ε^A=512, ε^V=8	ε^A=512, ε^V=16
4
No.	real example	ε^A=1024, ε^V=4	ε^A=1024, ε^V=8	ε^A=1024, ε^V=16
5

2. Partial Sentence Attacks (Results on GRID)

The WER between transcribed and target phrases “BIN BLUE AT A ZERO PLEASE” is up to 50%.

No.	real example	ε^A=256, ε^V=4	ε^A=256, ε^V=8	ε^A=512, ε^V=4	ε^A=512, ε^V=8	ε^A=1024, ε^V=4	ε^A=1024, ε^V=8
0
No.	real example	ε^A=256, ε^V=4	ε^A=256, ε^V=8	ε^A=512, ε^V=4	ε^A=512, ε^V=8	ε^A=1024, ε^V=4	ε^A=1024, ε^V=8
1

3. Full Sentence Attacks (Results on GRID)

The following videos are transcribed to “LAY RED AT C EIGHT SOON”.

No.	real example	ε^A=256, ε^V=4	ε^A=256, ε^V=8	ε^A=512, ε^V=4	ε^A=512, ε^V=8	ε^A=1024, ε^V=4	ε^A=1024, ε^V=8
0
No.	real example	ε^A=256, ε^V=4	ε^A=256, ε^V=8	ε^A=512, ε^V=4	ε^A=512, ε^V=8	ε^A=1024, ε^V=4	ε^A=1024, ε^V=8
1