Defeating Hidden Audio Channel Attacks on Voice Assistants via Audio-Induced Surface Vibrations

Voice access technologies are widely adopted in mobile devices and voice assistant systems as a convenient way of user interaction. Recent studies have demonstrated a potentially serious vulnerability of the existing voice interfaces on these systems to “hidden voice commands”. This attack uses synthetically rendered adversarial sounds embedded within a voice command to trick the speech recognition process into executing malicious commands, without being noticed by legitimate users.

In this paper, we employ low-cost motion sensors, in a novel way, to detect these hidden voice commands. In particular, our proposed system extracts and examines the unique audio signatures of the issued voice commands in the vibration domain. We show that such signatures of normal commands vs. synthetic hidden voice commands are distinctive, leading to the detection of the attacks. The proposed system, which benefits from a speaker-motion sensor setup, can be easily deployed on smartphones by reusing existing on-board motion sensors or utilizing a cloud service that provides the relevant setup environment. The system is based on the premise that while the crafted audio features of the hidden voice commands may fool an authentication system in the audio domain, their unique audio-induced surface vibrations captured by the motion sensor are hard to forge. Our proposed system creates a harder challenge for the attacker as now it has to forge the acoustic features in both the audio and vibration domains, simultaneously. We extract the time and frequency domain statistical features, and the acoustic features (e.g., chroma vectors and MFCCs) from the motion sensor data and use learning-based methods for uniquely determining both normal commands and hidden voice commands. The results show that our system can detect hidden voice commands vs. normal commands with 99.9% accuracy by simply using the low-cost motion sensors that have very low sampling frequencies.

Chen Wang
WINLAB, Rutgers University

S Abhishek Anand
University of Alabama at Birmingham

Jian Liu
WINLAB, Rutgers University

Payton R. Walker
University of Alabama at Birmingham

Yingying (Jennifer) Chen
WINLAB, Rutgers University

Nitesh Saxena
University of Alabama at Birmingham