Full Program »
Voicefox: Leveraging Inbuilt Transcription to Enhance the Security of Machine-Human Speaker Verification against Voice Synthesis Attacks
In this paper, we propose Voicefox, a defense against the threat of automated voice synthesis attacks in machine-based and human-based speaker verification applications. Voicefox is based on a hitherto undiscovered potential of speech-to-text transcription, already built into these applications. Voicefox relies on the premise that while the synthesized samples might be falsely accepted by the speaker verification systems and human listeners, they cannot be transcribed as accurately as a natural human voice by the speech-to-text techniques. Voicefox is not a speaker verification system, but rather an independent module that can be integrated with any speaker verification system to enhance its security against voice synthesis attacks.
To test our premise and as an essential pre-requisite for building Voicefox, we ran an extensive study that measures the accuracy of off-the-shelf speech-to-text techniques when confronted with the synthesized samples generated by the state-of-the-art speech synthesis techniques. Our results show that the transcription error rate for the synthesized voices is significantly higher, on average 2-3x, than the error rate for natural voices. This study quantitatively proves our hypothesis that human voices are transcribed more accurately than synthesized voices. We further propose several post-transcription rules in designing Voicefox, including acceptance of transcribed text even if up to a certain number of words are not transcribed correctly, and ignoring the words not available in the reference dictionary. By applying such rules, Voicefox can effectively reduce the false rejection rates to as low as 1.20-4.69% depending on the application and the transcriber used, and reduce the false accept rates to 0% for dictionaries with phonetically-distinct words.