STRIP: A Defence Against Trojan Attacks on Deep Neural Networks

Recent trojan attacks on deep neural network (DNN) models are one insidious variant of data poisoning attacks. Trojan attacks exploit an effective \textit{backdoor} created in a DNN model by leveraging the difficulty in interpretability of the learned model to misclassify any inputs signed with the attacker's chosen trojan trigger. Since the trojan trigger is a secret guarded and exploited by the attacker, detecting such trojan inputs is a challenge, especially at run-time when models are in active operation. This work builds the STRong Intentional Perturbation (STRIP) based run-time trojan attack detection system and focuses on vision system. We intentionally perturb the incoming input, for instance by superimposing various image patterns, and observe the randomness of the predicted classes for the perturbed inputs from a given deployed model---malicious or benign. A low entropy in the predicted classes violates the input-dependence property of a benign model and implies the presence of a malicious input---a characteristic of a trojaned input. The high efficacy of our method is validated through case studies on two popular and contrasting datasets: MNIST and CIFAR10. We achieve an overall false acceptance rate (FAR) of less than 1\%, given a preset false rejection rate (FRR) of 1\%, for different types of triggers. In particular, on the dataset of natural images in CIFAR10, we have empirically achieved the desired result of 0\% for both FRR and FAR even under a number of variants of trojan attacks.

Garrison Gao
NJUST, China and Data61, Australia

Chang Xu
Data61, CSIRO, Sydney, Australia

Derui Wang
Swinburne University of Technology, Australia

Shiping CHEN
Data61, CSIRO, Sydney, Australia

Damith C. Ranasinghe
Auto-ID Lab, The School of Computer Science, The University of Adelaide

Nepal Surya
Data61 CSIRO Australia