ACSAC2012 Program

Full Program »

Owed to their versatile functionality and size, PDF documents have
become a popular avenue for user exploitation ranging from large-scale
phishing attacks to targeted attacks. In this paper, we present a
framework for robust detection of malicious documents through machine learning
based on features extracted from document metadata and structure. Using real-world
datasets, we demonstrate the the adequacy of these document properties
for malware detection and the durability of these features across new
malware variants. Indeed, using multiple datasets containing an aggregate of over
5,000 unique malicious documents and over 100,000 benign ones, our
classification rates are well above 99% while maintaining low
false negatives of 0.2% or less for different classification
parameters and scenarios. Furthermore, we demonstrate the ability
to detect documents used in targeted attacks and separate
then from broad based threats.

We use the Random Forests classification method which is an ensemble
classifier employing randomly selected features in each individual
classification tree. These properties combined with a large number of
features yield strong detection rates, even on previously unseen
malware variants. Remarkably, we also discovered that by artificially reducing the
influence of the top features in the classifier, we can still achieve a high rate of
detection in an adversarial setting where the attacker is aware
of both the top features utilized in the classifier and our normality model.
Thus, we show strong resilience to mimicry attacks with knowledge of the document features,
classification method, and training set.

Author(s):

Charles Smutz
George Mason University
United States

Angelos Stavrou
George Mason University
United States