Automatic malware clustering plays a vital role in combating the rapidly growing number of malware variants. Most existing malware clustering algorithms operate on either static instruction features or dynamic behavior features to partition malware into families. However, these two distinct approaches have their own strengths and weaknesses in handling different types of malware. Moreover, different clustering algorithms and even multiple runs of the same algorithms may produce inconsistent or even contradictory results. To remedy this heterogeneity and lack of robustness of a single clustering algorithm, we propose a novel system called DUET by exploiting the complementary nature of static and dynamic clustering algorithms and optimally integrating their results. By using the concept of clustering ensemble, DUET combines partitions from individual clustering algorithms into a single consensus partition with better quality and robustness. DUET improves existing ensemble algorithms by incorporating cluster-quality measures to effectively reconcile differences and/or contradictions between base malware clusterings. Using real-world malware samples, we compare the performance of DUET (in terms of clustering precision, recall and coverage) with individual state-of-the-art static and dynamic clustering component.
The comprehensive experiments demonstrate DUET's capability of improving the coverage of malware samples by 20--40%
while keeping the precision near the optimum achievable by any individual clustering algorithm.
IBM T.J. Watson Research
Kang G. Shin
University of Michigan, Ann Arbor