Enhancing Signature-based Collaborative Spam Detection with Bloom Filters

Jeff Yan
University of Newcastle upon Tyne
UK

Pook Leong Cho
University of Newcastle upon Tyne
UK

To date, statistical spam filters are probably the most heavily studied, and most widely adopted technology for detecting junk emails. However, among other disadvantages, they fail to detect spam that cannot be predicted by machine learning algorithms on which they are based. Neither do they identify spam that is sent as an image attachment. In addition, these filters need to be regularly trained,
particularly when false positive or negative occurs. Signature-based collaborative spam detection (SCSD) systems provide a promising solution addressing all these problems. In particular, some SCSD systems can identify previously unseen spam messages as such, although intuitively this would appear to be impossible. However, SCSD approaches usually rely on huge databases of email signatures (or checksums), demanding lots of resource in signature lookup as well as signature database storage, transmission and merging. In this paper, we report our enhancements to two pioneering SCSD systems, Razor and DCC. In our enhancements, signature lookups can be performed in O(1), i.e. constant time, independent of the number of signatures in the database. Space-efficient representation can significantly reduce signature database size (e.g. by a factor of 16 or more for the Razor systems), even before any data compression
algorithm is applied. A simple but fast algorithm for merging different signature databases is also supported. We use Bloom filters and a novel variant to achieve all this.

Keywords: Spam detection, Bloom filters

Read Paper Read Paper (in PDF)