Using clustering to detect and mitigate spam distributions

Andrey Bakhmutov Kaspersky Lab

A number of clustering techniques grouping emails according to similarities of content to help identify spam have been intensively used in recent years. Clustering is used to reduce the dataset before applying machine learning algorithms, to detect spam campaigns or to categorize emails.

In this work, clustering is used to detect and mitigate spam distributions in real time. Even advanced anti-spam systems may suffer from a time lag between the start of a spam distribution and the moment the system is able to block it. Therefore, it is important to spot a new distribution as early as possible and react quickly. To achieve this, a large-scale, highly distributed system that gathers email information has been deployed. The shingling technique of slicing text messages into chunks and hashing them is utilized to represent email data. Shingles from numerous sources are collected via UDP protocol and put into a clustering database which performs on-the-fly hierarchical clustering of incoming data. Because of its large scale and the ability to cluster in real time, the system can quickly detect spam distributions by observing fast-growing clusters and instruct mail hubs and servers to block emails that fall into those clusters.

The system has proved to be effective, robust and open to incorporating future technologies. It operates as a part of an anti-spam product contributing to its detection rate and considerably reducing response time to new spam distributions.