Claudiu Cristian Musat BitDefender
Andra Miloiu BitDefender
Carmen Mitrica BitDefender
Spam keeps changing, but so far there have not been any quantitative studies to determine the rate at which novelty appears, nor any to identify the proportion of received messages that are really new and of those which are variations of older spam.
In this work we present a means of determining whether a newly received message is similar to previously seen ones or somehow different. Furthermore, we provide an apparatus that focuses on the spam samples that are most different from the ones already known.
We use a wave oriented k-means engine to cluster messages with a similar description in the chosen feature space. Then we use another instance of the engine to cluster the previously obtained spam clusters and single out the most different ones. Finally, by expanding the timeframe in a final step we detect long-term cluster similarities. The result of this process is a stream of clusters comprised of the messages that least resemble older ones.
Finding the real novelty is important because it enables analysts to focus on those messages and thus further reduce the false negative rate.