Detecting spam pictures using statistical features

Sándor Antal VirusBuster

  download slides (PDF)

The problem we want to solve is to detect spam messages which contain essential information in an attached picture.

Unfortunately, nowadays spammers usually vary the pictures randomly (e.g. include little dots or lines), which is why images of two instances of the same spam differ. The aim of the spammers who do this is to avoid their spam pictures being detected by hash-based methods. Our goal was to eliminate the problems caused by this trick and develop a fast method which is not as sensitive to the little differences in pictures as the hash-based methods are.

The methods we have developed and use are to calculate statistical parameters of the image file (size, average, STD etc.) without rendering the image to smooth the image using differnet IF methods (for example Gaussian Blur or various types of granulation filters) to remove several disturbances (e.g. random dots) to calculate global parameters of an image (e.g. brightness, contrast) to use these parameters in a hash function which gets similar hash values for similar pictures. It means that if there is a little difference between the hash values of two pictures then they are the same or almost the same considering these parameters as spam/ham features and using the Bayesian method. This means that it is enough to teach only a few (maybe only one) spam instance and (unless the pictures are varied significantly) the filter can detect the modified variations as well.



twitter.png
fb.png
linkedin.png
hackernews.png
reddit.png

We have placed cookies on your device in order to improve the functionality of this site, as outlined in our cookies policy. However, you may delete and block all cookies from this site and your use of the site will be unaffected. By continuing to browse this site, you are agreeing to Virus Bulletin's use of data as outlined in our privacy policy.