Automatically detecting spam at the cloud level using text fingerprints

2012-06-01

Marius Nicolae Tibeica

Bitdefender, Romania

Adrian Toma

Bitdefender, Romania
Editor: Helen Martin

Abstract

With content-based anti-spam technologies decreasing in efficiency, Marius Tibeica and Adrian Toma propose a fingerprinting algorithm that maps similar text inputs to similar signatures.


Due to increases in spam volume, as well as language diversity, content-based anti-spam technologies have decreased in efficiency. Alternative methods of similarity/outbreak detection are much needed, and by taking advantage of technological advances in the cloud infrastructure, we can reduce the impact on clients’ resources.

To address the similarity problem, we propose a fingerprinting algorithm that maps similar text inputs to similar signatures. There are two steps: the first involves creating an element of the fingerprint from each word or group of words, chosen by certain heuristics. The size of the text on which the fingerprint is created is very important: too little information can generate false positives, and too much information can make the matching process costly. Our approach is either to zoom in (increasing the number of fingerprint elements each word generates) if the text is too short, or zoom out (gradually reducing the length by eliminating certain groups) if the text is too long. We have tested the method using a clean stream of spam to train a matching filter with the Levenshtein distance as an indicator of similarity.

1. Introduction

Spammers constantly adapt their techniques in order to avoid detection filters. Signature-based anti-spam filters require frequent updates in order to remain effective, especially given the speed with which spam changes. Bayesian filters need constant training and can also miss spam with malicious attachments. IP address blocking is also problematic, as most spammers now rapidly change their IP addresses. Furthermore, a legitimate server that has been compromised for a short period of time cannot be blacklisted, as it also sends legitimate email messages. Spammers try to decrease the efficiency of URI blacklisting by registering a large number of domains or by using URL-shortening services. The need for an automatic similarity/outbreak detection method is clear.

The increasing popularity of portable devices and recent technological advances in the cloud infrastructure make moving processing away from the client an obvious choice – both to reduce the impact on the client’s resources and to significantly decrease update times. This shift in perspective calls for the use of a reliable fingerprint generation algorithm.

2. The fingerprint

There is an existing algorithm that generates fingerprints: context-triggered piecewise hashing (also known as fuzzy hashing) [1]. Unfortunately, on text of small dimensions, the length of the signature generated by the algorithm is too small and is unusable. This represents a big portion of spam messages, and these are also the hardest to detect using other content-based filters.

Our approach to creating a fingerprint for text is to focus on the actual words contained in the message, as this gives a good separation of entities in most languages. By generating a character from each word we obtain a basic fingerprint, but this has several limitations, which we address as follows.

2.1 The basic fingerprint

The creation of the basic fingerprint involves several steps:

  1. The input string is separated into different entities by delimiters [2]. These entities can be considered words.

  2. Entities larger than a certain threshold value can be further separated.

  3. We apply a hash function to each entity.

  4. A base64-encoded value of the six least significant bits of the hash is appended to the final fingerprint.

This will produce a fingerprint with a length equal to the number of entities found. If a fingerprint with a length within a certain range is needed, further processing is required.

2.2 The zoom in

Too little information from a fingerprint can generate false positives. To avoid this we can increase precision by gradually increasing the number of encoded values each hash appends to the final fingerprint. The number of encoded values represents the zoom level.

In this case, a 30-bit hash can offer up to five levels of zooming (for a 64-letter alphabet), and the possibility to increase the length of the fingerprints up to five times.

2.3 The zoom out

Too much information can make the matching process costly, especially when using a time-consuming [3] edit distance. To decrease the length of the fingerprint, we try to eliminate some of the entities in a way that gives two similar texts similar fingerprints:

  1. To avoid losing too much information, we create new hashes from groups of entities.

  2. For an X zoom-out level, a base64-encoded value of the six least significant bits of the hash is appended to the final fingerprint if hash % X = 0.

There is no way of finding the length of a fingerprint with a certain zoom-out level without calculating it, so zooming in will be done gradually until an acceptable length is found.

2.4 Choosing the hash function

We checked several hash functions to see which offered the least number of collisions on words from emails in various languages. The best choice was RSHash.

Hash functionCollisions 32 bitsCollisions 30 bits
RSHash04
BKDRHash16
SDBMHash27
OneAtATimeHash26
APHash46
FNVHash710
FNV1aHash710
JSHash266277
DJBHash266268
DEKHash435720
PJWHash16871687
ELFHash16871687
BPHash6190770909

Table 1. Analysis of hash functions on over 122,000 words from emails in various languages.

2.5 Example of fingerprint generation

Table 2 and Table 3 show how fingerprints with different zoom levels are generated from the text:

High end designer watch and handbag replica sale. Compare our price on a handful of our high end replicas!

EntitiesHash in hexBasic fingerprint2x zoom in fingerprint4x zoom in fingerprint
high25c4f948IEIlE5I
end260c14351M1mMU1
designer84f5afb7P7IPa7
watch34f5dc7511101c1
and2367c3d9ZnZjnDZ
handbag1aa88b794o5aoL5
replica33381ecaK4Kz4eK
salee96c2ebrWrOWCr
compare1c947587HUHcU1H
our24b80bd8Y4Yk4LY
price3b54d80dNUN7UYN
on1777af4fP3PX3vP
a61hAhAAAh
handful380be94eOLO4LpO
of1777af47H3HX3vH
our24b80bd8Y4Yk4LY
high3f155a68oVo/Vao
end260c14351M1mMU1
replicasad4c229pUpKUCp

Table 2. Basic fingerprint and zooming in example.

Entity groupsHash in hex1/2x zoom out1/3x zoom out1/4x zoom out1/5x zoom out1/6x zoom out
high end designer542068784 44 
end designer watch63514ba5 1 1 
designer watch and60acfb49     
watch and handbag73062bc7 H   
and handbag replica71486e1cc c  
handbag replica sale5c776d2euu  u
replica sale compare5e63573c8 88 
sale compare our4fe3444aK    
compare our price7ca1596cs s  
our price on7784933400000
price on a52cc87bd   9 
on a handful4f8398fe+    
a handful of4f8398f62    
handful of our743ba46d     
of our high7b451587 H   
our high end89d97a75     
high end replicas6ff630c6GG  G

Table 3. Zooming out example.

The basic and zoom-in fingerprints are generated from the same hashes, with the following results:

  • Basic fingerprint: I171Z5KrHYNPhOHYo1p

  • Fingerprint with 2x zoom in: EIM1P711nZo54KWrUH 4YUN3PAhLO3H4YVoM1Up

  • Fingerprint with 4x zoom in: lE5ImMU1IPa701c1jnDZaoL5z4eKOWCrcU1Hk4LY7 UYNX3vPAAAh4LpOX3vHk4LY/VaomMU1KUCp

The zoom-out fingerprints are:

  • 1/2 x 4cu8Ks0+2G

  • 1/3 x lHu0HG

  • 1/4 x 4c8s0

  • 1/5 x 4l809

  • 1/6 x u0G

2.6 Zoom levels on spam & legitimate emails

We analysed the spam flux and legitimate email messages over the course of two weeks. Setting a desired fingerprint length of 127 to 256 characters, we obtained the following results:

Zoom levels on spam emails.

Figure 1. Zoom levels on spam emails.

The cumulative results are:

  • Legitimate emails zoom in: 21.84%

  • Legitimate emails no zoom: 19.88%

  • Legitimate emails zoom out: 57.23%

  • Legitimate emails no suitable text: 0.99%

  • Spam emails zoom in: 36.26%

  • Spam emails no zoom: 20.82%

  • Spam emails zoom out: 42.15%

  • Spam emails no suitable text: 0.72%

Comparing fingerprints

Two fingerprints can be compared to determine whether the texts from which they were derived are similar.

Because the method of creating a fingerprint differs with each zoom level, only those with an identical zoom level can be compared. The examination looks at the zoom level and computes a Levenshtein distance, which then is scaled to produce a match score. For two fingerprints, f1 and f2, the score is:

By choosing a threshold T, in the range from 0 through 1, (1 meaning that a perfect match is required), we can say that the two fingerprints match if TS (f1, f2).

Detection and FP rates

We took a continuous stream of spam (15 hours, 865,000 emails) and divided it into 10-minute intervals. For a certain interval, we trained the filter with all the emails from the previous intervals and found a detection rate with a similarity threshold of 0.75 for both fuzzy hashing (with variable block size) and the proposed fingerprinting algorithm. The results are presented in Figure 2.

Detection rates on spam emails.

Figure 2. Detection rates on spam emails.

We then trained the filters with all the spam emails and ran a check on a corpus of 500,000 legitimate emails and newsletters. No false positives were registered.

The fingerprinting technology was also used in between official tests in VBSpam comparative testing and the zero false-positive rate was confirmed.

Further study and limitations

The fingerprint is based on content, especially words. As long as an email message has no words (including emails that only contain images or URLs) a fingerprint cannot be generated.

Bibliography

[1] Levenshtein, V. I. Binary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Nauk SSSR, 4, 163, 845–848, 1965.

[2] The delimiters that we chose are: {‘ ’, ‘\n’, ‘\t’, ‘\r’, ‘\0’, ‘.’, ‘,’, ‘:’, ‘;’, ‘(‘,’)’, ‘{‘,’}’, ‘[‘,’]’, ‘\\’, ‘/’, ‘^’, ‘\”’, ‘!’, ‘?’, ‘`’, ‘\’’, ‘+’, ‘*’, ‘^’, ‘$’, ‘|’, ‘?’, ‘”’}

[3] The Levenshtein edit distance [1] is found in O(mn) time (where m and n are the length of the measured strings).

[4] Kornblum, J. Identifying almost identical files using context triggered piecewise hashing. DFRWS conference 2006.

twitter.png
fb.png
linkedin.png
hackernews.png
reddit.png

 

Latest articles:

Nexus Android banking botnet – compromising C&C panels and dissecting mobile AppInjects

Aditya Sood & Rohit Bansal provide details of a security vulnerability in the Nexus Android botnet C&C panel that was exploited to compromise the C&C panel in order to gather threat intelligence, and present a model of mobile AppInjects.

Cryptojacking on the fly: TeamTNT using NVIDIA drivers to mine cryptocurrency

TeamTNT is known for attacking insecure and vulnerable Kubernetes deployments in order to infiltrate organizations’ dedicated environments and transform them into attack launchpads. In this article Aditya Sood presents a new module introduced by…

Collector-stealer: a Russian origin credential and information extractor

Collector-stealer, a piece of malware of Russian origin, is heavily used on the Internet to exfiltrate sensitive data from end-user systems and store it in its C&C panels. In this article, researchers Aditya K Sood and Rohit Chaturvedi present a 360…

Fighting Fire with Fire

In 1989, Joe Wells encountered his first virus: Jerusalem. He disassembled the virus, and from that moment onward, was intrigued by the properties of these small pieces of self-replicating code. Joe Wells was an expert on computer viruses, was partly…

Run your malicious VBA macros anywhere!

Kurt Natvig wanted to understand whether it’s possible to recompile VBA macros to another language, which could then easily be ‘run’ on any gateway, thus revealing a sample’s true nature in a safe manner. In this article he explains how he recompiled…


Bulletin Archive

We have placed cookies on your device in order to improve the functionality of this site, as outlined in our cookies policy. However, you may delete and block all cookies from this site and your use of the site will be unaffected. By continuing to browse this site, you are agreeing to Virus Bulletin's use of data as outlined in our privacy policy.