Vipul Sharma, Steve Lewis Proofpoint Inc.
download slides (PDF)
Spam filters that rely on machine learning often use the content of the emails to generate features for the classification model. One of the famous tricks of fooling such spam filters is to introduce random text or noise in the emails text - for example 'Viagra' is spelled as '\/|@[email protected]' and 'mortgage' as 'm_o_r_t_g-a-g-e'.
The problem of obfuscation becomes quite cumbersome because there are endless ways to obfuscate a given word and hence the feature space of the spam classification model has to be updated frequently with all such words that are seen in spam emails. This also introduces a spam counter lag since the feature space is updated after such words are seen in spam emails.
There are at least two possible methods to counter the text obfuscation problem. The first method is to de-obfuscate the spam message as a preprocessing step of classification. Previous research has proved that de-obfuscating spam emails gives the best classification accuracy, but it also suffers from performance-related issues. These drawbacks cause extra damage to the enterprise class spam solutions where the number of emails is extremely large; in the order of tens of millions per day. Any such slow and computationally-expensive preprocessing technique will increase the email delivery time and hardware requirements. This not only makes the solutions more expensive for the end users and but also creates severe performance issues for the service providers.
Taking the above constraints into consideration, another technique to counter obfuscation is to identify the obfuscated words in an email and use them as an indicator of spam. Previous research reports a success rate of 75% in catching spam emails using such techniques.
To quantify a better trade off between the performance and the classification accuracy of the technique, we compared several classification algorithms on this technique. We report the empirical comparison of various multivariate classification techniques (e.g. random forests, Bayesian classification, C4.5 etc.) for obfuscation detection.
Our study also shows that by localizing the solution of the problem of obfuscation on certain 'frequently obfuscated words' and using preprocessing techniques like discretization for feature generation, the detection accuracy can be increased to around 96%, simultaneously keeping the computational and timing cost to a minimum.
We also report a significant average increase of 0.2% in the enterprise level spam filtering effectiveness due to auxiliary classification models such as obfuscation detector.