Introducing VB anti-spam testing

2009-01-01

Martijn Grooten

Virus Bulletin, UK
Editor: Helen Martin

Abstract

Virus Bulletin has decided to use the experience gained during more than a decade of anti-malware comparative testing to develop a regular anti-spam product test. Martijn Grooten outlines the proposed test set-up.


According to my mother, dolphins are the best spam filter. She told me that ever since her email program was configured so that an image of dolphins was added to the bottom of every email she sent, she had stopped receiving emails about Viagra.

Unfortunately, the dolphins did not really eat the spam. As it turned out, the addition of the image to her email footers coincided with her starting to use a new email address, which of course was the real reason for the reduction in spam. And, while the dolphins are still at the bottom of each email she sends, the spam has now returned.

So what is the best spam filter? End-users have little factual information upon which to base their choice of filter, and even the vendors themselves have little idea of their products’ performance compared to that of their competitors.

It seems that there is a need for independent and regular spam filter performance testing, and Virus Bulletin has decided to use the experience gained during more than a decade of anti-malware comparative testing to develop a regular anti-spam product test. Following months of internal discussion, culminating in an informal, yet fruitful meeting with members of the anti-spam industry at the last VB conference, a test methodology has been drawn up. Testing will start during the early months of 2009; this article outlines the proposed test set-up.

Who are the tests for?

While writing this article, I have received many emails from representatives of anti-spam companies politely, yet impatiently enquiring as to when we will be ready to start anti-spam testing. There certainly is big demand for product testing from within the industry: companies want to know how well they are performing compared to their competitors. If they are doing well, they want to boast about it to potential customers, while if they fail to live up to their promises, they want to know what parts of their product they need to improve. Just as Virus Bulletin has been running anti-malware tests for over a decade and helping AV developers improve their products, the results of VB’s anti-spam tests will help anti-spam vendors improve their filters.

More obviously, anti-spam customers want to know how well their chosen product, or one they are interested in purchasing, has performed in tests: whether independent sources back up the claims made by the vendors and whether those claims are based on long-term good performance.

However, a comparative test of anti-spam filters is more than just the comparison of various products: as different filters use different filtering methods, the test will implicitly compare these methods as well. This information should be valuable for the anti-spam industry as a whole.

Problems with anti-spam testing

In theory, setting up a comparative anti-spam test is easy: one sets up a forwarder that sends incoming mail to n mailboxes, one for each product to be tested. After a fixed period of time, one counts the number of misclassified ham messages and misclassified spam messages and thus ends up with a two-dimensional score for each product.

In practice, a multitude of problems arise, especially if one wants the anti-spam test to reflect a real-life situation. The first of these is the actual definition of spam, over which there is not more than a vague general agreement. But even assuming that one does have a satisfying definition of spam, it is not straightforward even to manually sort all incoming email according to this definition. If we at Virus Bulletin were able to do this with 100% accuracy in an automated way, we would likely quit our spam testing and start selling a classifier-filter commercially.

But even sidestepping the classification and definition issues, it is possible that the set-up described above will confuse spam filters. These might see an email claiming to be from [email protected] that, from the filter’s point of view, is sent from the IP address of the forwarder. Based on this information a filter might (incorrectly) decide to classify the email as spam. To deceive spammers, a filter might temporarily block an email and ask the sender to resend it after a given amount of time (a method known as greylisting); while the forwarder could be programmed to obey this request, the fact that different filters will send different ‘blocking messages’ means that it would never be able to do exactly what the original sender would have done.

Testing guidelines

The list of problems with possible anti-spam test set-ups is much longer than the few highlighted above. However, we do believe that it is possible to run good, representative and reliable anti-spam tests. To this end, we have set five conditions that our tests will fulfil:

  • Comparative: in order to be truly comparative all products will be tested in parallel and will be sent the same or a statistically comparable email stream.

  • Real-time: the emails will be sent to the products in real time, with no delay.

  • Real email: all the emails used in the test will have actually been sent from external sources; no extra ham or spam will be generated to increase the test sets.

  • Unbiased: when classifying emails, the testers (or anyone else involved in the classifying of emails) will have no information on how the email has been labelled by any of the test products.

  • Statistically valid: there is no ‘wildlist’ of all the spam sent out in a given period of time and thus any test set of emails will be nothing but a sample of the billions of spam and ham emails sent during the test period. This sample should be large enough for the testers to make claims about the products’ spam catch rates and their false positive rates.

  • Non-isolated environment: the products being tested will be able to connect to the Internet during testing.

In our opinion, any good anti-spam test should fulfil the six conditions listed above.

However, it would be an audacious claim to state that following these guidelines will guarantee that a test will be faultless. The reason for this is deeper than the fact that any test based on statistics is bound to incorporate some error. Many spam filters block traffic at the SMTP level, based on the IP address of the sender, the content of the MAIL FROM or RCPT TO commands or a combination of these. When this happens, the email is never received and while this generally boosts the filter’s performance, it is impossible for the tester to decide whether the unreceived email was spam or ham.

Anti-spam test set-up

Virus Bulletin intends to test spam filters using two test set-ups per product.

Measuring the false positive ratio

The first set-up uses the existing Virus Bulletin email stream. An incoming SMTP server will be built to do four things:

  1. Accept all email, regardless of whether the address in the RCPT TO command exists on our mail system.

  2. Store the full email (including all headers), as well as the full SMTP transaction, in a database.

  3. Forward the email to all the products taking part in the test (see below).

  4. Forward the email to Virus Bulletin’s internal network.

The forwarding described in step 3 should leave the headers intact, but add an extra Received: header. Moreover, the MAIL FROM address should be changed using the Sender-Rewriting Scheme (SRS) to reflect the fact that the email is now sent from the forwarding SMTP server. Doing this should ensure that filters do not incorrectly classify an email as spam because it failed the SPF or DKIM test. (Of course, it may well be that the original email would have failed the SPF test, while the forwarded email will not; this is one of the reasons for testing using two parallel set-ups.) To ensure there is no bias towards any product, the forwarding will happen in a random order.

For each product, a script will run regularly on the server to read the product’s log files and store the filter’s classifications in the aforementioned database. For each incoming email, the database will contain a list of ham/spam classifications, one for each product.

In the case of many incoming emails, all filters will agree on the classification (all filters will classify the message as spam or all filters will classify it as ham). Where this is the case we will assume the classification to be correct – while this is not a guarantee that all the filters have got it right, it will greatly reduce the amount of time end-users need to spend manually classifying messages. Furthermore, given that our tests are comparative, it will not bias any of the products.

To classify the remaining bulk of emails, each end-user (members of the Virus Bulletin team) will be presented with a web interface displaying all the currently unclassified emails addressed to them and will be required to manually classify each as ‘ham’, ‘spam’ or ‘unclassifiable’.

(It is possible that one or more of the filters taking part in the test perform so poorly that they disagree with the majority of the filters. If this proves to be the case, we will ignore the output of the poorly performing filters for the purposes of preliminary classification. This will be reported with the test results.)

Measuring spam catch rates

While the first set-up will mainly be used to derive a metric for the false positive ratio (i.e. the relative occurrence of incorrectly classified ham), a second set-up will be used to measure the products’ spam catch rates.

To this end, we will use one or more large spam traps: domain names for which the incoming email is almost guaranteed to be spam. (At least in theory, there exist legitimate reasons for any address to be sent email, but if the email stream is large enough, the amount of legitimate email will be negligible.) Then, using round-robin DNS, we will distribute this mail stream equally among the products to be tested.

The products will thus communicate directly with the sending SMTP server. They will be free to do anything to check the credibility of the sender and to determine whether the email is spam: from slowing down the connection to discourage spammers from continuing with it (a method known as tarpitting), to checking the IP address against a DNS blacklist. However, they will not be allowed to block the connection temporarily and ask the sender to try again after a certain period of time (greylisting); round-robin DNS makes it unlikely that the second attempt arrives at the same SMTP server, thus potentially causing a number of problems, from an unbalanced stream to a long sequence of temporary failures that could ultimately cause the sending SMTP server to give up trying.

To measure the products’ performance, all SMTP transactions will be logged and these log files will be compared with the number of emails that end up being classified as ham: all other emails will be considered to have been blocked as spam.

Speed and performance tests

Although the spam catch rate and the false positive rate are the most obvious metrics one would look at to compare anti-spam products, they are far from the only ways to describe a product’s performance.

Therefore, although our initial focus will be on measuring spam catch and false positive rates, we hope in the future to look into other metrics too. These include both speed (how long does it take for a legitimate email to reach the user’s inbox?) and performance (how much CPU does the spam filter use?), but may also include metrics such as the relative amount of spam blocked during the SMTP transaction.

Other metrics that may be published with the test results include the standard variation in a product’s spam catch rate from its daily or hourly average. Or, in the event of a particularly large spam outbreak coinciding with a test, the products’ response times to that outbreak.

Definition of spam

The general definition of spam upon which the VB staff will base classification decisions is that of unsolicited bulk email: email sent in large quantities to users that have not explicitly given their consent to receive such email. This does not mean that everything else will be automatically classified as ham: we allow for the existence of a third category of ‘unclassifiable’ emails: ones which the recipient is unable to label as definitely ham or definitely spam. For instance, these may be messages sent to a predecessor’s email address, and the recipient may have no way of knowing whether or not their predecessor had consented to receive mail from the sender. This category of emails will be removed entirely from the computation of the products’ performance.

Even with this third category, we are aware that no end-user will be able to classify all the email they receive in a wholly consistent and accurate way. We believe that any inaccuracy introduced in this way will in fact reflect a real-life situation, and that the end-user’s perception of a filter’s performance is as important as its performance compared to any formal definition of spam.

Requirements and settings

To take part in Virus Bulletin’s anti-spam tests, products must be able to accept SMTP transfers and classify email into two categories: ham and spam. Products might have additional categories, such as ‘possibly spam’, and it will be decided on a product-by-product basis as to whether such categories are taken to mean ham or spam; this will always be done keeping the end-user in mind and will be reported with the test results.

Products must not send temporary failures to the sending SMTP server or, more generally, do anything that requires the sending server to reinitiate the connection.

Products must be able to log the results of their filters in such a way that testers can easily run a script that reads the relevant data from the results. Storing the email in two folders in a standard mail folder format, such as mbox, also counts as logging.

The products will be installed and configured using their default settings.

Products will be allowed to connect to the Internet at any time, to allow them to update themselves and to be able to test incoming email against live blacklists and whitelists.

Every test will start with a clean install of both the product and the operating system.

Neither end-users nor testers will report the results of the filters back to the products during testing.

Awards and pricing

Results of the tests, which we anticipate running six times per year, will be published in the Spam Supplement section of Virus Bulletin magazine (available only to Virus Bulletin subscribers), with a basic summary of the results available free of charge to all registered users on the VB website.

The best-performing products will be awarded a certification, similar to the current VB100 awards for anti-malware products. The precise criteria for obtaining these awards will be decided once we have started testing properly, but our current aim is for the best-performing 50 per cent to achieve the award.

Given the cost of testing, which needs to be performed in parallel and will require the use of a separate machine for each product, we will be charging companies to take part in our tests. Of course, included in the fee will be the right to display the award (if achieved) on the company’s website and product literature. Moreover, companies taking part will be provided with feedback on their products’ performance, which may include a full overview of the spam their product has missed and, after anonymizing, ham emails that were wrongly classified as spam.

We are well aware of the existence of free, open-source anti-spam products and do not wish our tests to exclude these. Therefore, developers of products that are available entirely free of charge, open-source and that contain no in-product advertising will not be charged to enter their products in our tests. (However, VB reserves the right to limit the number of such products in each test on a first-come-first-served basis.)

Looking ahead

We will be the first to admit that our tests will not be 100% accurate. However, we intend for our tests to be as close to reality as possible, and of course will continue to look for ways in which the methodology can be tweaked and improved.

VB welcomes readers’ feedback on the proposed test methodology, as well as enquiries from developers interested in submitting their products for testing. Please direct all comments and enquiries to [email protected].

twitter.png
fb.png
linkedin.png
hackernews.png
reddit.png

 

Latest articles:

Nexus Android banking botnet – compromising C&C panels and dissecting mobile AppInjects

Aditya Sood & Rohit Bansal provide details of a security vulnerability in the Nexus Android botnet C&C panel that was exploited to compromise the C&C panel in order to gather threat intelligence, and present a model of mobile AppInjects.

Cryptojacking on the fly: TeamTNT using NVIDIA drivers to mine cryptocurrency

TeamTNT is known for attacking insecure and vulnerable Kubernetes deployments in order to infiltrate organizations’ dedicated environments and transform them into attack launchpads. In this article Aditya Sood presents a new module introduced by…

Collector-stealer: a Russian origin credential and information extractor

Collector-stealer, a piece of malware of Russian origin, is heavily used on the Internet to exfiltrate sensitive data from end-user systems and store it in its C&C panels. In this article, researchers Aditya K Sood and Rohit Chaturvedi present a 360…

Fighting Fire with Fire

In 1989, Joe Wells encountered his first virus: Jerusalem. He disassembled the virus, and from that moment onward, was intrigued by the properties of these small pieces of self-replicating code. Joe Wells was an expert on computer viruses, was partly…

Run your malicious VBA macros anywhere!

Kurt Natvig wanted to understand whether it’s possible to recompile VBA macros to another language, which could then easily be ‘run’ on any gateway, thus revealing a sample’s true nature in a safe manner. In this article he explains how he recompiled…


Bulletin Archive

We have placed cookies on your device in order to improve the functionality of this site, as outlined in our cookies policy. However, you may delete and block all cookies from this site and your use of the site will be unaffected. By continuing to browse this site, you are agreeing to Virus Bulletin's use of data as outlined in our privacy policy.