Measuring and marketing spam filter accuracy

2005-11-01

John Graham-Cumming

The POPFile Project
Editor: Helen Martin

Abstract

'Over 99% accurate!' 'Zero critical false positives!' '10 times more effective than a human!' Claims about the accuracy of spam filters abound in marketing literature and on company websites. Yet even the term 'accuracy' isn't accurate.


Introduction

'Over 99% accurate!' 'Zero critical false positives!' '10 times more effective than a human!' Claims about the accuracy of spam filters abound in marketing literature and on company websites. Yet even the term 'accuracy' isn't accurate. The phrase '99% accurate' is almost meaningless; 'critical false positives' are subjective; and claims about being better than humans are hard to interpret when based on an unreliable calculation of accuracy.

Before explaining what's wrong with the figures that are published for spam filter accuracy, and describing some figures that actually do make sense, let's get some terminology clear.

Popular terminology

The two critical terms are 'spam' and 'ham'. The first problem with measuring a spam filter is deciding what spam is. There are varying formal definitions of spam, including unsolicited commercial email (UCE) and unsolicited bulk email (UBE). But to be frank, no formal definition captures people's common perception of spam; like pornography, the only definition that does work is 'I know it when I see it'.

That may be unsatisfactory, but all that matters in measuring a spam filter's accuracy is to divide a set of email messages into two groups: messages that are believed to be spam and those that are not (i.e. legitimate messages, commonly referred to as 'ham').

With spam and ham defined, it is possible to define two critical numbers: the false positive rate and the false negative rate. In the spam filtering world these terms have specific meanings: the false positive rate is the percentage of ham messages that were misidentified (i.e. the filter thought that they were spam messages); the false negative rate is the percentage of spam messages misidentified (i.e. the filter thought that they were legitimate).

To be formal, imagine a filter under test that receives S spam messages and H ham messages. Of the S spam messages, it correctly identifies a subset of them with size s; of the ham messages it correctly identifies h of them as being ham. The false positive rate of the filter is:

(H – h)/ H

The false negative rate is:

(S – s)/ S

An example filter might receive 7,000 spams and 3,000 hams in the course of a test. If it correctly identifies 6,930 of the spams then it has a false negative rate of 1%; if it misses three of the ham messages then its false positive rate is 0.1%.

How accurate is that filter? The most common definition of accuracy used in marketing anti-spam products is the total number of correctly identified messages divided by the total number of messages. Formally, that is:

( s + h) / ( S + H)

or, in this case 99.27%.

99.27% sounds pretty good when marketing, but this figure is meaningless. A product that identified all 7,000 spams correctly, but missed 73 hams (i.e. has a false positive rate of 2.43%) is also 99.27% accurate.

And therein lies the reason why 'accuracy' is useless. Since spam filters quarantine or delete messages they believe to be spam, a false positive is unseen by the end user. And a false positive is a legitimate (often business-related) email that has been lost. If you had to chose between a filter that loses 1 in 1,000 hams or one that loses nearly 1 in 40, you'd surely chose the former. The difference in importance between missed spam and missed ham reflects a skew in the cost of errors. (For a longer discussion of methods of calculating a spam filter's performance numbers see VB, May 2005, p.S1.)

While I'm on the subject of meaningless marketing words, take a look at 'critical false positives' (CFPs). A critical false positive is apparently a false positive that you care about. Anti-spam filter vendors like to divide ham messages into two groups: messages that you really don't want to lose, and those that it would be OK to lose. The handwaving definition of these two groups tends to be 'business messages' and 'personal messages and opt-in mailing lists'. Given that it's impossible to define a critical false positive, spam filter vendors have incredible latitude in defining what is and is not a CFP, and hence CFP percentages are close to useless.

Two numbers

In my anti-spam tool league table (ASTLT, see http://www.jgc.org/astlt/) – which summarizes published reports of spam filter accuracy – I use two numbers: the spam hit rate (which is the percentage of spam caught: 100% – false negative rate, or s/S) and the ham strike rate (the percentage of ham missed, i.e. the false positive rate).

A typical entry in the ASTLT looks like this:

hghfghgfhSpam hit rateHam strike rate
MegaFilterX .9956 .0010

This means that MegaFilter X caught 99.56% of spam and missed 0.1% of ham. The table is published in three forms: sorted by spam catch rate (best to worst, i.e. descending); sorted by ham strike rate (best to worst, i.e. ascending); and grouped by test. (Entries in the ASTLT are created from published reports of spam filter tests in reputable publications. The full details are provided on the ASTLT website. It is important to note that it's difficult to compare the numbers from different tests because of different test methodologies.)

The top five solutions from the current ASTLT figures (where top is defined by maximal spam hit rate and minimal ham strike rate) are:

ToolSpam hit rateHam strike rate
GateDefender .9954 .0000
IronMail .9880 .0000
SpamNet .9820 .0160
CRM114 .9756 .0039
SpamProbe .9657 .0014

Here, the 'best' filter is the one with the highest spam hit rate and lowest ham strike rate. In the sample of entries above GateDefender is overall best, with IronMail close behind.

The use of two numbers also means that charts can easily be drawn where the upper right-hand corner indicates the best performance. All that is necessary is to plot the spam catch rate along the X axis and the ham strike rate along the Y axis (albeit in reverse order). Figure 1 shows the position of the top five solutions in the ASTLT.

Spam hit rate and ham strike rate for the top five solutions from the current ASTLT. The upper right-hand corner of the chart indicates the best performance.

Figure 1. Spam hit rate and ham strike rate for the top five solutions from the current ASTLT. The upper right-hand corner of the chart indicates the best performance.

However, testing organizations such as VeriTest (http://www.veritest.com/) wish to publish a single figure giving the overall performance of a spam filter. The simplest way to do this is to combine the spam hit rate and ham strike rate by weighting the contribution that those two numbers make to an overall 'performance' score for the filter. Clearly, the way in which the weights are created needs to reflect how much importance an end user gives to missed ham vs. delivered spam.

In VeriTest's case the spam hit rate contributes 40% of the overall score and the ham strike rate contributes 60%. To achieve the final score, the first thing they do is to translate each of the percentages into a score on the scale 2 to 5.

Spam hit rateVeriTest points
At least .9500 5
Between .9000 and .9500 4
Between .8500 and .9000 3
Less than .8500 2

For the spam hit rate the top score, 5, comes at greater than .9500:

Ham strike rateVeriTest points
Less than .0050 5
Between .0050 and .0100 4
Between .0100 and .0150 3
Greater than .0150 2

VeriTest then takes the two 'VeriTest points' for a filter and combines them to obtain a final score (between 2 and 5), with 40% contributed by the spam hit rate and 60% by the ham strike rate.

Score = (spam hit rate points * 0.4) + (ham strike rate points * 0.6)

(For more on VeriTest's methodology see: http://www.veritest.com/downloads/services/antispam/VeriTest_AntiSpam_Benchmark_Service_Program_Description.pdf).

Using that scheme it's possible to score the top five tools in the ASTLT:

ToolSpam hit rateHam strike rateSHR strike rateHSR pointsScore
GateDefender .9954 .0000 5 5 5
IronMail .9880 .0000 5 5 5
SpamNet .9820 .0160 5 2 3.2
CRM114 .9756 .0039 5 5 5
SpamProbe .9657 .0014 5 5 5

The combined scores put four of the tools on the same footing, and only SpamNet is scored lower because of its poor ham strike rate.

Part of the problem here is that there is no discrimination between spam filters once they reach a spam hit rate of .9500, or a ham strike rate of .0050. Better discrimination occurs if the scale is extended to 10 points, with the spam hit rate and ham strike rate broken down further.

The top score of 10 is given if the spam filter gives a perfect performance and misses no spam. Between .9500 and perfection each percentage point change (.0100) adds a point:

Spam hit ratePoints
Perfect (i.e. 1) 10
Less than 1 9
Between .9800 and .9900 8
Between .9700 and .9800 7
Between .9600 and .9700 6
Between .9500 and .9600 5
Between .9000 and .9500 4
Between .8500 and .9000 3
Less than .8500 2

Similarly, points for the ham strike rate can be extended to 10, breaking down ham strike rates below .0050 every tenth of a percentage (.0010):

Ham strike ratePoints
Perfect (i.e. 0) 10
Less than .0010 9
Between .0010 and .0020 8
Between .0020 and .0030 7
Between .0030 and .0040 6
Between .0040 and .0050 5
Between .0050 and .0100 4
Between .0100 and .0150 3
Greater than .0150 2

Now rescoring the top five tools using the same weighting (40% for spam catching ability and 60% for correct ham identification) a distinction emerges:

ToolSpam hit rateHam strike rateSHR strike rateHSR pointsScore
GateDefender .9954 .0000 9 10 9.6
IronMail .9880 .0000 8 10 9.2
SpamNet .9820 .0160 8 2 4.4
CRM114 .9756 .0039 5 6 6.4
SpamProbe .9657 .0014 6 8 7.2

As spam filters improve, such discrimination between small changes in spam hit rate and ham strike rate are vital in determining which spam filter is the best.

Determining the right weights is difficult and subjective. Is a missed ham twice as bad as a missed spam, 10 times as bad? It's hard to know the answer. What is needed is a way of weighing the cost of an undelivered ham and the cost of a delivered spam.

Cost and sensivity

To try to model that, imagine that an organization receives M messages per year, that Sp percent of the messages are spam, and that the organization has determined that a delivered spam costs Cs (you choose the currency) and an undelivered ham costs Ch.

The annual cost of a spam filter can be determined in terms of its spam hit rate (SHR) and ham strike rate (HSR) as follows:

Cost = Sp * M * Cs * (1-SHR) + (1-Sp) * M * Ch * HSR

It's possible to simplify that formula when comparing filters by first eliminating M, yielding a cost per message (CPM):

CPM = Sp * Cs * (1-SHR) + (1-Sp) * Ch * HSR

And then, instead of assigning absolute values to the costs of missed messages, replace Cs and Ch within their relative costs. By assigning the cost of a delivered spam a base value of 1 and an undelivered ham a relative cost of H the formula can be used to compare filters:

Simplified cost = Sp * (1-SHR) + (1-Sp) * H * HSR

And given that the percentage of all messages that are spam is well known (and probably knowable for a given organization), an absolute value for Sp can be inserted. Imagine that 65% of all messages are currently spam:

Simplified cost = 0.65 * (1-SHR) + 0.35 * H * HSR

Now for any spam filter's published or tested spam hit rate and ham strike rate it's possible to plot H against the simplified cost. In that way an organization can determine which filter to choose based on the sensitivity to changes in H.

Figure 2, for example, is a graph showing the cost of each of the top five spam filters in this article with H varying from 1 to 10 (i.e. a false positive is between 1 and 10 times the cost of a delivered spam):

Simplified cost of each of the top five spam filters in this article.

Figure 2. Simplified cost of each of the top five spam filters in this article.

Because GateDefender and IronMail had a ham strike rate of .0000 the cost is constant and GateDefender (with the best spam hit rate) is the cheapest overall. (In a real test it would be better to evaluate the actual spam hit rate and ham strike rate before plugging them into the formulae above; it's unlikely that a ham strike rate of .0000 is currently feasible in the real world).

An interesting cross over happens when H is around 7. At that point SpamProbe becomes cheaper to use than CRM114; this reflects SpamProbe's lower ham strike rate. SpamNet quickly becomes the most expensive solution because of its high ham strike rate.

Conclusion

Spam filters are becoming more and more accurate; they are catching more spam and missing less ham. But it is still important to weigh two numbers when evaluating a filter: its ability to catch spam and its effectiveness at delivering ham.

Note

I am always on the look out for new tests to include in the league table; if you know of any please email them to me. The figures in this article are from the published test results that I know about; other tests may show that the products mentioned have better performance than indicated here.

twitter.png
fb.png
linkedin.png
googleplus.png
reddit.png

 

Latest articles:

VB2017 paper: Browser attack points still abused by banking trojans

With the ever-increasing use of banking-related services on the web, browsers have naturally drawn the attention of malware authors. They are interested in adjusting the behaviour of the browsers for their purposes, namely intercepting the content of…

Does malware based on Spectre exist?

It is likely that, by now, everyone in computer science has at least heard of the Spectre attack. Since many excellent explanations of the attack already exist, this article focuses on the probability of finding Spectre being exploited on Android…

EternalBlue: a prominent threat actor of 2017–2018

At the centre of last year's infamous WannaCry ransomware attack was an NSA exploit leaked by the Shadow Brokers hacker group, known as ‘EternalBlue’. The worm-like functionality of the exploit made a deadly impact by propagating to interconnected…

VB99 paper: Giving the EICAR test file some teeth

There are situations that warrant the use of live viruses. There are also situations where the use of live viruses is unwarranted. Specifically, live viruses should not be used when safer and equally effective methods can be used to obtain the…

Powering the distribution of Tesla stealer with PowerShell and VBA macros

Since their return more than four years ago, Office macros have been one of the most common ways to spread malware. In this paper, Aditya K Sood and Rohit Bansal analyse a campaign in which VBA macros are used to execute PowerShell code, which in…


Bulletin Archive

We have placed cookies on your device in order to improve the functionality of this site, as outlined in our cookies policy. However, you may delete and block all cookies from this site and your use of the site will be unaffected. By continuing to browse this site, you are agreeing to Virus Bulletin's use of data as outlined in our privacy policy.