The TREC 2005 Spam Filter Evaluation Track


Gordon V. Cormack

University of Waterloo, Canada
Editor: Helen Martin


The 14th Text Retrieval Conference (TREC 2005) took place in November 2005. One of the highlights of the event was the TREC spam filter evaluation effort. Gordon Cormack has all the details.


The 14th Text Retrieval Conference (TREC 2005) took place in November 2005. One of the highlights of the event was the TREC spam filter evaluation effort, in which 53 spam filters developed by 17 organizations were tested on four separate email corpora totalling 318,482 messages. The purpose of the exercise was not to identify the best entry; rather to provide a laboratory setting for controlled experiments. The results provide a helpful insight into how well spam filters work, and which techniques are worthy of further investigation.

The technique of using data compression models to classify messages was demonstrated to be particularly worthy, as evidenced by the fine performance of filters submitted by Andrej Bratko and Bogdan Filipi of the Josef Stefan Institute in Slovenia. Other techniques of note are the toolkit approach of Bill Yerazunis' CRM114 group and the combination of weak and strong filters by Richard Segal of IBM.

Evaluation tools and corpora

Each spam filter was run in a controlled environment simulating personal spam filter use. A sequence of messages was presented to the filter, one message at a time, using a standard command-line interface. The filter was required to return two results for each message: a binary classification (spam or ham [not spam]), and a 'spamminess' score (a real number representing the estimated likelihood that the message is spam). After returning this pair of results, the filter was presented with the correct classification, thus simulating ideal user feedback. The software created for this purpose – the TREC Spam Filter Evaluation Tool Kit – is available for download under the Gnu General Public License (

The spam track departed from TREC tradition by testing with both public and private corpora. The public corpus tests were carried out by participants, while the private corpus tests were carried out by the corpus proprietors. This hybrid approach required participants both to run their filters, and to submit their implementations for evaluation by third parties.

The departure from tradition was occasioned by privacy issues which make it difficult to create a realistic email corpus. It is a simple matter to capture all the email delivered to a recipient or set of recipients. However, acquiring their permission, and that of their correspondents, to publish the email is nearly impossible. This leaves us with a choice between using an artificial public collection and using a more realistic private collection. The use of both strategies allowed us to investigate this trade off.

The three private corpora each consisted of all the email (both ham and spam) received by an individual over a specific period. The public corpus came from two sources: Enron email released during the course of the Federal Energy Regulatory Commission’s investigation, and recent spam from a public archive. The spam was altered carefully so as to appear to have been addressed to the same recipients and delivered to the same mail servers during the same period as the Enron email. Despite some trepidation that evidence of this forgery would be detected by one or more of the filters, the results indicate that this did not happen.

To form a test corpus, each message must be augmented with a gold standard representing the true classification of a message as ham or spam. The gold standard is used in simulating user feedback and in evaluating the filter's effectiveness. As reported previously (see VB, May 2005, p.S1), much effort was put into ensuring that the gold standard was sufficiently accurate and unbiased. Lack of user complaint is insufficient evidence that a message has been classified correctly. Similarly, we believe that deleting hard-to-classify messages from the corpus introduces unacceptable bias. In comparing the TREC results with others, one must consider that these and other evaluation errors may tend to overestimate filter performance.

Evaluation measures

The primary measures of classification performance are ham misclassification percentage (hm%) and spam misclassification percentage (sm%). A filter makes a trade-off between its performances on these two measures. It is an easy matter to reduce hm% at the expense of sm%, and vice versa. The relative importance of these two measures is the subject of some controversy, with the majority opinion being that reducing hm% is more important, but not at all costs with respect to increasing sm%. At TREC we attempted to sidestep the issue by reporting the logistic average (lam%) of the two scores, which rewards equally the same multiplicative factor in ham or spam misclassification odds. More formally:

lam% = logit-1 = ( logit(hm%) + logit(sm%) ) / 2


logit(x) = log(odds(x))


odds(x) = x / (100% - x)

Another way to sidestep the trade-off issue is to use the spamminess score to plot a Receiver Operating Characteristic (ROC) curve that represents all (hm%, sm%) pairs that could be achieved by the filter by changing a threshold parameter. Figure 1 shows the ROC curve for the best filter from each organization, as tested on the public corpus. In general, higher curves indicate superior performance regardless of the trade off between hm% and sm%, while curves that intersect indicate different relative performance depending on the relative importance of hm% and sm%. The solid curve at the top (ijsSPAM2full; Bratko's filter) shows sm% = 9.89% when hm% = 0.01%, sm% = 1.78% when hm% = 0.1%, and so on.

A useful summary measure of performance is the area under the ROC curve, ROCA, a number between 0 and 1 that indicates overall performance. In addition to the geometric interpretation implied by its name, this area represents a probability: the probability that the filter will give a random spam message a higher spamminess score than a random ham message. TREC reports (1-ROCA) as a percentage, consistent with the other summary measures which measure error rates rather than success rates.

TREC 2005's spam evaluation used three summary measures of performance: lam%, (1-ROCA)%, and sm% at hm% = 0.1. Each provides a reasonable estimate of overall filter performance; none definitively identifies the best filter.


The TREC spam evaluations generated a vast number of curves and statistics, which will appear in the TREC 2005 proceedings to be published early in 2006 ( We summarize the results with respect to the public corpus.

ROC curve for the best filter from each organization, as tested on the public corpus.

Figure 1. ROC curve for the best filter from each organization, as tested on the public corpus.

Table 1 associates each of the selected test runs (i.e. the best per organization) with its author. Only 12 of the filters were authored by official TREC 2005 participants; the other five were popular open-source spam filters, configured by the spam track organizers in consultation with their authors.

bogofilterBogofilter (open source)David Relson (non-participant)
ijsSPAM2PPM-D compression modelAndrej Bratko (Josef Stefan Institute)
spamprobeSpamProbe (open source)Brian Burton (non-participant)
spamasas-bSpamassassin Bayes filter only (open source)Justin Mason (non-participant)
crmSPAM3CRM-114 (open source)Bill Yerazunis (MERL)
621SPAM1Spam GuruRichard Segal (IBM)
lbSPAM2dbacl (open source)Laird Breyer
popfilePopfile (open source)John Graham-Cumming (non-participant)
dspam-toeDSPAM (open source)Jon Zdziarski (non-participant)
tamSPAM1SpamBayes (open source)Tony Meyer
yorSPAM2 Jimmy Huang (York University)
indSPAM3 Indiana University
kidSPAM1 Beijing U. of Posts & Telecom.
dalSPAM4 Dalhousie University
pucSPAM2 Egidio Terra (PUC Brazil)
ICTSPAM2 Chinese Academy of Sciences
azeSPAM1 U. Paris-Sud

Table 1. The selected test runs and their authors.

Table 2 shows the three classification-based measures (hm%, sm%, and lam%) for each filter, ordered by lam%. Note that hm% and sm% give nearly opposite rankings, indicating their heavy negative correlation and dependence on threshold setting.


Table 2. The classification-based measures, ordered by lam%.

Table 3 shows the three summary measures: (1-ROCA)%, hm% at sm% = 0.1%, and lam% and the rank of each filter according to each of the measures. Note that while the rankings are not identical, they have a high positive correlation. The measures with respect to the other corpora vary somewhat but give the same general impression.

Run(1-ROCA)%RankSm% @ Hm%=0.1RankLam%Rank

Table 3. The summary measures and the rank of each filter according to those measures.


The most startling observation is that character-based compression models perform outstandingly well for spam filtering. Commonly used open-source filters perform well, but not nearly so well or nearly so poorly as reported elsewhere. We have reason to believe that reports on the performance of other filters are similarly unreliable; only standard evaluation will test their credence.

The main result from TREC is the toolkit and methods for filter evaluation. These may be used by anyone to perform further tests. The public corpus will be made available to all, subject to a usage agreement. The private corpora will remain in escrow so that new filters may be tested with them. Plans are already under way for TREC 2006, in which the same and new tests will be conducted on new filters and corpora. The new tests will include modelling of unreliable user feedback, use of external resources and other email processing applications.



Latest articles:

Throwback Thursday: We're all doomed

When a daily sports paper compares a national soccer crisis with the spread of an Internet worm, you know that the worm has had an enormous impact on everyday life. In March 2004, Gabor Szappanos tracked the rise of W32/Mydoom.

VB2018 paper: Unpacking the packed unpacker: reversing an Android anti-analysis native library

This paper analyses one of the most interesting anti-analysis native libraries we’ve seen in the Android ecosystem. No previous references to this library have been found.The anti-analysis library is named ‘WeddingCake’ because it has lots of layers.

VB2018 paper: Draw me like one of your French APTs – expanding our descriptive palette for cyber threat actors

When it comes to the descriptive study of digital adversaries, we’ve proven far less than poets. Currently, our understanding is stated in binary terms: ‘is the actor sophisticated or not?’. Juan Andres Guerrero-Saade puts forward his views on how we…

VB2018 paper: Office bugs on the rise

It has never been easier to attack Office vulnerabilities than it is nowadays. In this paper Gabor Szappanos looks more deeply into the dramatic changes that have happened in the past 12 months in the Office exploit scene.

VB2018 paper: Tracking Mirai variants

Mirai, the infamous DDoS botnet family known for its great destructive power, was made open source soon after being found by MalwareMustDie in August 2016, which led to a proliferation of Mirai variant botnets. This paper presents a set of Mirai…

Bulletin Archive

We have placed cookies on your device in order to improve the functionality of this site, as outlined in our cookies policy. However, you may delete and block all cookies from this site and your use of the site will be unaffected. By continuing to browse this site, you are agreeing to Virus Bulletin's use of data as outlined in our privacy policy.