The TREC 2006 Spam Filter Evaluation Track

2007-01-01

Gordon Cormack

University of Waterloo, Canada
Editor: Helen Martin

Abstract

The 15th Text Retrieval Conference (TREC 2006) took place in November 2006. For the second time, TREC included a spam track, whose purpose was to create realistic standardized benchmarks to measure spam filter effectiveness in a laboratory setting. Gordon Cormack reports on the results.


Introduction

The 15th Text Retrieval Conference (TREC 2006) took place in November 2006. For the second time, TREC included a spam track, whose purpose was to create realistic standardized benchmarks to measure spam filter effectiveness in a laboratory setting.

The TREC 2006 spam track evaluated new and existing techniques with new data sets using, as a baseline, the test method defined for TREC 2005 [1].

This method – which we dub ‘immediate feedback’ – presents to the filter a chronological sequence of email messages for classification, and simulates the behaviour of an idealized user by presenting to the filter the true classification of each message immediately thereafter. TREC 2006 introduced two new tests – delayed feedback and active learning – to model different usage scenarios. Details of the tests appear in the TREC Spam Track Guidelines [2].

The spam track uses a combination of public and private test corpora. Public corpora offer the advantage that they may be used and reused widely to compare the efficacy of diverse filtering approaches. Private corpora are more realistic, but access to them is limited. For TREC 2006 two public and two private corpora were used. One public corpus was English; the other Chinese. The two private corpora contained new email from two individuals whose email comprised two of the TREC 2005 corpora.

The best-performing method from TREC 2005 – Bratko’s compression-based filter – was a strong, but not dominant, performer at TREC 2006. OSBF-Lua, from Assis (a CRM114 team member in 2005), and a soft margin perceptron from Tufts University also showed top performance. OSBF-Lua appears to have the edge in most tests, but further experiments would be necessary to show significant differences among these three filters. A team from Humboldt University in Berlin used a discriminative filter with extensive pre-training to show excellent results for the active learning and several of the delayed feedback tests.

Evaluation setup

The test framework presents a set of chronologically ordered email messages, one at a time, to a spam filter for classification. For each message, the filter yields a binary judgement – spam or ham (i.e. non-spam) – which is compared to a human-adjudicated gold standard. The filter also yields a ‘spamminess’ score, intended to reflect the likelihood that the classified message is spam, which is the subject of post-hoc ROC (Receiver Operating Characteristic) analysis. The results of ROC analysis are presented as a graph (ROC curve) or as a summary error probability (1-ROC area).

The baseline test simulates an ideal user who reports filter errors immediately and accurately to the filter so that it may amend its behaviour. But real users are not ideal, and may be expected to under-report filter errors, and to do so only after some delay. This scenario is modelled by the delayed feedback test, in which the gold standard classification for a message is communicated to the filter only after it has been required to classify in the order of 1,000 further messages.

When a spam filter is first deployed, there may be a set of unclassified email messages – such as those existing in the user’s mailbox at the time of deployment – available for prior analysis. This scenario is modelled by the active learning test. The filter is able to present to the user several messages (100, 200, 400, etc. in distinct tests) for classification; the user indicates to the filter whether or not each message is spam.

Following this analysis phase, the filter is required to classify a sequence of new messages.

All tests were performed using the TREC Spam Filter Evaluation Toolkit, developed for this purpose. The toolkit is free software and is readily portable.

Test corpora

TREC 2006 used two public corpora, trec06p (English) and trec06c (Chinese), as well as two private corpora, MrX2 and SB2, whose sizes are given in Table 1.

Private corpora   
 HamSpamTotal
MrX290394013549174
SB29274269511969
Total183134283061143
Public corpora   
 HamSpamTotal
trec06p129102491237822
trec06c217664285464620
Total3467767766102442

Table 1. Corpus statistics.

The ham and some of the spam messages in trec06p were crawled from the web. These messages were adjudicated by human judges assisted by several spam filters – none of which were participants in TREC – using the methodology developed for TREC 2005. The messages were augmented by approximately 22,000 spam messages collected in May 2006. Each spam message was altered to make it appear to have been addressed to the same recipient and delivered to the same mail server during the same time frame as some ham message.

The trec06c corpus used data provided by Quang-Anh Tran of the CERNET Computer Emergency Response Team (CCERT) at Tsinghua University, Beijing. The ham messages consisted of those sent by a mailing list; the spam messages were those sent to a spam trap in the same Internet domain.

The MrX2 corpus was derived from the same source as the MrX corpus used for TREC 2005. For comparability with MrX, a random subset of X’s email from October 2005 through April 2006 was selected so as to yield the same corpus size and ham/spam ratio as for MrX. This selection involved primarily the elimination of spam messages, whose volume had increased about 50% since the 2003–2004 interval in which the original MrX corpus was collected. Ham volume was insubstantially different.

The SB2 corpus was collected from the same source as last year’s SB corpus. Spam volume had tripled since last year; all delivered messages were used in the corpus.

Results

Nine groups participated in the TREC 2006 filtering tasks; five of them also participated in the active learning task. For each task, each participant submitted up to four filter implementations for evaluation on the private corpora; in addition, each participant ran the same filters on the public corpora, which were made available following filter submission. All test runs are labelled with an identifier whose prefix indicates the group, and whose suffix indicates the corpus and test. Table 2 shows the identifier prefix for each submitted filter.

GroupFilter prefix
Beijing University of Posts and Telecommunicationsbpt
Harbin Institute of Technologyhit
Humboldt University Berlin & Strato AGhub
Tufts Universitytuf
Dalhousie Universitydal
Jozef Stefan Instituteijs
Tony Meyertam
Mitsubishi Electric Research Labs (CRM114)CRM
Fidelis Assisof1

Table 2. Participant filters.

Figure 1 shows the best result for each participant in the immediate feedback test with the trec05p corpus. Each result is represented by a ROC curve. In general, the higher curves are better, and there is little to choose among the top performers. Table 3 (column: trec06p immediate) presents 1-ROCA (%) as a summary of the distance from the curve to the top-left (optimal) corner of the graph. The other columns of the table present the same summary statistic for the other corpora, and for the delayed feedback test.

trec06p public corpus – immediate feedback.

Figure 1. trec06p public corpus – immediate feedback.

 trec06p trec06c MrX SB2 
Filter\Feedbackimmediatedelayimmediatedelayimmediatedelayimmediatedelay
oflS10.05400.16680.00350.06660.03630.06510.13000.3692
tufS20.06020.20380.00310.01040.06910.14490.33790.6923
ijsS10.06050.24570.00830.11170.08090.06330.16330.4276
CRMS30.11360.27620.01050.08880.13930.11290.29830.4584
hubS30.15640.19580.03530.04950.21020.22940.62250.8104
hitS10.28840.57830.20541.38030.14120.51840.58061.2829
tamS40.23260.41290.11730.27050.13280.17550.48130.9653
bptS21.21091.92641.89122.54442.54862.95711.43112.9050
dalS13.13836.32380.27390.48172.50354.34614.16205.6777

Table 3. Summary 1-ROCA (%).

Figure 2 shows the performance of the active learning filters as a function of n – the number of messages presented by the filter to the user for adjudication. The filter from Humboldt University uses a method known as uncertainty sampling – in which messages that the filter finds most difficult to classify are presented for adjudication – to achieve excellent results for small n, at the expense of performance for larger n.

Active Learning – trec06p Public Corpus.

Figure 2. Active Learning – trec06p Public Corpus.

Discussion

Although the Chinese corpus was much easier than the others, and SB2 was harder, results were generally consistent.

With a few exceptions, performance on the delayed feedback task was inferior to that of the baseline, as expected. It is not apparent that filters made much use of the unclassified data in the delayed feedback task; individual participant reports in the TREC proceedings will reveal this. The active learning task presents a significant challenge.

A number of new techniques were brought to bear in TREC 2006, including several machine-learning techniques (which, other than the standard naïve Bayes and its derivatives, were conspicuously absent from TREC 2005). Arguably the best-performing filter, OSBF-Lua, is open-source software [3].

Comparison between TREC 2005 and TREC 2006 results indicates that:

  1. The best (and median) filter performance has improved over last year.

  2. The new corpora are no ‘harder’ than the old ones; spammers have not defeated content-based filters.

  3. Challenges remain in exploiting unclassified data for spam filtering, within the framework of the delayed filtering and active learning tasks.

The spam track will continue in TREC 2007 [4].

Acknowledgements

The author thanks Stefan Büttcher and Quang-Anh Tran for their invaluable contributions to this effort.

Bibliography

[1] Cormack, G. Trec 2005 spam track overview. In Proceedings of TREC 2005 (Gaithersburg, MD, 2005).

twitter.png
fb.png
linkedin.png
hackernews.png
reddit.png

 

Latest articles:

Nexus Android banking botnet – compromising C&C panels and dissecting mobile AppInjects

Aditya Sood & Rohit Bansal provide details of a security vulnerability in the Nexus Android botnet C&C panel that was exploited to compromise the C&C panel in order to gather threat intelligence, and present a model of mobile AppInjects.

Cryptojacking on the fly: TeamTNT using NVIDIA drivers to mine cryptocurrency

TeamTNT is known for attacking insecure and vulnerable Kubernetes deployments in order to infiltrate organizations’ dedicated environments and transform them into attack launchpads. In this article Aditya Sood presents a new module introduced by…

Collector-stealer: a Russian origin credential and information extractor

Collector-stealer, a piece of malware of Russian origin, is heavily used on the Internet to exfiltrate sensitive data from end-user systems and store it in its C&C panels. In this article, researchers Aditya K Sood and Rohit Chaturvedi present a 360…

Fighting Fire with Fire

In 1989, Joe Wells encountered his first virus: Jerusalem. He disassembled the virus, and from that moment onward, was intrigued by the properties of these small pieces of self-replicating code. Joe Wells was an expert on computer viruses, was partly…

Run your malicious VBA macros anywhere!

Kurt Natvig wanted to understand whether it’s possible to recompile VBA macros to another language, which could then easily be ‘run’ on any gateway, thus revealing a sample’s true nature in a safe manner. In this article he explains how he recompiled…


Bulletin Archive

We have placed cookies on your device in order to improve the functionality of this site, as outlined in our cookies policy. However, you may delete and block all cookies from this site and your use of the site will be unaffected. By continuing to browse this site, you are agreeing to Virus Bulletin's use of data as outlined in our privacy policy.