MIT Spam Conference 2007

2007-05-01

John Graham-Cumming

Independent author, France

Editor: Helen Martin

Abstract

John Graham-Cumming provides a roundup of the papers and presentations at the fifth MIT Spam Conference.

Table of contents

Introduction
Schedule
A spam challenge, blog spam and search engine optimization
Spam detection by header/envelope information
IP reputations and trusted remailers
Modified neurons and support vector machine filters
Tarpitting and SMTP slowing
Image spam
Spamlet

Introduction

On 30 March, the fifth spam conference took place at Massachussets Institute of Technology (MIT) in Cambridge, MA, USA. Although popularly known as the ‘MIT Spam Conference’, the 2007 event broadened its focus to include spam, phishing and ‘other cybercrimes’.

Schedule

A total of 14 talks were scheduled for the one-day event. The conference grouped the talks into four tracks: invited talks (which covered blog, search engine and email spam), ‘Considering the source’ (with three talks covering SPF enhancements, reputation services and email header munging), ‘Working the text’ (which covered ground often seen at this conference: machine learning and text classification approaches to spam filtering), and ‘Thinking outside the text box’ (which perhaps should simply have been called ‘Miscellanea’, since it covered tarpitting, image spam detection and AI for responding to spammers).

As in 2006, all the talks and related papers are available for download as a disk image (in the form of an ISO file) from the conference website (http://spamconference.org). Unlike the 2006 conference, there was no real-time web cast of the event this time, but YouTube videos of each talk are linked from the conference website. Unfortunately, the excellent quality web cast has been replaced by very poor quality video (some without any sound at all), which makes watching the conference very trying. However, the organizer has promised that higher quality videos will be available soon.

Attendance at the conference also dropped this year, with estimates of the number of delegates ranging from 40 to 75; a far cry from the hundreds that overflowed the room back at the first conference in 2003. However, the drop in numbers is not surprising given that there are now at least two other technical conferences also covering spam: the Conference on Email and Spam (CEAS, http://ceas.cc/) and the Virus Bulletin conference (VB2007, http://www.virusbtn.com/conference/).

Nevertheless, there were some good presentations, and the broadening of the agenda gave some fresh faces a chance to present topics that have traditionally been absent from this conference’s agenda.

A spam challenge, blog spam and search engine optimization

First up was Richard Segal from IBM Research and Gordon Cormack from the University of Waterloo. Richard talked about the upcoming ‘Live Spam Challenge’ that will take place as part of CEAS 2007 in August. The challenge will pit filters against each other over a 24-hour period on live spam and ham. Messages will be provided with full envelope information so that almost all spam-filtering technologies can be tested. As well as live ham and spam the system will also provide simulated user feedback throughout the day so that filters can learn from the judgements of human recipients.

Next up was one of the most interesting talks of the day (although not the winner of the can-of-spam Best Paper award): Jessica Baumgarten talking about the different types of blog spam. Unfortunately, this presentation is not available from the conference website so you’ll have to make do with the YouTube video. [An article by Jessica Baumgarten on blog spam is also scheduled for the June issue of Virus Bulletin - Ed.]

After lunch, Aaron Emigh of Six Apart gave an unscheduled talk on the same subject (with assistance from Adam Thomason – this talk is available from the website), detailing some of the ways in which Six Apart deals with blog spam and showing that, once again, machine-learning filters like CRM-114 and DSPAM do a good job against this particular type of spam.

Last up before the break was Amanda Watlington of Searching For Profit, who gave an enlightening talk about the history and state of Search Engine Optimization (SEO) and Search Marketing.

Spam detection by header/envelope information

After a quick coffee and doughnut break the conference continued with Alberto Trevino and J.J. Ekstrom of Brigham Young University. Alberto talked about detecting spam solely by looking at header and envelope information for forged details. Just looking at HELO information they achieved a 61.8% spam detection rate with 0.33% false positives.

Looking at the validity of MAIL FROM achieved 79.3% spam detection rate with 0.53% false positives. Combining the results gave spam detection of 91.7% with 0.87% false positives. That’s not as good as some machine-learning spam filter authors claim (or as test results from the TREC Spam Track show), but this technique has the important advantage that it is independent of language, obfuscation, use of images, or any other content technique spammers try to use to get around a spam filter.

IP reputations and trusted remailers

Next, Alberto Mujica (whose company Reputation Technologies was one of the sponsors of the event) gave a talk that described Reputation Technologies’ service offering. In his talk he outlined the advantages of IP address reputation management.

Last up before lunch were Joseph McIsaac and Alex Pogrebnyak of Reflexion Networks talking about an enhancement to SPF that they term the ‘Trusted Remailer’ record. This record would allow a domain to publish the addresses of remailers that they trust; if the mailer is present in the record, the mail can be accepted despite the fact that the standard SPF lookup would indicate that the remailing domain was not permitted to send for a specific domain.

Modified neurons and support vector machine filters

After lunch, Alexandru Catalin Cosoi from BitDefender talked about combining the output of different spam filters using a modified neuron (a single perceptron) to incorporate the output of each spam filter and measure the relevance of the filter’s output (the relevance can decay over time as spammers update their spam to avoid certain filter techniques). Alexandru claimed that by combining filters and using the relevance for each filter calculated by the neuron they saw an increase in spam detection accuracy and a decrease in false positives of greater than 50%. He did not, however, present any test data against any standard spam/ham set.

Next up, Ángela Blanco and Manuel Martín-Merino from the Universidad Pontificia de Salamanca talked about methods of combining Support Vector Machine (SVM) spam filters to improve accuracy. They tested a variety of techniques using a corpus of around 5,000 messages; their best result was a spam detection rate of 89.9% with a false positive rate of 1.8%. Although they showed that their technique reduced false positives significantly, it was a pity that they did not produce a comparison with simple machine-learning techniques (such as Naïve Bayes or logistic regression) on the same data set, as the figures presented do not appear to represent an advance in the state of the art.

Tarpitting and SMTP slowing

More coffee, more doughnuts and it was time for Tobias Eggendorfer from the Universität der Bundeswehr München to talk about the latest news from his SMTP tarpitting experiments. He pointed out that many of the bulk mailers have become aware of tarpitting and thus are detecting deliberate slowness and dropping connections: hence it was time to update and to use the spammers’ awareness of tarpitting against them.

Tobias’s basic idea is to stutter (very slowly deliver the first few bytes of a connection) and then open up the connection for full speed. To make this transparent and compatible with existing SMTP servers his implementation is a network layer 2 bridge that can achieve connection control without affecting the contents of the IP or TCP header. The stuttering will cause a spammer to drop the connection (because they think they are in a tarpit), but will not affect a legitimate sender because they’ll quickly get a full speed connection once the stuttered portion is over.

By delaying each byte of the first 120 bytes of the SMTP connection transparently through the bridge by one second per byte, the total spam delivered to his test server dropped by 76.7%. In tests against real mail servers handling large amounts of ham, no false positives were observed.

This paper would have been my pick for the best paper of the conference, but the speaker after Tobias actually won the award for another sort of connection-shaping presentation.

Ken Simpson from MailChannels talked about his company’s product (full disclosure: I am a member of MailChannels’ technical advisory board). One nice chart from his presentation showed how connections drop off as they are slowed down: spammers drop off rapidly if they can’t get a fast connection, whereas legitimate senders will hang around for minutes to get their messages delivered. He claimed that the MailChannels’ product (whose architecture he went on to describe) drops 80–90% of spam by slowing down SMTP connections, and that the product is able to handle the incredibly high load placed on it by spammers without affecting normal email delivery. Although this was a vendor presentation the associated paper which describes the actual implementation (using Perl) is well worth reading if you are technically inclined.

Image spam

Next up, Giorgio Fumera, Ignazio Pillai, Fabio Roli, Battista Biggio from University of Cagliari described work they are doing on detection of image spam by looking at the obfuscation techniques used by spammers trying to avoid OCR. This is exactly analogous to work done on text classification of spam, where looking at the obfuscations used by spammers is often enough to detect spam without bothering with the actual text within the message. They showed that by calculating the perimetric complexity (a measure of the complexity of a black and white image defined as the square of the length of the boundary between black and white pixels divided by the total area of black), they could detect obscured spam images.

They also mentioned that these techniques (both theirs and OCR) were often unnecessary because standard text-classifier based spam filters often have enough other text to work with (ironically, such as the random text that spammers insert to fool filters) without considering the image. And they point out that for non-obscured images OCR-ing plus text classification currently works well.

Missing from the paper and presentation was any evaluation of these techniques in the real world. They showed a number of interesting examples, but without a test against a stream of real spams and hams (both with images) it’s hard to know whether these techniques would work in reality.

Spamlet

Lastly, Kenneth P. Dallmeyer, Peter C. Nelson, Elias D. Block, Brandon R. Elvidge from University of Illinois at Chicago talked about their Spamlet system, which is designed to engage spammers of all types in useless conversations (such as keeping a 419 scammer emailing back and forth) to use up their resources.

Missing from the conference, but scheduled, was Nouman Azam from EME College Rawalpindi in Pakistan talking about reducing the number of features needed by a classifier and comparing term frequency, mutual information and latent semantic indexing on the Ling Spam corpus. In his paper (which is available in the downloadable conference proceedings) he determined that mutual information feature space reduction gave the best accuracy with up to 20 features.

In all, despite diminished attendance figures, the conference provided excellent material and some interesting perspectives on issues relating to spam, and I would recommend checking out the proceedings at http://spamconference.org/.

[John Graham-Cumming will present a paper looking at past and future trends in spammer trickery, and outlining a proposed naming scheme for spammers’ tricks at VB2007 in Vienna, Austria, 19–21 September. For more information and online registration see http://www.virusbtn.com/conference/.]

Latest articles:

Nexus Android banking botnet – compromising C&C panels and dissecting mobile AppInjects

Aditya Sood & Rohit Bansal provide details of a security vulnerability in the Nexus Android botnet C&C panel that was exploited to compromise the C&C panel in order to gather threat intelligence, and present a model of mobile AppInjects.

Cryptojacking on the fly: TeamTNT using NVIDIA drivers to mine cryptocurrency

TeamTNT is known for attacking insecure and vulnerable Kubernetes deployments in order to infiltrate organizations’ dedicated environments and transform them into attack launchpads. In this article Aditya Sood presents a new module introduced by…

Collector-stealer: a Russian origin credential and information extractor

Collector-stealer, a piece of malware of Russian origin, is heavily used on the Internet to exfiltrate sensitive data from end-user systems and store it in its C&C panels. In this article, researchers Aditya K Sood and Rohit Chaturvedi present a 360…

Fighting Fire with Fire

In 1989, Joe Wells encountered his first virus: Jerusalem. He disassembled the virus, and from that moment onward, was intrigued by the properties of these small pieces of self-replicating code. Joe Wells was an expert on computer viruses, was partly…

Run your malicious VBA macros anywhere!

Kurt Natvig wanted to understand whether it’s possible to recompile VBA macros to another language, which could then easily be ‘run’ on any gateway, thus revealing a sample’s true nature in a safe manner. In this article he explains how he recompiled…

Bulletin Archive