Ending spam

2005-10-01

John Graham-Cumming

The POPFile Project
Editor: Helen Martin

Abstract

John Graham-Cumming reviews: Ending spam by Jonathan A Zdziarski


See Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification on Amazon

Title. Ending Spam

Author. Jonathan A. Zdziarski

Publisher. No Starch Press

ISBN. 1593270526

Ever since Paul Graham posted his renowned ‘A Plan for Spam’ web page, the web has been the publishing medium of choice for the hackers behind the annual Spam Conference at MIT.

Jonathan Zdziarski has done an adequate job of summarizing this collective web wisdom in his book Ending Spam. The book covers all the major thoughts of the open source Bayesian spam-filtering community, but is marred by the author’s strong biases and missing explanations. Despite those problems the book is accessible to any reader with a computer science background and is essential reading for anyone wanting to understand Bayesian spam filtering.

The book opens with a redundant chapter recounting the history of spam from 1978 through 2005 and is followed by the oddly titled Chapter 2, ‘Historical approaches to fighting spam’, which describes an almost random collection of old and new spam fighting techniques, yet omits others. Techniques such as greylisting and fuzzy hashes (e.g. DCC) are not mentioned. The omission of fuzzy hashing is odd because the chapter includes a discussion of ‘collaborative filtering’.

Chapter 3 provides an overview of a statistical filter’s building blocks and introduces terminology that the author has popularized through his dspam project. There are two big disappointments here: first, there is no explanation of Bayes Theorem (just a couple of paragraphs that give a general description), and second, the section on ‘understanding accuracy’ promotes the use of a single ‘accuracy’ percentage as a way of comparing spam filters. It’s a pity that the author provides no discussion of false positives and false negatives, nor does he point out that users care much more about false positives than false negatives and that a single percentage accuracy figure can disguise a false positive problem.

It is also in Chapter 3 that the author’s open source axe to grind becomes obvious with the bizarre claim that ‘Most manufacturers are a bit concerned with the idea of deploying a box that learns on its own. Their customers will no longer need annual contracts for nightly updates [of rule sets] or as many software upgrades, which certainly puts them in a precarious financial position’. That’s probably news to the folks at Proofpoint (amongst others).

Chapter 4 describes in detail the operation of a statistical spam filter with a clearly worked example. In addition, the chapter explains the various mathematical techniques used in a number of filters (starting with Paul Graham’s original proposal and going through to the Inverse Chi-square test proposed by Gary Robinson).

Chapter 5 points out that messages need to be decoded into a readable form for a statistical filter to work. It brushes very lightly over quoted-printable and base 64 encoding without describing how they work, and talks about some HTML encodings used by spammers to disguise messages. There’s also a small, odd section entitled ‘Message actualization’ that reads like an implementation detail of dspam.

Chapter 6 talks about message tokenization with an interesting discussion of what constitutes a word and how, for example, words in the subject line of an email are treated differently from the same words appearing in the body. The inadequate section on ‘internationalization’ reveals the author’s anglophone-centric world view with the statement: ‘The issue of foreign languages will eventually require a solution’ – I suggest ignoring this bit.

Chapter 7 describes the tricks that spammers use to attempt to subvert spam filters. There’s an excellent discussion of why these tricks don’t work and the author busts through a few myths about statistical spam filtering with clear explanations and examples of actual spammer tricks.

Chapters 8 and 9 could have been omitted. Chapter 8 describes a number of database solutions and their relative merits with respect to spam filtering; chapter 9 outlines some of the issues that a spam filter author faces when their filter is used in a large organization.

The chapters in Part III are the most lucid in the book. They draw heavily on the author’s previous writing and cover spam filter testing (Chapter 10), tokenization methods other than ‘split the message into words’ (Chapter 11), removing useless features from a message to improve accuracy (Chapter 13) and some examples of how Bayesian spam filters can collaborate (Chapter 14). Chapter 12 provides an interesting look at a non-Bayesian spam filtering technique using Hidden Markov models.

An appendix highlights five spam fighters: POPFile (for which I was interviewed), SpamProbe, TarProxy, dspam and CRM114.

Overall this is a book worth buying. If you want to know how Bayesian spam filters work then open the book at Chapter 3; if you already know how they work then jump straight to Chapter 10.

Know of a useful infosecurity book? Why not tell us about it so we can let others know - email: editor@virusbtn.com.

View this book on Amazon

twitter.png
fb.png
linkedin.png
googleplus.png
reddit.png

 

Latest articles:

A review of the evolution of Andromeda over the years before we say goodbye

Andromeda, also known as Gamaru and Wauchos, is a modular and HTTP-based botnet that was discovered in late 2011. From that point on, it managed to survive and continue hardening by evolving in different ways. This paper describes the evolution of…

VB2012 paper: Malware taking a bit(coin) more than we bargained for

When a new system of currency gains acceptance and widespread adoption in a computer-mediated population, it is only a matter of time before malware authors attempt to exploit it. As of halfway through 2011, we started seeing another means of…

VB2017 paper: VirusTotal tips, tricks and myths

Outside of the anti-malware industry, users of VirusTotal generally believe it is simply a virus-scanning service. Most users quickly reach erroneous conclusions about the meaning of various scanning results. At the same time, many very technical…

The threat and security product landscape in 2017

VB Editor Martijn Grooten looks at the state of the threat and security product landscape in 2017.

VB2017 paper: Nine circles of Cerber

The Cerber ransomware was mentioned for the first time in March 2016 on some Russian underground forums, on which it was offered for rent in an affiliate program. Since then, it has been spread massively via exploit kits, infecting more and more…


Bulletin Archive