Ending spam


John Graham-Cumming

The POPFile Project
Editor: Helen Martin


John Graham-Cumming reviews: Ending spam by Jonathan A Zdziarski

See Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification on Amazon

Title. Ending Spam

Author. Jonathan A. Zdziarski

Publisher. No Starch Press

ISBN. 1593270526

Ever since Paul Graham posted his renowned ‘A Plan for Spam’ web page, the web has been the publishing medium of choice for the hackers behind the annual Spam Conference at MIT.

Jonathan Zdziarski has done an adequate job of summarizing this collective web wisdom in his book Ending Spam. The book covers all the major thoughts of the open source Bayesian spam-filtering community, but is marred by the author’s strong biases and missing explanations. Despite those problems the book is accessible to any reader with a computer science background and is essential reading for anyone wanting to understand Bayesian spam filtering.

The book opens with a redundant chapter recounting the history of spam from 1978 through 2005 and is followed by the oddly titled Chapter 2, ‘Historical approaches to fighting spam’, which describes an almost random collection of old and new spam fighting techniques, yet omits others. Techniques such as greylisting and fuzzy hashes (e.g. DCC) are not mentioned. The omission of fuzzy hashing is odd because the chapter includes a discussion of ‘collaborative filtering’.

Chapter 3 provides an overview of a statistical filter’s building blocks and introduces terminology that the author has popularized through his dspam project. There are two big disappointments here: first, there is no explanation of Bayes Theorem (just a couple of paragraphs that give a general description), and second, the section on ‘understanding accuracy’ promotes the use of a single ‘accuracy’ percentage as a way of comparing spam filters. It’s a pity that the author provides no discussion of false positives and false negatives, nor does he point out that users care much more about false positives than false negatives and that a single percentage accuracy figure can disguise a false positive problem.

It is also in Chapter 3 that the author’s open source axe to grind becomes obvious with the bizarre claim that ‘Most manufacturers are a bit concerned with the idea of deploying a box that learns on its own. Their customers will no longer need annual contracts for nightly updates [of rule sets] or as many software upgrades, which certainly puts them in a precarious financial position’. That’s probably news to the folks at Proofpoint (amongst others).

Chapter 4 describes in detail the operation of a statistical spam filter with a clearly worked example. In addition, the chapter explains the various mathematical techniques used in a number of filters (starting with Paul Graham’s original proposal and going through to the Inverse Chi-square test proposed by Gary Robinson).

Chapter 5 points out that messages need to be decoded into a readable form for a statistical filter to work. It brushes very lightly over quoted-printable and base 64 encoding without describing how they work, and talks about some HTML encodings used by spammers to disguise messages. There’s also a small, odd section entitled ‘Message actualization’ that reads like an implementation detail of dspam.

Chapter 6 talks about message tokenization with an interesting discussion of what constitutes a word and how, for example, words in the subject line of an email are treated differently from the same words appearing in the body. The inadequate section on ‘internationalization’ reveals the author’s anglophone-centric world view with the statement: ‘The issue of foreign languages will eventually require a solution’ – I suggest ignoring this bit.

Chapter 7 describes the tricks that spammers use to attempt to subvert spam filters. There’s an excellent discussion of why these tricks don’t work and the author busts through a few myths about statistical spam filtering with clear explanations and examples of actual spammer tricks.

Chapters 8 and 9 could have been omitted. Chapter 8 describes a number of database solutions and their relative merits with respect to spam filtering; chapter 9 outlines some of the issues that a spam filter author faces when their filter is used in a large organization.

The chapters in Part III are the most lucid in the book. They draw heavily on the author’s previous writing and cover spam filter testing (Chapter 10), tokenization methods other than ‘split the message into words’ (Chapter 11), removing useless features from a message to improve accuracy (Chapter 13) and some examples of how Bayesian spam filters can collaborate (Chapter 14). Chapter 12 provides an interesting look at a non-Bayesian spam filtering technique using Hidden Markov models.

An appendix highlights five spam fighters: POPFile (for which I was interviewed), SpamProbe, TarProxy, dspam and CRM114.

Overall this is a book worth buying. If you want to know how Bayesian spam filters work then open the book at Chapter 3; if you already know how they work then jump straight to Chapter 10.

Know of a useful infosecurity book? Why not tell us about it so we can let others know - email: editor@virusbtn.com.

View this book on Amazon



Latest articles:

VB99 paper: Giving the EICAR test file some teeth

There are situations that warrant the use of live viruses. There are also situations where the use of live viruses is unwarranted. Specifically, live viruses should not be used when safer and equally effective methods can be used to obtain the…

Powering the distribution of Tesla stealer with PowerShell and VBA macros

Since their return more than four years ago, Office macros have been one of the most common ways to spread malware. In this paper, Aditya K Sood and Rohit Bansal analyse a campaign in which VBA macros are used to execute PowerShell code, which in…

VB2017 paper: Android reverse engineering tools: not the usual suspects

In the Android security field, all reverse engineers will probably have used some of the most well-known analysis tools such as apktool, smali, baksmali, dex2jar, etc. These tools are indeed must‑haves for Android application analysis. However, there…

VB2017 paper: Exploring the virtual worlds of advergaming

As adverts in gaming (‘advergaming’) ecosystems continue to become more sophisticated, so the potential complications grow for parents, children and gamers, who just want to play without having to worry about where their data is going (and how it is…

Distinguishing between malicious app collusion and benign app collaboration: a machine-learning approach

Two or more mobile apps, viewed independently, may not appear to be malicious - but in combination, they could become harmful by exchanging information with one another and by performing malicious activities together. In this paper we look at how…

Bulletin Archive