Ending spam


John Graham-Cumming

The POPFile Project
Editor: Helen Martin


John Graham-Cumming reviews: Ending spam by Jonathan A Zdziarski

See Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification on Amazon

Title. Ending Spam

Author. Jonathan A. Zdziarski

Publisher. No Starch Press

ISBN. 1593270526

Ever since Paul Graham posted his renowned ‘A Plan for Spam’ web page, the web has been the publishing medium of choice for the hackers behind the annual Spam Conference at MIT.

Jonathan Zdziarski has done an adequate job of summarizing this collective web wisdom in his book Ending Spam. The book covers all the major thoughts of the open source Bayesian spam-filtering community, but is marred by the author’s strong biases and missing explanations. Despite those problems the book is accessible to any reader with a computer science background and is essential reading for anyone wanting to understand Bayesian spam filtering.

The book opens with a redundant chapter recounting the history of spam from 1978 through 2005 and is followed by the oddly titled Chapter 2, ‘Historical approaches to fighting spam’, which describes an almost random collection of old and new spam fighting techniques, yet omits others. Techniques such as greylisting and fuzzy hashes (e.g. DCC) are not mentioned. The omission of fuzzy hashing is odd because the chapter includes a discussion of ‘collaborative filtering’.

Chapter 3 provides an overview of a statistical filter’s building blocks and introduces terminology that the author has popularized through his dspam project. There are two big disappointments here: first, there is no explanation of Bayes Theorem (just a couple of paragraphs that give a general description), and second, the section on ‘understanding accuracy’ promotes the use of a single ‘accuracy’ percentage as a way of comparing spam filters. It’s a pity that the author provides no discussion of false positives and false negatives, nor does he point out that users care much more about false positives than false negatives and that a single percentage accuracy figure can disguise a false positive problem.

It is also in Chapter 3 that the author’s open source axe to grind becomes obvious with the bizarre claim that ‘Most manufacturers are a bit concerned with the idea of deploying a box that learns on its own. Their customers will no longer need annual contracts for nightly updates [of rule sets] or as many software upgrades, which certainly puts them in a precarious financial position’. That’s probably news to the folks at Proofpoint (amongst others).

Chapter 4 describes in detail the operation of a statistical spam filter with a clearly worked example. In addition, the chapter explains the various mathematical techniques used in a number of filters (starting with Paul Graham’s original proposal and going through to the Inverse Chi-square test proposed by Gary Robinson).

Chapter 5 points out that messages need to be decoded into a readable form for a statistical filter to work. It brushes very lightly over quoted-printable and base 64 encoding without describing how they work, and talks about some HTML encodings used by spammers to disguise messages. There’s also a small, odd section entitled ‘Message actualization’ that reads like an implementation detail of dspam.

Chapter 6 talks about message tokenization with an interesting discussion of what constitutes a word and how, for example, words in the subject line of an email are treated differently from the same words appearing in the body. The inadequate section on ‘internationalization’ reveals the author’s anglophone-centric world view with the statement: ‘The issue of foreign languages will eventually require a solution’ – I suggest ignoring this bit.

Chapter 7 describes the tricks that spammers use to attempt to subvert spam filters. There’s an excellent discussion of why these tricks don’t work and the author busts through a few myths about statistical spam filtering with clear explanations and examples of actual spammer tricks.

Chapters 8 and 9 could have been omitted. Chapter 8 describes a number of database solutions and their relative merits with respect to spam filtering; chapter 9 outlines some of the issues that a spam filter author faces when their filter is used in a large organization.

The chapters in Part III are the most lucid in the book. They draw heavily on the author’s previous writing and cover spam filter testing (Chapter 10), tokenization methods other than ‘split the message into words’ (Chapter 11), removing useless features from a message to improve accuracy (Chapter 13) and some examples of how Bayesian spam filters can collaborate (Chapter 14). Chapter 12 provides an interesting look at a non-Bayesian spam filtering technique using Hidden Markov models.

An appendix highlights five spam fighters: POPFile (for which I was interviewed), SpamProbe, TarProxy, dspam and CRM114.

Overall this is a book worth buying. If you want to know how Bayesian spam filters work then open the book at Chapter 3; if you already know how they work then jump straight to Chapter 10.

Know of a useful infosecurity book? Why not tell us about it so we can let others know - email: editor@virusbtn.com.

View this book on Amazon



Latest articles:

VB2018 paper: Analysing compiled binaries using Logic

In this paper Thaís Moreira Hamasaki provides an introduction to some practical applications of SMT solvers in IT security, investigating the theoretical limitations and practical solutions, focusing on their use as a tool for binary static analysis.

VB2018 paper: Internet balkanization: why are we raising borders online?

Nowadays, walls are not just being raised in the real world, but on the Internet as well. Countries want to isolate themselves and shut down the information they are not comfortable with, or the companies they don’t want to do business with. Freedom…

VB2018 paper: Where have all the good hires gone?

Much ink has been spilled on the subject of the information security skills gap, and how difficult it is to hire and retain people for these positions. And yet, we all know someone who has had a hard time finding a suitable position despite having…

VB2018 paper: Little Brother is watching – we know all your secrets!

In their research, Siegfried Rasthofer, Stephan Huber & Steven Arzt evaluated the security level of the most popular family-tracking apps on Android. They assessed the security of the respective apps and conducted assessments of the corresponding…

VB2018 paper: Inside Formbook infostealer

Formbook is an infostealer that has been advertised for sale in public hacking forums since February 2016 by a user with the handle ‘ng-Coder' but only came to public attention after it was extensively used in spam campaigns in late 2017. This paper…

Bulletin Archive

We have placed cookies on your device in order to improve the functionality of this site, as outlined in our cookies policy. However, you may delete and block all cookies from this site and your use of the site will be unaffected. By continuing to browse this site, you are agreeing to Virus Bulletin's use of data as outlined in our privacy policy.