Ending spam

2005-10-01

John Graham-Cumming

The POPFile Project
Editor: Helen Martin

Abstract

John Graham-Cumming reviews: Ending spam by Jonathan A Zdziarski


See Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification on Amazon

Title. Ending Spam

Author. Jonathan A. Zdziarski

Publisher. No Starch Press

ISBN. 1593270526

Ever since Paul Graham posted his renowned ‘A Plan for Spam’ web page, the web has been the publishing medium of choice for the hackers behind the annual Spam Conference at MIT.

Jonathan Zdziarski has done an adequate job of summarizing this collective web wisdom in his book Ending Spam. The book covers all the major thoughts of the open source Bayesian spam-filtering community, but is marred by the author’s strong biases and missing explanations. Despite those problems the book is accessible to any reader with a computer science background and is essential reading for anyone wanting to understand Bayesian spam filtering.

The book opens with a redundant chapter recounting the history of spam from 1978 through 2005 and is followed by the oddly titled Chapter 2, ‘Historical approaches to fighting spam’, which describes an almost random collection of old and new spam fighting techniques, yet omits others. Techniques such as greylisting and fuzzy hashes (e.g. DCC) are not mentioned. The omission of fuzzy hashing is odd because the chapter includes a discussion of ‘collaborative filtering’.

Chapter 3 provides an overview of a statistical filter’s building blocks and introduces terminology that the author has popularized through his dspam project. There are two big disappointments here: first, there is no explanation of Bayes Theorem (just a couple of paragraphs that give a general description), and second, the section on ‘understanding accuracy’ promotes the use of a single ‘accuracy’ percentage as a way of comparing spam filters. It’s a pity that the author provides no discussion of false positives and false negatives, nor does he point out that users care much more about false positives than false negatives and that a single percentage accuracy figure can disguise a false positive problem.

It is also in Chapter 3 that the author’s open source axe to grind becomes obvious with the bizarre claim that ‘Most manufacturers are a bit concerned with the idea of deploying a box that learns on its own. Their customers will no longer need annual contracts for nightly updates [of rule sets] or as many software upgrades, which certainly puts them in a precarious financial position’. That’s probably news to the folks at Proofpoint (amongst others).

Chapter 4 describes in detail the operation of a statistical spam filter with a clearly worked example. In addition, the chapter explains the various mathematical techniques used in a number of filters (starting with Paul Graham’s original proposal and going through to the Inverse Chi-square test proposed by Gary Robinson).

Chapter 5 points out that messages need to be decoded into a readable form for a statistical filter to work. It brushes very lightly over quoted-printable and base 64 encoding without describing how they work, and talks about some HTML encodings used by spammers to disguise messages. There’s also a small, odd section entitled ‘Message actualization’ that reads like an implementation detail of dspam.

Chapter 6 talks about message tokenization with an interesting discussion of what constitutes a word and how, for example, words in the subject line of an email are treated differently from the same words appearing in the body. The inadequate section on ‘internationalization’ reveals the author’s anglophone-centric world view with the statement: ‘The issue of foreign languages will eventually require a solution’ – I suggest ignoring this bit.

Chapter 7 describes the tricks that spammers use to attempt to subvert spam filters. There’s an excellent discussion of why these tricks don’t work and the author busts through a few myths about statistical spam filtering with clear explanations and examples of actual spammer tricks.

Chapters 8 and 9 could have been omitted. Chapter 8 describes a number of database solutions and their relative merits with respect to spam filtering; chapter 9 outlines some of the issues that a spam filter author faces when their filter is used in a large organization.

The chapters in Part III are the most lucid in the book. They draw heavily on the author’s previous writing and cover spam filter testing (Chapter 10), tokenization methods other than ‘split the message into words’ (Chapter 11), removing useless features from a message to improve accuracy (Chapter 13) and some examples of how Bayesian spam filters can collaborate (Chapter 14). Chapter 12 provides an interesting look at a non-Bayesian spam filtering technique using Hidden Markov models.

An appendix highlights five spam fighters: POPFile (for which I was interviewed), SpamProbe, TarProxy, dspam and CRM114.

Overall this is a book worth buying. If you want to know how Bayesian spam filters work then open the book at Chapter 3; if you already know how they work then jump straight to Chapter 10.

Know of a useful infosecurity book? Why not tell us about it so we can let others know - email: editor@virusbtn.com.

View this book on Amazon

twitter.png
fb.png
linkedin.png
hackernews.png
reddit.png

 

Latest articles:

VB2018 paper: Uncovering the wholesale industry of social media fraud: from botnets to bulk reseller panels

In this paper GoSecure researchers Masarah Paquet-Clouston and Olivier Bilodeau explore an undocumented segment of the social media fraud (SMF) industry: wholesaling, from botnet supply operations to bulk reselling.

VB2018 paper: Now you see it, now you don't: wipers in the wild

There has recently been a trend of APT campaigns including a 'wiper' functionality to destroy data, either as a means to remove evidence or as its core purpose. This paper examines three different classifications of wipers through examples of various…

VB2018 paper: Who wasn’t responsible for Olympic Destroyer

Paul Rascagnères & Warren Mercer present the malware that they have identified – with moderate confidence – as having been used in the attack against the 2018 Winter Olympic Games. They describe the malware’s propagation techniques and its…

VB2018 paper: From drive-by download to drive-by mining: understanding the new paradigm

Jérôme Segura discusses the rise of drive-by cryptocurrency mining, explaining how it works and putting it in the broader context of changes in the cybercrime landscape.

The dark side of WebAssembly

The WebAssembly (Wasm) format rose to prominence recently when it was used for cryptocurrency mining in browsers. This opened a Pandora’s box of potential malicious uses of Wasm. In this paper Aishwarya Lonkar & Siddhesh Chandrayan walk through some…


Bulletin Archive

We have placed cookies on your device in order to improve the functionality of this site, as outlined in our cookies policy. However, you may delete and block all cookies from this site and your use of the site will be unaffected. By continuing to browse this site, you are agreeing to Virus Bulletin's use of data as outlined in our privacy policy.