Pseudo-words for spam detection in an unmodified Naive Bayesian Text Classifier

John Graham-Cumming POPFile Project

  download slides (PDF)

The POPFile program has proved highly accurate at spam detection with a low false positive rate, yet uses an unmodified Naïve Bayesian Text Classifier with no ‘magic’ values or tweaks. Initially, POPFile performed poorly against spam, but a library of email parsing code and a set of pseudo-words (non-words fed into the classifier that indicate particular email features – e.g. obfuscation of a spammy word, such as Viagra) have brought POPFile to over 99.8% accuracy. This paper will detail every POPFile pseudo-word, how they are created from spam and ham messages, and give empirical data on their importance when scored against a large corpus of spam and ham messages.


We have placed cookies on your device in order to improve the functionality of this site, as outlined in our cookies policy. However, you may delete and block all cookies from this site and your use of the site will be unaffected. By continuing to browse this site, you are agreeing to Virus Bulletin's use of data as outlined in our privacy policy.