Non-English spam: a case study
Vipul Sharma, Yanyan Yang and Jason Wallace Proofpoint
There had been a significant increase in the volume and sophistication of non-English spam over the last few years.
We see spam in multiple languages including Russian, German, Japanese, Chinese, Dutch, Spanish, French, Norwegian,
Finnish, Italian, Danish, Swedish, Greek, Thai, etc. and the list is continuously increasing. As the number of Internet
users increases across the globe over the years, we not only expect an increase in the volume of non-English spam but
also an increase in the number of languages used. An efficient sifting of such spam comes with its own challenges.
In this paper we discuss some of these challenges including language detection, implications of various character
sets, a model for a language-independent spam filter, etc. We will discuss some intrinsic differences between the
structure and techniques used in English spam and non-English spam. We will also discuss the properties that should
be used for efficient spam detection and properties that should not be used. We also reflect some insights on the
volume, rate of increase, type of languages and our effectiveness on such spam. We show the difference of our
language identification algorithm with other language detection algorithms. We also discuss the benefits of using a
hybrid model of sender reputation and text classification in dealing with the spam.
How are your spam levels compared to two months ago?
Leave a commentView 3 comments

The final VB100 of the year sees a double whammy of potential
pitfalls for our comparative participants - the
Vista operating system, which still seems shiny
and new as well as a little scary (to both developers and users), as well
as the x64 architecture, whose ostensible compatibility with standard
32-bit software belies oddities and intricacies that developers ignore at
their peril. The announcement of the test brought a few surprises, as
several regulars opted to skip this one, but the majority of veteran
competitors took part as usual, along with several newer faces, many of
whom look set to join the ranks of our regulars.
See full results.
Virus Bulletin currently has 148,048
registered users.