Marios Kokkodis and Michalis Faloutsos present the results of an empirical study of spamming activity, discussing the temporal spreading of spammers across the IP space and illustrating the evolution of high-activity spamming IP spaces.
Copyright © 2009 Virus Bulletin
Over the last few years, there has been an ongoing battle between botmasters and security administrators regarding the proliferation of bots. The former are constantly recruiting new members to their army, while the latter keep trying to improve their defences.
Intuitively, the larger a botnet becomes, the more harmful it can be. Since spamming is one of a botnet’s major activities, the proliferation of bots results in an increase in the volume of spam messages that travel across the Internet. Because of this, many studies have been conducted on the behaviour of spamming botnets. Even though the contribution of these studies is significant, it is also important to remain up to date, since spamming botnets evolve rapidly (e.g. by modifying their spamming tactics, expanding their army of compromised machines, updating the techniques they use to obfuscate their identities etc.). As a result of this constant evolution, empirical studies that deal with questions such as ‘who is sending all these unsolicited messages?’ become very important for both the evaluation and improvement of the currently available mitigation techniques.
In this article, we present the results of an empirical study that we conducted regarding spamming activity. In more detail, we discuss the temporal spreading of spammers across the IP space. Our study analyses spam messages received in the last four years and illustrates the evolution of high-activity spamming IP spaces. Our findings can be summarized into two main observations:
A previously unreported IP space has become a major source of spamming activity during the last two years.
There is a spreading trend of spamming activity across the IP space.
These two observations have grave significance since they can compromise the effectiveness of IP-filtering-based mitigation techniques. In the rest of this article we describe the analysis that led to our findings, and discuss some of the ensuing implications.
Before outlining our analysis, it is important to have a basic understanding of the functionalities of a botnet. A botnet is a collection of compromised machines (i.e. bots) that are controlled centrally by a botmaster. Their size varies between a few thousand to a few million compromised machines (e.g. the Conficker botnet has more than ten million bots in its army), while the amount of spam that a botnet can send varies between hundreds of millions to a few billion messages per day. In addition to spamming, botnets often engage in other malicious activities (e.g. DoS attacks). However, in this study, we concentrate only on spamming.
Figure 1 shows an abstract view of a spamming botnet. From this, we can identify three major groups of participants in the spamming process:
The botmaster. This is the person (or persons) that control(s) everything that has to do with the botnet. The botmaster is in charge of:
Recruiting new members (bots) by crawling the Internet and attacking unprotected machines.
Managing his current resources to maximize his profit (e.g. splitting them into groups and assigning a different spam campaign  to each group).
Managing the victim mailing lists (e.g. commanding bots to crawl the Internet and harvest new user accounts or to try randomly to guess some valid ones [e.g. from Google, Yahoo! etc.]).
The bots. These are compromised machines that blindly obey their masters’ commands. The bots are the origin of the spam messages received by Internet users.
The victims (represented by the ‘Internet’ cloud in Figure 1). These are listed user-accounts that receive spam.
Figure 1 provides a blueprint of the botnet spamming procedure: the botmaster assigns specific lists of users to each bot, and then commands them to begin sending spam messages.
Spam is a major problem that all network administrators have to overcome. The bad guys (spammers) are constantly improving their techniques, and so are the good ones (network administrators). As a result, a lot of work has been carried out in this field. Below are some fundamental findings that we already know about the origin of spam:
These findings are implicitly optimistic, as they suggest that by focusing on a few highly active IP prefixes, we may be able to fight spam at the IP level (e.g. block traffic from specific subnets). However, our study unveils a worrying trend: bots seem to be spreading widely across the IP space.
For our study, we used a publicly available dataset , which consisted of 2,046,520 spam messages (both email header and content). These messages were collected by various user accounts from three different domains over a four-year period (January 2006 – May 2009). The majority of these emails were flagged as spam by SpamAssassin, a well-known email filtering application. To increase our confidence in the dataset, we manually verified as spam a randomly chosen subset of the emails.
Extracting useful information from an email header is not trivial, since spammers usually bypass the SMTP protocol in order to obfuscate their identities. Therefore, we believe that it is important to clarify the parsing procedure we follow in our study.
Our ultimate goal here is to find a valid source IP for each message in the dataset. According to the SMTP protocol, each server that receives a message appends a Received record (e.g. Received : from example.com [22.214.171.124]) to the top of the email header. Hence, the earliest Received record should include the IP of the first SMTP server that forwarded the email (i.e. the source IP). However, as mentioned above, in the case of spam messages the protocol is often violated since spammers have developed techniques to hide (or obfuscate) their identities. An example of such a technique is to falsify the header information either by modifying it or by appending invalid Received headers. Therefore, the only relay from which we can identify the true IP address is the one that established the SMTP connection to our mail server. In our study, we used this as the source IP for conducting our analysis.
In order to provide better insight into our dataset we present some statistics in Table 1 and Table 2. In Table 1, we show the high-activity IP spaces for each of the four years of our dataset. In Table 2, we give the percentage of the IP space that is covered by these spaces, along with their respective contribution to the total volume of spam.
|Year||Space A||Space B||Space C|
|2006||58.* – 73.* 80.* – 90.*||-||190.* – 222.*|
|2007||57.* – 92.*||121.* – 126.*||188.* – 222.*|
|2008||57.* – 96.*||116.* – 126.*||188.* – 222.*|
|2009||57.* – 97.*||113.* – 126.*||188.* – 222.*|
Table 1. Active IP chunks between 2006 and 2009.
|Year||Active IP space (% of total IP space)||Volume of spam (% of the total volume)|
Table 2. Contribution in total received volume of spam of the active spamming IP spaces between 2006 and 2009.
More specifically, in the data from 2006, we can identify three high-activity spamming chunks of IPs (the first row in Table 1), which constitute 22.6% of the total IP address space , and are the origin of 92% of the total amount of spam that was received in 2006. This result barely follows the Pareto principle that we mentioned before. In addition, it indicates that, by applying some kind of traffic control on those three IP spaces, we could significantly reduce the volume of spam received.
In the 2007 data, the percentage covered by high-activity areas (presented in the second row in Table 1) rises to 29.3% of the total IP space, and is responsible for 95% of the total volume of received spam. Furthermore, the high-activity chunks of 2006 are only a subset of the respective spamming chunks of 2007 – an observation that shows a spreading of spammers over the IP space. In the data from 2008, we again identify three high-activity areas (third row in Table 1), which are the cause of 91.5% of the total volume of spam, and constitute 32.4% of the total IP address space. A similar argument can be made for the high-activity spamming IPs of 2009 (fourth row in Table 1): these chunks are responsible for 93.4% of the total amount of spam, while they cover 34.4% of the IP space.
In Figure 2 we present the spam Cumulative Distribution Function (CDF) for each of the last four years. In Table 1, we list the high-activity IP spaces with respect to spamming. The first unexpected observation is the intense spamming activity of the IP space between 113.* and 126.*. To the best of our knowledge, no one so far has observed high spam activity in this area (shown as an inset in Figure 2). Note that spaces A and C were reported by previous studies , which increases the confidence in our dataset. This new space shows low spam intensity in 2007 but by 2009, it has become one of the three major spamming IP areas, serving as the origin of 15% of the received volume of spam in 2009 while constituting only 5% of the total IP space.
Figure 2. The cumulative distribution of spamming activity across the IP space over the last four years. We show the two high-activity areas (left and right boxes) and an emerging high-activity area (middle box) not reported so far.
The next important observation has to do with the ‘spreading’ trend of spamming activity across the IP space between 2006 and 2009. This ‘spreading’ observation is supported by two facts: (a) a new active area has emerged (2007–2009, as described before), and (b) the known major spamming areas became wider as of 2006 (shown in both Table 1 and Figure 2).
There are several different ways to quantify this trend. For example, in 2006, the high-activity spaces covered 22.6% of the total IP space. This percentage increased every year and peaked at 34.4% in 2009, illustrating the spreading trend of the spamming areas.
Another way to show this ‘spreading’ is to focus on the spam activity of the /16 subnets that were active throughout the period covered by the data. In Figure 3, we plot the cumulative percentage of the spam activity of these /16 prefixes as a percentage of the total received spam in each year. Note that the total on the y-axis does not add up to 100%, as there is contribution from subnets that were not part of the group we examined. The x-axis presents the active /16 prefixes, in order of decreasing activity. Conceptually, the closer the line is to the upper left corner, the more concentrated the spamming activity. In 2006, almost 90% of the total volume of received spam originated from these subnets. In the following years, the contribution of these subnets steadily decreased, dropping down to 52% in 2009. This indicates that over time, new IPs become responsible for an increasing portion of the total volume of received spam.
The implications of the two observations that we made from our analysis need to be discussed further. The spreading trend indicates that IP-filtering can barely keep up with bots. This is due to the fact that spammers seem to exploit the entire active IP space , by constantly crawling  the Internet and recruiting new members.
Another important point is the rapid expansion of botnets to newly allocated IP spaces. According to IANA , the /8 subnets 121.* to 123.* were allocated for the first time in 2006 and by 2007, they were already part of the high-activity spamming subnets. The same happened for /8 subnets 114.* to 120.* a year later.
We have described an empirical study of a publicly available archive of spam messages gathered during the last four years. Our analysis has revealed a worrying trend: spamming bots seem to have spread wider across the IP space since 2006. This spreading has major implications since IP-based filtering for bots and spam is becoming more challenging. At the moment, it seems like security administrators may be losing the war against botmasters.
 A spam campaign is a group of spam messages that have the same (or very similar) subject (e.g. a drugstore advertisement).
 Husain, H.; Phithakkitnukoon, S.; Palla, S.; Dantu, R. Behavior analysis of spam botnets. COMSWARE. 2008.
 Ramachandran, A.; Feamster, N. Understanding the network-level behavior of spammers. SIGCOMM. 2006.
 Yinglian, X.; Yu, F.; Achan, K.; Panigrahy, R.; Hulten, G.; Osipkov, I. Spamming botnets: Signatures and characteristics. SIGCOMM. 2008.
 Zesheng, C.; Ji, C.; Barford, P. Spatial-temporal characteristics of internet malicious sources. INFOCOM. 2008.
 The 80-20 rule in our case indicates that 80% of the received volume of spam originates from 20% of the IP space.
 Spam. Archive. http://untroubled.org/spam/.
 In our study we considered all the valid IP addresses (i.e. allocated, unallocated and reserved) as IP space.
 Crawling the Internet means that some bots which are already members of the army find new unprotected machines across the active IP space and compromise them.