Spam and the digital divide

2007-12-01

Reza Rajabiun

COMDOM Software & York University, Canada
Editor: Helen Martin

Abstract

Reza Rajabiun considers the implications of spam for developing countries and the persistence of the digital divide.


Introduction

Over recent years, large volumes of spam seem to have become a permanent feature of the Internet. While the exact volume of spam changes on a daily basis, typically around 80% of all messages are classified as unwanted. Even if the theoretical ideal of a perfect Bayesian content filter were to exist, a high noise-to-signal ratio would still incur significant costs in terms of the physical and human resources required to provide end-users with access to new information technologies [1].

In addition to frustrating end-users and weakening their trust in information technology, spam increases the total volume of traffic that must be processed by ISPs and other network providers – thus also pushing up their costs. The increased costs can have significant implications for the provision of Internet and messaging services, especially in developing countries.

This article focuses on the implications of spam for developing countries and the persistence of the digital divide. We assert that the adoption of high-capacity, self-learning content filters at the server level must be an integral part of efforts to address the gap in access to information technologies across the global population.

The problem

Given the way in which spam has evolved over the past decade, it would seem safe to assume that the very low cost of sending messages via email and other new information technologies has acted as a strong driving factor for individuals to send spam in open networks. Closing a network to certain classes of external traffic, or imposing some form of ‘tax’ on senders would, of course, mitigate this. However, such drastic measures would be inconsistent with the role of the Internet as a platform for end-users to communicate cheaply and effectively, at both local and global levels.

Although we are not aware of any empirical studies that have measured the impact of spam on the digital divide, the extent of the gap between rich and poor countries in terms of access to new information technologies is clear. Recent data published by the International Telecommunications Union (ITU) highlights the challenge in bridging the digital divide. Figure 1 illustrates the evolution of this divide in terms of the percentage of Internet users in the population from 1995 to 2006. The graph shows that while a large gap remains, some narrowing has taken place due to an increase in Internet availability in developing countries over the past few years [2]. As global access to the telecommunications infrastructure increases, we can expect to see a further increase in the volume of spam.

The digital divide.

Figure 1. The digital divide.

A high proportion of noise relative to signal consumes large amounts of bandwidth and processing powers that are already scarce. This results in a high total cost of ownership (TCO) of providing Internet access to the billions of people who cannot otherwise benefit from educational and commercial services, as well as lower rates of return on investment – thus discouraging the public and private sectors from building network capacity.

Despite some narrowing of the digital divide, the ITU data shows that significant asymmetries persist geographically in the available Internet bandwidth. For instance, Denmark, a relatively small developed economy, has twice the capacity of Latin America and the Caribbean put together. From the end-users’ perspective, bandwidth constraints slow the downloading of data, which, outside of major urban areas, tends to take place through expensive dial-up connections and (increasingly) mobile platforms.

Additionally, if the emerging networks in developing countries get clogged up with spam too quickly, end-users’ trust in new information technologies will be undermined. This could present a significant obstacle to the development of new economic applications of telecommunications technology such as mobile text messaging payment systems in countries with underdeveloped banking infrastructures.

For the ISPs that provide the underlying services, bandwidth constraints exacerbate the traditional problems that face network operators in developed economies. These providers must employ a larger number of servers to process noisy incoming traffic. Bandwidth constraints mean that more of the processing must take place at the server level, in order to allow more people to access their messaging applications. The adoption of efficient anti-spam systems at the server level generally lowers the impact of spam on bandwidth, hardware and software.

Managing volatile and complex forms of spam impacts further on the administrative resources of the service providers. A 2005 report by the Organization for Economic Cooperation and Development (OECD) highlighted how the impact on human resources is magnified for infrastructure providers in developing countries [3]. The report identified the limited experience of workers in developing countries with spam, which challenges the stability of the operating infrastructure during sudden surges of global or local spam. Moreover, the assignment of scarce administrative skills to fighting spam has costs in terms of lost opportunities in implementing more productive applications, from education and training to those aimed at improving the efficiency of markets in developing countries [4]. Increased automation of the spam-filtering process will help mitigate the human resource costs of spam.

The 2005 OECD report estimated the cost of spam for a (very) large ISP, operating under administrative and bandwidth constraints common to developing countries, to be around 10% of its operating budget. Given the presence of scale economies in information technology management, the overall costs are likely to be higher for smaller ISPs with less capacity to hire well trained administrators, implement state of the art hardware or receive volume discounts on bandwidth from backbone operators.

The significance of the network costs of spam has resulted in proposals for a wide range of regulatory, economic and technological mechanisms for tackling the spam problem, some of which appear more practical than others.

Cost mitigation

Regulatory solutions to the spam problem have been adopted in an increasingly large number of jurisdictions since the early 2000s, including those in several developing countries. These laws typically involve imposing restrictions on spammers through criminal and/or civil sanctions. However, as detailed by Ramachandran and Feamster [5], there are many ways in which spammers can hide their identity within the infrastructure of large backbone providers through BGP spectrum agility techniques. These techniques render sender-oriented regulatory strategies ineffectual.

The economic approach to mitigating the spam problem suggests that the noise we observe today is an inevitable by-product of the adoption of technologies that radically lower the costs of sending information. A large number of proposals have been put forward for adopting mechanisms that are aimed at reallocating the costs of sending massive volumes of messages back to the spammers. However, much like regulatory solutions, the implementation of economic mechanisms requires the presence of credible sender authentication procedures. Widely used spamming technologies that are available on a commercial basis can bypass authentication protocols easily, thus both regulatory and economic solutions have found limited practical success.

One unfortunate result of the resilience of spammers has been the increased use of blacklisting, which arguably threatens to divide the global email system. Blacklisting and other ad hoc methods of identifying spam can be inefficient and discriminatory. For instance, large parts of the Chinese system are now blocked by the rest of the world, raising significant concerns for China’s Internet users. Local networks may construct national or regional ‘walled gardens’ by limiting incoming and outgoing traffic through ad hoc administrative decisions such as blocking messages containing non-standard text such as Chinese, Cyrillic or Arabic.

With the emergence of image spam, which places even higher demands on processing and bandwidth, some administrators have reacted similarly by excluding from their networks all messages that contain pictures. Such efforts may be justified to maintain the stability of a network in the shorter term, but clearly they limit the usefulness of the Internet as a global platform for personal and business communications.

Given the inadequacy of regulatory and economic solutions, the optimization and automation of anti-spam systems appears to be the most practical solution for reducing the network costs of spam, and hence their impact on the digital divide.

Over the past years, the rising network costs of spam have motivated anti-spam software developers not only to enhance the accuracy of their systems by taking account of end-user preferences, but also to increase automation and throughput.

At least since the proposal by Sahami et al. [6], computer scientists have argued that the development of Bayesian content filters will offer the most efficient solution in terms of accuracy. One reason for this is that content classifiers that can learn from end-user preferences about what constitutes ham and spam can take account of the subjective nature of such a classification process in a heterogeneous network. The open source SpamAssassin project, which now serves as the core of numerous commercial front-end software and appliances, has followed these early insights [7].

However, some large ISPs switched to a second type of spam filtering in the early 2000s which relied on the characterization of spam as a large number of similar messages, rather than by scanning and filtering the content of the messages themselves. Although less accurate than Bayesian filters, the so-called fingerprinting/checksum systems offered much higher throughput rates [8].

Our tests indicate that a Linux server using SpamAssassin running on a 1.7 GHz CPU can process around 20 messages per second. The throughput rate of the leading fingerprinting/checksum systems available today (as reported by their providers) converges to a rate of around 100 messages per second on comparable hardware and OS configurations. This difference explains to some degree why some large ISPs switched to commercial fingerprinting systems, despite their limited accuracy relative to Bayesian filters. Theoretically, fingerprinting systems lowered the total number of servers required to handle a specific volume of traffic by a factor of 5.

Unfortunately, spammers quickly learned to automate the production of large volumes of messages that each appear unique to a fingerprinting system. To some degree, the battle between spammers and these systems has contributed to the growth and sophistication of the spam we observe today [9].

More recently, developers of Bayesian systems have worked on increasing their throughput rates radically by implementing advanced pattern scanning and content classification techniques. For instance, the Tachyon Core scanning engine in the COMDOM Antispam for Servers software produces throughput rates of around 600 messages per second on similar configurations to those noted above. Anti-spam system developers have responded to the economic problem by increasing the processing capacity of their software by more than 30 times during a period of less than five years. This improvement means that one mail server operating on a second-generation Bayesian filter can handle the same volume of traffic as six servers relying on the fastest of fingerprinting/checksum systems.

In addition to their higher processing efficiency and accuracy, Bayesian content filters allow for the decentralization and automation of spam identification. Management of fingerprinting systems necessitates a centralized architecture through which anti-spam software developers adjust their checksum-generating algorithms to respond to changes in randomization techniques. This design feature requires communication between local servers and a centralized database of checksums, which further drains bandwidth and processing power. Advanced Bayesian content filters learn automatically from end-user behaviour, place this knowledge in a local database, and then identify spam/ham based on the historical preferences that have been used to train them. Localization reduces bandwidth constraints, as advances in automation reduce the need for continuous administrative intervention, and ad hoc exercises in rule setting.

Implementation

Unfortunately, the development of more efficient technological solutions does not necessarily translate into increased availability of electronic communications. One reason for this is the fixed switching costs arising from decisions made earlier about operating systems and security applications. However, switching costs are likely to be more relevant to the choice of anti-spam technologies in developed countries, where more people are already ‘tied in’ to older and/or less efficient software.

The urban/rural divide within developing countries poses specific challenges in terms of extending points of contact between end-users and the local hubs required to process and deliver their messages. Adoption of more efficient anti-spam technologies will lower the network costs facing all ISPs. However, this does not mean that existing providers will necessarily use these resources to extend access to more remote areas.

On a more positive note, it is imperative to remember that advances in mobile technologies are making it increasingly less costly to extend the traditional reach of the digital economy beyond urban areas. Solving the ‘last mile’ problems lowers the costs of extending access into areas with a low population density. In conjunction with low-cost and multi-tasking mobile devices, such as the ‘one laptop per child’ program, such advances have the potential to narrow the divide we observe today at the global level [10]. Unfortunately, if the experience of developed countries is any guide, the reduction in costs will be accompanied by a rise in undesirable content for the new end-users.

Some of the noise will be from the large global flows of spam that we see today, sent mostly from the networks of large operators in Western Europe and North America. Another portion will be produced by local sellers, who will use the new technologies to find buyers for their products and services.

Regardless of their origins, large volumes of spam necessitate a capacity to scan and filter electronic content efficiently, that is: accurately, quickly, and with the minimum level of administrative intervention. The adoption of fast, self-learning filters should be encouraged with the implementation of targeted programs that condition technology licensing on increased level of access [11]. The lower the resources required to run messaging servers, the more resources (both human and physical) will be available to narrow the digital divide.

Bibliography

[1] Loder, T. C.; Van Alstyne, M. W.; Wash, R. ‘Information asymmetry and thwarting spam’ for an intuitive economic model of spam production, the perfect Bayesian filter, and other classes of solutions. 2004. http://ssrn.com/abstract=488444.

[3] Spam issues in developing countries. http://www.oecd-antispam.org/.

[4] Chowdhury, S. K. ‘Search cost and rural producers’ trading choice between middlemen and consumers in Bangladesh’, Journal of Institutional and Theoretical Economics (JITE), Mohr Siebeck, Tübingen, vol. 127(3), (2004). (An insightful analysis of the impact of communication technologies to the dynamics of local exchange.)

[5] Ramachandran, A.; Feamster, N. Understanding the network-level behavior of spammers. 2006. SIGCOMM 06, Pisa, Italy.

[6] Sahami, M.; Dumais, S.; Heckerman, D.; Horvitz, E. A Bayesian approach to filtering junk email. 1998. AAAI Workshop on Learning for Text Categorization.

[8] Kosik, P.; Rajabiun, R. ‘Antispam technology impact assessment: fingerprinting versus Bayesian filtering’ (September, 2007) contains an updated review of the accuracy and throughput rates of fingerprinting/checksum systems and Bayesian content filters used at the ISP and large corporate server level. http://www.comdomsoft.com/en/antispam/white-papers/.

[9] http://www.jgc.org/tsc.html contains updated data and analysis of history and emerging forms of spam.

[11] For example, the COMDOM Software Educational Program at: http://www.comdomsoft.com/en/education.

twitter.png
fb.png
linkedin.png
hackernews.png
reddit.png

 

Latest articles:

Nexus Android banking botnet – compromising C&C panels and dissecting mobile AppInjects

Aditya Sood & Rohit Bansal provide details of a security vulnerability in the Nexus Android botnet C&C panel that was exploited to compromise the C&C panel in order to gather threat intelligence, and present a model of mobile AppInjects.

Cryptojacking on the fly: TeamTNT using NVIDIA drivers to mine cryptocurrency

TeamTNT is known for attacking insecure and vulnerable Kubernetes deployments in order to infiltrate organizations’ dedicated environments and transform them into attack launchpads. In this article Aditya Sood presents a new module introduced by…

Collector-stealer: a Russian origin credential and information extractor

Collector-stealer, a piece of malware of Russian origin, is heavily used on the Internet to exfiltrate sensitive data from end-user systems and store it in its C&C panels. In this article, researchers Aditya K Sood and Rohit Chaturvedi present a 360…

Fighting Fire with Fire

In 1989, Joe Wells encountered his first virus: Jerusalem. He disassembled the virus, and from that moment onward, was intrigued by the properties of these small pieces of self-replicating code. Joe Wells was an expert on computer viruses, was partly…

Run your malicious VBA macros anywhere!

Kurt Natvig wanted to understand whether it’s possible to recompile VBA macros to another language, which could then easily be ‘run’ on any gateway, thus revealing a sample’s true nature in a safe manner. In this article he explains how he recompiled…


Bulletin Archive

We have placed cookies on your device in order to improve the functionality of this site, as outlined in our cookies policy. However, you may delete and block all cookies from this site and your use of the site will be unaffected. By continuing to browse this site, you are agreeing to Virus Bulletin's use of data as outlined in our privacy policy.