Fighting spam using tar pits

2007-09-01

Tobias Eggendorfer

Independent researcher , Germany

Editor: Helen Martin

Abstract

Tobias Eggendorfer describes how both SMTP and HTTP tar pits offer interesting ways of helping to get rid of spam.

Table of contents

Introduction
Symptomatic therapy
Greylisting
Preventing spam
HTTP tar pit
Adding an SMTP tar pit
Identification of harvesters
SMTP tar pit simulator
Conclusion

Introduction

Spam and viruses are the biggest threat to Internet usage according to a recent survey. One thing they have in common is that they can be distributed by email. For a spammer, email is the main tool, while for a virus writer, email is just one of many propagation vectors.

One of the most important considerations for spam prevention techniques is to keep the number of false positives and negatives to a minimum. A false negative is a spam message put into a user’s inbox, wasting his time and increasing the risk of his accidentally deleting an important message because it got lost among hundreds of spam messages. A false positive is a ‘ham’, i.e. a non-spam message, that is moved to the spam folder. In a business environment this could result in the loss of business as a result of an order not getting the required attention, or it could put a company at risk of being sued for not fulfilling an order in time, thus increasing the economic risks associated with a false positive by several orders of magnitude.

Most anti-spam and anti-malware techniques are symptomatic cures for the epidemics, with few attempting to tackle the root of the problem. Anti-malware programs try to identify malware as soon as it is transferred onto a computer, be it by email, by a downloaded program file or introduced through a remote exploitable security hole. A causal therapy would be to implement an operating system with security in mind and dedicate know-how and labour time to security testing. OpenBSD is an example of how effective this can be.

Symptomatic therapy

One anti-spam technique that attempted to tackle the root of the problem was sender authentication and identification – which included Domain-ID, Sender-ID and SPF. Although well intended, it was obvious to many that these technologies would fail, because all known options break important email functionality. Email forwarding is a very important function, but it is virtually impossible with these security measures in place. With people moving around the world, travelling and changing jobs very quickly, having access to their communication systems through any service they choose is a must. In a paper presented at the Conference on Email and Anti Spam (CEAS), a Google employee explained that the vast majority of GoogleMail users have their email forwarded from some other account to GoogleMail, thus complicating sender authentication. It is safe to assume that a lot of web mail service providers encounter the same problem.

But removing important functionality was not the only problem for sender authentication, another being the fact that spammers could register their domains with perfectly valid SPF records in DNS. Identifying spam using any of the sender authentication techniques is like playing Russian roulette – some will survive.

Furthermore, patents and the attempt to gain influence, power and money also became involved when these techniques first became available. It is not unlikely that this prevented a lot of developers and administrators from implementing this technology and the lack of uptake of the method also contributed to its failure.

Greylisting

Another approach to getting rid of spam without implementing a filter is greylisting. Greylisting takes advantage of the fact that most bulk mailers only have a limited subset of SMTP implemented. In particular, they often lack the functionality to deal with temporary error conditions on the server side. A temporary error is a failure situation that might be resolved within a short period of time without any intervention from the administrator. Examples are a user over-quota condition, temporary failure to connect to a company’s LDAP directory to identify whether a user exists locally, or an overload condition on the server.

The SMTP standard has foreseen these issues and suggests that MTAs retry to send those messages for a certain period of time. In its default configuration, the quite common MTA sendmail is set up to retry to deliver a message for five days. However, spammers’ bulk mailers do not resend messages. Therefore, greylisting systems respond to an incoming mail message with a temporary error code and will store a tuple consisting of the client’s IP and the envelope from and to of the mail message. If, after a certain period of time, this client reconnects and tries to deliver a message with a corresponding tuple, the mail will be accepted. According to the proponents of greylisting, this system reduced spam by 80% when it was first introduced.

Now, however, the efficiency of greylisting is dropping, as bulk mailers have learned to try to resend their messages. Greylisting.org explains greylisting’s new main advantage:

‘This delay in new sender contacts also gives you a lot of extra power. This may be an hour, but in this hour there is a large chance that the mass mailer/spammer has been identified by the more conventional anti-spam software. Thus, when he retries it, is likely that we will know him for what he really is!’

Obviously, there are other ways to leave a message waiting until the spam filter has been updated.

Also, some providers claim that greylisting would waste their resources: each message needs to be stored for a longer period of time, thus forcing a large provider to add terabytes of storage space to accommodate those waiting messages.

Greylisting is also incompatible with a setup often found within larger providers. If instead of one outgoing mail server a server farm is used, then often the resend attempt comes from a different IP than the original. Greylisting proponents argue that this should not be an issue, as there are only a few relevant providers using this technology and those can be whitelisted manually. This might be feasible for a small environment with limited worldwide contacts, but not for an international environment.

Furthermore, there are a lot of mail services that are incompatible with greylisting, e.g. messages sent by Yahoo Groups didn’t make it to the recipient if he used greylisting. Once again, the supporters of greylisting came up with the solution of whitelisting those IPs and also explained that those companies must be using a broken MTA. But a problem introduced by a new technology can’t be an existing system’s fault – a new safety feature should be compatible with existing technology.

Taking these issues into account, and adding the ever decreasing effectiveness of greylisting, it seems that greylisting is not the solution to the spam problem, either.

Preventing spam

By looking at the problems with some of the existing attempts to reduce spam, we can draw up a list of a number of requirements for a new concept. The most important are: compatibility, efficiency, and, last but not least, free availability without involving patents.

One method to prevent spammers from spamming is simply to prevent them from collecting email addresses. One of their sources is malware-infected computers. Some malware searches the local hard disk for email addresses. Obviously, in order to prevent this method of address harvesting we need to prevent the malware from getting onto the machine – a problem that is not yet fully resolved and beyond the scope of this article.

However, another major source of email addresses is the Internet, both the Usenet and the World Wide Web. The latter has become more important over the years and with the advent of forums and newsgroups being mirrored to it, is more promising for an email harvester.

One thing that can be done to prevent the harvesting of email addresses from a web page is to obfuscate the addresses in such a way that they are unreadable to harvesters, but compatible with any browser technology and barrier-free. The latter requirement is not met by the often suggested use of a graphical representation of the email address. We did some analysis on the efficiency of obfuscation methods and found a rather simple one to be very effective. Simply by adding a white space after every other letter, the address is blown up and it becomes very difficult to find it automatically in any document, e.g. finding ‘us er @ ex am ple. co m’ in this text is not trivial, as the left and right boundaries are hard to identify.

Ongoing research confirms this still to be a secure and efficient way of obfuscating an email address. We therefore developed an Apache module that obfuscates addresses on the fly during output using this method, thus making secure obfuscation a matter of installing and enabling it in the Apache configuration file.

HTTP tar pit

Obfuscation is a rather egoistic approach: it helps the user protect his inbox, but it is of no use to the wider Internet community. Therefore, we looked for a way to stop harvesters while they are in the process of collecting email addresses. To do so, we developed an HTTP tar pit.

In brief, the HTTP tar pit creates random web pages linking back to itself and thereby traps the harvester in an infinite loop. Obviously, the links need to be different every time, because no decent spider would return to a page it has visited before. Our tar pit creates random file names with random, yet plausible, file extensions. The server is configured to redirect every request for one of those random URLs to the tar pit script. This is done by using the ErrorDocument-method of Apache and a reset of the HTTP status code from ‘404 Document not found’ to ‘200 OK’.

As some harvesters implement a maximum link depth on certain domains in order to avoid endless loops, we use DNS wild cards to create random sub domains. This resets the harvester’s link counter and thus keeps him in the tar pit. To further increase the effects, we run several interconnected tar pits on multiple machines with multiple IPs, thereby further obfuscating their existence.

Because the HTTP tar pit offers more new links to itself than there are new links on an average web page, the tar pit’s links pollute the list of pages to visit that is maintained by the harvester. Thus, the more often the tar pit is visited, the more efficient it gets until it takes up almost 100% of the harvester’s links to visit.

Besides catching spammers in an endless link loop, the tar pit also stutters each byte slowly to the client to delay the communication further and catch the harvester for even longer. This stuttering needs to be carefully adjusted to the time-outs harvesters use, because they should not disconnect too quickly, but stay in the trap.

When implementing this HTTP tar pit, we also took into consideration the ‘good’ spiders used by search engines. If they were trapped, their operators might even sue for compensation. Fortunately, the W3C Robots-Exclusion-Standard offers a method to tell spiders not to analyse certain pages. Therefore, we set up a robots.txt, which is ignored by almost all harvesters, to protect good spiders. If any search engine spider were to ignore this information, we would not be liable for them becoming trapped.

An argument often put forward against using robots.txt is that it would be easy for harvesters to start conforming to this standard too. But if they did, this would just mean that preventing harvesters from collecting an email address becomes as simple as adding the specific page containing the mail address to the robots.txt’s list of pages not to visit.

Adding an SMTP tar pit

With a view to being an attractive tar pit to spammers, the tar pit should offer email addresses, since harvesters output every new address found to their user interface. This serves as a kind of progress meter, and it would stop listing new addresses as soon as the harvester was mostly visiting links within the tar pit. The human operator of the harvester would notice its reduced effectiveness, start investigating it and ultimately find out that he has run into an HTTP tar pit and blacklist its URL.

However, just printing out random email addresses from the tar pit is not a good solution, as this could lead either to spamming random genuine addresses or to bounce spam if the addresses are nonexistent and the sender address was forged. We therefore decided to set up an SMTP tar pit and list addresses that point to this tar pit. The SMTP tar pit adds another level of frustration to the spammer.

Like its HTTP counterpart, an SMTP tar pit also delays communication between the spamming client and the tar pit server. With SMTP, creating endless loops of links is impossible, but by stuttering the server’s responses byte by byte and adding artificial overhead by creating extra long responses with lots of SMTP’s continuation lines, the slow down is remarkable. On a regular connection, delivering a message usually takes a matter of a fraction of a second, whereas on a SMTP tar pit it might take up to one hour.

Supporters of SMTP tar pits therefore claim that they can block a spammer’s sending process and thereby protect the Internet from a spam run. Although this is not true, as bulk mailers are able to connect to multiple servers at the same time and are tar pit-aware (i.e. disconnect quickly if they recognize an SMTP tar pit), in our setup this did not matter, because we just needed an SMTP server to take care of the email addresses published by the HTTP tar pit.

Adding that SMTP tar pit to the HTTP tar pit increased the HTTP tar pit’s efficiency by several orders of magnitude. We found harvesters staying in the tar pit and looping infinitely for several weeks and making hundreds of thousands of visits during that time.

Identification of harvesters

Since most visits to a tar pit are by harvesters, the tar pit also offers a simple method to identify the IP addresses from where harvester activity occurs. This piece of information might help in protecting other web pages: if the IP addresses the harvesters use are available to those web servers, they could block access to the harvesters.

To do this, we built another Apache module, this time an input filter that looks up the client’s IP in a database of known harvesting IPs populated by our (by then distributed) network of combined HTTP and SMTP tar pits. If an IP is listed there, access to the protected page is forbidden and an error message is displayed. The harvester is then prevented from collecting email addresses from this page, because it cannot access it.

Since it is possible that humans might accidentally click into a tar pit, we decided to impose the website ban only for a certain amount of time, depending on the frequency of visits to the HTTP tar pit. We also chose to reassess the listing of the IP address after 24 hours, as we realized that a lot of harvesting on our tar pit network was done from dynamically assigned IP addresses. Blocking those IPs for longer than absolutely necessary might be annoying for the user to whom the is IP assigned after it has been used for harvesting – even though, from an anti-spammer’s perspective, it would be helpful if the harvesting activity were to result in complaints to the provider.

Currently, however, this harvester identification does not offer 100% protection, because the harvester has to have visited a tar pit prior to a protected page. Obviously, this is out of our control.

SMTP tar pit simulator

By doing research into SMTP tar pits, we found that spammers would quickly disconnect if they realized the remote server was a tar pit. We decided to take advantage of this behaviour by setting up an SMTP tar pit simulator, first on a bridge and later as a patch for the widely used mail server sendmail.

Our tar pit simulator behaves like an SMTP tar pit for a certain amount of bytes sent, i.e. it will stutter the first 60–120 bytes to the client slowly, and will then open up the connection to full speed. Our tests showed that approximately 80% of the connections spammers made to our servers were dropped during the stuttering period. We did not find any ham sender that disconnected – meaning that, to our knowledge, the system does not generate any false positives.

Although it is not a perfect solution, it reduces the workload of a spam filter on the mail server significantly, either allowing it to do more computing-intensive mail analysis or to be run on cheaper hardware. Reducing spam by 80% would mean bringing spam levels back to those of 2001.

The advantage of the tar pit simulator is that spammers could only adapt to it if they accept a higher risk of being trapped in a real tar pit. So it is an economic decision for them as to whether to disconnect quickly to avoid being trapped in real tar pit or whether to wait for longer in case it is ‘only’ a simulator. The longer they wait, the higher their loss if they end up in a real tar pit.

Therefore, the more unpredictable the simulation time, the harder it is for spammers to adapt. We suggest that this time be randomized.

A prerequisite for the tar pit simulator to work is the existence of SMTP tar pits in the Internet, even though they themselves are not effective in fighting spam. This is another reason to combine the HTTP tar pit described above with an SMTP tar pit and not a plain mail server.

Conclusion

To sum up, both SMTP and HTTP tar pits offer interesting ways of getting rid of spam. Our HTTP tar pit prevents the collection of email addresses from web pages and it helps to identify harvesting IPs. This means that access to web pages can be blocked dynamically, thus protecting them from harvesters. Finally, a simulated SMTP tar pit might reduce the amount of spam a mail server has to deal with by 80%. This would provide a significant relief to the local mail infrastructure.

Latest articles:

Nexus Android banking botnet – compromising C&C panels and dissecting mobile AppInjects

Aditya Sood & Rohit Bansal provide details of a security vulnerability in the Nexus Android botnet C&C panel that was exploited to compromise the C&C panel in order to gather threat intelligence, and present a model of mobile AppInjects.

Cryptojacking on the fly: TeamTNT using NVIDIA drivers to mine cryptocurrency

TeamTNT is known for attacking insecure and vulnerable Kubernetes deployments in order to infiltrate organizations’ dedicated environments and transform them into attack launchpads. In this article Aditya Sood presents a new module introduced by…

Collector-stealer: a Russian origin credential and information extractor

Collector-stealer, a piece of malware of Russian origin, is heavily used on the Internet to exfiltrate sensitive data from end-user systems and store it in its C&C panels. In this article, researchers Aditya K Sood and Rohit Chaturvedi present a 360…

Fighting Fire with Fire

In 1989, Joe Wells encountered his first virus: Jerusalem. He disassembled the virus, and from that moment onward, was intrigued by the properties of these small pieces of self-replicating code. Joe Wells was an expert on computer viruses, was partly…

Run your malicious VBA macros anywhere!

Kurt Natvig wanted to understand whether it’s possible to recompile VBA macros to another language, which could then easily be ‘run’ on any gateway, thus revealing a sample’s true nature in a safe manner. In this article he explains how he recompiled…

Bulletin Archive