The challenges of collecting and monitoring URLs that point to malware


André D. Corrêa

Malware Patrol, Brazil
Editor: Helen Martin


Since 2005, the Malware Patrol Project has been cataloguing URLs used in phishing scams and distributing block lists for the most popular proxies and anti-spam systems. André D. Corrêa describes the challenges of collecting and monitoring malicious URLs.

The Malware Patrol project [1] is a fully automated and user-contributed system for collecting, analysing, blocking, alerting and monitoring URLs for the presence of malware. Since 2005 it has been cataloguing URLs used in phishing scams and distributing block lists for the most popular proxies and anti-spam systems. Every URL collected is visited daily to make sure the list is reliable and up to date.

From the ground up

In June 2005 a discussion took place on the GTS mailing list (a Brazilian security group hosted by regarding URLs pointing to malware. List members began submitting addresses, encouraging network administrators to block their users’ access to them to prevent infections. It was no surprise that a couple of days later nobody knew which URLs were still active and what kind of malware they hosted. With this in mind, the Malware Patrol project was set up – initially operating under the name ‘Malware Block List’.

Since the beginning, the goal of the project has been to provide a central point for the collection, analysis and monitoring of URLs pointing to dangerous file extensions, and also to distribute, free of charge to non-commercial users, up-to-date lists of infected addresses.

A rudimentary system was developed initially just to merge URLs and create a few block lists. It quickly became necessary to visit the addresses to make sure their status hadn’t changed. Another important feature introduced during the first few weeks was the possibility of user contributions either via a web form or by forwarding suspect emails directly to a specially crafted mailbox.

Extracting URLs

With the modernization and proliferation of phishing scams, fraudsters keep finding new and interesting ways to obfuscate the URLs included in their messages. URL obfuscation is indispensable as a means to trick users into downloading and installing malware. We find a vast variety of techniques being used to hide URLs, but there are also still a lot of ‘newbie’ phishers sending simple HTML email messages with URLs hidden by ‘href’ tags. Common obfuscation methods include using ‘short URL’ services, domain names that look legitimate except for the addition of a couple of letters or numbers, long URLs that begin with credible names, use of the ‘@’ symbol (e.g. http://[email protected]) and hexadecimal URL encoding. More advanced phishers are using JavaScript to create URLs during onClick(), multiple HTTP redirects, browser/crawler detection, Content-Disposition pages and fake anti-virus/anti-malware sites.

It is easy to envisage the technical and social engineering aspects of phishing scams evolving a lot more, becoming even more sophisticated and dangerous for end-users.

The constant evolution of obfuscation methods means that the project’s URL extraction system must undergo constant development and improvement. One of the most recent threats that is forcing a profound change in the URL extraction system is the use of Content-Disposition [2], [3]. This feature is discussed in RFC 2183 (August 1997) and RFC 2616 (June 1999). Basically it is an extension of the MIME protocol that instructs the user agent on how to work with an attachment. Using Content-Disposition it is possible to point users to a non-dangerous URL (e.g. a PHP file) and from there send executable files or infected documents. It is an important security threat because proxies and content-filtering systems won’t block users’ access to PHP files, for instance, whereas they would deny access to EXE files.

This is a complicated issue for the Malware Patrol project due to the fact that URLs are filtered based on file extension. With a limited amount of processing power and bandwidth available there is a need to concentrate on dangerous URLs. Processing non-dangerous extensions, like PHP or ASP files, dramatically increases the CPU and bandwidth requirements.

Yet another block list

Common sense says we already have enough DNSBLs and block lists out there, so why bother running another block list? Well, the first ‘Real-time Blackhole List’ was created by Paul Vixie in 1997 [4]. Since then, DNSBLs and URL lists have become more sophisticated and focused on specific aspects, but none of them concentrate on URLs that point to malware. More importantly, most URL lists become outdated very quickly. There is a need for an accurate list of addresses that point to malware and the only way to maintain such a list is to visit every URL daily to verify its status.

Lots of factors can change the status of a URL. One of the most common situations involves malware being hosted on free web accounts. Such free accounts have a limited bandwidth allowance, and when that is exceeded access to the account is denied. Therefore, if a binary receives lots of hits it can quickly become unavailable, but will be reachable again in a few hours or the next day when the free hosting provider puts it back online.

Another important driver of status change is the renewal of malware from the hosting providers. Sometimes domain or hosting administrators act when alerted about malicious binaries on their servers and remove them. Among the most criticized issue of DNSBLs and URL lists is the slow response to removal requests. These requests are taken seriously by the Malware Patrol. Having a system that automatically validates all addresses in the database every day makes the removal process very fast and reliable.

Moreover, there are other good reasons to run a block list of malicious URLs including: current usage of Content-Disposition, distribution of binaries not yet classified by anti-virus solutions, alerting system administrators of malware hosted on their servers and cooperation with security groups.

Not so dangerous extensions

In addition to the common dangerous extensions everybody is blocking today, including but not limited to exe, pif, bat, scr, cmd, reg, com, etc., there are the ‘not so dangerous extensions’. Those cannot be blocked by default on proxies and content filters but pose serious threats to users. Examples include pdf [5], swf and png [6] files for which critical bugs were found recently. This is another strong motivation for using an up-to-date URL block list. These days, it is no longer possible to trust file extensions.

Detecting new malware

Approximately 20% of the samples collected by the project every day are pieces of malware that have not yet been classified by anti-virus vendors. This is an important threat for users that rely solely on anti-virus systems for protection. Currently, when a sample is not identified as malware by the anti-virus software we use, it goes through a scoring system. This system evaluates some important characteristics of the binary including: domain name and file name patterns, file extension, use of packers and cryptors, presence of strings, URLs or IRC commands, etc.

When a binary has a high score, the sample is sent to partner anti-virus vendors for analysis and classification. The project has already identified a lot of new malware. Most of the malware found on phishing sites are trojans targeting financial institutions.

Sandbox systems are not used for active malware analysis due to the complexity involved in automating these systems and the large amount of hardware resources required.

Malware vs. PoC

Occasionally, some binaries and source code are crawled but their classification as malware is not obvious. It is important to differentiate malware targeting innocent users from software used for teaching or research purposes. In this category we can include: proofs of concept for vulnerabilities or bugs, hacktools, spamtools and even spyware. Although these can certainly be used for malicious purposes, they are not convenient for phishing scams. Proofs of concept are usually distributed in their source code format, needing compilation to run; hacktools and spamtools are attacking tools but have limited value when installed in a victim’s machine; and finally spyware may not be considered malware in the strict sense of the term.

Special attention must be paid to the rogue anti-virus, anti-spyware, anti-malware and scareware [7] that are proliferating on the Internet. Users are easily convinced to download and use them. Some of the current scams direct users to fake sites that appear to run an anti-virus scanner on the victim’s machine. The scanner reports numerous (non-existent) viruses or trojans on the system and claims that they can only be removed if the victim pays for the (rogue) software licence.

Alerting and removing malware from the net

When malware is found, it is important to notify system and network administrators and urge them to remove it from the Internet as soon as possible and to investigate the related criminal activity. This is done by the alert system that sends email messages to domain and ASN administrators and to the CSIRT responsible for the top level domain name of the domain hosting the malware.

It is disappointing that fewer than 15% of the alerts elicit a response. Although there are many Internet documents and RFCs [8] specifying the email addresses to be used by domain administrators, the majority of those addresses bounce, leaving no way of contacting the administrator.

Administrators should pay more attention and domain registrars could play an important role in educating and enforcing the existence of the abuse mailbox. This is necessary so that alerts and important information can be sent to administrators. Those mailboxes also need to be checked frequently – most of them bounce because they are full.

Many huge service providers take an annoying approach to abuse mailboxes. They simply respond with a default message pointing to a web form that must be filled in to file a complaint. Although it is understandable that enormous volumes of messages (mostly spam) are sent to those addresses, this approach doesn’t work when security groups are trying to cooperate and it is not permitted by any RFC.

After a few weeks of running the malware alert system an unexpected side effect was noticed: the alert mailbox started receiving spam and phishing scams.

Cooperating with the security community

The cooperation and exchange of information with security groups and professionals is essential for any project that captures and redistributes data. The Malware Patrol is open to establishing working relationships with any serious security group that can send and/or wants to receive information regarding malware and phishing scams. Some of the groups that already cooperate with the project include: CAIS - RNP [9], Team Cymru [10], Web of Trust [11], SURBL [12], the now ceased CastleCops project, as well as more than 10,000 personal contributors.

Although the project has many spam traps established, contributions from users and security professionals are also very important to capture the most recent phishing scams. There are two contribution channels available: a web form that can be filled in with suspect URLs and an email address to which suspect messages can be sent, which extracts URLs and puts them in a queue for later analysis. The exchange of information with security groups is mostly done via email.

Other ways of acquiring URLs that are used or are under development include: monitoring mailing lists and newsgroups related to information security, botnets and hacking, web crawling and IRC monitoring bots.

Keep crawling

New URLs go to a queue that is visited by an automated crawler every hour. The crawler visits the addresses and grabs binaries, if available. If no binary is found, it verifies the HTTP status code and HTML output generated by the web server. The crawler can follow many levels of redirection via HTTP and HTML and can also handle most of the Content-Disposition scenarios.

Every URL in the database has a status that varies from infected to not available, not found, invalid Content-Type and others. Those addresses are visited every day to check that their status hasn’t changed. The only way to keep the block lists accurate is to frequently visit all addresses. There is also a need to impersonate different browsers because some fraudsters use browser detection to prevent crawlers from grabbing their binaries.

Block lists for everyone

For end-users, the most important information produced by the project is the block lists. Today we have 29 different formats available for non-commercial use. They include popular open-source software like: Squid, SpamAssassin, Postfix, Firekeeper, DansGuardian and even ClamAV. There are two types of list: the ‘regular’ list, which includes protocol, host name, domain name and directories; and the ‘aggressive’ list that just includes protocol, host name and domain name, therefore blocking access to the entire infected domains.

File names and extensions of malware are never disclosed to end-users. The project also refuses to exchange malware samples without a really good reason to do so.

There are 60 minutes in every hour

Most list downloads are made by automated scheduled jobs and it is easy to figure out that administrators like the minute zero of every hour to run those jobs. This causes a tremendous overhead on web servers during the three or four minutes before and after the turn of the hour. It is strongly advisable to set schedules to run at other times when the system is running lower on CPU usage. This also helps to prevent download timeouts that can lead to broken or incomplete block lists.

Servers and software

Excluding the anti-virus scanners used to identify and categorize malware, all other software employed by the project is open source. Servers are hosted in three distinct data centres and run Linux Slackware [13] or FreeBSD [14]. MySQL is used for the database. The crawlers run multi-threaded on Perl and C, Apache and LigHTTP are used for web servers and Exim for SMTP, running the URL extraction system.

What’s next?

Malware Patrol is a not-for-profit project that runs on volunteer efforts, user contributions and donations. With limited hardware, software and financial resources, there is a need to concentrate efforts on service availability and data integrity. Meanwhile, the improvements queue is always growing. Important developments scheduled for the coming months include: installing new hardware to support the growing number of users, integration with browser plug-ins, creation of a DNSRBL, improvements to crawlers and URL extractors, and creation of thumbnails of the emails and sites used in phishing scams.

New list users, contributors and security groups are always welcome.

Also, users and companies that find the service useful are encouraged to make a donation to help keep the project going.


For the last four years the Malware Patrol project has been collecting, monitoring and distributing lists of URLs that point to malware. The project provides an additional security tool for system and network administrators in the daily task of keeping users safe. End-users and the security community are always responsive to reliable tools and information sources.

The phishing scam landscape is constantly evolving and although anti-virus solutions and user education are important ways to help prevent infections and the losses caused by malware, it is also necessary to closely monitor the evolution of these threats.


[1] Malware Patrol.

[2] The Content-Disposition Header Field.

[3] Hypertext Transfer Protocol - HTTP/1.1.

[4] DNSBL - Wikipedia.

[5] Adobe - Security bulletins and advisories.

[6] Multiple vulnerabilities in libpng. TA04-217A.html.

[7] Wikipedia - Scareware.

[8] Mailbox names for common services, roles and functions.

[9] CAIS - RNP.

[10] Team Cymru.

[11] Web of Trust.

[13] Linux Slackware.



Latest articles:

Nexus Android banking botnet – compromising C&C panels and dissecting mobile AppInjects

Aditya Sood & Rohit Bansal provide details of a security vulnerability in the Nexus Android botnet C&C panel that was exploited to compromise the C&C panel in order to gather threat intelligence, and present a model of mobile AppInjects.

Cryptojacking on the fly: TeamTNT using NVIDIA drivers to mine cryptocurrency

TeamTNT is known for attacking insecure and vulnerable Kubernetes deployments in order to infiltrate organizations’ dedicated environments and transform them into attack launchpads. In this article Aditya Sood presents a new module introduced by…

Collector-stealer: a Russian origin credential and information extractor

Collector-stealer, a piece of malware of Russian origin, is heavily used on the Internet to exfiltrate sensitive data from end-user systems and store it in its C&C panels. In this article, researchers Aditya K Sood and Rohit Chaturvedi present a 360…

Fighting Fire with Fire

In 1989, Joe Wells encountered his first virus: Jerusalem. He disassembled the virus, and from that moment onward, was intrigued by the properties of these small pieces of self-replicating code. Joe Wells was an expert on computer viruses, was partly…

Run your malicious VBA macros anywhere!

Kurt Natvig wanted to understand whether it’s possible to recompile VBA macros to another language, which could then easily be ‘run’ on any gateway, thus revealing a sample’s true nature in a safe manner. In this article he explains how he recompiled…

Bulletin Archive

We have placed cookies on your device in order to improve the functionality of this site, as outlined in our cookies policy. However, you may delete and block all cookies from this site and your use of the site will be unaffected. By continuing to browse this site, you are agreeing to Virus Bulletin's use of data as outlined in our privacy policy.