Michael Venable and colleagues explain how program-matching techniques can help in triage, in-depth malware analysis and signature generation.
Copyright © 2007 Virus Bulletin
The number of variants in malware families appears to be on the rise and is turning into a veritable flood. New defences must be found to detect these variants and curtail the flood.
We propose that program comparison techniques can be an effective shield by assisting in triage, in-depth malware analysis and signature generation. The Vilo program search portal is used as an example to illustrate the usefulness of approximated program-matching and the extraction of commonalities and differences from malware variants.
A recent trend in malware production is to generate large numbers of variants at increasingly rapid rates. It is now not uncommon to see thousands of versions of a malicious program released in a short space of time, with each version differing in only minor ways. For instance, consider the recent case of the ‘Storm Worm’. CommTouch Software reported that over 54,000 variants had been released in under two weeks.
The seemingly endless stream of variants places increased strain on anti-virus researchers, who seek to ensure their products are able to recognize each of the variants.
Something needs to be done to counteract the flood of malware variants. But what?
We argue that program comparison tools are a useful long-term defence against variants – in particular, tools that can determine program similarity, search for matches in a database, and describe commonalities and differences. These tools can be used to organize triage processes and leverage organizational knowledge.
Further, tools that analyse differences and commonalities lie at the heart of assisting in-depth variant analysis and family-aware signature generation. The argument is illustrated using Vilo, a set of tools for searching and comparing program variants.
Vilo has been shown to be effective at partitioning malware repositories and can perform searches quickly enough for interactive querying.
This paper introduces Vilo and its capabilities. For a more thorough introduction, we invite the reader to visit .
Variant flood attacks are illustrative of how malware authors are increasingly shifting their focus from targeting vulnerabilities in everyday products to targeting vulnerabilities in anti-virus systems. In the case of the variant battle, the weakness is in the defence infrastructure that relies heavily on signatures.
Signatures are frequently reactive – they tend to be effective in defending only after the initial specific attack has been made. In particular, they often fail to detect new versions of malicious programs that have been altered just enough so that the existing signatures no longer match. Moreover, signature generation is a time-consuming, but necessary task that requires expertise.
This combination of properties leaves the infrastructure vulnerable to attacks in the form of a rapid influx of variants.
In one form of the attack, all that is required is to produce signature-defeating variants faster than the signatures can be constructed and distributed. So long as it is relatively easy to crank out a new modification that evades the signatures, it will remain an effective attack. In another form of the attack, a rapid flooding of a large number of variants increases the difficulty of matching all variants whilst simultaneously creating a denial-of-service attack on the limited resources of anti-virus analysts.
Malware authors have recognized these opportunities and are creating variations on a massive scale. The headlines that the ‘Storm Worm’ trojan have made are unsurprising when one can read the trends forewarned by anti-virus industry reports. For example, according to Microsoft’s ‘Security Intelligence Report’, Microsoft found 97,924 variants of malware within the first half of 2006. According to Symantec and Microsoft, typically only a few hundred families appear in any half-year period. This places the number of variants in an average family in the thousands per half-year period.
The Microsoft data shows that the top seven families account for more than 50 per cent of all variants found. The top 25 families account for over 75 per cent. Thus it is a solid bet that any new malicious program found in the wild is a variation of some previous program. The lion’s share of the work in handling the flood of new programs would be done if one could recognize even only these topmost 25 families automatically.
Numerous methods can be employed to construct such variants, including packing, manually altering and rebuilding malware, using automated malware generation tools, and automated code modification, such as those found in metamorphic malware. In short, the effort level needed to create variants different enough to cause havoc is low relative to the number of problems they create. Some means must be found to counteract the variation attacks, but what can be done?
Behaviour-based heuristics have become an important means of detecting previously unseen variations. Typically, detection in this fashion involves running the potentially malicious program in some type of virtual environment, such as in a sandbox. While executing, the behaviour of the program is monitored until its observed or inferred behaviour sufficiently matches a known proscribed behaviour. Thanks to increasingly powerful machines, this approach is becoming feasible in an increasing number of circumstances. Still, behaviour-based detection methods can be expensive, and there are many ways in which malware authors can defeat the sandboxing or emulation. Other techniques are still desired.
When the variations are constructed using automated methods of mutating the code, the properties of the mutating engine itself may form an entry point for a counter-attack. It may be feasible, for example, to normalize programs before trying to match them, i.e. to remove the variations caused by the mutation engines. Several research groups have worked on this approach, including ourselves. We have shown that it is possible in some circumstances to produce a ‘perfect’ normalizer for metamorphic engines . It may also be feasible to detect the use of the mutation engine itself by observing properties of the generated code . This approach works much like matching a piece of literary work to its author by observing writing style.
Apart from such mutation engine counter-attacks, the prime counter-attack is most likely to be found in the nature of the variants themselves: similarity.
One possible approach is to capitalize on the high overlap of code between program variants and draw out the similarities and differences in the actual programs themselves. By creating a ‘similarity score’ between two programs, one could quickly deduce the behaviour of a new sample. Searches can be performed on new samples and anything matching sufficiently closely to a known malicious program can be labelled as malicious.
In addition, knowing the similarities and differences between two files can help steer manual analysis in the right direction. For example, differences can pinpoint new functionality that may need to be analysed further, while similarities identify areas that may previously have been analysed, promoting reuse of organizational knowledge.
To explore this possibility, we have created a demonstration portal called Vilo that performs searches on whole binaries and provides tools to assist analysts in extracting the similarities and differences between files. We argue that program comparison techniques such as these will be important shields in the defence against the variant flood attacks. The general approach is described below and applications to anti-virus analysis are outlined.
The core part of Vilo is a search component. It receives search requests in the form of binary programs. In response, it delivers a list of programs found to be similar to the query, paired with computed similarity scores. A web-based search portal exists to serve as a human-friendly interface to Vilo.
Via the portal, users can upload whole binary programs and receive a listing of related programs in order of similarity (as illustrated in Figure 1). For each matched file, users can ‘drill down’ to view additional information, such as the embedded strings and assembly listing, and compare it against the uploaded file. With Vilo, analysts can map malware relationships, find commonalities and dissimilarities between programs, and view ‘hot spots’ in the assembly listings that the two programs have in common.
The search method used is an adaptation of text retrieval matching using the so-called tf x idf term-vector query matching methods. These have been used for matching text documents to queries and for the related task of detecting duplicate documents. The search method has been designed so that it is insensitive to changes such as instruction reordering; does not allow common code sequences such as function prologues or code libraries to affect the results adversely; and avoids complicated analysis in preference to simple analysis, such as disassembly, to reduce the likelihood of an unsuccessful analysis.
Figure 2 illustrates Vilo’s likely place in the analysis pipeline. Vilo has access to the collection of known malicious files and is able to integrate into the existing queue management infrastructure. There, it is available to service requests to support triage, analysis and the generation of new malware signatures.
Next, we will look in more detail at how Vilo benefits each of these three areas.
Anti-virus companies receive new malware samples through a wide variety of sources. It is common to have more sample submissions than people available to analyse them, resulting in a queue into which incoming samples are placed while awaiting analysis.
For efficiency, it is necessary to remove known malicious and benign samples from this queue. This is commonly done by feeding the samples to various anti-virus scanners and removing any files that are identified as malicious. The rest of the samples are submitted to analysts for further analysis.
Unfortunately, many variants are not identified as malicious by the scanners. The unidentified variants must be submitted for further analysis, even though near-identical samples may previously have been analysed. This redundant work can be eliminated by catching the variants before they go to the analysts.
Vilo can assist in this area by filtering files that match closely any known malicious files. Using Vilo’s similarity score, it is a simple task to find and remove variants from the queue of incoming files. The web interface in Figure 1 illustrates how the search can help in triage. Here, the results suggest that this sample is likely to be a variant of Bagle.R.
Continuing this example, all organizational documentation on Bagle.R could be delivered along with the new sample to the analyst, thus promoting knowledge reuse and reducing the amount of rediscovery needed on the part of the analyst.
During analysis of a new sample, it can be helpful to have similar files at hand (particularly if past analysis results can be retrieved as well) and to know what makes the files alike as well as how they differ. This information can be used to guide the analyst and to decrease workload. Knowing how two files differ helps the analyst quickly identify new functionality that has been introduced in newly released variants. In addition to running the malware in a virtual machine in the hopes of learning its behaviour, the analyst can find the exact location of new code and can then use that to determine what step to take next in the analysis process.
Similarities identified by Vilo can provide valuable insight into the behaviour of a new sample before beginning any detailed analysis. Not only is knowing that a sample is 90 per cent identical to some other sample a good indication that they share a lot of functionality, but instructions in the sample can be matched against a database of code segments further to reveal specific functionality. For example, if a code segment within the sample under inspection matches a segment in the database that is known to be a backdoor, then it can be concluded that the sample also features a backdoor, without the need to launch a virtual machine. Vilo’s code comparison tools make this possible.
Vilo allows the user to view a side-by-side comparison of the assembly listing for the uploaded and matched files. Included in this view is a colour-coded overview bar making it easy to spot commonalities quickly among the two assembly listings. A section of the bar with bright red colouring indicates that the corresponding part of the file contains a high number of matches, whereas dark blue indicates very few or no matches. The user can click the overview bar to go to the corresponding position in the file, making it a snap to zoom in and find code similarities between two files.
Figure 3 shows a comparison of samples of two variants of the Klez worm. The figure shows the degree of commonality between the two files. This can be seen by glancing at the overview bar near the top of the window.
The lines of code are also colour-coded and can be clicked to have the program find the corresponding matching code in the other file. In the figure, we’ve done exactly this to find a piece of code shared by both files. The selected portion is shown as blue text. Notice the matched lines are not identical (jump targets are different). Vilo’s approximate search is not affected by such simple differences.
Though not shown, users can also view a similar comparison of embedded strings as well as PE (Portable Executable) file information.
Current static signature generation typically involves extracting a byte sequence from the sample that is common among variants while distinct enough to limit false positives.
When done manually, this is a very time-consuming activity requiring a good understanding of the malware on the part of the analyst. Vilo’s search makes it possible to find all common variants easily, and its binary comparison algorithm provides the functionality needed to isolate similarities and differences, making it possible to create signatures that are relevant to all or most of the members within a family.
For years, anti-virus products have relied on the presence of static signatures within malicious software as a means of malware identification. In many cases, it is possible to identify several variants of a known malicious program using only a single signature. However, as the number of variants increases, the number of signatures required grows, as well as the time required by analysts to inspect the variants.
Malware authors have realized this and have begun creating variants on a grand scale, reaching into the tens of thousands and easily overwhelming the current infrastructure.
Vilo offers a unique search algorithm suitable for finding variants of known malicious programs, making it applicable in the areas of triage, manual analysis, and signature generation. Vilo can operate in the anti-virus back-end as a filtering tool of incoming malware samples. Already analysed malware samples could be culled from incoming malware queues and related programs could be grouped together to improve the efficiency of the analysts.
Vilo includes a web-based user interface that, when given a program, presents the user with a ranked ordering of related programs, making it possible to map out malware relationships. Vilo also provides tools to isolate the differences between two files. This information guides the analyst by highlighting new functionality to be analysed and reduces the amount of time needed to analyse a file. In addition, Vilo can assist in signature generation by identifying pieces of code that are similar among a group of files.
Malware authors have attacked a weak spot in the anti-virus industry, but the high degree of similarity between variants can prove to be a weakness in its own right. Vilo’s patent-pending search algorithm is well-suited for detecting the types of variations typically found in malware, making it a good defence against the incoming flood – a shield in the variation battle.
 Vilo website: http://vilo.cacs.louisiana.edu/.
 Walenstein et al. Normalizing metamorphic malware using term rewriting. Proceedings of the 6th IEEE Workshop on Source Code Analysis and Manipulation, 2006.