The ideal tools of an ideal virus lab


Jozsef Matrai

VirusBuster Ltd, Hungary
Editor: Helen Martin


Each profession has its own set of tools, and whenever there is an improvement in those tools, the work of that profession becomes more efficient. Every company in the anti-virus industry has its own confidential technology for studying malicious and potentially malicious code. However, creating all the necessary tools for malware analysis in-house is not always economical, particularly for small companies. This article is aimed at anyone who is a potential user or creator of malware analysis tools.


Each profession has its own set of tools, and whenever there is an improvement in those tools, the work of that profession becomes more efficient.

Every company in the anti-virus industry has its own confidential technology for studying malicious and potentially malicious code. However, creating all the necessary tools for malware analysis in-house is not always economical, particularly for small companies. This article is aimed at anyone who is a potential user or creator of malware analysis tools.

Running, debugging and watching

Malicious code tends to involve a lot more computational effort than non-malicious code. For example, non-malicious code might say: 'I want to UrlDownLoadToFileA to the local file'. However, to avoid analysis, malicious code might scan through the memory for the 'M', 'Z' signatures of DLL files, get the entry point of UrlDownLoadToFileA using only a checksum made from the characters of the function name, and compute the URL string using a long formula.

As a result of this complexity, we need a tool for running, debugging and watching the resource usage of malicious code.

A virtual world

What is the problem with current commercial products? Commercial debuggers are, in general, hardware-level debuggers. On Intel x86-compatible machines they overwrite instructions with opcode 0xCC to interrupt a running program, set the trace bit of the EFLAG Register and modify a DRx breakpoint register. This means that the malicious code may behave differently under debug control than in reality.

A much better solution is to use a virtual environment, such as an emulator, which makes it a lot harder for the malicious code to determine whether it is running on a real machine or under a debugger.

Consider the following scenario: imagine a piece of Trojan code that tries to determine whether it is connected to the Internet or running within an emulated network. The code checks for the existence of two files: ' MUST_EXIST_FILE' and ' {RANDOMSTRING}_MUST_NOT_EXIST_FILE' (where {RANDOMSTRING} represents a random alphabetical string). When the Trojan code discovers that it is running within an emulated network, it stops working completely, even if the executable is restarted.

What happens when you attempt to study this code on a real machine?

  • At first, you see two queries in the Bind log: '' and ''. You reconfigure Bind so that queries to the two sites are redirected to a local server and restore the client machine by overwriting the whole hard disk from the original image.

  • On your second attempt you see two queries in the Apache log of your local machine: ' MUST_EXIST_FILE' and ' DFSDFDS_MUST_NOT_EXIST_FILE'. You reconfigure the web server to host these two pages and restore the client machine again.

  • After many failed attempts you find yourself reconfiguring the web server, making sure that the file '' exists and the file' does not exist. You keep waiting in the hope of finding out whether this Trojan does anything other than making web server queries.

If you had been able to restart the program so that the GetTickCount() calls returned the same values each time, the random number generator would produce the same output: the download queries are always '' and ' WERWER_MUST_NOT_EXIST_FILE'. A human operator might not able to do this, but a virtual user can. In this case, you only have four combinations to try: the two files multiplied by the two states (exists or does not exist).

In the virtual world you must have the ability to step into the same river twice or more. If you can feed all GetTickCount() values to a polymorphic engine, it will produce all possible outputs.

Virtual machines may be controlled both by real users and by virtual users. The virtual user may be a script such as the following that allows UNEXEPACK to run only while it is not in our own process or while it is in UNEXEPACK's own code:

while (computer[0].Motherboard.CPU[0].CR3 != OUR_PROCESS_CR3
|| Computer[0].Motherboard.CPU[0].EIP >= 0x00600000U ) {

Skipping the emulation of OS functions can also help speed things up. Consider a Win32 executable loaded at 0x00400000...0x00403FFF. It uses DLL images loaded at 0x60000000...0xBFFFFFFF. The virtual user (script) can recognize Win32 function calls easily, depending on which address space the register instruction pointer, EIP, belongs to. Instead of having to run all necessary Windows processes you can run only one process, and instead of stepping through Windows functions, you can use faster replacement code.

Recent PE EXE packers/protectors can cause a headache when replacing Win32 calls. Consider in the unpacked code:

FF 15 xx xx xx xx CALL [USER32DLL_RegQueryValueExA]

The packer will overwrite it with:

E8 xx xx xx xx CALL equivalent

The first time I came across such a substitution, I thought it would be very easy to handle: all I had to do was to compare the CALLed byte sequence with all byte sequences at DLL export entries. Unfortunately, the replaced function entry code can be obfuscated. There are many Win32 exported entries that look like this:

	etc ...

It is very common for routines to begin with a JMP to a common address. When I did an experiment to find the equivalent Win32 address of a replaced call, I ran the replaced call for cases where the EIP lay inside the user address space, stored the EIP and the general registers, then ran all Win32 exported entries, to a maximum of 1,000 instructions deep where the EIP and general registers were not equal to the stored values. In approximately 80 per cent of the cases the answer was one exported entry. In the other cases the answer was more than two entries or even no entries.


When performing malware analysis we need a good resource-monitoring system. Generally we want to watch the disk I/O at file read and write level, but sometimes we want to monitor at the sector read/write level. We would also like to view the network traffic as TCP/IP packets (for email, FTP and HTTP) and as Ethernet packets. This is another headache. Making a hardware emulator is far cheaper than emulating an operating system. The hardware is completely documented, and the scope is much smaller than an OS.

In a hardware emulator, Ethernet traffic, commands sent to the disk, etc. can be observed, but usually we do not want to do this - we want to monitor file I/O, email I/O, etc. Observation of these things is possible only if the resource-monitoring system knows something about the file system and network protocols.

We have to be able to build our virtual world for components, computers, CPUs, mainboards, Internet servers and so on. Portability is another crucial factor for emulators. Currently we are in transition between the '32-bit age' and the '64-bit age', so any hurriedly-written, non-portable code will quickly end up in the trash.

Instead of constructing an instruction emulator, there is a more complicated solution:

  1. Make a global descriptor table when there is no accessible segment for a Ring3 code.

  2. Make a local descriptor table with all segments marked as non-readable, non-writeable and non-executable, with the exception of six descriptors for the code you want to study (ES, CS, SS, DS, FS, GS).

  3. Make an I/O privilege table, with access to any I/O ports disabled.

  4. Run the code you want to study at Ring3. What happens when the OS is called? A small routine, such as GetLastError(), begins and returns. What about a KERNEL32::WriteFile call? The Ring3 DLL code calls a Ring0 KERNEL routine (INT 0x2E under Win2K, INT 0x80 under Linux x86), which causes an exception. This is the way to study the Ring0 calls. Of course, you cannot see the GetLastError() calls.


Reading disassembled code is time-consuming and it would be much better to use decompilers for code that originated as a high-level language. The most important languages are: Microsoft Visual C++, Borland Builder/ Delphi and Microsoft Visual Basic, which has a special compiled format.

A decompiler may be able to resolve Win32 complex data types. Consider C code using Win32 SYSTEMTIME datatype. When EBX points to such a structure, WORD [EBX] is the field dwYear, WORD [EBX + 2] is dwMonth. However, if the data type cannot be recomposed, the SYSTEMTIME structure is an 'unsigned short int Array[8]', an access to dwYear is 'Array[0]', an access to dwMonth is 'Array[1]', and the resulting C code will be less readable than the output of a very intelligent decompiler. v Code comparison is also a very important goal. Imagine that one analysed program contains routines A, B, C and D. When another researcher analysing a different program finds A, B, C and D routines, they should be able to refer to the former analysis.


For efficient malware analysis we need a virtual world, decompilers and code comparators. Emulation is the easiest solution from an algorithmic point of view. A lot of free software is available to help us to build virtual worlds, such as Bochs (emulator), Windows Emulator, Samba, Apache, and so on. They know file formats, network protocols and hardware specifications.

Code comparison is algorithmically simple at assembly level, but it is very difficult at high-level language level. When code comparison tools are being developed, it is important to retain backward-compatibility with routines that have been analysed previously.

Decompilation is also very difficult. No commercial or open source solutions are available for the very complex tasks, but I have heard of some in-house solutions. For example, Ero Carrera and Gergely Erdélyi introduced a code comparison tool for malware-naming in their VB2004 conference paper 'Digital genome mapping', and Lubos Vrtik introduced a VBA6 decompiler in his VB2003 conference paper 'Inside VBA6 decompiler'.


This article contains my own views on the tools that are needed by a very intelligent malware analysis lab, and I would welcome the opinions of others.

One day I will be sitting in the ideal virus lab, studying software that looks like this:

if ( MD5SUM ( _1st_Input() ) = _CONST_1 ) {
	if ( MD5SUM ( _2nd_Input() ) = _CONST_2 ) {
		if ( MD5SUM ( _3nd_Input() ) = _CONST_3) {

As I study I will be considering how I will explain to an average, less-than-gifted user about the circumstances in which the software produces the results of '_1st_Output()'. If only that was my biggest problem.



Latest articles:

Throwback Thursday: We're all doomed

When a daily sports paper compares a national soccer crisis with the spread of an Internet worm, you know that the worm has had an enormous impact on everyday life. In March 2004, Gabor Szappanos tracked the rise of W32/Mydoom.

VB2018 paper: Unpacking the packed unpacker: reversing an Android anti-analysis native library

This paper analyses one of the most interesting anti-analysis native libraries we’ve seen in the Android ecosystem. No previous references to this library have been found.The anti-analysis library is named ‘WeddingCake’ because it has lots of layers.

VB2018 paper: Draw me like one of your French APTs – expanding our descriptive palette for cyber threat actors

When it comes to the descriptive study of digital adversaries, we’ve proven far less than poets. Currently, our understanding is stated in binary terms: ‘is the actor sophisticated or not?’. Juan Andres Guerrero-Saade puts forward his views on how we…

VB2018 paper: Office bugs on the rise

It has never been easier to attack Office vulnerabilities than it is nowadays. In this paper Gabor Szappanos looks more deeply into the dramatic changes that have happened in the past 12 months in the Office exploit scene.

VB2018 paper: Tracking Mirai variants

Mirai, the infamous DDoS botnet family known for its great destructive power, was made open source soon after being found by MalwareMustDie in August 2016, which led to a proliferation of Mirai variant botnets. This paper presents a set of Mirai…

Bulletin Archive

We have placed cookies on your device in order to improve the functionality of this site, as outlined in our cookies policy. However, you may delete and block all cookies from this site and your use of the site will be unaffected. By continuing to browse this site, you are agreeing to Virus Bulletin's use of data as outlined in our privacy policy.