The ideal tools of an ideal virus lab


Jozsef Matrai

VirusBuster Ltd, Hungary
Editor: Helen Martin


Each profession has its own set of tools, and whenever there is an improvement in those tools, the work of that profession becomes more efficient. Every company in the anti-virus industry has its own confidential technology for studying malicious and potentially malicious code. However, creating all the necessary tools for malware analysis in-house is not always economical, particularly for small companies. This article is aimed at anyone who is a potential user or creator of malware analysis tools.


Each profession has its own set of tools, and whenever there is an improvement in those tools, the work of that profession becomes more efficient.

Every company in the anti-virus industry has its own confidential technology for studying malicious and potentially malicious code. However, creating all the necessary tools for malware analysis in-house is not always economical, particularly for small companies. This article is aimed at anyone who is a potential user or creator of malware analysis tools.

Running, debugging and watching

Malicious code tends to involve a lot more computational effort than non-malicious code. For example, non-malicious code might say: 'I want to UrlDownLoadToFileA to the local file'. However, to avoid analysis, malicious code might scan through the memory for the 'M', 'Z' signatures of DLL files, get the entry point of UrlDownLoadToFileA using only a checksum made from the characters of the function name, and compute the URL string using a long formula.

As a result of this complexity, we need a tool for running, debugging and watching the resource usage of malicious code.

A virtual world

What is the problem with current commercial products? Commercial debuggers are, in general, hardware-level debuggers. On Intel x86-compatible machines they overwrite instructions with opcode 0xCC to interrupt a running program, set the trace bit of the EFLAG Register and modify a DRx breakpoint register. This means that the malicious code may behave differently under debug control than in reality.

A much better solution is to use a virtual environment, such as an emulator, which makes it a lot harder for the malicious code to determine whether it is running on a real machine or under a debugger.

Consider the following scenario: imagine a piece of Trojan code that tries to determine whether it is connected to the Internet or running within an emulated network. The code checks for the existence of two files: ' MUST_EXIST_FILE' and ' {RANDOMSTRING}_MUST_NOT_EXIST_FILE' (where {RANDOMSTRING} represents a random alphabetical string). When the Trojan code discovers that it is running within an emulated network, it stops working completely, even if the executable is restarted.

What happens when you attempt to study this code on a real machine?

  • At first, you see two queries in the Bind log: '' and ''. You reconfigure Bind so that queries to the two sites are redirected to a local server and restore the client machine by overwriting the whole hard disk from the original image.

  • On your second attempt you see two queries in the Apache log of your local machine: ' MUST_EXIST_FILE' and ' DFSDFDS_MUST_NOT_EXIST_FILE'. You reconfigure the web server to host these two pages and restore the client machine again.

  • After many failed attempts you find yourself reconfiguring the web server, making sure that the file '' exists and the file' does not exist. You keep waiting in the hope of finding out whether this Trojan does anything other than making web server queries.

If you had been able to restart the program so that the GetTickCount() calls returned the same values each time, the random number generator would produce the same output: the download queries are always '' and ' WERWER_MUST_NOT_EXIST_FILE'. A human operator might not able to do this, but a virtual user can. In this case, you only have four combinations to try: the two files multiplied by the two states (exists or does not exist).

In the virtual world you must have the ability to step into the same river twice or more. If you can feed all GetTickCount() values to a polymorphic engine, it will produce all possible outputs.

Virtual machines may be controlled both by real users and by virtual users. The virtual user may be a script such as the following that allows UNEXEPACK to run only while it is not in our own process or while it is in UNEXEPACK's own code:

while (computer[0].Motherboard.CPU[0].CR3 != OUR_PROCESS_CR3
|| Computer[0].Motherboard.CPU[0].EIP >= 0x00600000U ) {

Skipping the emulation of OS functions can also help speed things up. Consider a Win32 executable loaded at 0x00400000...0x00403FFF. It uses DLL images loaded at 0x60000000...0xBFFFFFFF. The virtual user (script) can recognize Win32 function calls easily, depending on which address space the register instruction pointer, EIP, belongs to. Instead of having to run all necessary Windows processes you can run only one process, and instead of stepping through Windows functions, you can use faster replacement code.

Recent PE EXE packers/protectors can cause a headache when replacing Win32 calls. Consider in the unpacked code:

FF 15 xx xx xx xx CALL [USER32DLL_RegQueryValueExA]

The packer will overwrite it with:

E8 xx xx xx xx CALL equivalent

The first time I came across such a substitution, I thought it would be very easy to handle: all I had to do was to compare the CALLed byte sequence with all byte sequences at DLL export entries. Unfortunately, the replaced function entry code can be obfuscated. There are many Win32 exported entries that look like this:

	etc ...

It is very common for routines to begin with a JMP to a common address. When I did an experiment to find the equivalent Win32 address of a replaced call, I ran the replaced call for cases where the EIP lay inside the user address space, stored the EIP and the general registers, then ran all Win32 exported entries, to a maximum of 1,000 instructions deep where the EIP and general registers were not equal to the stored values. In approximately 80 per cent of the cases the answer was one exported entry. In the other cases the answer was more than two entries or even no entries.


When performing malware analysis we need a good resource-monitoring system. Generally we want to watch the disk I/O at file read and write level, but sometimes we want to monitor at the sector read/write level. We would also like to view the network traffic as TCP/IP packets (for email, FTP and HTTP) and as Ethernet packets. This is another headache. Making a hardware emulator is far cheaper than emulating an operating system. The hardware is completely documented, and the scope is much smaller than an OS.

In a hardware emulator, Ethernet traffic, commands sent to the disk, etc. can be observed, but usually we do not want to do this - we want to monitor file I/O, email I/O, etc. Observation of these things is possible only if the resource-monitoring system knows something about the file system and network protocols.

We have to be able to build our virtual world for components, computers, CPUs, mainboards, Internet servers and so on. Portability is another crucial factor for emulators. Currently we are in transition between the '32-bit age' and the '64-bit age', so any hurriedly-written, non-portable code will quickly end up in the trash.

Instead of constructing an instruction emulator, there is a more complicated solution:

  1. Make a global descriptor table when there is no accessible segment for a Ring3 code.

  2. Make a local descriptor table with all segments marked as non-readable, non-writeable and non-executable, with the exception of six descriptors for the code you want to study (ES, CS, SS, DS, FS, GS).

  3. Make an I/O privilege table, with access to any I/O ports disabled.

  4. Run the code you want to study at Ring3. What happens when the OS is called? A small routine, such as GetLastError(), begins and returns. What about a KERNEL32::WriteFile call? The Ring3 DLL code calls a Ring0 KERNEL routine (INT 0x2E under Win2K, INT 0x80 under Linux x86), which causes an exception. This is the way to study the Ring0 calls. Of course, you cannot see the GetLastError() calls.


Reading disassembled code is time-consuming and it would be much better to use decompilers for code that originated as a high-level language. The most important languages are: Microsoft Visual C++, Borland Builder/ Delphi and Microsoft Visual Basic, which has a special compiled format.

A decompiler may be able to resolve Win32 complex data types. Consider C code using Win32 SYSTEMTIME datatype. When EBX points to such a structure, WORD [EBX] is the field dwYear, WORD [EBX + 2] is dwMonth. However, if the data type cannot be recomposed, the SYSTEMTIME structure is an 'unsigned short int Array[8]', an access to dwYear is 'Array[0]', an access to dwMonth is 'Array[1]', and the resulting C code will be less readable than the output of a very intelligent decompiler. v Code comparison is also a very important goal. Imagine that one analysed program contains routines A, B, C and D. When another researcher analysing a different program finds A, B, C and D routines, they should be able to refer to the former analysis.


For efficient malware analysis we need a virtual world, decompilers and code comparators. Emulation is the easiest solution from an algorithmic point of view. A lot of free software is available to help us to build virtual worlds, such as Bochs (emulator), Windows Emulator, Samba, Apache, and so on. They know file formats, network protocols and hardware specifications.

Code comparison is algorithmically simple at assembly level, but it is very difficult at high-level language level. When code comparison tools are being developed, it is important to retain backward-compatibility with routines that have been analysed previously.

Decompilation is also very difficult. No commercial or open source solutions are available for the very complex tasks, but I have heard of some in-house solutions. For example, Ero Carrera and Gergely Erdélyi introduced a code comparison tool for malware-naming in their VB2004 conference paper 'Digital genome mapping', and Lubos Vrtik introduced a VBA6 decompiler in his VB2003 conference paper 'Inside VBA6 decompiler'.


This article contains my own views on the tools that are needed by a very intelligent malware analysis lab, and I would welcome the opinions of others.

One day I will be sitting in the ideal virus lab, studying software that looks like this:

if ( MD5SUM ( _1st_Input() ) = _CONST_1 ) {
	if ( MD5SUM ( _2nd_Input() ) = _CONST_2 ) {
		if ( MD5SUM ( _3nd_Input() ) = _CONST_3) {

As I study I will be considering how I will explain to an average, less-than-gifted user about the circumstances in which the software produces the results of '_1st_Output()'. If only that was my biggest problem.



Latest articles:

VB2019 paper: Operation Soft Cell – a worldwide campaign against telecommunication providers

In this paper researchers from Cybereason look at Operation Soft Cell – a worldwide campaign against telecommunication providers.

VB2019 paper: A study of Machete cyber espionage operations in Latin America

Reports on cyber espionage operations have been on the rise in the last decade. However, operations in Latin America are heavily under-researched and potentially underestimated. This paper analyses and dissects a cyber espionage tool known as…

VB2019 paper: The push from fiction for increased surveillance, and its impact on privacy

Privacy is a major source of concern for people living in the digital age. Whether they are right in their suspicions or not, people have even started questioning whether their smart light bulbs are spying on them. The only certainty we have in all…

VB2019 paper: Oops! It happened again!

This paper will take you on a scenic tour spanning three decades of malware, three decades of digital ecosystems and three decades of history repeating itself. When will we ever learn?

VB2019 paper: A vine climbing over the Great Firewall: a long‑term attack against China

This paper discloses details of a little-known APT group, PoisonVine, and its long history of cyberespionage activities lasting 11 years. The group is keen on Chinese entities and aims to harvest political and military intelligence. The paper…

Bulletin Archive

We have placed cookies on your device in order to improve the functionality of this site, as outlined in our cookies policy. However, you may delete and block all cookies from this site and your use of the site will be unaffected. By continuing to browse this site, you are agreeing to Virus Bulletin's use of data as outlined in our privacy policy.