Anti-unpacker tricks – part fourteen

2010-11-01

Peter Ferrie

Microsoft, USA
Editor: Helen Martin

Abstract

Last year, a series of articles described some tricks that might become common in the future, along with some countermeasures. In this final article of the series we look at anti-unpacking by anti-emulating.


New anti-unpacking tricks continue to be developed as older ones are constantly being defeated. Last year, a series of articles has described some tricks that might become common in the future, along with some countermeasures [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14].

In this final article of the series we look at anti-unpacking by anti-emulating.

Unless stated otherwise, all of the techniques described here were discovered and developed by the author.

Software interrupts

Interrupt 0x2E

On Windows XP and later versions (but only on 32-bit platforms), if the CPU supports the SYSEXIT instruction, Windows will return in the EDX register the address of the next instruction to execute.

Example code looks like this:

   ;any value will work
   ;but requires user32.dll loaded
   or    eax, -1
   int 2eh
l1: cmp edx, offset l1
   jne being_debugged

The reason for this is obscure. This disassembly shows more:

   test d [esp+4], 1 ;ring check
l1: jne    l2 ;taken if ring 3
   ...
l2: iretd ;return to caller
l3: test b [esp+9], 1 ;check T flag
   jne l2 ;use iret if set
   pop edx ;edx=eip
   add esp, 4 ;discard cs
   and b [esp+1], -3 ;clear I flag
   popfd ;load flags
   pop ecx ;discard error code
   sti
   sysexit ;fast return to caller

In this disassembly, there is no reference to either l1 or l3. However, what cannot be seen here is that code exists elsewhere in the kernel, which checks for CPU support for the SYSEXIT instruction. If such support exists, then the kernel adjusts the value at l1+1 such that the branch reaches l3 instead of l2.

When active, the only way to reach l2 is if the T flag is set. In all other cases, the faster SYSEXIT instruction is used instead of the IRETD instruction. As a side effect of that change, the EDX value always contains the EIP value on return.

Interestingly, Windows 2000 contains similar code, as we can see in this disassembly:

  ;check for SYSEXIT support
  ;internal flag, not CPUID value
  test d [xxxxxxxx], 1000
  je  l1 ;taken if not supported
  test d [esp+4], 1 ;ring check
  je  l1 ;taken if ring 0
  ...
  pop edx ;edx=eip
  add esp, 8 ;discard cs, eflags
  pop ecx ;discard error code
  sti
  sysexit ;fast return to caller
l1: iretd ;return to caller

Here, a variable is checked instead of using an altered branch. It has poorer performance, but it avoids the in-memory patch. However, the code that queries the CPU capabilities does not contain any code to enable this feature. As a result, the SYSEXIT path is never reached.

There is an additional unexpected behaviour in the 32-bit version of Windows Vista and later Windows versions. If the value in the EAX register exceeds the size of the standard service table, then Windows will call through the ntdll KiUserCallbackDispatcher() function, which in turn calls through the PEB->KernelCallbackTable table. The index that is used depends on the Windows version. For Windows Vista, the index is currently 0x4c, and for Windows 7, the index is currently 0x4a. These values could change in the future, but it is trivial to fill a table that can support any value. This technique could be used to redirect execution in an obfuscated manner for those platforms.

Example code looks like this:

  call    GetVersion
  cmp al, 5
  jb  l1 ;not Vista+
  push    offset l2
  call    GetModuleHandleA
  push    offset l3
  push    eax
  call    GetProcAddress
  xchg    ecx, eax
  jecxz l1 ;not supported
  push    eax
  push    esp
  push    -1 ;GetCurrentProcess()
  call    ecx
  pop ecx
  loop    l1 ;taken if not WOW64
  mov eax, fs:[ecx+30h]
  mov d [eax+2ch], offset l4
  int 2eh
  jmp     being_debugged
l1: ...
l2: db     “kernel32”, 0
l3: db     “IsWow64Process”, 0
l4: dd     4ah dup (0)
  dd  offset l1 ;Windows 7
  dd  0
  dd  offset l1 ;Vista

Operand-size override

The operand-size override (0x66) can be used on instructions that transfer control. The result is that the EIP register is truncated to a 16-bit value. Execution resumes (if possible) from the resulting address.

Example code looks like this:

  xor ebx, ebx
  push    40h
  mov eax, esp
  push    3000h
  push    esp
  push    ebx
  push    eax
  push    -1 ;GetCurrentProcess()
  call    NtAllocateVirtualMemory
  xchg    ecx,eax
  db  66h
  jecxz l1
l1: ...

In this example, execution continues at the address (l1&0xffff). This technique works with all types of branch – the 7x form and the 0f xx form.

Example code looks like this:

  xor ebx, ebx
  push    40h
  mov eax, esp
  push    3000h
  push    esp
  push    ebx
  push    eax
  push    -1 ;GetCurrentProcess()
  call    NtAllocateVirtualMemory
  test    eax, eax
  db  66h
  je  l1
l1: ...

In this example, execution continues at the address (l1&0xffff). This technique also works with relative calls and relative jumps.

Example code looks like this:

  xor ebx, ebx
  push 40h
  mov eax, esp
  push 3000h
  push esp
  push ebx
  push eax
  push -1 ;GetCurrentProcess()
  call NtAllocateVirtualMemory
  call l1 ;determine eip
l1: pop ax ;discard low 16 bits
  call small l2
l2: ...
l3: ...

As with the previous example, execution continues at the address (l1&0xffff). However, unlike the previous example, this one can return to l3, with a balanced stack, simply by executing a 32-bit RET instruction.

Note the explicit mention of a ‘32-bit RET instruction’. This is important because the technique also works with all types of return (near and far).

Example code looks like this:

  xor ebx, ebx
  push 40h
  mov eax, esp
  push 3000h
  push esp
  push ebx
  push eax
  push -1 ;GetCurrentProcess()
  call NtAllocateVirtualMemory
  push small offset l1
  db  66h
  ret
l1: ...

As in the last example, execution continues at the address (l1&0xffff). Finally, this technique also works with the IRET instruction.

Example code looks like this:

  xor ebx, ebx
  push 40h
  mov eax, esp
  push 3000h
  push esp
  push ebx
  push eax
  push -1 ;GetCurrentProcess()
  call NtAllocateVirtualMemory
  pushfw
  push small cs
  push small offset l1
  iretw
l1: ...

As with the previous example, execution continues at the address (l1&0xffff).

Since this is a most uncommon use of the operand-size override, it is possible that some emulators will not support it.

Multi-tasking

The CPU supports the running of multiple tasks. Each of those tasks has access to various resources such as the CPU registers and the FPU. However, when a task switch occurs, the CPU saves only the CPU registers and none of the FPU state. Instead, the CPU sets the ‘TS’ (Task Switched) bit in a control register, which signifies that a task switch has occurred. Whenever the CPU encounters an FPU, MMX, or SSE instruction (with a few exceptions), it checks the state of that bit. If the bit is set, then the CPU checks the state of the ‘MP’ (Monitor Processor) bit. This bit is under software control. If it is also set, then the CPU raises an ‘NM’ (Non-Maskable) exception that refers to the co-processor. The software-based task manager intercepts that exception and saves the state of the FPU, MMX and SSE environment prior to clearing the TS bit to avoid a redundant save. The reason the MP bit exists is to avoid the relatively large overhead of saving the FPU state in the event that it is entirely unnecessary because a task did not use the FPU at all. There is also the possibility that several related tasks might share the FPU. In such a case, the task manager can also clear the MP bit to avoid a redundant save.

The task-switching behaviour can be exploited as an anti-emulation trick. Specifically, a process can execute an FPU instruction, thus causing the NM exception to be raised and the FPU state to be saved. The task manager will clear the TS bit in response to this event, and potentially clear the MP bit too. After some time passes and other tasks are executed, the task manager will set the MP bit again if it was cleared, and the processor will set the TS bit again. This cycle will continue until eventually the process resumes execution. At that time, the two bits should be set. A process can detect this cycle.

Example code looks like this:

  wait ;raise NM
l1: smsw ax
  and al, 0ah
  cmp al, 0ah
  je  l1 ;wait while TS and MP
l2: smsw ax
  test al, 2 ;wait for MP
  je  l2
  test al, 8 ;check for TS
  je  being_debugged

This technique is used by Waledac. However, it does not work on the 64-bit versions of Windows. Specifically, the loop at l2 never exits, because the MP bit is never set again for the process.

VirtualPC-specific

There are some common methods in shellcode for finding the value of the EIP register using instructions that contain no bytes with a value of zero. One of those methods uses an FPU instruction.

Example code looks like this:

l1: fldz
  fnstenv [esp-0c]
  pop eax
l2: ...

When l2 is reached, the value in the EAX register will be the address of l1. Thus, given the following code, it seems reasonable to assume that the branch at l3 will never be taken:

l1: fldz
  fnstenv [esp-0c]
  pop eax
l2: cmp eax, offset l1
l3: jne being_debugged

However, this is an invalid assumption. In VirtualPC, single-stepping over the fldz instruction results in a completely different value in the EAX register. The cause is unknown at the time of writing, but the value appears to be a constant (0x74b036). This means that the code could be altered in a very subtle way.

Example code looks like this:

org 74b035h
l1: fldz
  fnstenv b [esp-0ch]
  pop eax
  dec b [eax+(offs l2-offs l1)-1]
  mov eax, offset l3+01000000h
l2: mov ecx, offset being_debugged
  jmp eax
l3: ;...

If the code executes freely, then execution continues from l3. However, single-stepping over the fldz instruction causes the ‘mov ecx’ instruction to become a ‘mov eax’ instruction, thus causing execution to resume from being_debugged.

That is a very subtle anti-debugging trick indeed.

twitter.png
fb.png
linkedin.png
hackernews.png
reddit.png

 

Latest articles:

Nexus Android banking botnet – compromising C&C panels and dissecting mobile AppInjects

Aditya Sood & Rohit Bansal provide details of a security vulnerability in the Nexus Android botnet C&C panel that was exploited to compromise the C&C panel in order to gather threat intelligence, and present a model of mobile AppInjects.

Cryptojacking on the fly: TeamTNT using NVIDIA drivers to mine cryptocurrency

TeamTNT is known for attacking insecure and vulnerable Kubernetes deployments in order to infiltrate organizations’ dedicated environments and transform them into attack launchpads. In this article Aditya Sood presents a new module introduced by…

Collector-stealer: a Russian origin credential and information extractor

Collector-stealer, a piece of malware of Russian origin, is heavily used on the Internet to exfiltrate sensitive data from end-user systems and store it in its C&C panels. In this article, researchers Aditya K Sood and Rohit Chaturvedi present a 360…

Fighting Fire with Fire

In 1989, Joe Wells encountered his first virus: Jerusalem. He disassembled the virus, and from that moment onward, was intrigued by the properties of these small pieces of self-replicating code. Joe Wells was an expert on computer viruses, was partly…

Run your malicious VBA macros anywhere!

Kurt Natvig wanted to understand whether it’s possible to recompile VBA macros to another language, which could then easily be ‘run’ on any gateway, thus revealing a sample’s true nature in a safe manner. In this article he explains how he recompiled…


Bulletin Archive

We have placed cookies on your device in order to improve the functionality of this site, as outlined in our cookies policy. However, you may delete and block all cookies from this site and your use of the site will be unaffected. By continuing to browse this site, you are agreeing to Virus Bulletin's use of data as outlined in our privacy policy.