Inside the Microsoft script encoder

2006-01-01

Peter Ferrie

Symantec Security Response, USA
Editor: Helen Martin

Abstract

Peter Ferrie delves inside the Microsoft script encoder.


Introduction

When the Microsoft Script Encoder was released in 1999, it was predicted that malware authors would use it to obfuscate their code. As a result, tools claiming to be able to decode the files produced by the script encoder started to appear almost instantly.

You break it, you buy it

Recently, I was given an encoded script to examine, and I was told that it might contain an exploit of some kind. Since I am not in the habit of carrying script decoding tools with me, I downloaded a script decoder tool and used it to decode the file.

The result was a partially decoded file containing a fragment of what looked like shellcode and a lot of garbage bytes. Since the file was in ANSI format, there were three possibilities: that the file contained double-byte character set (DBCS) characters which were not being decoded correctly on the English system; that the script was broken; or that the tool contained a bug.

Assuming the first case, I tried decoding the file using other language formats that the tool supported, but again I was unsuccessful.

There are four languages that use DBCS characters: traditional Chinese, simplified Chinese, Japanese and Korean. Since the tool that I had downloaded supported only three of these languages, I decided to download several other script decoder tools in the hope that one of them would support the remaining language.

When the half-dozen or so tools that I had downloaded also failed to decode the script, I ruled out the third possibility. How likely is it that every copy of a tool would have the same bug? (As a matter of fact this is more likely than one might imagine, as I found out afterwards via a completely unrelated matter: try searching for tools that enable 'Unreal' mode on x86 and x86-64 processors, notice how many people claim to have found it, and notice that none of them enable the A20 line first.)

That left the second possibility - that the script was broken.

See Script run, run script run

Although I knew that running the script wouldn't provide a conclusive result, I ran it anyway. Sure enough, the Windows Scripting Host reported that the script was invalid. The question was: why?

Since I had long forgotten the details of encoded scripts, I downloaded the Microsoft Script Encoder tool (screnc.exe) and started to reverse-engineer it. Under normal circumstances, one would assume that examination of an encoder would be sufficient to provide an understanding of decoding methods. However, in this instance that is not the case, and it seems that the creators of the decoding tools all made the same mistake.

Screnc.exe

The first interesting thing I noticed about the Microsoft Script Encoder is that it supports ANSI, UTF-8, Little-Endian and Big-Endian Unicode input file formats. This is interesting because neither Microsoft's own VBScript and JScript scripting engines (before encoding), nor the decoder built into those scripting engines (after encoding), support anything but ANSI and Little-Endian Unicode. Upon attempting to execute files in the other formats, the Windows Scripting Host reports that they are invalid.

The second interesting thing is that the encoding is done by the VBScript and JScript scripting engines themselves. The reason for this is that these engines support dynamic encryption, using the EncodeScriptFile method. It seems that this method was not noticed by malware authors. The rest was fairly straightforward: encoded files begin with the signature '#@~^', followed by the base64-encoded length of the script that follows immediately. After the script is the base64-encoded checksum, and the signature '^#~@'.

The checksum is simply the sum of all of the characters from the script before it was encoded. It is used during the decoding phase to verify that the script has been decoded correctly, rather than to verify that the encoded script has not been altered.

If the script is not already in the Unicode format, it is converted to the Unicode format in memory prior to encoding. If the original file was in the ANSI format, the current code page is used to perform the translation, which causes the DBCS problem described above.

Once in the Unicode format, characters are not encoded if more than seven bits are required to identify them. In those cases, the character is simply copied instead. However, if the resulting encoded script is then saved in the ANSI format, any characters that cannot be represented in seven bits or fewer will be replaced by the system default for untranslatable characters (which is usually the '?' character). If such a replacement is made, the script can no longer be decoded properly.

Compound interest

Now the fun begins - watching how Microsoft's VBScript and JScript scripting engines deal with encoded files. Immediately we see that what screnc.exe produces is not all that those engines will accept.

The first thing the script decoder does is to search the entire script for the signature '#@~^'. This means that the script is not required to appear at the start of the file. Screnc.exe can produce such files only when the script is inside an HTML file. However, all of the decoding tools that I tried supported this behaviour.

The encoded script is decoded into the same location in the script at which it is found (but not the same location in memory). This means that unencoded script can appear before and/or after encoded script, even though screnc.exe cannot produce such files. None of the tools that I tried were affected by this, since they all found the script no matter where it was. However, one of the tools did not append the unencoded script that appeared after the encoded script.

The entire script is searched for all signatures that exist. This means that multiple encrypted scripts can appear in a single file! Screnc.exe cannot produce such files, and none of the tools that I tried supported it, either. But wait - it gets worse ...

What the #@~^?!

Any part of a script can be encoded, even down to the level of individual characters. The result is that a script such as:

oh_this="bad"

can become (unencoded characters are marked in bold):

oh_th#@~^AQAAAA==raQAAAA==^#~@s="b#@~^AQAAAA==CYQAAAA==^#~@d"

Fortunately, recursive encryption is not allowed, since the script decoder makes only a single pass over the script and decodes it to a different location in memory. If the script decoder had decoded to the same memory location, then it might have been possible to support recursive encoding to arbitrary levels, which would have made the problem much worse.

Given that an encoded script can appear anywhere in a file, it might seem surprising that all of the authors of the decoder tools made the same assumption: that the '#@~^' signature is a guarantee that what follows is an encoded script.

Of course, that's simply not true. Thus, the line:

x="#@~^" :#@~^AwAAAA==a{F5gAAAA==^#~@:msgbox(x)

was not decoded by any of those tools, yet Microsoft's VBScript and JScript scripting engines decoded and executed it correctly (it prints '1', not '#@~^'). Also note the space that appears after the ' " '. Without it, even Microsoft's VBScript and JScript scripting engines are fooled into believing that what follows the first '#@~^' signature is an encoded script, since the ':' character is a valid entry in the base64 dictionary that is used to decode the script length. The decoded length is an enormous value, and too large for the Windows Scripting Host, which reports that the script is invalid and exits without executing any further script. Despite the fact that unencoded script can appear both before and after encoded script, the decoding is done before any script is interpreted, so string concatenation does not work.

For example,

a="#@~^AQAAAA=="+"qMQAAAA==^#~@"

does not decode to 'a=1'. Additionally, decoding is not done after any script is interpreted, so this line:

eval(x)

where x is the encoded script

'#@~^AQAAAA==CYQAAAA==^#~@'

that was read from a file, is not decoded and does not evaluate to 'a'.

(That's a bug)

As mentioned above, the length and checksum of the decoded script are stored in base64 format. A bug exists in the base64 decoder in Microsoft's VBScript and JScript scripting engines, which does not limit the input values correctly. This can be used to obfuscate the true length and/or checksum from tools which accept only files whose length and checksum are correct.

There is also an integer overflow bug in Microsoft's VBScript and JScript scripting engines that causes a crash while calculating the length of the script. The bug is triggered if the top bit is set in a decoded length that would otherwise point within the file.

Conclusion

So what happened to that script? In the end, it was simply broken. There were extra characters inserted throughout the script, so the decoded length did not match; and there were some characters whose value was incorrect ('.' instead of the tab character, for example), so the decoded checksum did not match. After I had identified and removed the extra characters, and corrected the incorrect characters, the script decoded properly on an English system. It even contained an exploit, but it was one that we knew about already.

twitter.png
fb.png
linkedin.png
hackernews.png
reddit.png

 

Latest articles:

VB2018 paper: Office bugs on the rise

It has never been easier to attack Office vulnerabilities than it is nowadays. In this paper Gabor Szappanos looks more deeply into the dramatic changes that have happened in the past 12 months in the Office exploit scene.

VB2018 paper: Tracking Mirai variants

Mirai, the infamous DDoS botnet family known for its great destructive power, was made open source soon after being found by MalwareMustDie in August 2016, which led to a proliferation of Mirai variant botnets. This paper presents a set of Mirai…

VB2018 paper: Hide’n’Seek: an adaptive peer-to-peer IoT botnet

This paper presents a thorough analysis of the inner workings of Hide’n’Seek, a peer-to-peer IoT botnet discovered in January 2018. With an exploit table that can be updated in memory and modular in its approach, Hide’n’Seek gives us a glimpse of…

Botception: botnet distributes script with bot capabilities

Researchers Jan Sirmer and Adolf Streda describe the branch of the Necurs botnet that they have been monitoring, the changes it has undergone in the course of a year, and present an analysis of the next stage of the attack: Flawed Ammy.

VB2018 paper: Since the hacking of Sony Pictures

Minseok (Jacky) Cha describes various attacks in Korea which occurred after the Sony Pictures hacking incident and which are suspected to be the work of the same group, the Lazarus Group.


Bulletin Archive

We have placed cookies on your device in order to improve the functionality of this site, as outlined in our cookies policy. However, you may delete and block all cookies from this site and your use of the site will be unaffected. By continuing to browse this site, you are agreeing to Virus Bulletin's use of data as outlined in our privacy policy.