Anshuman Singh University of Louisiana at Lafayette
Andrew Walenstein University of Louisiana at Lafayette
Arun Lakhotia University of Louisiana at Lafayette
Supervised machine learning methods have become popular for malware detection. However, these methods assume that malware population is stationary or drifts very slowly with time. A rapidly drifting population can quickly make machine-learning-based malware classifiers outdated, necessitating frequent retraining.
In this paper, we study whether this assumption holds. To do this, we study the rate of drift in unpacked samples of six malware families by ordering them temporally based on their PE header timestamp. We use cosine similarity between samples, represented as tf-idf vectors of opcode 2-grams, to measure the drift. We compute this similarity for each sample with one of the early samples. The results show sharp and distinct flat similarity bands at different similarity values in each of the malware families we studied. This result is counter-intuitive because evolving malware should ideally produce samples with decreasing similarity over time to give a monotonically falling similarity curve.
These similarity bands are rather surprising, and imply one or more of the following points. The bands may indicate that a malware family itself has independently evolving subfamilies. They may also indicate that the population drift exists, but is quite slow. Or, it could also be indicative of the use of different packers that leave some signature artifacts in the unpacked code.