Clean data profiling

Catherine Robinson, Julie Weber, Bartlomiej Uscilowski and Thomas Parsons Symantec

  download slides (PDF)

The volume of malicious software being created at present is so high that it has triggered discussion in the AV industry as to whether a blacklisting model is feasible in the future. In this context, clean data sets are becoming increasingly important and so is the need to classify them.

In this paper, we discuss problems and solutions related to gathering and profiling large clean data sets. We provide guidelines for gathering clean files and keeping them uncompromised, determining their level of trust and their intrinsic quality (usefulness).

We present a systematic approach to profiling files and managing the metadata in a clean set. Considering the nature of the data that needs to be extracted we group the profiling metadata into two categories: lower-level and higher-level information. The lower-level data is extracted automatically directly from files and contains information that helps in locating files and determining the type of files. Higher-level metadata consists of information that allows file categorisation. We present the possible sources of this information that could be obtained automatically or with manual annotations. We also attempt to define a naming convention for identifying software and standardising the type of data that can be queried.

Finally, we have a look at existing clean data sets, profiled and unprofiled, and their shortcomings for this particular usage.


We have placed cookies on your device in order to improve the functionality of this site, as outlined in our cookies policy. However, you may delete and block all cookies from this site and your use of the site will be unaffected. By continuing to browse this site, you are agreeing to Virus Bulletin's use of data as outlined in our privacy policy.