VB100 Methodology

How VB100 testing is performed


This document describes the methodology of the VB100. The methodology document is updated to include the most recent additions and adjustments to the VB100 comparative system, but where possible should still be applicable to historic tests. For details on how any changes to the methodology affect understanding of past tests, please contact vbtest@virusbulletin.com

Core goals

The purpose of the VB100 comparative is to provide an insight into the relative performance of the solutions taking part in our tests, covering as wide a range of areas as possible within the limitations of time and available resources.

The results of our tests should not be taken as definitive indicators of the potential of any product reviewed, as all solutions may contain additional features not covered by our tests and may offer more or less protection depending on the configuration and operation of a specific setting and implementation.

VB100 certification is designed to be an indicator of general quality and should be monitored over a period of time. Achieving certification in a single comparative can only show that the solution in question has met the certification requirements in that specific test. A pattern of regular certifications and few or no failed attempts should be understood to indicate that the solution’s developers have strong quality control processes and strong ties to industry-wide sample sharing initiatives - ensuring constant access to and coverage of the most prevalent threats.

Alongside the pass/fail data, we recommend taking into account the additional information provided in each report, and also suggest consultation of other reputable independent testing and certification organizations.

Product set-up and operation

Products are installed from the installation packages provided for the test, and any accompanying updates are also added. For the majority of tests, where a live internet connection is available, products are given a manual update signal if required, then systems are rebooted and a second update is run before testing commences. Should updating fail to complete successfully, where this is made obvious the lab team may make further efforts to force an update, but if the information is not clear (or if updates cannot be effected with reasonable effort) the product will be tested as is.

Version information details are recorded before and after each test run, but update settings are left at the defaults and additional updates may occur during the test run, depending on the design of the product. Where applicable, tools are run to confirm a functioning connection to online lookup systems, and the results recorded for later reference.

Malware detection measures

Currently, the malware detection rates recorded in our reports cover only static detection of inactive malware present on the hard drive of the test system, not active infections or infection vectors.

For on-demand tests, products are directed to scan sample sets using a context-menu or ‘right-click’ scan option. If this is not available, the standard on-demand scan from the product interface is used. If this facility proves unsuitable for a given test, any available command-line scanning tool may be used as a last resort.

In all cases the default settings are used, with the exception of automatic cleaning/quarantining/removal, which is disabled where possible, and logging options, which are adjusted where applicable to ensure the full details of scan results are kept for later processing.

In on-access measures sample sets are accessed using bespoke tools which spark products with on-read protection capabilities to check and, where necessary, block access to malicious files. Again, automatic cleaning and removal is disabled where possible. In solutions which provide on-write but not on-read detection, sample sets are copied from one partition of the test system to another, or written to the test system from a remote machine. In the case of solutions which offer on-read detection but default to other methods, settings may be changed to enable on-read for malicious test sets to facilitate testing.

It is important in this setting to understand the difference between detection and protection. The results we report show only the core detection capabilities of traditional malware technology. Many of the products under test may offer additional protective layers to supplement this, including but not limited to: firewalls, spam filters, web and email content filters and parental controls, software and device whitelisting, URL and file reputation filtering including online lookup systems, behavioural/dynamic monitoring, HIPS, integrity checking, sandboxing, virtualization systems, backup facilities, encryption tools, data leak prevention and vulnerability scanning. The additional protection offered by these diverse components is not measured in our tests. Users may also obtain more or less protection than we observe by adjusting product settings to fit their specific requirements.

Performance measures

The performance data included in our tests is intended as a guide only, and should not be taken as an indicator of the exact speeds and resource consumptions a user can expect to observe on their own systems. Much of the data is presented in the form of relative values compared to baselines recorded while performing identical activities on identical hardware, and is thus not appropriate for inferring specific performances in other settings; it should instead be used to provide insight into how products perform relative to one another.

An automated system performs a suite of standard activities, including launching popular applications, installing and uninstalling packages, moving and copying files around the system, archiving files and extracting from archives, downloading files from a web server and so on. The sample files and applications used are selected to represent the types of items most commonly used in normal, everyday situations, for both business and home users.

System resource usage is measured using the Windows performance monitor tool. Levels of memory and CPU usage are recorded every five seconds during each of several tasks. Data from the on-access speed test periods plus an additional on-access run over the system partition, as well as from the automated suite of activities, is used for the ‘heavy file access’ measures, and periods of inactivity for the ‘idle system’ measures.

During all these measures the solution’s main interface, a single instance of Windows Explorer and a single command prompt window are open on the system, as well as any additional windows required by the testing tools. The results are compared with baseline figures obtained during the same baseline test runs as used for the on-access speed calculations, to produce the final results showing the percentage increase in resource usage during the various activities covered.

In our server tests only, an additional set of more technical performance measures is recorded to provide more detailed insight into how the operating speed of a system is affected by security solutions.

On-demand speed figures are presented as a simple throughput rate, determined by measuring the length of time taken to scan a standard set of clean sample files using the standard on-demand scan from the product interface. The size of the sample set is divided by the time taken to scan it, resulting in a value in megabytes of data processed per second (MB/s).

On-access speeds are gathered by running a file-opening tool over the same sets; speeds are recorded by the tool and compared with the time taken to perform the same action on an unprotected system (these baselines are taken several times and an average baseline time is used for all calculations). The difference in the times is divided by the size of the sample set, to give the additional time taken to open the samples in seconds per megabyte of data (s/MB).

Both on-demand and on-access measures are taken with the default settings, with an initial ‘cold’ measure showing the products' performance on first sight of the sample sets and ‘warm’ measures showing the average of several subsequent scans over the same test sets. This provides an indication of whether products are using smart caching techniques to avoid re-scanning items that have already been checked.

An additional run is performed with the settings adjusted, where possible, to include all file types and to scan inside archive files. This is done to allow closer comparison between products with more or less thorough settings by default.

The level of settings used by default and available is shown in the archive type table. These results are based on scanning and accessing a set of archives in which the Eicar test file is embedded at different depths. An uncompressed copy of the file is also included in the sample set, with its file extension changed to a random one not used by any executable file type to determine whether solutions rely on file extensions to decide whether or not to check a given item.

Stability measures

The aim of stability rating system is to provide a guide to the stability and quality of products participating in VB100 comparative reviews. It is designed only to cover areas of product performance observed during VB100 testing, and all bugs noted must be observed during the standard process of carrying out VB100 comparative tests; thus all products should have an equal chance of displaying errors or problems present in the areas covered.

Bug classification

Bugs and problems are classified as very minor, minor, serious, severe and very severe. The following is an incomplete list of examples of each category:

Very minor – Error messages displayed but errors not impacting product operation or performance;

Minor – Minor (non-default) product options not functioning correctly; product interface unresponsive for brief periods (under 30 seconds);

Serious – Scan crashes or freezes; product interface freezes, or becomes unresponsive for long periods (over 30 seconds, with protection remaining active); scans fail to produce accurate reporting of findings; product ignores configuration in a way which could damage data;

Severe – System unresponsive; system requires reboot; protection disabled or rendered ineffective;

Very severe – BSOD; system unusable; product non-functional

Bugs will be counted as the same issue if a similar outcome is noted under similar circumstances. For each bug treated as unique, a raw score of 1 point will be accrued for very minor problems, 2 points for minor problems, 5 points for serious problems, 10 points for severe problems and 20 points for very severe problems. These raw scores will then be adjusted depending on two additional factors:

Bug repeatability

All issues should be double-checked to test reproducibility. Issues will be classed as “reliably reproducible” if they can be made to re-occur every time a specific set of circumstances is applied; “partially reproducible”, if the problem happens sometimes but not always in similar situations; “occasional”, if the problem occurs in less than 10% of similar tests; and “one-off” if the problem does not occur more than twice during testing, and not more than once under the same or similar circumstances. One-off and occasional issues will have a points multiplier of x0. 5; reliably reproducible issues will have a multiplier of x2.

Bug circumstances

As some of our tests apply unusually high levels of stress to products, this will be taken into account when calculating the significance of problems. Those that occur only during high-stress tests using unrealistically large numbers of malware samples will be given a multiplier of x0.5.

Product classification

The system classifies products on a five level stability scale. The five labels indicate the following stability status:

Solid (0 points) – the product displayed no issues of any kind during testing.

Stable (0.1-4.9 points) – a small number of minor or very minor issues were noted, but the product remained stable and responsive throughout testing.

Fair (5-14.9 points) – a number of minor issues, or very few serious but not severe issues, but none that threatened to compromise the functioning of the product or the usability of the system.

Buggy (15-29.9 points) – Many small issues, several fairly serious ones or a few severe problems were observed; the product or system may have become unresponsive or required rebooting when under heavy stress.

Flaky (30+ points) – There were a number of serious or severe issues which compromised the operation of the product, or rendered the test system unusable.

Sample selection and validation

The sample sets for the speed tests are built by harvesting all available files from a selection of clean systems and dividing them into categories of file types, as described in the test results. They should thus represent a reasonable approximation of the ratios of different types of files on a normal system. The remaining portion of the false positive sample set is made up of a selection of items from a wide range of sources, including popular software download sites, the download areas of major software development houses, software included on pre-installed computers, and media provided with hardware and popular magazines.

In all cases, packages used in the clean sets are installed on test systems to check for obvious signs of malware infiltration, and false positives are confirmed by solution developers prior to publication wherever possible. Samples used are rated for significance in terms of size of user base, and any item adjudged too obscure or rare is discarded from the set. The set is regularly cleaned of items considered too old to remain significant.

Samples used in the infected test set also come from a range of sources. The WildList samples used for the core certification set stem from the master samples maintained by the WildList Organization. These are validated in our own lab, and in the case of true viruses, only fresh replications generated by us are included in the test sets (rather than the original samples themselves).

Any set of polymorphic viruses used will include a range of complex viruses, selected either for their current or recent prevalence or for their interest value as presenting particular difficulties in detection; again all samples are replicated and verified in our own lab.

For the other sets, including the RAP sets, any sample gathered by our labs in the appropriate time period and confirmed as malicious by us is considered fair game for inclusion. Sources include the sharing systems of malware labs and other testing bodies, independent organizations and corporations, and individual contributors as well as our own direct gathering systems. All samples are marked with the date on which they are first seen by our lab. The RAP collection period begins ten days prior to the product submission deadline for each test, and runs until ten days after that deadline; the deadline date itself is considered the last day of 'week -1'.

The sample sets for the Response tests are compiled from samples seen in the week prior to each test run. As this test is performed with a live web connection and not all products under test can be run in parallel, each product may be exposed to a different set of samples, or to the same set of samples at a different time of day. To counter any bias this may introduce, the test is run multiple times during the test period, with the order of products arranged to ensure none is given any time advantage, and average scores are calculated. Samples are classified and filtered to try to ensure reasonable equivalency between sets.

All samples are verified and classified in our own labs using both in-house and commercially available tools. To be included in our test sets all samples must satisfy our requirements for malicious behaviour; adware and other ‘grey’ items of potentially unwanted nature are excluded from both the malicious and clean sets as far as possible.

Reviews and comments

The product descriptions, test reports and conclusions included in the comparative review aim to be as accurate as possible to the experiences of the test team in running the tests. Of necessity, some degree of subjective opinion is included in these comments, and readers may find that their own feelings towards and opinions of certain aspects of the solutions tested differ from those of the lab test team. We recommend reading the comments, conclusions and additional information in full wherever possible.


Latest Report

The latest VB100 comparative test report

Latest RAP Quadrant

The latest RAP averages quadrant

VB100 Test Schedule

The schedule for upcoming VB100 tests and reports

VB100 Procedures

How the VB100 testing process works

VB100 Methodology

How VB100 testing is performed

VB100 Test Archive

Details of all previous VB100 comparatives

VB Testing