VB ESA - M365 Methodology

How the VB ESA - M365 comparative tests are carried out

Overview

Introduction

VB ESA - M365 is Virus Bulletin’s continuously running performance test programme for solutions that supplement Microsoft 365’s native security (Exchange Online Protection (EOP)[1]) by adding extra detection layers. Products can participate in the programme either publicly or privately; this methodology documents the full details of the test programme in both cases.

 

Test objective

Public VB ESA - M365 tests are designed to quantify the changes in email filtering performance that result from deploying tested solutions alongside Microsoft 365, and to compare performance metrics among the evaluated products.

Outside the public test periods (and for privately tested products), VB seeks to provide participating vendors with continuous feedback about the performance of their product.

 

Suitable products

  • Integrated Cloud Email Security (ICES): solutions that integrate API-first, without MX record changes. They monitor mailboxes, detect post-delivery threats, and automate remediation.
  • Secure Email Gateways (SEGs) for Microsoft 365: solutions that sit in front of Microsoft 365 via MX routing.

 

The comparative base

The Microsoft 365 comparative base is used to establish a common reference against which the additional detection and filtering capabilities of Microsoft 365 email security add-ons are measured. The comparative base represents email samples that were handled by Microsoft 365 alone and therefore fall outside the scope of evaluation for the tested add-on products.

A sample is included in the comparative base if it meets the following conditions:

  • Microsoft 365 assigned the sample one of the following status values, observed across all test instances in which the tested add-on solutions were deployed:
    • Failed
    • FilteredAsSpam
  • Microsoft 365 assigned in at least one test instance the status:
    • Quarantined

 

Relationship between baseline and product results

Product effectiveness is measured only on samples that are not part of the comparative base.

The samples included in the baseline are excluded from product scoring and do not contribute to detection or filtering metrics for the tested add-ons.

 

Interpretation and limitations

The Microsoft 365 comparative base is a synthetic construct created for the purpose of this test and should not be interpreted as a definitive measure of Microsoft 365 performance in isolation.

The following limitations apply:

  • The baseline reflects Microsoft 365 behaviour under the specific configurations and licensing used in the test.
  • Aggregation across multiple instances is used to ensure consistency and attribution clarity and may not correspond to the behaviour of a single production tenant.
  • Microsoft 365 filtering decisions may vary over time due to model updates and threat intelligence changes.

Despite these limitations, the comparative base provides a stable and transparent reference that enables meaningful comparison of add-on performance.

 

Test outline

The test exposes each tested product to both unwanted and legitimate emails, and records the product’s response to these emails.

 

Test cases (emails)

VB utilizes a variety of email sources for test cases:

  • Spam emails
    • Third-party real-time email feeds: a typical example would be Project Honey Pot and other similar services.
    • Virus Bulletin’s own threat intelligence: emails collected through our own spam traps and through other means.
  • Legitimate emails
    • Ham emails: email discussion list emails.
    • Newsletters: both commercial and non-commercial opt-in newsletters.

All emails used in the test are in the wild. Virus Bulletin does not create new emails, e.g. to simulate spear-phishing tactics. Some modifications to the in-the-wild emails are necessary to facilitate testing and to protect intellectual property, as detailed later on in this document.

Emails are forwarded to the tested product without undue delay upon receipt by our threat intelligence, in order to stay as close to real time as possible.

The legitimate emails used are predominantly written in English, whereas unwanted emails represent a wide variety of languages.

Note that full solutions and complementary solutions may be subjected to a slightly different mix of emails when there is a potential conflict of interest (for example, if the vendor that supplies the email feed also has its products publicly tested).

Emails are sorted into the following categories and subcategories:

  • Legitimate emails
    • Ham
    • Newsletters
  • Spam emails
    • Assorted: any unwanted email that does not fit into any of the more specific categories outlined below.
    • Phishing: spam emails containing a link that leads either to malware or to an attempt to steal credentials.
    • Malware: emails with an attachment that is either malware itself or would likely download malware. It is possible that a password (often present in the email) would need to be entered in order for the malware to be executed.

 

Test results

The products’ responses are referenced against the respective body of the test cases and are sorted into the following categories:

  • True positive = unwanted test case identified as unwanted (e.g. spam)
  • True negative = legitimate test case identified as legitimate (legitimate email treated as such)
  • False positive = legitimate test case identified as unwanted (false alarm)
  • False negative = unwanted test case identified as legitimate (unwanted email missed)

Definitions

Incremental Detection Rate (IDR)

The Incremental Detection Rate is the proportion of malicious samples not filtered by Microsoft 365 that are detected by the tested add-on.

The Incremental Detection Rate is calculated as:

IDR = (Number of malicious samples detected by the add-on) ÷ (Total number of malicious samples not filtered by Microsoft 365)

Where:

  • ‘Detected’ refers to any action by the tested add-on that prevents delivery or places the message under administrative control (e.g. block or quarantine).
  • The denominator includes only spam samples that passed Microsoft 365 filtering and were therefore outside the comparative base.

Residual sample set

The residual sample set is the set of email samples that pass Microsoft 365 filtering and are therefore eligible for evaluation by the tested solutions.

 

Award criteria

In addition to certification, the test recognises exceptional performance through a series of awards and badges. These distinctions are intended to highlight products that demonstrate outstanding incremental protection beyond Microsoft 365’s native filtering, while maintaining strict control over false positives.

All awards are evaluated exclusively on samples outside the Microsoft 365 comparative base.

Awards:

  • VB ESA - M365: Spam IDR ≥ 80%; False Positives <= 1; Newsletter FPs <= 3.
  • VB ESA - M365+: Spam IDR ≥ 95%; False Positives 0; Newsletter FPs 0.

Badges:

  • VB ESA - M365 Top Performer: highest Spam IDR; False Positives 0; Newsletter FPs 0.
  • VB ESA - M365 Phishing 100: Phishing FNs 0; False Positives 0; Newsletter FPs 0.
  • VB ESA - M365 Malware 100: Malware FNs 0; False Positives 0; Newsletter FPs 0.

 

Testing procedure

Overview

The product lifecycle in VB ESA - M365 begins with an initial product setup, typically done in cooperation with the vendor.

This is followed by continuous testing for the designated testing period. For publicly tested products that join the test on a commercial basis, the designated testing period is approximately one year, during which four shorter periods are designated as official test periods. Data obtained from the official test periods serve as the basis for public, comparative test results and certification.

 

Test environment

The test environment uses a Microsoft 365 tenant with one licensed user account on Microsoft 365 Business Basic or Microsoft 365 Business Standard. The assessment focuses on email filtering capabilities; these licences include Exchange Online Protection (EOP)[2] and do not include Defender for Office 365[3].

In order to allow the uninterrupted flow of emails from VB’s server, a connector[4] is configured in the “Exchange admin center” for each of the tested solutions.

 

Introducing test cases to the test environment

Test cases (emails) are sent continuously to the tested products.

Each email is delivered in an individual SMTP transaction. Virus Bulletin seeks to keep modifications to the original, in-the-wild versions of the emails to a minimum, however a number of differences between the original and in-test emails are inevitable. Some of these changes are identical to having a front-end SMTP server that rewrites the recipient information:

  • SMTP connections will be made from an IP address belonging to Virus Bulletin, instead of the original sending IP.
  • The SMTP transaction HELO/EHLO domain will be overwritten by that of the actual sending host in the Virus Bulletin infrastructure. Note that the domain will be preserved in the Received: header, as described later on in this section.
  • The SMTP transaction recipient (also known as the SMTP envelope recipient) will be rewritten to [email protected], where “user” is an ID assigned by Virus Bulletin.

The more invasive changes are:

  • Any references to the original spam trap within the email MIME will be replaced in the same manner as with the SMTP transactions. Note that this might break any digital signatures, most notably DKIM. This replacement affects both the MIME header and the MIME parts and happens both at mailbox user level (e.g. the subject “Notice for <spam-recipient>” may become “Notice for <rewritten-user>”) and domain level (e.g. “Your mailbox at <spamtrap-domain> is suspended” may become “Your mailbox at vbspamtest.com is suspended”).
  • Ham emails – which commonly originate from mailing lists – will be re-engineered to appear as if they were sent to the vbspamtest.com domain directly.

Some metadata about the original email will be retained through a Received: header, again as if the email were received first by a front-end SMTP server. A single new Received: header will be inserted into the email, in the following fashion:

Received: from <original-reverse-dns> (HELO <original-helo-domain> [<originating-ip>]) by <vb-mta>
(<vb-mta-software>) with [E]SMTP id <vb-message-id>; <date>

where

  • <original-reverse-dns> is the reverse DNS of the original sending IP, if available, otherwise the literal “Unknown”.
  • <original-helo-domain> is the original HELO/EHLO command argument domain.
  • <vb-mta> is the FQDN of the Virus Bulletin SMTP server that received the email.
  • <vb-mta-software> is the standard (and cosmetic) label for the SMTP server used by Virus Bulletin.
  • <vb-message-id> is Virus Bulletin’s own message ID for identifying the email and the tested product (it is not the original Message-ID).
  • <date> is the date time and time of the receipt.

 

Recording product response

A tested product receives test case emails from the VB infrastructure and the product’s response is recorded. These responses are sorted into two categories:

  • “Spam” – the product declared the email to be unwanted.
  • “Ham” – the product either did not act on the email, or explicitly designated it to be a “wanted” email.

For Integrated Cloud Email Security (ICES) the following rules are applied:

  • The email will be considered to be marked as ‘spam’ if:
    • The email was read from the ‘Junk email’ mailbox.
    • The email was read from a custom mailbox where the blocked emails are added by the tested product.
  • The email will be considered to be marked as ‘ham’ if:
    • Microsoft 365’s status for the email is Delivered.
    • Neither of the ‘spam’ responses were observed.
    • The email was read from the ‘Inbox’ mailbox.
    • The email was read from a custom mailbox where the legitimate emails are added by the tested product.

For Secure Email Gateways (SEGs) for Microsoft 365 the following rules apply:

  • The email will be considered to be marked as ‘spam’ if:
    • An SMTP 5xx error occurs while the email is being sent to the product.
    • The SMTP transaction fails due to repeated SMTP 4xx transient errors, host unavailability, transaction interruption and the retry attempts (up to six times, at 20-minute intervals) are exhausted.
    • The email was read from the ‘Junk email’ mailbox.
    • The email was read from a custom mailbox where the blocked emails are added by the tested product.
  • The email will be considered to be marked as ‘ham’ if:
    • Microsoft 365’s status for the email is Delivered.
    • None of the ‘spam’ responses were observed.
    • The email was read from the ‘Inbox’ mailbox.
    • The email was read from a custom mailbox where the legitimate emails are added by the tested product.

 

Test case validation

To closely simulate real-time conditions, emails from real-world sources are promptly introduced into the testing environment without unnecessary delay. Despite employing the best threat intelligence and meticulously crafted automation, some of the emails received by the tested products may not be relevant for the test. Therefore, Virus Bulletin regularly reviews and validates test case emails, discarding any that are deemed unsuitable.

Note that in this validation process, many perfectly suitable emails may also be discarded due to the limited capacity for manual review.

 

Feedback

As a general rule, feedback is provided to the participating products on a weekly basis. No feedback is given during the official test periods; participants will be given advance warning of any other interruption to feedback (e.g. due to holidays or maintenance periods). The feedback provided is non-comparative by nature, i.e. the feedback by itself is not suitable to determine how a product ranks against other products in the test.

This feedback is for the vendor’s own information only, and sharing of the details publicly either by the vendor or by Virus Bulletin is not permitted.

Feedback includes:

  • performance metrics on the test case bodies
  • speed measurements, where applicable
  • test cases for false negatives and false positives, including
    • emails, as sent to the product (MIME)
    • email transaction logs
    • the header of the email as it was returned by the product, if applicable.

Note that Virus Bulletin may cap the number of emails shared at 400 per email category.

 

Disputes

Disputes may be submitted at any time, however for the official test period, Virus Bulletin requires that public test participants submit their disputes within 10 business days upon receipt of the feedback, to ensure timely publication of the public report.

Disputes are evaluated on a case-by-case basis. The vendor is asked to provide supporting data or evidence, if any, along with their dispute. Although all efforts will be made to resolve disputed issues to the satisfaction of all parties, Virus Bulletin reserves the right to make the final decision.

To reflect the broad nature of real-life issues, the scope of the disputes is not limited.

 

Policies

Product build and configuration policies

Public test reports can only be representative for the reader if the testing is conducted using publicly available product versions. For this reason, vendors are not permitted to use any enhancement of the product that is not available for general audiences. We encourage (but do not require) vendors to share with the public the configuration used for their product in the test.

Tests are usually conducted with the latest generally available version of a product or service being tested. Deviations from this policy will be documented in the report.

Private tests are not subject to these constraints.

For both public and private product testing, credentials must be provided for a Microsoft 365 account with a Microsoft Business Basic or Microsoft 365 Business Standard licence. This account must have administrator rights and be configured with the specific protection offered by the product being tested.

 

Binary classification mapping

Email security products often classify emails into various categories.

However, the VB ESA - M365 test relies on binary classification – there is either a hit on a test case (“email not wanted”), or there is no hit (“legitimate”). It is the product vendor’s responsibility to map their own classifications into either category above. In lieu of such mapping from the vendor, the Virus Bulletin test team will endeavour to set up the mapping themselves.

 

Withdrawal from the test (opting out)

Products participating in the public test cannot be withdrawn from the test once the official test period has started. Public interest dictates that a test report is to be published, regardless of whether or not it is favourable from the vendor’s perspective.

However, Virus Bulletin may, at its discretion, allow withdrawal of a product in extraordinary circumstances, when compelling reasons suggest that its inclusion in the report would bear no relevance to the public. Examples of such situations are: collected data is proven to be tainted by lab-specific technical issues; significant testing errors have occurred, such as deviations from protocol, etc. Note that technical issues that impact not just the particular test environment but a wider user base (e.g. cloud outage, faulty rule updates, etc.) at the time of the test do not qualify as a basis for withdrawal and Virus Bulletin may proceed to publishing the report.

 

Technical issue resolution

Virus Bulletin pledges to work with the vendor to resolve technical issues with the product and notify the vendor as soon as possible when such issues are detected.

 

Vendor commentary

Prior to the publication of a report, the vendor of the product may choose to provide commentary to be included in the report notes. This is to ensure that the vendor’s perspective receives a fair representation. Such commentaries can be useful when the report contents are disputed by the vendor. Commentaries are subject to reasonable length limits and editorial approval.

 

Product audit

Vendors of full email security solutions are provided with remote access to their products and may audit their configuration, state, or logs at any time.

 

Participant inclusion

The public test series features participants that have signed up and opted to be included in public testing. Virus Bulletin may also choose to include products at its own discretion. For this latter type of products, Virus Bulletin commits to:

  • Extend an invitation to the vendor of the product and offer to adopt a voluntary participant status, with ample time allowed for the vendor to consider and prepare for the public test. Voluntary participant status grants the vendor the same range of rights and level of service as other vendors enjoy, but only for the duration of the process required to set up, test and dismantle the product.
  • For products that do not adopt a voluntary participant status, Virus Bulletin will:
    • show diligence and reasonable care during the testing process.
    • allow the vendor to include vendor commentary, as described in the policies above.

 

Changelog

Version 1.0

  • Initial form of the methodology.

 

VB ESA M365

Latest report

The latest VB ESA M365 comparative test report

VB ESA - M365 Methodology

How the VB ESA - M365 comparative tests are carried out

VB ESA - M365 Schedule

The schedule for upcoming VB ESA - M365 test reports

VB ESA - M365 for end-users

Learn more about how VB ESA - M365 works

VB ESA - M365 for vendors

Not VB ESA - M365 certified and want to be? See how your product can be enrolled to the test bench.

VB ESA M365 test archive

Details of all previous VB ESA M365 comparatives

VB testing

VB100

VBSpam

Consultancy services

We have placed cookies on your device in order to improve the functionality of this site, as outlined in our cookies policy. However, you may delete and block all cookies from this site and your use of the site will be unaffected. By continuing to browse this site, you are agreeing to Virus Bulletin's use of data as outlined in our privacy policy.