Beyond lexical and PDNS: using signals on graphs to uncover online threats at scale

Friday 6 October 14:00 - 14:30, Red room

Dhia Mahjoub (Cisco Umbrella (OpenDNS))
David Rodriguez (Cisco Umbrella (OpenDNS))

Botnet domains at their core aren't necessarily lexical nor defined by query volume. Botnets are graph-based. From a sequence of DNS graphs, one can mine subgraphs distinctive of botnets spreading malware, harvesting credentials, or delivering DDoS attacks to cripple high-value online assets.

Typical methods for detecting botnets using DNS leverage either 1) lexical characteristics of domain names (n-gram entropy, perplexity), or 2) traffic and static graph properties (measuring burst in query volumes and similarity of inter-client traffic, respectively). These insights build on the characteristic of algorithmically generated domains (AGDs) but miss the temporal nature of machines surfing the internet: i.e. graphs change from one time window to the next.

In this talk, we propose a novel method unifying the interactions between client machines, hostnames and hosting IPs by building a tripartite graph consisting of tens of millions of vertices and edges. We then propose methods to represent a sequence of graphs as signals to be mined in order to detect botnet attacks and online threats in general.

As our first use case, we ignore the lexical and move beyond traditional degree and centrality graph metrics. Instead, we pair client machines to hostnames and reveal that the trademark of a bot in a botnet is three things: 1) the variety of hostnames it queries, 2) the popularity of the hostnames, and 3) the frequency with which the bot repeats itself. Using Hadoop technologies, we show that these signals are scalable (to the millions) and distinguish Necurs, Conficker, Suppobox, PykSpa, and more.

In our second use case, we tackle the difficult task of predicting the number of domains within a family of AGDs. Introducing a measure involving the popularity of domains and repetition of a bot, we can approximate the number of domains an ideal classifier should catch.

We also show how we combine botnet detection derived from monitoring hosting IP space, e.g. fast flux with detection based on infected clients' behaviour. This provides a unified model to track botnet threats. In closing, we’ll explain how to monitor these threats using various forms of cohort analysis and analysis of variance techniques.

This talk will be very useful to data analysts and security researchers as our new methods proved to be very efficient and scalable at uncovering internet-scale trends and tracking highly dispersed and massive threats.

Dhia Mahjoub

Dr Dhia Mahjoub is the Head of Security Research at Cisco Umbrella (OpenDNS). He leads the core research team focused on large-scale threat detection and threat intelligence and advises on R&D strategy. Dhia has a background in networks and security, has co-authored patents with OpenDNS and holds a Ph.D. in graph algorithms applied on Wireless Sensor Networks problems. He regularly works with prospects and customers and speaks at conferences worldwide including Black Hat, Defcon, Virus Bulletin, BotConf, ShmooCon, FloCon, Kaspersky SAS, Infosecurity Europe, RSA, Usenix Enigma, ACSC, NCSC, and Les Assises de la sécurité.

@DhiaLite

David Rodriguez

David Rodriguez is a security researcher and data scientist at Cisco Umbrella (OpenDNS). He has co-authored multiple pending patents with Cisco in distributed machine learning applications centred around deep learning and behavioural analytics. He has an M.A. in mathematics from San Francisco State University and previously worked at Location Labs by Avast and Esurance. David has spoken at SAI Computing Conference 2016, Black Hat 2017, and at Data Science meetups in the Bay Area.

Read paper Watch video

2017 PÉTER SZŐR AWARD

Other VB2017 papers

Mariachis and jackpotting: ATM malware from Latin America

Thiago Marques (Kaspersky Lab)

Fabio Assolini (Kaspersky Lab)

Of all the forms of attack against financial institutions in the world, the ones that are most likely to combine traditional…

Keynote address: Inside Cloudbleed

John Graham-Cumming (Cloudflare)

In February 2017, Cloudflare was revealed to have been leaking private information including HTTP headers, cookies and POST data…

Walking in your enemy's shadow: when fourth-party collection becomes attribution hell

Juan Andres Guerrero-Saade (Kaspersky Lab)
Costin Raiu (Kaspersky Lab)

Attribution is complicated under the best of circumstances. Sparse attributory indicators and the possibility of overt…