VB2017 preview: Beyond lexical and PDNS (guest blog)

Posted by    on   Oct 5, 2017

In this special guest blog post, VB2017 Silver sponsor Cisco Umbrella writes about a paper that researchers Dhia Mahjoub and David Rodriguez will present at the conference this Friday.

 

In the past decade, detection of DGA (Domain Generation Algorithm) domains has relied primarily on lexical analysis of domain names, tracking of NX (non-resolving) domains, and malware reversing. The earliest works have been groundbreaking but since then, we have only observed small incremental improvements, combining machine learning techniques with sandbox-based analysis of DGA malware.

cisco-blog-fig1.png

Figure 1: Time-dependent user-domain interactions.

In a talk to be given by Cisco Umbrella researchers Dhia Mahjoub and David Rodrigeuz at VB2017 this Friday, we propose to completely reframe the problem and take advantage of a worldwide visibility into user-domain interactions via DNS. We introduce a novel approach to not only represent client-domain interactions as a bipartite graph but also carefully study the evolution of the topological properties of this graph over time, hence the concept of 'time series on graphs'.

cisco-blog-fig2.png 

Figure 2: A client (a yellow triangle) querying domains (grey circles) at different rates at different times.

We unravel these time-dependent graphs by tracking bots surfing the Internet. What we mean by 'bot', is a client machine infected with malware, controlled by some other machine(s). We then study how these machines query domains on the Internet. Reciprocally, we study domains receiving queries from bots. And here's the breakthrough: the sender of queries and the receiver of queries appear to be symmetrical and loaded with action. 

Stepping back, it becomes clear that a bot is not only defined by the speed at which it queries domains, but also by the diversity of domains, repetition, and popularity of those domains. Similarly, algorithmically generated domains typically deployed in botnets, are not only defined by the speed at which they are queried by clients, but also by the diversity of clients, repetition, and chattiness of those clients. 

cisco-blog-fig3.png

Figure 3: Graphical properties serving as building blocks to signals. 

From a machine to domain edge, in the graphs we analyse are values indicating the force with which a machine is attracted to a domain, or a domain is attracted to a client. This interaction is one of millions that occur hourly, creating one very noisy graph. As we observe this graph over time, we see the values fluctuate with differing velocity. The beauty is to see these sender/receiver signals isolate domains used in a broad variety of campaigns: Necurs, Conficker, Suppobox, PykSpa, and more.

cisco-blog-fig4.png

Figure 4: Interactions of one node in a graph with another, with edge weights varying over time. 

Using Hadoop technologies, we derive methods for creating and storing these signals computed on graphs, mapping the interactions of tens of millions of user-domains. In our talk, we will explain how we broke this problem down into smaller sub-problems that could be solved with effective MapReduce jobs woven in Oozie workflows, and why we chose not to use Spark and GraphX but to build our own graph and graph metric techniques.

Come to our talk to learn about these new methods to analyse and define a few intuitive and yet effective features on the nodes of any bipartite graph, but with a network security twist:

  1. Chattiness of a user IP (or the number of unique domains this user queried over a period of time)
  2. Popularity of a domain (or the number of unique user IPs that queried this domain over a period of time)
  3. Jaccard similarity for a user IP (or the percentage of similar domains this user IP queries from one hour to the next)
  4. Jaccard similarity for a domain (or the percentage of similar user IPs that queried this domain from one hour to the next)
  5. Spread (or the ratio between the average and median Jaccard similarity for a user IP or a domain).

'Beyond lexical and PDNS: using signals on graphs to uncover online threats at scale' will be presented by Dhia Mahjoub and David Rodriguez at 14:00 on Friday 6 October in the Red room.

VB2017-325w.jpg

 Tags

dga cisco vb2017 pdns
twitter.png
fb.png
linkedin.png
googleplus.png
reddit.png

 

Latest posts:

Didn't come to VB2017? Tell us why!

Virus Bulletin is a company - and a conference - with a mission: to further the research in and facilitate the fight against digital threats. To help us in this mission, we want to hear from those who didn't come to Madrid. What is your impression of…

Montreal will host VB2018

Last week, we announced the full details of VB2018, which will take place 3-5 October 2018 at the Fairmont The Queen Elizabeth hotel in Montreal, Quebec, Canada.

VB2017 preview: Beyond lexical and PDNS (guest blog)

In a special guest blog post, VB2017 Silver sponsor Cisco Umbrella writes about a paper that researchers Dhia Mahjoub and David Rodriguez will present at the conference this Friday.

Avast to present technical details of CCleaner hack at VB2017

The recently discovered malicious CCleaner version has become one of the biggest security stories of 2017. Two researchers from Avast, the company that had recently acquired CCleaner developer Piriform, will share the results of their investigations…

VB2017 preview: Walking in your enemy's shadow: when fourth-party collection becomes attribution hell

We preview the VB2017 paper by Kaspersky Lab researchers Juan Andrés Guerrero-Saade and Costin Raiu on fourth-party collection and its implications for attack attribution.