VB2017 preview: Beyond lexical and PDNS (guest blog)

Posted by    on   Oct 5, 2017

In this special guest blog post, VB2017 Silver sponsor Cisco Umbrella writes about a paper that researchers Dhia Mahjoub and David Rodriguez will present at the conference this Friday.

 

In the past decade, detection of DGA (Domain Generation Algorithm) domains has relied primarily on lexical analysis of domain names, tracking of NX (non-resolving) domains, and malware reversing. The earliest works have been groundbreaking but since then, we have only observed small incremental improvements, combining machine learning techniques with sandbox-based analysis of DGA malware.

cisco-blog-fig1.png

Figure 1: Time-dependent user-domain interactions.

In a talk to be given by Cisco Umbrella researchers Dhia Mahjoub and David Rodrigeuz at VB2017 this Friday, we propose to completely reframe the problem and take advantage of a worldwide visibility into user-domain interactions via DNS. We introduce a novel approach to not only represent client-domain interactions as a bipartite graph but also carefully study the evolution of the topological properties of this graph over time, hence the concept of 'time series on graphs'.

cisco-blog-fig2.png 

Figure 2: A client (a yellow triangle) querying domains (grey circles) at different rates at different times.

We unravel these time-dependent graphs by tracking bots surfing the Internet. What we mean by 'bot', is a client machine infected with malware, controlled by some other machine(s). We then study how these machines query domains on the Internet. Reciprocally, we study domains receiving queries from bots. And here's the breakthrough: the sender of queries and the receiver of queries appear to be symmetrical and loaded with action. 

Stepping back, it becomes clear that a bot is not only defined by the speed at which it queries domains, but also by the diversity of domains, repetition, and popularity of those domains. Similarly, algorithmically generated domains typically deployed in botnets, are not only defined by the speed at which they are queried by clients, but also by the diversity of clients, repetition, and chattiness of those clients. 

cisco-blog-fig3.png

Figure 3: Graphical properties serving as building blocks to signals. 

From a machine to domain edge, in the graphs we analyse are values indicating the force with which a machine is attracted to a domain, or a domain is attracted to a client. This interaction is one of millions that occur hourly, creating one very noisy graph. As we observe this graph over time, we see the values fluctuate with differing velocity. The beauty is to see these sender/receiver signals isolate domains used in a broad variety of campaigns: Necurs, Conficker, Suppobox, PykSpa, and more.

cisco-blog-fig4.png

Figure 4: Interactions of one node in a graph with another, with edge weights varying over time. 

Using Hadoop technologies, we derive methods for creating and storing these signals computed on graphs, mapping the interactions of tens of millions of user-domains. In our talk, we will explain how we broke this problem down into smaller sub-problems that could be solved with effective MapReduce jobs woven in Oozie workflows, and why we chose not to use Spark and GraphX but to build our own graph and graph metric techniques.

Come to our talk to learn about these new methods to analyse and define a few intuitive and yet effective features on the nodes of any bipartite graph, but with a network security twist:

  1. Chattiness of a user IP (or the number of unique domains this user queried over a period of time)
  2. Popularity of a domain (or the number of unique user IPs that queried this domain over a period of time)
  3. Jaccard similarity for a user IP (or the percentage of similar domains this user IP queries from one hour to the next)
  4. Jaccard similarity for a domain (or the percentage of similar user IPs that queried this domain from one hour to the next)
  5. Spread (or the ratio between the average and median Jaccard similarity for a user IP or a domain).

'Beyond lexical and PDNS: using signals on graphs to uncover online threats at scale' will be presented by Dhia Mahjoub and David Rodriguez at 14:00 on Friday 6 October in the Red room.

VB2017-325w.jpg

 Tags

dga cisco vb2017 pdns
twitter.png
fb.png
linkedin.png
hackernews.png
reddit.png

 

Latest posts:

Haroon Meer and Adrian Sanabria to deliver VB2019 closing keynote

New additions to the VB2019 conference programme include a closing keynote address from Thinkst duo Haroon Meer and Adrian Sanabria and a talk on attacks against payment systems.

Free VB2019 tickets for students

Virus Bulletin is excited to announce that, thanks to generous sponsorship from Google Android, we are able to offer 20 free tickets to students who want to attend VB2019.

VB2018 paper: Lazarus Group: a mahjong game played with different sets of tiles

The Lazarus Group, generally linked to the North Korean government, is one of the most notorious threat groups seen in recent years. At VB2018 ESET researchers Peter Kálnai and Michal Poslušný presented a paper looking at the group's various…

Book your VB2019 ticket now for a chance to win a ticket for BSides London

Virus Bulletin is proud to sponsor this year's BSides London conference, which will take place next week, and we have a number of tickets to give away.

First 11 partners of VB2019 announced

We are excited to announce the first 11 companies to partner with VB2019, whose support will help ensure a great event.

We have placed cookies on your device in order to improve the functionality of this site, as outlined in our cookies policy. However, you may delete and block all cookies from this site and your use of the site will be unaffected. By continuing to browse this site, you are agreeing to Virus Bulletin's use of data as outlined in our privacy policy.