Thorough Analysis For Using Data Science To Detect Malicious Domains
Analyzing existing enterprise traffic logs with a data science approach is an efficient way to detect signs of a breach. VPN and Active Directory logs can be used to detect compromised account activities. Database or file-level access logs can also be used to detect insider threat activities. Mining these voluminous logs require different machine learning and data mining methods will vary depending on use cases. As an example of User & Entity Behavior Analytics (UEBA), in this post, I’ll sketch out some data science approaches used in the field to detect malicious domains by learning from web proxy logs.
Lexical Analysis of Domain Names
Malwares are capable of communicating to short-lived domains to evade domain blacklisting in a signature-based approach. They use domain names that are algorithmically generated. For example, iwxyrxthxswyxrcx.com or b0lz1md5qvf4w7w9w3iy9x2s17c5z4h0y1s.info. One can immediately see these domains with random alphanumeric string have a very different look and feel from normal domains that have reasonable human readability.
When I first tackled this problem more than 3 years ago in a data science service engagement for a client, the solution was to use a language modeling technique called N-gram modeling to detect these nefarious domains. This technique follows these steps:
- Start from a large population of legitimate domains. For every domain (mydomain.com), parse to get all pairs of consecutive characters from its secondary level domain name (m-y, y-d, d-o, o-m, m-a, a-i, i-n). Count the number of times any pair occurs in the population.
- From the count, I obtain the likelihood of observing each consecutive character pair in the population. Common pairs (ex. m-a) will get high probability while rare pairs (q-x) will get a low probability. These are the bi-gram models.
- Given an unknown domain (gdhcix.com), parse to get all letter pairs. Then simply look up and multiply the individual probability of letter pairs (g-d, d-h, h-c, c-i, i-x) to discover the probability of seeing this particular collection in the normal population. A collection of many low-probability pairs, such as those from randomly generated domain names, will have a low overall probability. A simple threshold is applied.
Having repeated elsewhere, this method is simple to implement in order to catch low-hanging fruits that were previously impossible to catch with rules. Malware’s use of algorithmically generated domain names has been around for a while. Since then, sophisticated hackers have moved on to use randomly generated domain names based on dictionary words, instead of the 30+ alphanumeric characters. Devising ways to overcome this is another discussion topic. However, the simple technique described here still works so long the malware that generates malicious domains with random characters continue to exist.
Behavior Modeling of Domains
One can take it beyond the above statistical lexical analysis. The differences between malicious and legitimate domains’ behavioral cues or features set them apart. Machine learning is used to train a classifier. The key to its success is in the definition of the domain’s features as input to the classifier.
Security domain knowledge dictates the features for domains. In an upcoming ISI-ICDM’15 conference workshop paper, which I am a co-author with my ex-colleagues, the paper will enumerate all features used. They are broadly categorized into three buckets:
- Features in the first category are about connections to and from the domain. For example, the percent of traffic containing images, or the average number of bytes sent.
- The next category is around URL’s lexical information such as length of URL, percentage of special characters.
- The final category is based on the WhoIs database such as number of days since the domain was first registered/updated, or the number of distinct registered orgs for this domain.
Labels of domains are required to build a classifier. This is where I leveraged external threat intelligence for supervised learning. The idea is to classify unknown domains with a model that learns the distributions of behavioral features of bad and legitimate domains using the labels as a guide. Given a large collection of features, a random Forest model is a good choice due to its ability to explore the high dimensional space and the robustness to a skewed data set where the number of known malicious domains is far smaller than that of legitimate domains. I had good experience with this model in the field. In one enterprise data, the cross-validation test resulted in a 46% recall rate at 100% precision rate, which is very good.
In counter-terrorism work, link analysis has been effective in finding additional suspects by traversing the connectivity network of known bad guys. Similarly, exploiting the connectivity information between internal users and external web domains can yield insightful signals. Formally, given a bipartite graph, a network of internal nodes (users) connecting to external nodes (domains), I perform information propagation to “spread” the risk or the degree of badness from the known bad nodes to the rest of nodes in an iterative fashion.
How does this work? Initially, all nodes are assigned with a risk of zero, except the few bad domains known from external intelligence. The non-zero risk of these bad domains is transferred to internal users connecting to them in the graph. Having accumulated risks from all the bad domains, these users then pass the risks onto all other domains connecting to the users. As a result, previously unknown domains may now get non-zero risks. As the process is iterated until convergence, all nodes on the network, users and/or domains, will be assigned risk scores. Threshold is applied to identify suspect entities. I found good detection by this graph analysis technique alone. While the domains’ risk scores are derived from the network connectivity information, I further combine the scores with those produced from another analysis method as discussed in the Random Forest model that derives the risk scores from the behavior information. This hybrid approach has proven to be effective; details are given in the upcoming paper1.
An additional benefit of the graph technique is that it is inherently visual. One can always render the node-to-node connectivity network on the screen. The insight is visually constructed, enabling easy and immediate analysis. In the figure below, red nodes are users; black nodes are known bad domains; green nodes are computationally determined suspicious domains; the lines indicate the communication connection. One can see there are interesting observations worthy of further study; for example, the green nodes with highly shared users.
To summarize, I’ve outlined some machine learning approaches to detect malicious domains by learning from standard web proxy logs. Of course, the same tools work for behavior analytics on other logs as well, including infrastructure authentication and access logs. Depending on use cases, some only need simple tools no further then profiling for one-dimensional signals to flag events beyond some standard deviations from the norm. Other use cases need more advanced tools as described in this blog. The takeaway is data science applies handsomely to UEBA. It has been shown to be effective for security analytics. More advancement in this direction will come.
Look for the next blog from me. Simply enter your email, and you’ll receive it automatically.
 A Hybrid Learning from Multi-Behavior for Malicious Domain Detection on Enterprise Network, Shi, Lin, Fang, Zhai, ISI-ICDM’15