MITRE released a significant update to the MITRE ATT&CK framework which included several new and updated techniques. One of those techniques, domain generation algorithms (DGA) was submitted by our research team. We are excited to be contributors and will explain this technique in more detail.
Domain Generation Algorithm (T1483)
The domain generation algorithm has remained a main source of communication for malware in the past 10 years. DGAs are designed to generate quick random seeds such as dictionary words, DWORD values, random digits, gibberish strings (hcbhjbdjbjhsb.ru) as domains which can be used to provide instructions for malware to exfiltrate data, provide updates and execute commands on a system remotely. Earlier families of malware used a static list of IP/domains which were eventually blocked by defenders. Attackers now write sophisticated DGA codes to circumvent defenders and draft thousands of DGAs, of which only a few have true instructions for command and control, to make their connection persist over the long term and stay resilient against enforcement actions. Malware like Kraken, Conficker, Murofet and Chopstick showcase DGAs whose attributes vary from date dependent, static and dynamic seeds.
Figure 1. An example of how DGA exfiltrates data from a target
Detecting dynamically generated domains can be challenging due to the rapid rotation of DGA seeds, constantly evolving malware families, sinkhole awareness, and the complexity of the DGA algorithm.
This makes signature-based detection to these signals irrelevant and requires a machine learning approach to detect those efficiently.
Using machine learning, DGA becomes a solvable problem. Some may be familiar with n-grams from natural language processing, where analysts count the frequency of how often words follow each other in normal speech or writing. Similarly, n-grams can be used to analyze the words in a domain name. If the words in a suspected domain name never follow each other in common use on the internet, they then have a high probability of being random.
One approach is to segment the domains into substrings with the size of “n”. Each substring of length n is called a gram. The larger the value of n, the smaller the number of substrings and vice versa as can be seen in the figure below. According to research 3, 4, 5 have the best accuracy when predicting randomness. For example, a 3-gram approach for word “youarepwned” would be you, oua, uar, epw, pwn, wne, ned. We, therefore, test the substrings (you, uar, epw) for randomness by excluding the top 1 million ranked domains such as the Alexa top million websites for example. We then check domains with high randomness against CDN whitelists and if the domain is not present, we check for a threshold value. If the level of randomness is higher than the threshold, it is deemed as a DGA-generated domain. This helps us increase the chance of detection and predict the range of random domains that may be generated.
Figure 2. An example of 2-gram and 3-gram breakdowns for “youarepwned” showing predictable randomness in domain generation
We apply the same method for second- and third-level domains to detect the occurrence of the word, for example, xuxu(dot)youarepwned(dot)net). Our behavior-based analytics approach mixed with deep learning helps detect DGA before it causes any damage. In a previous post, our Chief Data Scientist Derek Lin discusses behavioral modeling for DGAs using a random forest model. For example, in the case of ransomware, the attacker will encrypt the files and request encryption keys and send sensitive data. Any malicious request in addition to a random DGA will provide substantial evidence for abnormal activity that can be further investigated. Inputs include logs that contain domain access information, such as DNS NX domain logs, DNS request/response logs, WHOIS information, passive DNS traffic, proxy traffic and EDR activity.
Figure 3: Exabeam Advanced Analytics detects a likely DGA domain krbsectyxfpxsofe.ru.
Signature-based detection is not considered an effective measure against DGAs because of the rapid changes in the algorithm. In addition to techniques like entropy change, frequency analysis and Markov chains Exabeam provides extensive detection techniques for behavior analytics using n-gram and machine learning. With the evolution of DGA techniques, it will be challenging to predict an adversary’s action and underscores the need to better prepare ourselves by sharing intelligence and working collaboratively.