In this blog series, I’ve talked about the applicability of data science for user entity behavior analytics (UEBA). The use of statistical analysis is best driven by expert knowledge; some machine learning examples were given to find contextual information for alert prioritization. In this blog, let’s explore more use cases and examples where machine learning applies.
An Entity Categorization Example
In my last post, I discussed how a data-driven classifier can be used to determine if an account is a human or service account. We can do the same for asset type classification – whether a machine asset is a workstation or a server. This classification is critical to enable Stateful TrackingTM in UEBA to organize activities in user sessions. Since a meaningful user session starts with a user logon event to a workstation among other conditions, we need to know the logon machine’s type.
Similar to service accounts, assets proliferate and mushroom within the environment, and usually there isn’t a central repository that categorizes the different types of assets. Even when such a repository is present, it is rarely up to date. This presents a good application of machine learning: build a classifier using the asset’s own behavior data. Activity stream from a workstation is different from that from a server. Designing a set of behavior indicators and applying a relevant learning algorithm allows us to build a classifier for the purpose of identifying the type of an asset on the network.
A Detection Example
Another use case for machine learning is threat detection. There is no single magical algorithm that processes multiple data sources to find noteworthy outliers. In enterprise logs, data sources are heterogeneous; data may be incomplete; logged entities’ behaviors are nuanced and require expert analysis. Therefore, it’s best to use machine learning to address targeted use cases. An example is the detection of algorithmically generated domain (AGD) names.
It is a common practice for malware to establish communications over a pseudo-random domain name that is generated by an algorithm. Access to such a domain is indicative of malware communicating outside of the network. The pseudo-random domain names are impossible to detect using regular expressions but can be detected through probabilistic language modeling. Here we might use a letter-based N-grams model to determine the likelihoods of N-letter sequences learned from a large corpus of normal-looking words or web domains. For example, if N=2, bigrams from the word “exabeam” is “e-x”, “x-a”, “a-b”, … “a-m”. We have about 700+ such bigrams and we can train their likelihoods from millions of domain names. The model represents how a large collection of normal domains appears. Given a pseudo-random domain name and its collection of bigrams, it will score low against the model. Hence, the domain is likely an algorithmically generated and can be further evaluated with more contextual information, such as when the domain was first registered.
A Context Estimation Example
If we have an alert that a user accessed an asset for the first time, how much weight should we give to this alert? The alert must be viewed within its context. A good context is whether members of his peer group have or have not accessed this particular asset before. Active Directory (AD) data does provide some of a user’s peer group context information, albeit incomplete or out of date in typical enterprises. Peer group labels such as department, title, or office location have been observed to have far less than an ideal 100% coverage of a user population. Lacking that complete labeling coverage, use of peer group for alerts’ context is suboptimal.
On the other hand, the machine learning-based recommendation system is well-suited for this problem. Netflix, Amazon, and others in the data analytics industry have long used recommendation system technology to predict the next movie or items a user is likely to buy, based on data from other members who have shared the same buying patterns. By the same token, we can use a recommendation-type system to find a user’s peer group if they share the same historical asset access behavior patterns. Given a first-time asset access alert for a user, we can keep or remove the alert based on the frequency of his peers’ access to the same asset. This is a prime example of using machine learning to reduce false positives. We don’t always have to focus on the detection use cases for UBA. Reducing false positives via machine learning increases the precision rate; hence, better detection.
Simplicity is king
I hope you enjoyed this blog series from a data scientist’s perspective. Here are some parting words. For the uninitiated, machine learning for UEBA may sound intimidating. But like an iPhone, a good UEBA system hides all the complexity under the hood. That is the main challenge for machine-learning based UEBA system. Exposing machine learning output directly to end users without useful explanation only forces analysts to do more work to investigate; it doesn’t reduce the work load. Despite its underlying complexity and sophistication, a good machine learning-based UEBA must strive to keep the appearance of the system simple. Simplicity is king in the world of UEBA.
You can also learn more here: http://www.exabeam.com/product/applications/