Skip to main content

A User and Entity Behavior Analytics System Explained – Part II

In my last blog, I talked about the role of statistical analysis in a User Entity Behavior Analytics (UEBA) system.   Expert-driven statistical modeling is a key and core component of an anomaly detection system.  It is intuitive and easy to use and understand for analysts of all levels.  In part II of this series, I’ll discuss the role of machine learning in a UBA system.

Machine learning is a method that is used to devise complex models and algorithms for the purpose of learning or making predictions from data.  In the UEBA context, I’ve always felt using a single complex modeling technique to detect users’ anomalous behavior has low likelihood of success.  Lack of ground truth and the use of unsupervised learning doesn’t bode well for UEBA applications that require very low false positives.  Enterprise environments are complex, fraught with lots of uncertainties and ill-defined data sets.  Network context information is not always reliable.  User behaviors are not necessarily boxed in and the environment is always in a state of flux.  All of these factors make a monolithic detection algorithm over multiple data sources difficult to materialize.  Instead, machine learning is best suited for solving targeted use cases. I’ll give an example of a patent-pending machine learning application used in Exabeam’s UEBA system.

Exabeam uses machine learning to help better estimate a potential alert’s context so that we can calibrate the alert’s score.   If we see an account performing a high volume of activity, that might be abnormal for a human user but perfectly normal if the account is a service account.  Raising an alert without considering the context is prone to high rate of false positives.  Therefore, wherever possible, we leverage an enterprise’s existing account labeling information for the UEBA system deployment.  However, not all environments have such data readily available; more often than not, the information may be incomplete since such data is hard to maintain and it mushrooms out of IT control as the environment grows. Also, maintaining such data typically has not been critical for core IT operations.  Despite the labeling imperfection, machine learning delivers the best value when working with noisy data in order to categorize it.  We can use it to estimate or even correct the labeling information of an account, whether it is a service account or a user account.

We have created and deployed several algorithms for the task.  One method leverages information from Lightweight Directory Access Protocol (LDAP) files that enterprises maintain for directory services to provide records of network entities such as users and assets.  Every entity is described by a collection of key value pairs.  Some keys are semi-standardized, some are not, and the value of a key might be free text.  Human eyes tend to do a fair job in identifying whether an account is a service or user account by reading the key-value pairs.   How do we create an algorithm for computer to do the same or better for automated classification?

Data science work starts with an exploratory study to vet ideas.  Here, a simple application of Singular Value Decomposition (SVD) is used to validate the assumption that there is enough signal in the text data to possibly separate accounts in the LDAP files.  To do so, we first represent each account with a binary vector of size N where N is the number of keys used in the LDAP files; N is typically in the order of several hundred across enterprise environments.  Each dimension holds a Boolean value indicating whether a key is present in the account or not.

In a large environment of tens of thousands of accounts, we’d have tens of thousands of such vectors.  SVD is a conventional dimensionality reduction technique to transform and represent vectors in high N-dimensional space to  a space of much smaller number of dimensions.  This is great for visualization and modeling.  To wit, we plot singular values for N transformed dimensions in Figure 1.  It shows most of the signal is captured in just the first few dimensions.


Figure 1. Singular Values for Transformed Dimensions

If we retain just these first few important dimensions, say just two, and visualize the accounts in Figure 2 in a two-dimensional space, we clearly see there is a good separation in the population.

derekblogfig2.pngFigure 2. Visualizing Two Dimensions

This is a simple approach of using unsupervised learning to gain insight for the classification task.  Although at this stage we do not know if one cluster in the figure corresponds to user accounts and the other to service accounts, we have a good indication that a computer algorithm can be used to separate the data population using the textual data alone.   Ultimately we had experimented building classifiers with or without leveraging existing labels.  A classifier is chosen to suit the production requirement.

There are still other ways to skin the classification problem.  The text-based account classification described here is one.  Another is to classify accounts using their behavior data.   We first define behavior features derived from accounts’ activities recorded in the logs.  Obvious behavior features include number of events generated or received by the accounts, number of hosts an account is connected to, etc.  Assuming most of the account population are user accounts, then given the feature collected, an unsupervised learning approach such as one-class Support Vector Machine (SVM) has been tried here for the purpose of account classification.

In this blog, we have illustrated an example of using machine learning to derive network account context information.  Other context derivation examples include estimation of the host machine type (a server or a workstation), a user’s peer group, and an asset’s peer group. The more we use data to understand the complex and noisy IT environment, the better quality of alert signals we can produce.  In UEBA, context information derivation is not the only area where machine learning helps.   Other areas ripe for machine learning applications are in various targeted detection problems and false positive controls.   I’ll choose an example or two in the last part of this blog series.

You can also  learn more here:

Leave a Reply

Your email address will not be published. Required fields are marked *

Topics: data science