How risk assessment for UEBA (user entity behavior analytics) works is not unlike how humans assess risk in our surrounding environment. When in an unfamiliar setting, our brain constantly takes in data regarding objects, sound, temperature, etc. and weighs different sensory evidence against past learned patterns to determine if and what present risk is before us. A UEBA system works in a similar manner. Data from different log sources, such as Windows AD, VPN, database, badge, file, proxy, endpoints, etc. are ingested. Given these inputs and learned behaviors, how do we fuse the information to make up a final score for risk ranking?
Before we dive deeper, let me first share some general thoughts on the construction of security analytics systems to frame the question better. When I started working in the security field, my first instinct as a data scientist was to traditionally define some learning features which are generally continuous in values as inputs to an available machine learning algorithm (SVM, decision tree, and the like), to identify outliers or malicious entity sessions. But it soon became clear that such a conventional monolithic learning framework (Figure 1) has little chance for production success. First, security data is heterogeneous and we cannot expect all data sources to be available from the start for learning purposes. This makes construction of comprehensive features difficult, if possible at all. When a new data source is added, the need to re-train or re-tune the monolithic modeling makes it impractical for a production system. Second, the flexibility to quickly configure and deploy learning features is extremely important. It is impractical to re-learn a single-algorithm system every time new features are to be added. Third, even if it exists, an over-encompassing monolithic learning algorithm using a wide variety of data for malicious event detection tends to be a black-box approach. This goes against the must-have user requirement that output must be easily explained and interpreted.
So, it is not surprising that an effective security analytics system consists of statistical indicators or sensors that can be added to meet new data demands and are easy to interpret. As a result, instead of having an end-to-end monolithic framework, we have a collection of independent indicators. An explicit step is now needed to fuse the outputs together.
Some indicators are based on statistical analysis for anomaly detection, e.g. whether a user accessed an asset abnormally. Some are simply based on facts; e.g. whether there is a malware alert found on an asset. Others involve machine learning such as detecting a DGA (Domain Generation Algorithm) domain by bigram modeling or neural network. A few others rely on context derived by machine learning to aid anomaly detection; e.g. selecting the best peer group via behavior analysis for peer analysis. These indicators are designed to be as statistically independent as possible. At Exabeam, there are more than a few hundred such indicators across a variety of data types, each carefully developed according to security expertise, data science, and field experience.
How do we now fuse these indicator outputs into a final session score? An obvious approach is to define an anchor score for each triggered indicator, or an anomaly, and then to sum up the scores from all the anomalies within a session. However, this simple approach is not optimal across different environments. On a global level, some indicators are prone to trigger more or less across user population (this can be due to environment-specific reasons) . On a local level, some indicators tend to trigger more for specific accounts; for example, first time access anomaly to an asset for a service account. Indicators with frequent triggers are less informative in the security context. A simple sum of triggered indicator scores for the final session scoring creates unnecessary score inflation, resulting in an increase of false positive rate and decrease in precision rate.
At Exabeam, an anomaly score is first adjusted dynamically based on a variety of factors from the behavior profiles. See Figure 2. One example is score adjustment for peer group-based indicators based on the degree of membership. Then we use a Bayesian method to reduce the false positives associated with indicators that trigger frequently. Individual risk scores are weighted according to the observed historical triggering frequencies in both global and user level, before we sum them up as a final session score. The Bayesian score modeling is trained periodically to capture the population’s and individual users’ dynamic anomaly-triggering behavior.
In summary, this example of a UEBA scoring system is both expert-driven and data-driven. Statistics, fact, or machine learning-based anomalies in a session initially have expert-assigned anchor scores. Each score is then calibrated depending on variety of data factors specific to that anomaly. Finally, Bayesian modeling is used to learn indicators’ triggering frequencies to enhance the precision of final output scores. This scoring process has performed well in the field, as It is also highly explainable for later investigation effort.
The mathematical details of various such calibrations are important and would be topics for future blogs.