Skip to main content

Account Resolution via Market Basket Analysis

Machine learning and statistical analysis have many practical applications in the detection of malicious user and entities as part of  User & Entity Behavior Analytics (UEBA) solutions.  Threat detection typically garners attention, this is as true on the show floor of security conferences, as it is for the text of marketing material.  Equally important, although less mentioned, is the application of machine learning for context estimation. Contextual information such as whether the machine is a laptop or a server, or if an account is a service or human account, helps analysts define statistical detection indicators and provides richer information for security alert triage and investigation.  When contextual information is not available or up-to-date (as is often is the case for many enterprises), the use of machine learning to derive context is particularly advantageous. While several of my past blogs have focused on other applications of this technique, for example dynamic peer analysis, today’s post discusses using machine learning to indicate whether two or more accounts actually belong to one same person.

The motivation for this use case is two-fold:

Forensic Investigation

Generally speaking, context assists in forensic investigation.  For example, let’s say an employee named Joe Smith has several accounts he uses, including:

  • JoeS” – a normal user account used for day-to-day work.
  • js_admin” – an administration account for his network management tasks.
  • smith1” – an account used for some cloud-based applications.

Without the context that JoeS, js_admin, and smith1 are actually same user, we can’t link their activities together.  Similar to the parable of the blind men describing an elephant, it becomes difficult to understand Joe Smith’s behavior if we only have partial visibility of his functional roles and their related accounts. By combining activity streams from multiple accounts together, an analyst has a complete view of Joe’s activity, allowing him to better interpret raised alerts for forensic effort.

Detecting Account Misuse

In addition to helping with forensic analysis, context helps detect account misuse. If we observe that another user account “alice”, which belongs to an employee named Alice –  initiates an account switching event to js_admin, it is important to raise an alert because js_admin is Joe’s account. This would indicate one of two things, both of which demand attention: Joe and Alice are sharing accounts (this is a poor IT practice) or there is a potential security risk indicative of malicious behavior.

So, the question is how to identify Joe’s multiple accounts.

Taking a Data-driven Approach.

The idea is simple.  Enterprise authentication logs such as Windows authentication logs or cloud application logs, have account logon activity data that includes the account used, the time, and IP address involved.  If two or more accounts are frequently observed using the same IP address and during a specific time unit such as a day, then these accounts likely belong to one user.  It is sound intuition that users usually have a  go-to machine (e.g. a laptop) or machines with which they normally log-in to their accounts to conduct their work.  With the problem framed this way, it is easy to see that classic Market Basket Analysis using association rule learning is a reasonable algorithm choice for performing account resolution analysis.

Account Resolution via Market Basket Analysis

Market Basket Analysis gets its name from its typical application, the retail sector.  A basket is a single transaction at a checkout counter containing a number of purchased items.  Given enough historical transactions, we can identify items that tend to be purchased together in a basket.  From there, we can also derive relationship rules that say if some certain items are bought, then we can expect customers also buy certain other items as well.  The discovered items are highly associated or correlated with one another.  Retailers use these findings in selecting the location and promotion of goods inside the store.  Without diving into details, the Apriori algorithm is the search strategy used to find these relationships among items across all baskets.

From the analysis perspective, our account resolution use case is no different from the retail application. Consider a tuple of an IP address and a day as a basket, and logon accounts seen on that IP on that day as items in the basket. For a user like Joe Smith, on some many days, we see JoeS and js_admin account logon activities on a machine (e.g. his laptop); that is, many IP + day buckets would contain items {JoeS, js_admin}.  Similarly, many other IP + day buckets may contain items {JoeS, Smith1}.

Of course, machines are not exclusive to Joe; otherwise, account resolution would be trivial. Like retail transactions in which baskets can contain variety of items, these IP + day baskets where Joe’s accounts are seen can also see other accounts occasionally from Bob, or Tom as well.  Despite the “noise” or variability in data, the algorithm is capable to find accounts that tend to co-occur together across various IP + day baskets.  By properly thresh-holding various statistics (e.g. support, confidence, and lift) on these associated items, we get a good estimate of which accounts are of the same user.

Fine Tuning the Results

Data science output is probabilistic and there is always some amount of tradeoff between false positives and detection rates.  If desired, additional heuristics can be applied to sharpen the results. For example, corporate account naming conventions usually leverages letters from user’s actual name; for example, joes vs. js_admin for Joe Smith.  We can employ a post-processing step to achieve high precision output by requiring names of same-user account candidates to meet some minimum degree in string similarity.

Account resolution is a good example of using machine learning to enrich contextual information for user accounts. This blog illustrated how simple Market Basket Analysis can be applied to the task of adding valuable user context.  I hope you found it interesting!

Leave a Reply

Your email address will not be published. Required fields are marked *

Topics: data science
2017