The Wrong (and Right) Way to Engage Data Science in Security Analytics
More enterprises are waking up to the fact that data analytics is becoming an inseparable part of cybersecurity defense posture. An immediate question is how to integrate the traditional security operation center (SOC) with the data science team. Other than its obvious implication to the organizational structure, the answer to this question is important in deriving value from data science work or in making a data analytics product procurement decision. I’ll give some examples and use cases relating to this question.
The Wrong Way: Viewing data science as a product
In this scenario, the SOC team views the data science team’s output as an another alert-generating product, among all the existing security products they have already purchased. The deterministic world that SOC lives in hasn’t necessarily converged with the data scientists’ probabilistic world. Therefore, organizationally and functionally, it is convenient to have a clear boundary between the two teams where alerts generated by the data science team flow to SOC, leaving SOC to perform the subsequent triage and forensic investigations. The hope is, data science, with all its advanced mathematics, can produce high quality alerts with minimal engagement from SOC. Unfortunately, this is a dangerous route with unreal expectations.
A typical scenario is like this. The SOC team, realizing the limit of deterministic correlation rules, desire to find user behavior anomalies from log data to discover compromised account activities as early as possible. This high level use case description is then handed over to the data scientists. Data scientists are then off to work. The SOC team is overwhelmed and busy in the meanwhile with their day-to-day investigations; the SOC wants to be bothered by the Data Science team only when they have results.
Data scientists are a creative sort and they can always come up with something that is anomalous. Besides, anomaly detection isn’t a new research topic; many tools abound. It’s not hard to construct an anomaly detection framework.
Speaking from experience and to illustrate, let me provide an example: Given the historical infrastructure access records of a user, we can create a matrix of NxM where N is number of time points, M is the the number of network assets the user has ever touched. Each matrix cell (i,j) captures the frequency of access of the user at time i for network asset j. Given a new user behavior data captured in a new column for the i+1 time, the goal is to find the degree of anomaly of this new column to the rest of data. Once the data is mathematically organized in this way, there is more than one way to measure the distance of the new column to the rest of data. One way is to compute such a distance in a lower dimensional spectral domain via singular value decomposition. Such a technique and others like it do find anomalies. These are then presented to SOC. End of data science work.
The problem is there are always plenty of anomalies in a modern enterprise network that is highly dynamic. For example, IT admins’ legitimate actions frequently generate highly noisy anomalies. Furthermore, anomalies without explanations or contexts are useless for triage. Investigations on frequent false positive anomalies require non-trivial forensic efforts. To the SOC, this presents yet another noisy product when they are already inundated with alerts from other security products. SOC teams are busy folks; without further collaboration, expectation quickly falls short.
Such data science-led output, viewed as some black-box product output with minimal or no involvement from SOC, is bound to disappoint.
The Right Way: Closely integrating security and data science teams
In this scenario, the security team embraces data science work, treating the data scientists as part of a team, rather than drawing a fence between the two. Not regarding the data science team as just another upstream alert generating source, the SOC collaborates proactively with the data science team to create detailed use cases for more focused needs.
Take the same example as above. In addition to the core data modeling effort from data scientists, SOC analysts actively participate to provide their perspectives. Rather than providing piecemeal works, data scientists now have the full perspective of what makes a useful output. For example, presenting alerts to the SOC is just the beginning. The SOC would want to know all the contexts in order to triage and investigate:
- Is this a service or a human account?
- Is the device accessed an executive-level device?
- Is the device a host or a server?
- What are the prior activities (parsed, transformed, and sesssionized)?
Indeed, pulling this contextual information is not trivial. Some factual information requires the right architecture and platform support for efficient data retrieval. Other derived information opens up additional data science work. Active collaboration between the two teams generates possibilities that make the end-to-end cycle of security work.
For example, to see if an alerted account is a human or service account, data scientists need the SOC’s intuition on what makes good machine learning features, to build a classifier to leverage historical infrastructure access data. The SOC may suggest some learning feature examples such as whether an account generates many events, or connects to many hosts, or has periodic behavior, etc. In the machine learning world, human input still counts for a lot. As another example, to see if a device is a host or a server, data scientists need security analysts’ domain knowledge in parsing and retrieving the relevant Windows’s Active Directory events for modeling.
In short, I am advocating enterprises not to look at SOC and data science teams as two separate organizational or functional entities. See them as one team, or risk failure.
Buying a data analytics product
Incidentally, the same argument above can be made for the process of making analytics-based security product procurement decision. Evaluating and purchasing such a product solely based on a data science team’s evaluation without a direct feedback from the security team increases the risk of SOC end users getting something they don’t need – another noisy alerting source that would sap their resources. One can’t evaluate such a product based on data science alone – in the number of machine learning algorithms it has, for example. The security folks must be involved in the purchasing decision – how well the product meaningfully transforms the data for interpretation, presents contexts, facilitates triage and forensic process, and ultimately making their operations more efficient, data science or not.