A security executive recently reflected with me on his experience in building the security analytics practice in his enterprise. They have come a long way, having hired a couple of data scientists and setup the requisite Big Data infrastructure. While some lessons are learned, some challenges remain. As a data scientist who loves to get his hands dirty with data, I believe there are clear benefits to building data science models to target emerging use cases on common or enterprise-custom data sources. But from my experience of past data science services with clients, I also recognize the challenges an enterprise faces in building a data science practice for cyber security in house.
Use Case Definition and Scoping
Finding the right use cases is the most important first step in the journey of data science. This is easier said than done. While enterprises log data from a wide variety of sources, it is not always apparent what use cases can be created from them. Recognized examples are the use of authentication logs for user-behavior analytics (UBA), and the use of web access log for malicious domain detection . Other interesting examples are the use of proxy log for poisoned watering hole domain detection, or the use of logical and physical access logs to identify possible account misuse or compromise. It takes a joint effort between data scientists, security folks, and business owners to create and define use cases. They would discuss topics on use case value proposition, data availability, and modeling complexity, etc. Prioritization of candidate use cases needs input from all stakeholders. Without alignment from stakeholders, data science use cases risk to fall short of expectation during or after the execution.
Given the hyperbole around Big Data analytics, it is easy to over-promise with data science. Failing to scope the use case is likely to fail the execution, too. Take the use case of leveraging web log to detect malicious domain names via domain’s behavior modeling, do you have historical known malicious domain names available for supervised learning? If not, expect the unsupervised learning to output higher false positive rate. Even if you have some past malicious domain names, what is the quality, volume, or relevance of these labels? Known domains for adware distribution sites aren’t useful for detecting tier-1 C&C domains. Proper use case scoping includes data reality check before the use case execution.
Similarly, for insider threat detection use case, what data do we have to support the use case? Authentication log and the more granular file-level (or database) access log have different level of signals. Plan to invest in the necessary effort to obtain data for the strongest signal possible.
Folks with data science background and security domain knowledge are hard to find. However, close collaboration between the two camps can make magic happen. This is also easier said than done. Security analysts’ input is critical for forensic feedback on data science-based outputs to enable iterative model tuning efforts. Unfortunately, from what I’ve seen, unless the two sets of folks are in the same business unit and are fully aligned, busy security analysts under constant fire drills have neither have the time nor the incentive to work with data scientists.
Data scientists aren’t created equal. We all come with different experiences and biases. Given that security analytics is still relatively new, solution frameworks built are only as good as the people who created them. Staff the team wisely.
Unlike other domains, it’s difficult to evaluate data science model’s output in the security analytics domain. For example, in image analytics, the validation for, say, a neural network model’s classification result is immediately visual. Similarly, in bank fraud detection, the model validation is not difficult, either the model missed the fraud or had false negative that resulted in money loss. Unfortunately, cyber security analytics deals with the unknown. Unless we are lucky, known breaches are either few or non-existent over the time period where the prediction model is to be benchmarked. Objective evaluation is difficult. Some enterprises resort to employ a red team for penetration testing in simulated security attacks.
Subjective and incidental evaluation is also possible. If the goal is to alert on new malware, one must be prepared to perform deep forensic work, such as machine image analysis. If the goal is to detect insider threats, the right investigation channel must be set up. At least for an enterprise I worked with, a formal inquiry with HR must be filed to launch an investigation. This is quite challenging for data science projects that need to show value quickly, especially if more than a few output alerts are presented for review. Hence, depending on the level of evaluation difficulty, a typical subjective criterion is simply the ability of the data science model (or security product for that matter) to sensibly explain its own alerts with minimal false positives. Such evaluation challenges must be recognized so a data science-based solution can be designed accordingly.
Data science projects typically start with a proof-of-concept (POC). But there is much work to do in transitioning from POC to an operationalized framework. POC usually starts with a batched data set and with a singular focus of finding the maximum signal possible. Issues pertinent to an eventually successful operationalization are often overlooked. For example, runtime performance constraint considerations, guidance for model parameter tuning procedure in production, integration with existing case workflow, and ease of alert interpretation, and staffing requirement to maintain the models, etc. are all critical elements that are best taken into account even during the POC stage. Otherwise, a POC may degenerate into just an interesting science project without the benefit of operationalization.
To Build or To Buy?
This brings to the question whether an enterprise should build its own security analytics projects or buy products from vendors.
To build, a journey in data science journey is a long-term investment. It starts with an organization’s vision for a data science-driven culture so to get the necessary long-term support on data science, infrastructure, and security – the 3 legs of security analytics. Also see the great blog post from Pivotal that discusses the data science transformational path for an organization. Given the vision and if executed correctly, build-your-own can pay off in a long run, enabling numerous use cases that leverage new ideas and data sources that otherwise aren’t used by vendors.
To buy, some immediate benefits are apparent, not least for having some coverage on the popular use cases. But my advice is to dive deep to vendors’ offerings. In the area of UBA, not all UBA are created equal. A good product hides data science under its hood, keeping the focus on meeting security analysts’ needs with a superb ability to explain the results, all while keeping very low false positives.
It’s an interesting question to wrestle with. My take is that there is probably room for both decisions, so long that challenges for build-your-own are recognized and the security product landscape is well surveyed.
 Oprea, Alina, et al. “Detection of early-stage enterprise infection by mining large-scale log data.” arXiv preprint arXiv:1411.5005 (2014)
See how Exabeam leverages exsisting log data to quickly detect advanced attacks and accelerate incident response.