Data Science and Security Research: Two Parts of the Whole
My friend, and Exabeam’s Chief Data Scientist, Derek Lin, previously wrote a blogpost about the wrong and right ways for data science and security operation teams to interact. In this post, I would like to expand on that idea and talk about the nature of the two disciplines, their complementary aspects, and how each is indispensable for meaningful security analytics.
Data science has done wonders in many vertical domains, from retail to marketing to biology. It is only natural to try and apply it to the security domain, especially when traditional rule based approaches have reached their limits. However, the security domain has some unique aspects which makes it more challenging for data science to consume directly.
A major difference between security and other domains is that in other domains the result that is required from the data science is usually well-defined, e.g. the existence of a disease or lack thereof, the list of movies a user is likely to watch, etc. In security analytics, however, the required output is the identification of “malicious behavior.” This definition is subjective and much more vague. Analyzing results is also difficult as there is usually more than one way to explain unusual behaviors (more on that here).
Another challenge is the wide gap between the raw security data and the data that algorithms can successfully consume. My favorite example is that a simple logon in a Windows environment produces at least 4 different log events of different types. Without understanding the intricacies of the Kerberos protocol, how it is implemented in Windows and its interaction with the security logs, these events will be treated independently, whereas in fact they are all part of a single transaction.
Trying to digest security data directly will almost surely guarantee results that are useless, albeit statistically significant.
Security research can fill much of the gaps mentioned above. Good security researchers will not only understand the meaning of the logs, they will also know how to extract information that is concealed or exists in them indirectly. They will understand things like the difference in the logging mechanism of domain controllers versus member servers, and whether an event is likely to appear often or seldom. They will know the meaning of hundreds of event types and sub-types, and which events are must haves and which can be safely ignored.
However, when it comes to identifying servers or users that are frequently observed together, or applying collective inference to decisions, most security researchers will have a harder time. In most cases, security researchers will not be familiar with concepts such as Probabilistic Graphical Models, Bayesian Networks and Collaborative Filtering, and the problems they can solve. This where the security domain ends and the data science domain starts again.
As you can see, there is a symbiotic relationship between data science and security research. Neither will be fully effective without the other. Good security research as well as top notch data science are needed in order to provide meaningful security analytics. Each practitioner will have their strengths and shortcomings, but bringing together the right mix of both will create the much desired whole and complete solution that is so desperately needed today.