Data Science, Engineering and Security: 3 Legs of User Behavior Analytics
The term “user behavior analytics” (UBA) has been abuzz in the cyber security community since last year. Gartner formally declared UBA as a space for security vendors last year; we also saw higher level of UBA activities in at this year’s RSA conference than before. This is a topic that I hold dear. Prior to joining Exabeam, I had been working on general behavior modeling for enterprise applications using security logs for the past few years. While the goals and benefits of UBA are becoming understood, here let me offer my perspectives on the components of UBA.
There are three components in UBA:
- Data science methods that learn from past data to flag anomalies for high precision alerts,
- Platform support that allows efficient ingestion and processing of big and high velocity data streams and;
- The security expertise that drives the use case directions.
Organizations building out UBA capacity must have a complete understanding of, and support for these three legs of the UBA stool.
Machine learning foundation for general behavioral modeling for anomaly detection has been around for many years. Anomaly detection: A survey, by Chandola and Banerjee gives a good overview of general anomaly detection techniques. Particularly for security-centric behavior modeling applications, there has been a rich body of works to study ISP-level network data such as for malicious domain detection; or on single host log data for user command usage profiling. Not until recently has there been publications that study user population monitoring using enterprise-level network logs. This isn’t surprising. After all, enterprises understandably are reluctant to open up and share their internal security logs due to privacy concerns. This has impeded the research progress on developing techniques to address enterprise-level UBA. In other communities such as those in imaging processing or speech and language processing, availability of standard benchmark data facilitated great strides in research progress. This is not the case for the security industry and it is a significant reason why data science research leveraging enterprise security logs for UBA has been spotty and slow to catch on.
However, we live in an interesting time. Recurring high-profile breaches have encouraged or forced the security community to find alternatives to bridge the gap. This year we see the first publication of using a private enterprise-level authentication data log such as Active Directory to analyze user behavior. In addition, the security community has started to recognize the need of engaging with data scientists to address security problems, academic work or not. For example, in the past 2 years, I have worked with large enterprises for custom data science service engagements to build in-house applications for insider threat detection using advanced machine learning methods. But expensive data science work for security should not be a privilege reserved for large enterprises with the means. Vendors are now offering commercial solutions for enterprises that do not wish or have no means to build in-house data science capacity. Use of data science for UBA will be more widely adopted going into the future.
Since years ago, I have advocated that Security is a Big Data problem that requires an infrastructure for massive storage, efficient querying, and data science library support. A massively parallel processing (MPP) database such as Greenplum or HP Vertica is perfect for data science studies. Researchers can iterate over very long period of historical data to detect threats already present on the network (Mandiant’s 2015 M-Trends reports still reports 200+ days of malware dwell time), and have the results returned in a reasonable amount of time. For many security use cases where modeling/detection accuracy is the key and real-time processing is not a requirement, a general MPP database is an appropriate platform.
In just over a year, the technology has moved to the next level, so have use cases requirements. In-memory stream-based platform, such as Apache Spark or Pivotal’s GemFire, enables more real-time use cases. For example, it is desirable to detect account takeover threat as soon as it happens in order to minimize the damage. The challenge for data science work on such platform is in the cost of time-space and accuracy-performance tradeoff. Enterprises and vendors staying close to the advancement in platform engineering will have an edge for use cases relating to UBA.
Typically data scientists do not speak the security language. Security analysts usually aren’t well-versed in the math language. But the security climate today is forcing the convergence of data science and security skill sets in the quest of perfecting UBA. At Exabeam for example, our security analysts and data scientists work closely together side by side. For instance, Active Directory (AD) is the most common authentication mechanism in enterprises, it also produces the most indecipherable log. Without security analysts’ expertise in interpreting meanings from the recorded AD events, it is impossible for data scientists to leverage the logged data for user behavior modeling for access pattern anomaly detection.
In summary, data science, platform support, and security expertise are the key factors making up user behavior analytics for enterprise security applications. In future blogs, I would share more thoughts on each of them.
 Aviah Litan, Mark Nicolett (2014), Market Guide for User Behavior Analytics. A Gartner Inc. publication
 Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3), 15.
 Mahmoud, M., Nir, M., & Matrawy, A. (2015). A Survey on Botnet Architectures, Detection and Defences. International Journal of Network Security, 17(3), 272-289.
 Salem, M. B., Hershkop, S., & Stolfo, S. J. (2008). A survey of insider attack detection research. In Insider Attack and Cyber Security (pp. 69-90). Springer US.
 Kent, A. D., Liebrock, L. M., & Neil, J. C. (2015). Authentication graphs: Analyzing user behavior within an enterprise network. Computers & Security, 48, 150-166