SIEM Essentials QuizRead More
Traditionally, security technologies used two primary analytical techniques to detect security incidents:
The common denominator of these older techniques is that they are good at detecting known bad behavior. However they suffer from two key drawbacks:
Addressing unknown risks—including insider threats, which are trickly to detect because they are users legitimately logged into corporate systems—requires advanced analytics. Advanced threat analytics technology can:
In order to achieve these types of analysis, new analytics methods are needed, as well as access to bigger data than ever before.
Data science is a new discipline that leverages scientific and mathematical analysis of data sets, as well as human understanding and exploration, to derive business insights from big data.
Data science is helping security analysts and security tools make better use of security data, to discover hidden patterns and better understand system behavior.
Machine learning is part of the general field of Artificial Intelligence (AI). It uses statistical techniques to allow machines to learn without being explicitly programmed.
Machine learning goes beyond correlation rules, to examine unknown patterns and use algorithms for prediction, classification and insight generation.
Artificial Intelligence (AI) is claimed to be a part of many security analytics solutions. Don’t take vendor claims for granted—check what exactly is included in the term “AI”. How are vendors building their models? Which algorithms are used? Look under the hood to understand what exactly is being offered.
In supervised learning, the machine learns from a data set that contains inputs and known outputs. A function or model is built that makes it possible to predict what the output variables will be for new, unknown outputs.
Security tools learn to analyze new behavior and determine if it is “similar to” previous known good or known bad behavior.
In unsupervised learning, the system learns from a dataset that contains only input variables. There is no correct answer, instead the algorithm is encouraged to discover new patterns in the data.
Security tools use unsupervised learning to detect and act on abnormal behavior (without classifying it or understanding if it is good or bad).
Deep learning techniques simulate the human brain by creating networks of digital “neurons” and using them to process small pieces of data, to assemble a bigger picture. Deep learning is most commonly applied to unstructured data, and can automatically learn the significant features of data artifacts. Most modern applications of deep learning utilize supervised learning.
Deep learning is primarily used in packet stream and malware binary analysis, to discover features of traffic patterns and software programs and identify malicious activity.
Data mining is the use of analytics techniques, primarily deep learning, to uncover hidden insights in large volumes of data. For example, data mining can uncover hidden relations between entities, discover frequent sequences of events to assist prediction, and discover classification models which help group entities into useful categories.
Data mining techniques is used by security tools to perform tasks like anomaly detection in very large data sets, classification of incidents or network events, and prediction of future attacks based on historic data.
UEBA solutions are based on a concept called baselining. They build profiles that model standard behavior for users, hosts and devices (called entities) in an IT environment. Using primarily machine learning techniques, they identify activity that is anomalous, compared to the established baselines, and detect security incidents.
The primary advantage of UEBA over traditional security solutions is that it can detect unknown or elusive threats, such as zero day attacks and insider threats. In addition, UEBA reduces the number of false positives because it adapts and learns actual system behavior, rather than relying on predetermined rules which may not be relevant in the current context.
Random Forest is a powerful supervised learning algorithm that addresses the shortcomings of classic decision tree algorithms. A decision tree attempts to fit behavior to a hierarchical tree of known parameters.
For example, in the tree below customer satisfaction is distributed according to two variables, product color and customer age. A decision tree algorithm will inaccurately predict that a different color or slightly different age is a good predictor of satisfaction. This is called overfitting—the model uses insufficient or inaccurate data to make predictions on new data.
Random Forest automatically breaks up decision trees into a large number of sub-trees or stumps. Each sub-tree emphasizes different information about the population under analysis. It then obtains the result of each sub-tree, and takes a majority vote of all the sub-trees to obtain the final result (a technique called bagging).
By combining all the sub-trees together, Random Forest can cancel out the errors of each individual tree and dramatically improve model fitting.
In a security context: Random Forest can help analyze sequential event paths and improve predictions about new events, even when the underlying data is insufficient or improperly structured.
Dimension Reduction is the process of converting a data set with a high number of dimensions (or parameters describing the data) to a data set with less dimensions, without losing important information.
For example, if the data includes one dimension for the length of objects in centimeters and another dimension for inches, one of these dimensions is redundant and does not really add any information, as can be seen by their high correlation. Removing one of these dimensions will make the data easier to explain.
Generally speaking, a Dimension Reduction algorithm can determine which dimensions do not add relevant information and reduce a data set with n dimensions to k, where k<n.
Besides correlation analysis, other ways to remove redundant dimensions include analysis of missing values; variables with low variance across the data set; using decision trees to automatically pick the least important variables, and augmenting those trees with Random Forest; factor analysis; Backward Feature Elimination (BFE); and Principal Component Analysis (PCA).
: Security data typically consists of logs with a large number of data points about events in IT systems. Dimensional Reduction can be used to remove the dimensions that are not necessary for answering the question at hand, helping security tools identify anomalies more accurately.
Isolation Forest is a relatively new technique for detecting anomalies or outliers. It isolates data points by randomly selecting a feature of the data, then randomly selecting a value between the maximum and minimum values of that feature. The process is repeated until the feature is found to be substantially different from the rest of the data set.
The system repeats this process for a large number of features, and builds a random decision tree for each feature. An anomaly score is then computed for each feature, based on the following assumptions:
A threshold is defined, and features which require relatively long decision trees to become fully isolated are determined to be “normal”, with the rest determined to be “abnormal”.
Isolation Forest is a technique that can be used by UEBA and other next-gen security tools to identify data points that are anomalous compared to the surrounding data.
Security Information and Event Management (SIEM) systems are a core component of large security organizations. They capture, organize and analyze log data and alerts from security tools across the organization. Traditionally, SIEM correlation rules were used to automatically identify and alert on security incidents.
Because SIEMs provide context on users, devices and events in virtually all IT systems across the organization, they offer ripe ground for advanced analytics techniques. Today’s SIEMs either integrate with advanced analytics platforms like UEBA, or provide these capabilities as an integral part of their product.
Next-generation SIEMs can leverage machine learning, deep learning and UEBA to go beyond correlation rules and provide:
Exabeam is an example of a next-generation SIEM that comes with advanced analytics capabilities built in—including complex threat identification, automatic event timelines, dynamic peer grouping of similar users or entities, lateral movement detection and automatic detection of asset ownership.
If you'd like to see more content like this, visit the Exabeam Information Security Blog
SIEM Essentials QuizRead More
Evaluation criteria, build vs. buy, cost considerations and complianceRead More
SIEM under the hood - the anatomy of security events and system logsRead More
User and Entity Behavioral Analytics detects threats other tools can’t seeRead More
A comprehensive guide to the modern SOC - SecOps and next-gen techRead More
From correlation rules and attack signatures to automated detection via machine learningRead More
Beyond alerting and compliance - SIEMs for insider threats, threat hunting and IoTRead More
Security Automation and Orchestration (SOAR) - the future of incident responseRead More
How SIEMs are built, how they generate insights, and how they are changingRead More
Components, best practices, and next-gen capabilitiesRead More