Six Design Considerations for Your Security Data Lake
Many enterprises are joining the rush to set up data lakes for handling petabytes of security data and logs. But many executives and architects assume that once they finish setting up log sources, applying parsers, and arming their SOC analysts with reports, their data lake will deliver the goods.
Large numbers of enterprises, such as retail conglomerates, consumer banks, airlines, and insurance companies are joining the rush to set up data lakes for handling petabytes of security data and logs. But many executives and architects assume that once they finish setting up log sources, applying parsers, and arming their SOC analysts with reports, their data lake will deliver the goods. Alas, if only that were true!
What usually ends up happening is years of frustration, millions of dollars spent, and multiple security threats left undetected or detected too late. So if you’re one of those considering investing in a new security data lake or replacing your existing one, take a pause. Be mindful of how you design yours.
To help with this exercise, here are six design considerations for your security data lake based on our interactions with successful Exabeam customers.
1. Design with the end in mind.
Clearly define your business goals, constraints, and use cases before designing your data lake. Many businesses find it easiest to carry these over from their legacy data lake. This is possibly the biggest mistake you can make, as you want to be certain you don’t miss out on new opportunities by designing for the future. Lending informed thought to how your new data lake will be used is a prerequisite for its design.
What’s more, your business goals and constraints will also influence the fundamental architecture decisions you’ll have to make. Here are some questions for you to consider:
- Do you have strategic goals related to migrating to the cloud?
- Do you foresee growth in certain types of business applications and infrastructure resources over the next 3–5 years?
- Do you have limited bandwidth and data source connectivity challenges, or do you cluster infrastructure in some locations?
- Do you need geographical isolation of logs for compliance to regulations?
- Did you realize a significant cost savings through AWS Reserved Instances or long-term leases of data centers in some locations?
Make certain to capture all such details before you start designing. They will help you answer questions regarding cluster sizing and decentralization.
Determine the core value of your security data lake. Is it:
- Data acquisition and aggregation
- Data curation and enrichment
- Supporting insight generation
Or will your data lake be a combination of all three? This will help you identify the integrations and advanced features you really need. Otherwise you could be forced to painstakingly build and maintain custom capabilities.
2. Listen to your users.
Perhaps you’re wondering why this is being called out so high on the list. Here’s why.
What comes naturally to the initial business and technical evaluators of a data lake vendor doesn’t necessarily carry over to the real-world analysts who are called upon to investigate security threats under a mountain of stress.
Your users (SOC analysts, admins, auditors)—those who’ve been using your existing system—will also be the users of the new data lake. Collected from the front line, they’re the ones who have the insights to inform you about what works and what doesn’t. Learn what a day in their life looks like. This will enable you to place a premium on user experience and performance.
Here are some additional considerations:
- How many team members comprise your SOC?
- How much time should they have to allocate to daily tasks and repetitive actions that could be better served through automation?
- What level of automation do they expect?
- What is their level of comfort in learning a new query language?
When we created Exabeam Data Lake, we fully understood these challenges. Choosing to use Elasticsearch as our product core, we understood that the average SOC analyst has little time to learn how Elastic or Lucene works. So we went to great lengths to fully optimize it for the SOC team experience. Today it provides a complete point-and-click user interface, coupled with a significant collection of out-of-the-box compliance reporting and dashboard content.
3. Identify all your data sources and retention needs.
Four dimensions of logs—volume, velocity, variety, and retention—require your focus when designing a robust logging infrastructure.
- Don’t make the mistake of carrying over the Events Per Second or Daily Log Volume from your existing system. You might hear from an internal team how your current system ingests only a portion of your logs on a selective basis because “Vendor X bills us incrementally by log volume.” But the Exabeam Data Lake pricing model eliminates that concern, so identify all of your disparate data sources—firewalls, network devices, Windows devices, email, applications and more.
- Identify those data sources that generate a variable volume of logs and are prone to spikes due to traffic volume, seasonality, and other reasons.
- Understand different log formats and the proportion of structured vs unstructured data. This will help you plan and prepare parsing requirements before you begin deployment. With the Exabeam Data Lake, you don’t need to think about writing regular expressions to parse your content—we provide hundreds of out-of-the-box parsers to help you make sense of your security logs. But in case you need a new one, we’ll quickly build it for you—it’s that easy.
- Tradeoffs exist between search performance and long-term storage costs, so know your log retention requirements upfront. Logs can be more easily searched using “searchable retention.” Here, logs are retained for six months—well below the one-to-seven year “long-term compliance/archival retention” option used by insurance companies, banks, and other financial institutions.
4. Regarding disaster recovery, high availability, and fault tolerance, know what you need and why.
With the rise of global distributed systems, businesses are quick to outline all three as must-have requirements for their data lakes. But often they fall prey to specifying “the works” without really understanding what their business requires. When quizzed about these capabilities, many IT and DevOps teams admit they don’t have an answer. Teams have invested millions in creating a fully-operational, mirror site—when all they needed was to meet compliance requirements pertaining to a set of log copies in a secondary location. You don’t want to make the same mistake.
Disaster recovery typically refers to a set of policies and procedures that restore operations and mission-critical system availability. Banks and other regulated industries require detailed disaster recovery mechanisms. But each business must determine the level of automation and sophistication it requires for itself. Some teams operate mirror sites with full replication of logs, context data, and user defined data; this can become very complex and costly.
High availability means your system offers a high level of operational performance for a given period of time. You want to be certain there is no single point of failure (SPOF) across your infrastructure and data pipelines. But businesses have different SLAs for availability. What makes sense for a bank may not necessarily apply to a SaaS business. So don’t fall into the “more 9s than we require” trap.
Fault tolerance means your system can continue to operate normally if a portion of it fails. Exabeam Data Lake offers a high degree of availability and fault tolerance throughout its pipeline. Its Kafka component ensures that data ingestion remains uninterrupted—even if the rest of the pipeline is down or being serviced. Similarly, data replication across its Elasticsearch nodes ensures that, even if one becomes temporarily unreachable, its data remains integral and can be searched.
5. Don’t put the cart before the horse.
No doubt you’ve at least heard of Apache Spark, Amazon Kinesis, Elasticsearch, Kafka, Hadoop, and/or Cloudera from your engineers or IT. These are all very powerful technologies that leading innovative companies have used to solve their infrastructure and business problems. But that doesn’t mean they provide an out-of-the-box solution for all your security data lake needs. So the next time someone on your team informs you about this terrific new technology or the next big thing, ask them why it matters to your organization and about which specific business problems it solves.
6. Think long-term and design for growth.
As your business grows, your production environments correspondingly increase in scale, variety, and complexity. Your log volume will also expand, as you’ll want to add logs from newer applications and infrastructure that weren’t online when you initially designed your data lake. In selecting an architecture that can grow with your business, you want a system that lets you scale in a predictable and cost-efficient manner. Using Elasticsearch to power its operations, Exabeam Data Lake offers you that flexibility. It easily scales without having to re-architect your infrastructure.
If you would like to share feedback with our Data Lake product management team, feel free to reach out through your technical account manager or leave a message on Exabeam Community.