Skip to main content

Is this Chad's Personal E-mail Address? A Data Exfiltration Context

Data exfiltration is a common, multi-faceted security threat every enterprise faces. It’s defined as the unauthorized transfer of private data or intellectual property from a corporate computer to an external location.

One way such illegitimate data transfer occurs is through the e-mail channel. The chance of a disgruntled or a departing employee e-mailing confidential data to their personal account is all too easy.

How can this scenario be addressed?

Several existing security products attempt to thwart illegitimate data transfers. For example, data loss prevention (DLP) solutions detect the presence of sensitive data inflight or at rest—usually by matching it against predefined signatures. Despite such offerings, detecting unauthorized data transfer remains an enterprise security challenge, particularly in relation to the volume of false positives they generate.

Additional contextual information is needed to help calibrate data transfer alerts and reduce false positives. For example, is an external address to which data is being emailed a personal account? E-mail from chad@enterprise.com to chadx103@gmail.com should raise an analyst’s eyebrows and serve as evidence of risk.

But how can you match Chad with his external email accounts? Short of reading such data from known HR records (if they exist), one method is to mine it from historical e-mail records—a data science problem.

A data exploration exercise

A typical data science approach has two phases. The first is an exploration phase, where an examiner “feels” the data to gain intuition about it. The second phase is to engineer machine learning algorithms. At Exabeam, such exploration has enabled us to develop classification heuristics to determine if two e-mail addresses belong to the same person.

String-matching method

Most users adopt conventions in naming their personal email accounts. One observation is that a user’s “handle” is often based on their first and/or last name. To leverage this, naming variants can be evaluated for their similarity to an external address. For example, Chad might use chad_doe, chadd, cdoe, or chad.doe.

This approach is similar to domain name permutation used in identifying phishing sites. With an effectiveness rating of about 10%, it catches the low-hanging fruit in relation to linking users with external addresses. And it yields near-zero false positives.

Behavior-based method

But not all personal e-mail addresses use actual names as their root; many bear no correlation. Here, historical e-mail records can be used to determine whether there is sufficient behavioral fingerprints to link a corporate sender to an external receiver (e.g., chad@enterprise.com to 1koodood16@gmail.com). Insights gleaned from data exploration in real environments have permitted Exabeam to develop such associative heuristics.

These are based on a variety of factors, including:

  • Frequency of communication between corporate and external addresses
  • Direction of communication between the addresses
  • Textual content within e-mail Subject: fields

Based on such factors, various metrics are used to classify whether a pair of work/personal addresses belong to the same person. One observed metric is whether a Subject: line contains a null string (supposing that one doesn’t bother completing the field when e-mailing oneself). Another is the ratio of messages between sender and receiver that are marked forwarded versus replied.

Such non-trivial data exploration enables Exabeam to develop useful metrics. Firm benchmark numbers are hard to come by, as there is no ground truth. Yet from the metrics we’ve constructed heuristics-based rules.

Consider the following: In an environment where no known volume of users have sent personal emails to themselves while at work, this method has matched up to 15% of the user base with personal e-mail addresses—and with near-zero false positives. Other enterprises may realize a different result depending on security practices in place. (Note: the heuristics rule threshold can be relaxed to claim more e-mail addresses—if a use case can tolerate some amount of false positives.)

In anomaly detection of malicious activity, context is the key in mitigating false positives. For detecting data exfiltration via the e-mail channel, DLP alerts can be prioritized by knowing whether a personal e-mail is involved. This is one way data analytics for context estimation can be used in Exabeam’s Advanced Analytics security offering.

Leave a Reply

Your email address will not be published. Required fields are marked *

Topics: data science
2017