In a previous blog, I provided an introductory overview of implementing data science for cybersecurity. Today, I’d like to dive deeper and build a threat detection system using data science. Whether you’re building this yourself or evaluating a security vendor, this blog can serve as a guide to understanding how you can identify unusual behaviors in your enterprise with data science.
Let’s start with our objective: Use log data already in your environment and accurately identify threats within it. The standard rules and thresholds approach to this goal has been problematic, and it’s important spend a couple of minutes discussing why that has been the case. Carnegie Mellon’s CERT stipulates the following rule if your SOC team is dealing with a “leaver,” or employee about to leave your organization:
- if the mail is from the departing insider
- and the message was sent in the last 30 days
- and the recipient is not in the organization’s domain
- and the total bytes summed by day are more than a specified threshold
- then send an alert to the security operator
It’s impossible to select a single rule or threshold that will apply to every person in your organization. Let’s assume John Smith normally sends out 40 email attachments every day—sometimes a bit more, sometimes a bit less. Assuming this is normally distributed, we can visualize John’s normal email attachment behavior as a green bell curve that peaks at 40 (Figure 1).
The rule says that if John is behaving maliciously, he’ll send out more than 40 attachments. Let’s say that number is 60. This becomes a red curve that peaks at 60 (Figure 1), resulting in two curves: one indicating normal (green) and one indicating abnormal behavior (red).
Now the challenge of setting a threshold is clear: what happens if John sends out 50 email attachments? Is this normal or abnormal? Looking at the curves in Figure 1 it is clear you can’t really know for sure; the number 50 is covered by both green and red curves.
In reality, leaver data theft incidents are even more complicated than the illustration in Figure 1. A purposefully malicious insider will spend most of his or her time behaving normally and, every once in a while, sneak in malicious behavior—it's just natural human behavior. So, in reality, the abnormal, red curve is depressed (Figure 2).
Again, setting a simple thresholds simply does not work. Say you set a threshold for an alert for when John sends out more than 60 attachments. The good news is that you’re catching some of that red abnormal behavior, everything to the right of the dashed line in Figure 2. The bad news is that you’re ignoring a lot of the red line behavior—the low and slow abnormal activity, everything to the left of the dashed line. On the other hand, if you set the threshold too low, you’ll just generate a lot of noise and false positives. The threshold approach simply presents too much room for false positives and missed threats.
So let’s look a new approach by building some new curves. We can figure out what normal looks like for everyone just by counting. What you see below in Figure 3 are four different histograms showing email logs that have counts of email attachments for four different individuals. These can be fitted to a curve, such as a gamma distribution. You can build a system that tracks and counts emails and email attachments for every individual and then you can fit them to a curve. When a user sends out an email, you can compute how abnormal that activity is (indicated by the red curves). It wouldn’t be very abnormal for the individual in the top right-hand corner to send out three email attachments in a day. For the individual at the top left-hand corner, however, it would be quite unusual to send out three email attachments.
This is a great example of how we use volumetric data to determine abnormal sizes, and the same approach can be applied to determine other types of abnormal behavior. The idea is to just start counting and then use that information to create a curve that describes normal, which in turn will allow you to quantify abnormality. Statisticians, you would recognize this technique as anomaly detection by building probability density functions and likelihood estimation.
Being able to detect and quantify abnormal behaviors using the above approach is a good start. But we can do even more! To reduce noise, let’s figure out how to compute a score that looks for multiple clues factoring into your risk score.
For example, an employee named Jenny is exhibiting multiple abnormal behaviors: sending an unusually high-volume email attachment for her, working at unusual hours for her, accessing repositories she doesn’t normally access. Each of these behaviors will have an associated quantity determined using the anomaly detection technique above; now let’s sum these “probabilities” together to create an aggregated risk score for Jenny.
In other words, we want a way to take in all the data that is active in your environment, run it through anomaly models, and then sum up those probabilities in some form. I encourage you to look to methods that are statistically relevant or explainable. In our case, we chose to look at probabilities together and come up with a score that reflects the risk associated with all these behaviors happening around the same time, to the same thing (or “entity”).
Careful consideration went into the mathematical architecture we settled on, pictured above in Figure 4, in order to make it resilient to the conditions of a typical production environment, which can be messy and high-volume. There are a number of strategic choices we made but let me highlight a few important ones.
Arbitrary data sources: It was important for us to build this framework independent of the dataset so that it could remain equally true whether you have a single dataset of Active Directory or multiple datasets like Active Directory, DNS, NetFlow, and more. Pragmatically, you (the data scientist) don’t have much control over the data sources in the environment.
Feature extraction: We’ve done all of the data science up front and then built the feature extraction into the code. We know which independent variables are important, what to count, where, and when. This is critical for performance, because many of these data sources move very quickly and we want to be able to score in real-time.
Simple models: Most our models are straight-forward, one-dimensional models; this is not an exercise in deep learning. The reasons why that’s important is because these simple models are cost-effective, battle-tested, generalizable, and converge quickly. One-dimensional univariate models require less computational power to run. Simple models tend to generalize nicely and more effectively than more complex models that may overfit.
Multiple models: At Interset, we run hundreds of models at the same time. This has an advantage because different customers, different tenants, different business areas, and different users are always going to be different. So not all models are going to be equally effective at all times. By running a large population of models, weaker learners can be dominated by the stronger learners in every unique environment. The most applicable models will converge quickly in those environments, and the weaker learners won’t impact the system negatively.
“Universal join”: We don’t want to “hard code” the set of data sources or models. How then can we meaningfully combine the outputs of different models from different data sources, where the semantics of each model can differ wildly? Being able to force everything to emit a probability allows us to “universally join” the pieces together by looking at three qualities: probability, entity, and time. The input may be coming in from different datasets with different semantics and input variables, but the output is always going to be an entity (i.e. user account or IP address), a period of time (event A happened in same timeframe as event B), and a probability score. This enables more flexibility, which is critical because we want to be able to grow into our desired use cases and environments.
Ease of orchestration and interpretation: We squash the output of our score. When this is deployed into a SOC environment, you want your SOC analysts (who aren’t data scientists) to understand the scores and use them most effectively. A simple number that’s between 0 and 100 is very effective. This is really critical in production because you want to be able to stabilize the output scores in a way that allows your SOC to independently build playbooks and implement SOAR systems that won’t be impacted by additional datasets and models.
Ultimately, the mathematical choices you make have to have the production environment in mind. Trust me, we made a few mistakes along the way and learned this the hard way. Theoretically, one approach may seem better. But pragmatically, a different approach was better to get it into production.
If you’d like to see how our approach translated into action, be sure to check out my session at the 2019 Micro Focus Cybersecurity Summit, Data Science 201: Probabilistic Approaches to Anomaly Detection.