Editor’s note: Vasudha is a presenter for the upcoming ODSC East 2019 this April 30-May 3 in Boston! Check out her talk “Detecting Cybersecurity Incidents with Machine Learning” there.
Data breaches are a growing threat to both companies and consumers. For individuals, breaches can result in the release of private consumer information such as credit card numbers and health records, as well as personal identifiers like home addresses and social security numbers. Many individuals are unfortunately familiar with the lasting consequences to their credit and privacy.
For companies, a data breach can mean the public release of sensitive financial information, proprietary trade secrets, and customer data, resulting in hefty fines and damage to the brand. Many companies are therefore interested in detecting and preventing breaches as they happen. To do that, we need ways to identify the unauthorized transfer of data from a server or a computer, known as data exfiltration.
A data exfiltration incident begins with an entry point into a company’s network, using for example phishing emails or vulnerabilities in corporate networks. Once inside, the malicious actor then searches for data and begins aggregating it. Most companies have alarms in place that flag suspicious activity. To avoid setting off those alarms, an attacker moves through this stage slowly, and can spend weeks or months quietly collecting information. When the data collection is finished, the malicious actor transfers the data off the network to an external destination. Mature attackers often wipe the traces of their activity behind them, so that companies may not realize that a breach has occurred even after the fact.
What can we do to stay safe?
What would it take to detect data exfiltration incidents before the data leaves the network? A direct approach would be to look for unusual aggregation of data. One example is a computer accessing data out of line with its function, such as an engineering account collecting earnings figures. Another sign would be a folder on a particular machine that gets steadily larger over a period of time. Catching these kinds of flags would help a network admin take preventative action, but the relevant data is rarely collected.
In the meantime, companies are increasingly interested in detecting incidents at the data transfer stage. This means detecting an attack after the fact, rather than preventing it outright, but would allow an affected company to start remediation. A network typically has an enormous volume of outgoing traffic, the vast majority of which is legitimate and not of concern. Identifying malicious transfers out of this flood, while avoiding false positives, is the challenge for building an exfiltration detector.
Insert machine learning as a useful tool for this situation. Anomaly detection can automate the search for outliers, and separate out the subset of traffic that warrants further investigation. There are several ways to increase the fidelity of the resulting alerts. We studied how to use statistical properties of the traffic itself to make better definitions of typical behavior on the network, as well as how to use contextual information about recurring behavior that doesn’t need flagging. For a full discussion of the methods and results, check out my talk at ODSC East 2019!