Recap of Machine Learning For Network-Based IDS Study

March 18, 2016

Misuse Detection vs. Anomaly Detection

An IDS generally has to deal with noisy network traffic volumes, uneven data distribution, the difficulty to identify the distinctions normal and abnormal network behavior, and the ability to adapt to a constantly changing environment. Bad packets generated from software bugs, corrupt DNS data, and local packets that escaped can create a significantly high false-alarm rate.

Understanding how NDIS recognizes network activity helps determine the core weaknesses behind the system. Strategies for classification of network behaviors typically fall under two categories: misuse detection and anomaly detection.

Misuse detection techniques examine both network and system activity for known instances of misuse using signature matching algorithms. This technique is effective at detecting attacks that are already known.

Alerts may be generated by the IDS, but reaction to every alert wastes time and resources leading to instability of the system. To overcome this problem, IDS should not start elimination procedure as soon as the first symptom has been detected but rather it should be patient enough to collect alerts and decide based on the correlation of them. Typically, misuse detection rules acts like a firewall that looks out for:

Using one of the many SMTP/SSH exploits
Detecting a port scan
Parsing user commands looking for abuse

On the other hand, anomaly detection on the other hand proceeds by comparing every instance to what is “normal” to the network. It seems obvious that such system needs a profile of the network which may be a problem in the way that it takes time and resources to train an anomaly detection sensor in order to build a profile that is reflecting a normal network usage. For instance:

Excessive bandwidth usage
Excessive system calls from a process
More than one entity using a service

Benefits of Machine Learning to NIDS

The challenge is to efficiently capture and classify various behaviors in a computer network, since they cannot be categorized under a single umbrella. Few modern machine learning approaches can be used to solve the problem of finding malicious activity within the network:

Supervised/Unsupervised Learning

One popular strategy is to monitor a network’s activity for anomalies, or anything that deviates from normal network behavior. Anomaly detection creates models of normal behavior for networks, systems, applications, end users and other devices and then looks for deviations from those patterns of behavior at a much faster pace.

Putting machine learning algorithms in place and using a variety supervised (classification) and unsupervised machine learning (clustering) algorithms to detect anomalous patterns of user behavior, as gleaned from a variety of sources, like server logs, Active Directory entries, and virtual private networking (VPN) logs.

Without human intervention, unsupervised machine learning does all of the processing work in order to identify potential security issues. It does this by processing millions of data points each minute and automatically identifying anomalous behavior. It then correlates anomalies across multiple data sources to determine their potential impact.

Artificial Neural Networks (ANN)

A neural network consists of a collection of processing units called neurons that are highly interconnected like a human brain. ANNs have the ability to learning by example and generalize from limited, noisy, and incomplete data and create better predictions the more it “learns” from the network. They have been successfully employed in a broad spectrum of data-intensive applications. If ANNs are fed raw network traffic data, this can help the neural nets learn overtime what can be an anomaly/outlier of the network and detect them faster.

Takeaways

Machine learning has potential, but requires a ton of cognitive load. Since operations of machine learning has been a hot topic throughout the tech industry, there is a lot of aloofness to the actual potency of their solution, and the same applies for NIDS. Network traffic is composed of many individual sessions, which equals to enormous amounts of variety and unpredictable behavior. Finding what qualifies as “normal” is hard for networks to identify, which could result in higher false positive rates.

Human activity, application behavior, and network traffic are all heavily autocorrelated, making it hard to understand what activity is normal. This gives malicious actors plenty of opportunity to “hide in plain sight” and even an opportunity to train the system that malicious activity is normal. This is the disconnect between what the system reports and what the operator wants, and the root cause for too many false positives.

The best solution is to properly train the dataset and use machine learning models for fraud detection and analysis. NIDS alone cannot detect all malicious activity based on its signature-based system. It’s not recommended to jumping to perform classification/regression on the data, without taking time to take a look and analyze the data, understanding the features and their relation with each other and the output.

This step gives a lot of insights to the problem. It can also provide possible answers to questions that may arise due to odd behavior of the learning model. The learning model can be trained as an effective predictor for what lies in network traffic.