The considerable number of resources cover machine learning for cybersecurity and the ability to protect us from cyber attacks. Still, it's important to scrutinize how actually Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL) can help in cybersecurity right now, and what this hype is all about.
First of all, I have to disappoint you. Unfortunately, machine learning will never be a silver bullet for cybersecurity compared to image recognition or natural language processing, two areas where machine learning is thriving. There will always be a man trying to find weaknesses in systems or ML algorithms and to bypass security mechanisms. What's worse, now hackers are able to use machine learning to carry out all their nefarious endeavors.
Fortunately, machine learning can aid in solving the most common tasks including regression, prediction, and classification. In the era of the extremely large amount of data and cybersecurity talent shortage, ML seems to be an only solution.
This article is an introduction written to give a practical technical understanding of the current advances and future directions of ML research applied to cybersecurity.
Machine Learning Terminology
Stop calling everything 'AI' - learn the terms.
AI (Artificial Intelligence) - a broad concept.A Science of making things smart or, in other words, human tasks performed by machines (e.g., Visual Recognition, NLP, etc.). The main point is that AI is not exactly machine learning or smart things. It can be a classic program installed in your robot cleaner like edge detection. Roughly speaking, AI is a thing that somehow carries out human tasks.
ML (Machine Learning) - an Approach(just one of many approaches) to AI that uses a system that is capable of learning from experience. It is intended not only for AI goals (e.g., copying human behavior) but it can also reduce the efforts and/or time spent for both simple and difficult tasks like stock price prediction. In other words, ML is a system that can recognize patterns by using examples rather than by programming them. If your system learns constantly, makes decisions based on data rather than algorithms, and change its behavior, it's Machine Learning.
DL (Deep Learning) - a set of Techniques for implementing machine learning that recognizes patterns of patterns -Ã?? like image recognition. The systems identify primarily object edges, a structure, an object type, and then an object itself. The point is that Deep Learning is not exactly Deep Neural Networks. There are other algorithms, which were improved to learn patterns of patterns, such as Deep Q Learning in Reinforcement task.
The definitions show that the cybersecurity field refers mostly to machine learning (not to AI). And a large part of the tasks is not human-related.
Machine learning means solving certain tasks with the use of an approach and particular methods based on data you have.
Most of the tasks are subclasses of the most common ones, which are described below.
Regression (or prediction) - a task of predicting the next value based on the previous values.
Classification - a task of separating things into different categories.
Clustering - similar to classification but the classes are unknown, grouping things by their similarity.
Association rule learning (or recommendation) - a task of recommending something based on the previous experience.
Dimensionality reduction - or generalization, a task of searching common and most important features in multiple examples.
Generative models - a task of creating something based on the previous knowledge of the distribution.
There are different approaches in addition to these tasks. You can use only one approach for some tasks, but there can be multiple approaches to other tasks.
Approaches to Solving ML Tasks
Trends of the past:
Supervised learning. Task Driven approach. First of all, you should label data like feeding a model with examples of executable files and saying that this file is malware or not. Based on this labeled data, the model can make decisions about the new data. The disadvantage is the limit of the labeled data.
Ensemble learning. This is an extension of supervised learning while mixing different simple models to solve the task. There are different methods of combining simple models.
Unsupervised Learning. Data Driven approach. The approach can be used when there are no labeled data and the model should somehow mark it by itself based on the properties. Usually, it is intended to find anomalies in data and considered to be more powerful in general as it's almost impossible to mark all data. Currently, it works less precisely than supervised approaches.
Semi-supervised learning. As the name implies, semi-supervised learning tries to combine benefits from both supervised and unsupervised approaches, when there are some labeled data.
Future trends (well, probably)
Reinforcement learning. Environment Driven approach can be used when the behavior should somehow react to the changing environment. It's like a kid who is learning environment by trial and error.
Active learning. It's more like a subclass of Reinforcement learning that probably will grow into a separate class. Active learning resembles a teacher who can help correct errors and behavior in addition to environmental changes.
Machine Learning tasks and Cybersecurity
Let's see the examples of different methods that can be used to solve machine learning tasks and how they are related to cybersecurity tasks.
Regression (or prediction) is simple. The knowledge about the existing data is utilized to have an idea of the new data. Take an example of house prices prediction. In cybersecurity, it can be applied to fraud detection. The features (e.g., the total amount of suspicious transaction, location, etc.) determine a probability of fraudulent actions.
As for the technical aspects of regression, all methods can be divided into two large categories: machine learning and deep learning. The same is used for other tasks.
For each task, there are the examples of ML and DL methods.
Machine learning for regression
Below is a short list of machine learning methods (having their own advantages and disadvantages) that can be used for regression tasks.
SVR (Support Vector Regression)
You can find out the detailed explanation of each method here.
Deep learning for regression
For regression tasks, the following deep learning models can be used:
Artificial Neural Network (ANN)
Recurrent Neural Network (RNN)
Neural Turing Machines (NTM)
Differentiable Neural Computer (DNC)
Classification is also straightforward. Imagine you have two piles of pictures classified by type (e.g., dogs and cats). In terms of cybersecurity, a spam filter separating spams from other messages can serve as an example. Spam filters are probably the first ML approach applied to Cybersecurity tasks.
The supervised learning approach is usually used for classification where examples of certain groups are known. All classes should be defined in the beginning.
Below is the list related to algorithms.
Machine learning for classification
K-Nearest Neighbors (K-NN)
Support Vector Machine (SVM)
Random Forest Classification
It's considered that methods like SVM and random forests work best. Keep in mind that there are no one-size-fits-all rules, and they probably won't operate properly for your task.
Deep learning for classification
Artificial Neural Network
Convolutional Neural Networks
Deep learning methods work better if you have more data. But they consume more resources especially if you are planning to use it in production and re-train systems periodically.
Clustering is similar to classification with the only but major difference. The information about the classes of the data is unknown. There is no idea whether this data can be classified. This is unsupervised learning.
Supposedly, the best task for clustering is forensic analysis. The reasons, course, and consequences of an incident are obscure. It's required to classify all activities to find anomalies. Solutions to malware analysis (i.e., malware protection or secure email gateways) may implement it to separate legal files from outliers.
Another interesting area where clustering can be applied is user behavior analytics. In this instance, application users cluster together so that it is possible to see if they should belong to a particular group.
Usually, clustering is not applied to solving a particular task in cybersecurity as it is more like one of the subtasks in a pipeline (e.g., grouping users into separate groups to adjust risk values).
Machine learning for clustering
K-nearest neighbors (KNN)
Deep learning for clustering
Self-organized Maps (SOM) or Kohonen Networks
Association Rule Learning (Recommendation Systems)
Netflix and SoundCloud recommend films or songs according to your movies or music preferences. In cybersecurity, this principle can be used primarily for incident response. If a company faces a wave of incidents and offers various types of responses, a system learns a type of response for a particular incident (e.g., mark it as a false positive, change a risk value, run the investigation). Risk management solutions can also have a benefit if they automatically assign risk values for new vulnerabilities or misconfigurations built on their description.
There are algorithms used for solving recommendation tasks.
Machine learning for association rule learning
Deep learning for association rule learning
Deep Restricted Boltzmann Machine (RBM)
Deep Belief Network (DBN)
The latest recommendation systems are based on restricted Boltzmann machines and their updated versions, such as promising deep belief networks.
Dimensionality reduction or generalizations not as popular as classification, but necessary if you deal with complex systems with unlabeled data and many potential features. You can't apply to cluster because typical methods restrict the number of features or they don't work. Dimensionality reduction can help handle it and cut unnecessary features. Like clustering, dimensionality reduction is usually one of the tasks in a more complex model. As to cybersecurity tasks, dimensionality reduction is common for face detection solutions - the ones you use in your iPhone.
You can find more on dimensionality reduction here (including the general description of the methods and their features).
The task of generative models differs from the above-mentioned ones. While those tasks deal with the existing information and associated decisions, generative models are designed to simulate the actual data (not decisions) based on the previous decisions.
The simple task of offensive cybersecurity is to generate a list of input parameters to test a particular application for Injection vulnerabilities.
Alternatively, you can have a vulnerability scanning tool for web applications. One of its modules is testing files for unauthorized access. These tests are able to mutate existing filenames to identify the new ones. For example, if a crawler detected a file called login.php, it's better to check the existence of any backup or test its copies by trying names like login_1.php, login_backup.php, login.php.2017. Generative models are good at this.
Machine learning generative models
Deep learning generative models
Generative adversarial networks (GANs)
Recently, GANs showed impressive results. They successfully mimic a video. Imagine how it can be used for generating examples for fuzzing.
Cybersecurity Tasks and Machine Learning
Instead of looking at ML tasks and trying to apply them to cybersecurity, let's look at the common cybersecurity tasks and machine learning opportunities. There are three dimensions (Why, What, and How).
The first dimension is a goal, or a task (e.g., detect threats, predict attacks, etc.). According to Gartner's PPDR model, all security tasks can be divided into five categories:
The second dimension is a technical layer and an answer to the Ã¢??WhatÃ¢?? question (e.g., at which level to monitor issues). Here is the list of layers for this dimension:
network (network traffic analysis and intrusion detection);
application (WAF or database firewalls);
Each layer has different subcategories. For example, network security can be Wired, Wireless or Cloud. Rest assured that you can't apply the same algorithms with the same hyperparameters to both areas, at least in near future. The reason is the lack of data and algorithms to find better dependencies of the three areas so that it's possible to change one algorithm to different ones.
The third dimension is a question of Ã¢??HowÃ¢?? (e.g., how to check the security of a particular area):
in transit in real time;
For example, if you are about endpoint protection, looking for the intrusion, you can monitor processes of an executable file, do static binary analysis, analyze the history of actions in this endpoint, etc.
Some tasks should be solved in three dimensions. Sometimes, there are no values in some dimensions for certain tasks. Approaches can be the same in one dimension. Nonetheless, each particular point of this three-dimensional space of cybersecurity tasks has its intricacies.
It's difficult to detail them all so let's focus on the most important dimension - technology layers. Look at the cybersecurity solution from this perspective.
Machine learning for Network Protection
Network protection is not a single area but a set of different solutions that focus on a protocol such as Ethernet, wireless, SCADA, or even virtual networks like SDNs.
Network protection refers to well-known Intrusion Detection System (IDS) solutions. Some of them used a kind of ML years ago and mostly dealt with signature-based approaches.
ML in network security implies new solutions called Network Traffic Analytics (NTA) aimed at in-depth analysis of all the traffic at each layer and detect attacks and anomalies.
How can ML help here? There are some examples:
regression to predict the network packet parameters and compare them with the normal ones;
classification to identify different classes of network attacks such as scanning and spoofing;
clustering for forensic analysis.
You can find at least 10 papers describing diverse approaches in academic research papers.
The new generation of anti-viruses is Endpoint Detection and Response. It's better to learn features in executable files or in the process behavior. Keep in mind that if you deal with machine learning at endpoint layer, your solution may differ depending on the type of endpoint (e.g., workstation, server, container, cloud instance, mobile, PLC, IoT device). Every endpoint has its own specifics but the tasks are common:
regression to predict the next system call for executable process and compare it with real ones;
classification to divide programs into such categories as malware, spyware, and ransomware;
clustering for malware protection on secure email gateways (e.g., to separate legal file attachments from outliers).
Academic papers about endpoint protection and malware specifically are gaining popularity. Here are a few examples:
Application security is my favorite area, by the way, especially ERP Security.
Where to use ML in app security? - WAFs or Code analysis, both static and dynamic. To remind you, Application security can differ. There are web applications, databases, ERP systems, SaaS applications, micro-services, etc. It's almost impossible to build a universal ML model to deal with all threats effectively in near future. However, you can try to solve some of the tasks.
Here are examples what you can do with machine learning for application security:
regression to detect anomalies in HTTP requests (for example, XXE and SSRF attacks and auth bypass);
classification to detect known types of attacks like injections (SQL, XSS, RCE, etc.);
clustering user activity to detect DDOS attacks and mass exploitation.
More resources providing ideas of using ML for application security:
This started security Information and Event Management (SIEM).
SIEM was able to solve numerous tasks if configured properly including user behavior search and ML. Then the UEBA solutions declared that SIEM couldn't handle new, more advanced types of attacks and constant behavior change.
The market has accepted the point that a special solution is required if the threats are regarded from the user level.
However, even UEBA tools don't cover all things connected with different user behavior. There are domain users, application users, SaaS users, social networks, messengers, and other accounts that should be monitored.
Unlike malware detection focusing on common attacks and the possibility to train a classifier, user behavior is one of the complex layers and unsupervised learning problem. As a rule, there is no labeled dataset as well as an idea of what to look for. Therefore, the task of creating a universal algorithm for all types of users is tricky in user behavior area. Here are the tasks that companies solve with the help of ML:
regression to detect anomalies in User actions (e.g., login in unusual time);
classification to group different users for peer-group analysis;
clustering to separate groups of users and detect outliers.
The process area is the last but not least. While dealing with it, it's necessary to know a business process in order to find something anomalous. Business processes can differ significantly. You can look for fraud in banking and retail system or a plant floor in manufacturing. The two are totally different, and they demand a lot of domain knowledge. In machine learning feature engineering (the way you represent data to your algorithm) is essential to achieve results. Similarly, features are different in all processes.
In general, there are the examples of tasks in the process area:
regression to predict the next user action and detect outliers such as credit card fraud;
classification to detect known types of fraud;
clustering to compare business processes and detect outliers.
You can find research papers related to banking fraud as ICS and SCADA systems security is much less represented.
Malware Data Science: Attack Detection and Attribution(sept 2018) - As seen from the title, this book is focused on malware. It was just released by the time of writing this article so I can't give any feedback so far. But I bet it is a must for everyone from endpoint protection teams.
There are more areas left. I have outlined the basics. On the one hand, machine learning is definitely not a silver-bullet solution if you want to protect your systems. Undoubtedly, there are many issues with interpretability (particularly for deep learning algorithms), but humans also cannot interpret their own decisions, right?
On the other hand, with the growing amount of data and decreasing number of experts, ML is an only remedy. It works now and will be mandatory soon. It is better to start right now.
Keep in mind, hackers are also starting to use ML in their attacks. My next article will reveal how exactly attackers can utilize ML.