Anomaly Detection

June 23rd, 2017

Anomaly detection is the identification of rare observations which do not conform with normal pattern in a dataset. Depending on application domain, anomaly detection is also referred as outlier, fraud, intrusion, misuse, deviation or exception detection.

Anomaly detection has a wide range of applications such as (host or network) intrusion detection, fraud detection in banking and telecom industry, tumor detection, electrocardiograms and vital functions supervision in medical domain, image based forensic (document manipulation) analysis etc.

Broadly speaking, there can be three types of anomalies:

1. Point or Global Anomaly
An individual data point is anomalous with respect to rest of the data, e.g. 120 years old person, Internet payment with large transaction value etc.

2. Contextual Anomaly
An observation which is anomalous in a specific context e.g., 80 years of undergraduate student, extremely low temperature in June or July.

3. Collective Anomaly
A subset of instances collectively causes anomaly by significantly deviating from remainder of the data. For example, delays in network traffic, shipment of cargoes etc.

Anomaly detection is a challenging machine learning problem, as it hard to model rare anomalous and abundant normal instances - unbalanced dataset. There is no general purpose anomaly detection solution. It rather requires development of domain specific solutions. Understanding of domain and engineering new features play critical role in developing a successful anomaly detection solution.

Depending of nature of dataset, anomaly detection can be approached with one of following three techniques;

1. Unsupervised Anomaly Detection
It is one of most commonly used anomaly detection technique. It is applied when we have unlabeled dataset i.e. instances are not labelled as 'normal' or 'anomaly'. Majority of the instances are usually normal and anomalous instance are very rare. The challenge in unsupervised anomaly detection is to look at each instance and determine whether it fits to remainder of the dataset. Clustering algorithms are often used to perform unsupervised anomaly detection .

2. Supervised Anomaly Detection
In supervised anomaly detection, instances in the dataset are labelled as 'normal' or 'anomaly'. Therefore, it can be treated as binary classification problem; hence training a classifier with given dataset. Unbalanced nature of data and skewed anomaly class can make supervised anomaly detection a difficult problem.

3. Semi-supervised Anomaly Detection
It involves building a model from training dataset which represents normal behavior. Then this model is applied to generate test instances. An anomaly occurs when there is a larger distance between test instance and the instance generated by learned model.