This vignette describes the anomaly detection algorithm and provides some examples.
The anomaly detection problem is set up here as a binary classification task, where each observation \(x\) is classified as anomalous (class 1
) or non-anomalous (class 0
). We assume that anomalies are much less frequent than normal observations.
The scoring method is as follows. Given an observation \(x\) consisting of features \(\{x_1, ..., x_n\}\), compute the probability density \(p(x)\) of the features. If the \(p(x) < \epsilon\), where \(\epsilon\) is a tuned parameter, the observation is classified as anomalous.
This algorithm assumes that each of the features \(x_i\) is normally distributed with mean \(\mu_i\) and variance \(\sigma_i^2\): \(x_i \sim \mathcal{N}(\mu_i,\sigma_i^2)\).
For a variable with mean \(\mu\) and variance \(\sigma^2\), the probability density of the normal distribution is given by:
\[p(x; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\]
The normal distribution is parametrized by mean \(\mu\) and variance \(\sigma^2\). For a data set \(\{x_1,x_2, ..., x_m\}\), the parameters are estimated using maximum likelihood:
\[\mu = \frac{1}{m}\sum_{i=1}^{m}x_i\] \[\sigma^2 = \frac{1}{m}\sum_{i=1}^{m}(x_i-\mu)^2\]
The joint probability is calculated under the assumptions that features are independent of each other, and each feature is normally distributed. For a set of observations \(\{x_1,x_2, ..., x_m\}\) where each observation \(x_i \in R^n\):
\[p(x) = \prod_{j=1}^{n}p(x_j;\mu_j,\sigma_j^2)\]
Using the above assumptions and definitions, the algorithm is implemented as follows:
anode
, for final evaluation.step 2
.step 2
and \(\epsilon\) from step 3
.step 2
and \(\epsilon\) from step 3
.Machine Learning Course - the package is based on anomaly detection lectures from Andrew Ng’s Machine Learning course.