Week 9 notes: Anomaly Detection

2017-09-14

Introduction

No notes.

Density Estimation
Problem Motivation

Build model p(x) from data.

If p(xtest) < ɛ, it’s an anomaly.

Gaussian Distribution

Gaussian (Normal) Distribution

Say , if x is a distributed Gaussian with mean μ, standard deviation σ, and variance σ2; then the probability density function (hàm mật độ xác suất) is:


Figure 1. The red curve is the standard normal distribution
Source: Wikipedia - Normal distribution

Parameter estimation

Given the dataset x(1), ..,x(m); where x(i) is a distributed Gaussian. The task is to find μ and σ2:

In machine learning people tend to lean 1/m formula but in practice whether it is 1/m or 1/(m-1) it makes essentially no difference assuming m is reasonably large.

Algorithm

Density estimation

Traning set: x(1),…,x(m).

Each sample is

Anomaly detection algorithm

  1. Compute μ1,..,μn; σ1,..,σn.
  2. Given new example x, compute p(x). Anomaly if p(x) < ɛ.
Building an Anomaly Detection System
Developing and Evaluating an Anomaly Detection System

The important of real-number evaluation

When developing a learning algorithm (choosing features, etc.), making decisions is much easier if we have a way of evaluating our learning algorithm (a number to compare for instance).

Assume we have some labeled data, of anomalous and non-anomalous examples. (y = 0 if normal, y = 1 if anomalous).

Training set x(i) (unlabeled).

Cross validation set (xcv(i), ycv(i)).

Test set (xtest(i), ytest(i)).

Aircraft engines motivating example

1000 good engines (normal).

20 flawed engines (anomalous).

We split the data into:

  • Training set: 6000 (60%) good engines (y = 0, but we assume they’re unlabeled).

  • CV: 2000 (20%) good engines (y = 0), 10 anomalous (y = 1).

  • Test: 2000 (20%) good engines (y = 0), 10 anomalous (y = 1).

We work on the training set, compute μ1,..,μn; σ1,..,σn and the model p(x).

Algorithm evaluation

Fit model p(x) on training set (the vast majority of them are normal).

On a cross validation/test example x, predict:

  • y = 1 if p(x) < ε (anomaly).
  • y = 0 if p(x) > ε (normal).

and see how often it gets the label right.

Classification accuracy is not a good way to measure the algorithm’s performance, because of skewed classes (so an algorithm that always predicts y = 0 will have high accuracy). Instead, use these possible evaluation metrics:

  • True positive, false positive, false negative, true negative.
  • Precision/Recall.
  • F1-score.

Can also use cross validation set to choose parameter ε: try many different values of ε and pick the one that minimizes (say) F1-score, or otherwise does well on your cv set.

Anomaly Detection vs. Supervised Learning

Problem: If we have the label data (we know some of them are anomalies, some are not anomalies), why don’t we just use supervised learning?

Describe when to use anomaly detection versus supervised learning

Anomaly detection:

  • Very small number of positive examples (y = 1), (0-20 is common), large number of negatives (y = 0) examples.
  • Many different “types” of anomalies. Hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we’ve seen so far.

Supervised learning:

  • Large number of positive and negative examples.
  • Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples likely to be similar to ones in training set.
Choosing what features to use

Non-gaussian features

If we plot the Gaussian distribution p(xi; μi, σi2) of the feature xi and the graph is not a bell shaped curve, we can play with different transformations of the data in order to make it more Gaussian.

The algorithm will usually work okay even if we don’t, but if we use these transformations to make the data more gaussian, it might work a bit better.


Figure 2. Gaussian and Non-gaussian feature
Source: Coursera Machine Learning course

For example, in the second feature, if we take the log(x) transformations, we can make it become Gaussian.

Error analysis for anomaly detection

Want p(x) large for normal examples x.

Want p(x) small for anomalous examples x.

What if: p(x) is comparable (say, both large) for normal and anomalous examples.

We need to look at that example and figure out what went wrong, and see if we can come up with a new feature x2 that helps to distinguish between this bad example, compared to the rest of normal examples.

Monitoring computers in a data center

Choose feature that might take on unusually large or small values in the event of an anomaly.

  • x1 = memory use of computer.
  • x2 = number of disk accesses/sec.
  • x3 = CPU load.
  • x4 = network traffic.

The new features may be the combination of available features:

  • x5 = x3 / x4.
  • x6 = x32 / x4.
Predicting Movie Ratings
Problem Formulation

Example: Predicting movie ratings

User rates movies using zero to five starts.


Figure 3. User rates movies using zero to five starts
Source: Coursera Machine Learning course

Some notations:

  • nu: no. users
  • nm: no. movies
  • r(i, j) = 1 if user j has rated movie i.
  • y(i, j) = rating given by user j to move i (defined only if r(i, j) = 1).

Problem: predict the values of the question marks (to recommend it to the user).

Content Based Recommendations

Content-based recommender systems

Let’s say each movies have two features:

  • x1 = degree of romantic.
  • x2 = degree of action.

Figure 4. Two new features of 5 movies
Source: Coursera Machine Learning course

Now each movie can be represented by a feature vector, for example the first movie “Love at last”: x(1)=[1 0.9 0]T (including bias).

For each user j, learn a parameter θ(j). Predict user j as rating movie i with (θ(j))Tx(i) stars.

We denote m(j) = no. of movies rated by user j.

To learn θ(j), this is actually a linear regression problem:

If want, we can also add the regularization term (ignore bias term of θ course); and we can also eliminate the constant m(j) because it doesn’t affect the optimization objective.

To learn θ(1),..,θ(nu):

And the gradient descent update:


Figure 5. Gradient Descent update
Source: Coursera Machine Learning course

It maybe very difficult to get the features for the movies/items. In the next lesson, we will talk about an approach that does not assume that we have these features.

Collaborative Filtering
Collaborative Filtering

Let consider the problem when we don’t know the values of the movies’ features. (e.g. x1 and x2 are unknown).

Further more, now the users will tell us how much they like romantic and action movies. For example θ(Alice) = θ(1) = [0 5 0]T, it means that Alice likes romantics movies (θ1(1) = 5) and doesn’t really like action movies (θ2(1) = 0).

From these figures, it becomes possible to try to infer what are the values of x1 and x2 for each movie.

For example; the movie Love at last is liked by Alice, Bob and hated by Carol, Dave. Since Alice, Bob like romantic movies; Carol, Dave like action movies; we can reasonably conclude that this is a romantic movie, and x(1) might be [1 1.0 0.0]T.

Mathematically, what we’re really asking is what feature vector should x(1) be so that:

  • (1))Tx(1) ≈ 5.
  • (2))Tx(1) ≈ 5.
  • (3))Tx(1) ≈ 0.
  • (4))Tx(1) ≈ 0.

Optimization algorithm

Given θ(1),..,θ(nu), to learn x(i):

Given θ(1),..,θ(nu), to learn x(1),..,x(nm):

Summary

Given x(1),..,x(nm) (and movie ratings), can estimate θ(1),..,θ(nu).

Given θ(1),..,θ(nu), can estimate x(1),..,x(nm).

Chicken and egg problem

Initially, we can random some value for θ, then use it to learn x, then use x to learn better θ, then use θ to learn better x, and so on… This actually works and if you do this, this will actually cause your algorithm to converge to a reasonable set of features for your movies and a reasonable set of parameter for your different users.

In the next lesson we will improve this algorithm and make it quite a bit more computationally efficient.

Collaborative Filtering Algorithm

There’s a more efficient algorithm that doesn’t need to go back and forth, but can solve for θ and x simultaneously.

Put (1) and (2) together:

If we hold the x constant, and minimize J with the respect to θ, it becomes (1) problem. If we do the opposite, it becomes (2) problem.

We can get away with the convention x0 = 1, thus no need for θ0. x and θ are now both n dimension.

Collaborative filtering algorithm

  1. Initialize θ(1),..,θ(nu), x(1),..,x(nm) to small random values.

  2. Minimize J(θ(1),..,θ(nu), x(1),..,x(nm)) using gradient descent (or an advanced optimization algorithm). E.g. for every j = 1,..,nu, i = 1,..,n</sub>m</sub>:

In this formula we’re no longer have the convention x0 = 1.

  1. For a user with parameter θ and a movie with (learned) features x, predict a star rating of θTx.
Low rank matrix factorization
Vectorization: low rank matrix factorization

Talk about the vectorization implementation of collaborative filtering algorithm and other things you can do with this algorithm.

Take all the data (move and user) into a matrix:

Predicted ratings matrix:


Figure 6. Predicted ratings matrix, (i,j) = predictive rating of user j on movie i.
Source: Coursera Machine Learning course

This predicted ratings matrix can be written as XΘT, where:


Figure 7. X and Θ matrix.
Source: Coursera Machine Learning course

This algorithm called Low Rank Matrix Factorization, this term comes from the property that the matrix XΘT has a “low rank” property in linear algebra.

Finding related movies

For each product i, we learn a feature vector x(i). Although neither we know what exactly these features are, nor they’re easy to visualize; the algorithm will usually learn very meaningful features for capturing whatever are the most important properties of a movies that causes you to like or dislike it.

Problem: How to find movies j related to movie i?

If the distance between movie i and j is small (e.g. small), then they might be similar. (You can use KNN to find K most similar movies to movie i).

Implementation Detail: Mean Normalization

Let’s consider a user who hasn’t rated any movies.


Figure 8. Eve hasn't rated any movies.
Source: Coursera Machine Learning course

According to the optimization formula, the only factor that affects to θ5 is regularization term. By minimizing this term, θ5 will end up equals 0.

So when we go to predict how user 5 would rate any movies, the result is 0. Then we can’t recommend this user any movie to watch.

The idea of mean normalization will help us fix this problem:


Figure 9. Build Y matrix, compute the mean of each row, then subtract from each row the average rating for that movie. (normalizing each movie to have an average rating of zero).
Source: Coursera Machine Learning course

Then we learn θ and x from the new Y matrix.

For user j, on movie i predict: θ(j)x(i) + μi.

For user 5, on movie i predict: θ(5)x(i) + μi = μi.

This is reasonable because if a person hasn’t rated/watched any movie, we just predict for each of the movies the average rating that those movies got.

If we have some movie with no ratings, we can solve it similarly (subtract from the column instead of from row).