Week 9 notes: Anomaly Detection
2017-09-14Introduction
No notes.
Density Estimation
Problem Motivation
Build model p(x) from data.
If p(xtest) < ɛ, it’s an anomaly.
Gaussian Distribution
Gaussian (Normal) Distribution
Say , if x is a distributed Gaussian with mean μ, standard deviation σ, and variance σ2; then the probability density function (hàm mật độ xác suất) is:
Parameter estimation
Given the dataset x(1), ..,x(m); where x(i) is a distributed Gaussian. The task is to find μ and σ2:
In machine learning people tend to lean 1/m formula but in practice whether it is 1/m or 1/(m-1) it makes essentially no difference assuming m is reasonably large.
Algorithm
Density estimation
Traning set: x(1),…,x(m).
Each sample is
Anomaly detection algorithm
- Compute μ1,..,μn; σ1,..,σn.
- Given new example x, compute p(x). Anomaly if p(x) < ɛ.
Building an Anomaly Detection System
Developing and Evaluating an Anomaly Detection System
The important of real-number evaluation
When developing a learning algorithm (choosing features, etc.), making decisions is much easier if we have a way of evaluating our learning algorithm (a number to compare for instance).
Assume we have some labeled data, of anomalous and non-anomalous examples. (y = 0 if normal, y = 1 if anomalous).
Training set x(i) (unlabeled).
Cross validation set (xcv(i), ycv(i)).
Test set (xtest(i), ytest(i)).
Aircraft engines motivating example
1000 good engines (normal).
20 flawed engines (anomalous).
We split the data into:
-
Training set: 6000 (60%) good engines (y = 0, but we assume they’re unlabeled).
-
CV: 2000 (20%) good engines (y = 0), 10 anomalous (y = 1).
-
Test: 2000 (20%) good engines (y = 0), 10 anomalous (y = 1).
We work on the training set, compute μ1,..,μn; σ1,..,σn and the model p(x).
Algorithm evaluation
Fit model p(x) on training set (the vast majority of them are normal).
On a cross validation/test example x, predict:
- y = 1 if p(x) < ε (anomaly).
- y = 0 if p(x) > ε (normal).
and see how often it gets the label right.
Classification accuracy is not a good way to measure the algorithm’s performance, because of skewed classes (so an algorithm that always predicts y = 0 will have high accuracy). Instead, use these possible evaluation metrics:
- True positive, false positive, false negative, true negative.
- Precision/Recall.
- F1-score.
Can also use cross validation set to choose parameter ε: try many different values of ε and pick the one that minimizes (say) F1-score, or otherwise does well on your cv set.
Anomaly Detection vs. Supervised Learning
Problem: If we have the label data (we know some of them are anomalies, some are not anomalies), why don’t we just use supervised learning?
Describe when to use anomaly detection versus supervised learning
Anomaly detection:
- Very small number of positive examples (y = 1), (0-20 is common), large number of negatives (y = 0) examples.
- Many different “types” of anomalies. Hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we’ve seen so far.
Supervised learning:
- Large number of positive and negative examples.
- Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples likely to be similar to ones in training set.
Choosing what features to use
Non-gaussian features
If we plot the Gaussian distribution p(xi; μi, σi2) of the feature xi and the graph is not a bell shaped curve, we can play with different transformations of the data in order to make it more Gaussian.
The algorithm will usually work okay even if we don’t, but if we use these transformations to make the data more gaussian, it might work a bit better.
For example, in the second feature, if we take the log(x)
transformations, we can make it become Gaussian.
Error analysis for anomaly detection
Want p(x) large for normal examples x.
Want p(x) small for anomalous examples x.
What if: p(x) is comparable (say, both large) for normal and anomalous examples.
We need to look at that example and figure out what went wrong, and see if we can come up with a new feature x2 that helps to distinguish between this bad example, compared to the rest of normal examples.
Monitoring computers in a data center
Choose feature that might take on unusually large or small values in the event of an anomaly.
- x1 = memory use of computer.
- x2 = number of disk accesses/sec.
- x3 = CPU load.
- x4 = network traffic.
The new features may be the combination of available features:
- x5 = x3 / x4.
- x6 = x32 / x4.
Predicting Movie Ratings
Problem Formulation
Example: Predicting movie ratings
User rates movies using zero to five starts.
Some notations:
- nu: no. users
- nm: no. movies
- r(i, j) = 1 if user j has rated movie i.
- y(i, j) = rating given by user j to move i (defined only if r(i, j) = 1).
Problem: predict the values of the question marks (to recommend it to the user).
Content Based Recommendations
Content-based recommender systems
Let’s say each movies have two features:
- x1 = degree of romantic.
- x2 = degree of action.
Now each movie can be represented by a feature vector, for example the first movie “Love at last”: x(1)=[1 0.9 0]T (including bias).
For each user j, learn a parameter θ(j). Predict user j as rating movie i with (θ(j))Tx(i) stars.
We denote m(j) = no. of movies rated by user j.
To learn θ(j), this is actually a linear regression problem:
If want, we can also add the regularization term (ignore bias term of θ course); and we can also eliminate the constant m(j) because it doesn’t affect the optimization objective.
To learn θ(1),..,θ(nu):
And the gradient descent update:
It maybe very difficult to get the features for the movies/items. In the next lesson, we will talk about an approach that does not assume that we have these features.
Collaborative Filtering
Collaborative Filtering
Let consider the problem when we don’t know the values of the movies’ features. (e.g. x1 and x2 are unknown).
Further more, now the users will tell us how much they like romantic and action movies. For example θ(Alice) = θ(1) = [0 5 0]T, it means that Alice likes romantics movies (θ1(1) = 5) and doesn’t really like action movies (θ2(1) = 0).
From these figures, it becomes possible to try to infer what are the values of x1 and x2 for each movie.
For example; the movie Love at last is liked by Alice, Bob and hated by Carol, Dave. Since Alice, Bob like romantic movies; Carol, Dave like action movies; we can reasonably conclude that this is a romantic movie, and x(1) might be [1 1.0 0.0]T.
Mathematically, what we’re really asking is what feature vector should x(1) be so that:
- (θ(1))Tx(1) ≈ 5.
- (θ(2))Tx(1) ≈ 5.
- (θ(3))Tx(1) ≈ 0.
- (θ(4))Tx(1) ≈ 0.
Optimization algorithm
Given θ(1),..,θ(nu), to learn x(i):
Given θ(1),..,θ(nu), to learn x(1),..,x(nm):
Summary
Given x(1),..,x(nm) (and movie ratings), can estimate θ(1),..,θ(nu).
Given θ(1),..,θ(nu), can estimate x(1),..,x(nm).
Chicken and egg problem
Initially, we can random some value for θ, then use it to learn x, then use x to learn better θ, then use θ to learn better x, and so on… This actually works and if you do this, this will actually cause your algorithm to converge to a reasonable set of features for your movies and a reasonable set of parameter for your different users.
In the next lesson we will improve this algorithm and make it quite a bit more computationally efficient.
Collaborative Filtering Algorithm
There’s a more efficient algorithm that doesn’t need to go back and forth, but can solve for θ and x simultaneously.
Put (1) and (2) together:
If we hold the x constant, and minimize J with the respect to θ, it becomes (1) problem. If we do the opposite, it becomes (2) problem.
We can get away with the convention x0 = 1, thus no need for θ0. x and θ are now both n dimension.
Collaborative filtering algorithm
-
Initialize θ(1),..,θ(nu), x(1),..,x(nm) to small random values.
-
Minimize J(θ(1),..,θ(nu), x(1),..,x(nm)) using gradient descent (or an advanced optimization algorithm). E.g. for every j = 1,..,nu, i = 1,..,n</sub>m</sub>:
In this formula we’re no longer have the convention x0 = 1.
- For a user with parameter θ and a movie with (learned) features x, predict a star rating of θTx.
Low rank matrix factorization
Vectorization: low rank matrix factorization
Talk about the vectorization implementation of collaborative filtering algorithm and other things you can do with this algorithm.
Take all the data (move and user) into a matrix:
Predicted ratings matrix:
This predicted ratings matrix can be written as XΘT, where:
This algorithm called Low Rank Matrix Factorization, this term comes from the property that the matrix XΘT has a “low rank” property in linear algebra.
Finding related movies
For each product i, we learn a feature vector x(i). Although neither we know what exactly these features are, nor they’re easy to visualize; the algorithm will usually learn very meaningful features for capturing whatever are the most important properties of a movies that causes you to like or dislike it.
Problem: How to find movies j related to movie i?
If the distance between movie i and j is small (e.g. small), then they might be similar. (You can use KNN to find K most similar movies to movie i).
Implementation Detail: Mean Normalization
Let’s consider a user who hasn’t rated any movies.
According to the optimization formula, the only factor that affects to θ5 is regularization term. By minimizing this term, θ5 will end up equals 0.
So when we go to predict how user 5 would rate any movies, the result is 0. Then we can’t recommend this user any movie to watch.
The idea of mean normalization will help us fix this problem:
Then we learn θ and x from the new Y matrix.
For user j, on movie i predict: θ(j)x(i) + μi.
For user 5, on movie i predict: θ(5)x(i) + μi = μi.
This is reasonable because if a person hasn’t rated/watched any movie, we just predict for each of the movies the average rating that those movies got.
If we have some movie with no ratings, we can solve it similarly (subtract from the column instead of from row).