Week 9 notes: Anomaly Detection
20170914Introduction
No notes.
Density Estimation
Problem Motivation
Build model p(x) from data.
If p(x_{test}) < ɛ, it’s an anomaly.
Gaussian Distribution
Gaussian (Normal) Distribution
Say , if x is a distributed Gaussian with mean μ, standard deviation σ, and variance σ^{2}; then the probability density function (hàm mật độ xác suất) is:
Parameter estimation
Given the dataset x^{(1)}, ..,x^{(m)}; where x^{(i)} is a distributed Gaussian. The task is to find μ and σ^{2}:
In machine learning people tend to lean 1/m formula but in practice whether it is 1/m or 1/(m1) it makes essentially no difference assuming m is reasonably large.
Algorithm
Density estimation
Traning set: x^{(1)},…,x^{(m)}.
Each sample is
Anomaly detection algorithm
 Compute μ_{1},..,μ_{n}; σ_{1},..,σ_{n}.
 Given new example x, compute p(x). Anomaly if p(x) < ɛ.
Building an Anomaly Detection System
Developing and Evaluating an Anomaly Detection System
The important of realnumber evaluation
When developing a learning algorithm (choosing features, etc.), making decisions is much easier if we have a way of evaluating our learning algorithm (a number to compare for instance).
Assume we have some labeled data, of anomalous and nonanomalous examples. (y = 0 if normal, y = 1 if anomalous).
Training set x^{(i)} (unlabeled).
Cross validation set (x_{cv}^{(i)}, y_{cv}^{(i)}).
Test set (x_{test}^{(i)}, y_{test}^{(i)}).
Aircraft engines motivating example
1000 good engines (normal).
20 flawed engines (anomalous).
We split the data into:

Training set: 6000 (60%) good engines (y = 0, but we assume they’re unlabeled).

CV: 2000 (20%) good engines (y = 0), 10 anomalous (y = 1).

Test: 2000 (20%) good engines (y = 0), 10 anomalous (y = 1).
We work on the training set, compute μ_{1},..,μ_{n}; σ_{1},..,σ_{n} and the model p(x).
Algorithm evaluation
Fit model p(x) on training set (the vast majority of them are normal).
On a cross validation/test example x, predict:
 y = 1 if p(x) < ε (anomaly).
 y = 0 if p(x) > ε (normal).
and see how often it gets the label right.
Classification accuracy is not a good way to measure the algorithm’s performance, because of skewed classes (so an algorithm that always predicts y = 0 will have high accuracy). Instead, use these possible evaluation metrics:
 True positive, false positive, false negative, true negative.
 Precision/Recall.
 F_{1}score.
Can also use cross validation set to choose parameter ε: try many different values of ε and pick the one that minimizes (say) F_{1}score, or otherwise does well on your cv set.
Anomaly Detection vs. Supervised Learning
Problem: If we have the label data (we know some of them are anomalies, some are not anomalies), why don’t we just use supervised learning?
Describe when to use anomaly detection versus supervised learning
Anomaly detection:
 Very small number of positive examples (y = 1), (020 is common), large number of negatives (y = 0) examples.
 Many different “types” of anomalies. Hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we’ve seen so far.
Supervised learning:
 Large number of positive and negative examples.
 Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples likely to be similar to ones in training set.
Choosing what features to use
Nongaussian features
If we plot the Gaussian distribution p(x_{i}; μ_{i}, σ_{i}^{2}) of the feature x_{i} and the graph is not a bell shaped curve, we can play with different transformations of the data in order to make it more Gaussian.
The algorithm will usually work okay even if we don’t, but if we use these transformations to make the data more gaussian, it might work a bit better.
For example, in the second feature, if we take the log(x)
transformations, we can make it become Gaussian.
Error analysis for anomaly detection
Want p(x) large for normal examples x.
Want p(x) small for anomalous examples x.
What if: p(x) is comparable (say, both large) for normal and anomalous examples.
We need to look at that example and figure out what went wrong, and see if we can come up with a new feature x_{2} that helps to distinguish between this bad example, compared to the rest of normal examples.
Monitoring computers in a data center
Choose feature that might take on unusually large or small values in the event of an anomaly.
 x_{1} = memory use of computer.
 x_{2} = number of disk accesses/sec.
 x_{3} = CPU load.
 x_{4} = network traffic.
The new features may be the combination of available features:
 x_{5} = x_{3} / x_{4}.
 x_{6} = x_{3}^{2} / x_{4}.
Predicting Movie Ratings
Problem Formulation
Example: Predicting movie ratings
User rates movies using zero to five starts.
Some notations:
 n_{u}: no. users
 n_{m}: no. movies
 r(i, j) = 1 if user j has rated movie i.
 y^{(i, j)} = rating given by user j to move i (defined only if r(i, j) = 1).
Problem: predict the values of the question marks (to recommend it to the user).
Content Based Recommendations
Contentbased recommender systems
Let’s say each movies have two features:
 x_{1} = degree of romantic.
 x_{2} = degree of action.
Now each movie can be represented by a feature vector, for example the first movie “Love at last”: x^{(1)}=[1 0.9 0]^{T} (including bias).
For each user j, learn a parameter θ^{(j)}. Predict user j as rating movie i with (θ^{(j)})^{T}x^{(i)} stars.
We denote m^{(j)} = no. of movies rated by user j.
To learn θ^{(j)}, this is actually a linear regression problem:
If want, we can also add the regularization term (ignore bias term of θ course); and we can also eliminate the constant m^{(j)} because it doesn’t affect the optimization objective.
To learn θ^{(1)},..,θ^{(nu)}:
And the gradient descent update:
It maybe very difficult to get the features for the movies/items. In the next lesson, we will talk about an approach that does not assume that we have these features.
Collaborative Filtering
Collaborative Filtering
Let consider the problem when we don’t know the values of the movies’ features. (e.g. x_{1} and x_{2} are unknown).
Further more, now the users will tell us how much they like romantic and action movies. For example θ^{(Alice)} = θ^{(1)} = [0 5 0]^{T}, it means that Alice likes romantics movies (θ_{1}^{(1)} = 5) and doesn’t really like action movies (θ_{2}^{(1)} = 0).
From these figures, it becomes possible to try to infer what are the values of x_{1} and x_{2} for each movie.
For example; the movie Love at last is liked by Alice, Bob and hated by Carol, Dave. Since Alice, Bob like romantic movies; Carol, Dave like action movies; we can reasonably conclude that this is a romantic movie, and x^{(1)} might be [1 1.0 0.0]^{T}.
Mathematically, what we’re really asking is what feature vector should x^{(1)} be so that:
 (θ^{(1)})^{T}x^{(1)} ≈ 5.
 (θ^{(2)})^{T}x^{(1)} ≈ 5.
 (θ^{(3)})^{T}x^{(1)} ≈ 0.
 (θ^{(4)})^{T}x^{(1)} ≈ 0.
Optimization algorithm
Given θ^{(1)},..,θ^{(nu)}, to learn x^{(i)}:
Given θ^{(1)},..,θ^{(nu)}, to learn x^{(1)},..,x^{(nm)}:
Summary
Given x^{(1)},..,x^{(nm)} (and movie ratings), can estimate θ^{(1)},..,θ^{(nu)}.
Given θ^{(1)},..,θ^{(nu)}, can estimate x^{(1)},..,x^{(nm)}.
Chicken and egg problem
Initially, we can random some value for θ, then use it to learn x, then use x to learn better θ, then use θ to learn better x, and so on… This actually works and if you do this, this will actually cause your algorithm to converge to a reasonable set of features for your movies and a reasonable set of parameter for your different users.
In the next lesson we will improve this algorithm and make it quite a bit more computationally efficient.
Collaborative Filtering Algorithm
There’s a more efficient algorithm that doesn’t need to go back and forth, but can solve for θ and x simultaneously.
Put (1) and (2) together:
If we hold the x constant, and minimize J with the respect to θ, it becomes (1) problem. If we do the opposite, it becomes (2) problem.
We can get away with the convention x_{0} = 1, thus no need for θ_{0}. x and θ are now both n dimension.
Collaborative filtering algorithm

Initialize θ^{(1)},..,θ^{(nu)}, x^{(1)},..,x^{(nm)} to small random values.

Minimize J(θ^{(1)},..,θ^{(nu)}, x^{(1)},..,x^{(nm)}) using gradient descent (or an advanced optimization algorithm). E.g. for every j = 1,..,n_{u}, i = 1,..,n</sub>m</sub>:
In this formula we’re no longer have the convention x_{0} = 1.
 For a user with parameter θ and a movie with (learned) features x, predict a star rating of θ^{T}x.
Low rank matrix factorization
Vectorization: low rank matrix factorization
Talk about the vectorization implementation of collaborative filtering algorithm and other things you can do with this algorithm.
Take all the data (move and user) into a matrix:
Predicted ratings matrix:
This predicted ratings matrix can be written as XΘ^{T}, where:
This algorithm called Low Rank Matrix Factorization, this term comes from the property that the matrix XΘ^{T} has a “low rank” property in linear algebra.
Finding related movies
For each product i, we learn a feature vector x^{(i)}. Although neither we know what exactly these features are, nor they’re easy to visualize; the algorithm will usually learn very meaningful features for capturing whatever are the most important properties of a movies that causes you to like or dislike it.
Problem: How to find movies j related to movie i?
If the distance between movie i and j is small (e.g. small), then they might be similar. (You can use KNN to find K most similar movies to movie i).
Implementation Detail: Mean Normalization
Let’s consider a user who hasn’t rated any movies.
According to the optimization formula, the only factor that affects to θ^{5} is regularization term. By minimizing this term, θ^{5} will end up equals 0.
So when we go to predict how user 5 would rate any movies, the result is 0. Then we can’t recommend this user any movie to watch.
The idea of mean normalization will help us fix this problem:
Then we learn θ and x from the new Y matrix.
For user j, on movie i predict: θ^{(j)}x^{(i)} + μ_{i}.
For user 5, on movie i predict: θ^{(5)}x^{(i)} + μ_{i} = μ_{i}.
This is reasonable because if a person hasn’t rated/watched any movie, we just predict for each of the movies the average rating that those movies got.
If we have some movie with no ratings, we can solve it similarly (subtract from the column instead of from row).