Week 8 notes: Unsupervised Learning
20170913Introduction
No notes.
Clustering
Introduction
No notes.
KMeans Algorithm
KMeans is by far the most popular, the most widely used clustering algorithm.
KMeans is an iterative algorithm and it does two things:
 Cluster assignment step.
 Move centroid step.
Input:
 K (number of clusters).
 Traning set {x^{(1)}, .., x^{(m)}}. x^{(i)} is n dimensional vectors.
Kmeans algorithm
Randomly initialize K cluster centroids μ_{1}, .., μ_{K}.
Repeat {
 for i = 1 to m (Cluster assignment step)
 c^{(i)} = index (from 1 to K) of cluster centroid closest to x^{(i)}.
 for k = 1 to K (Move centroid step)
 μ_{K} = average (mean) of points assigned to cluster k.
}
Optimization Objective
μ_{c(i)} = cluster centroid of cluster to which example x^{(i)} has been assigned.
Optimization objective: is to minimize the cost function:
Cluster assignment step: minimize J w.r.t c^{(i)}.
Move centroid step: minimize J w.r.t μ_{k}.
The cost function J must decrease over the number of iterations.
Cost function J here is also called distortion function.
Random Initialization
Help Kmeans avoid local optima.
Recommended way
Should have K < m
Randomly pick K training examples as K cluster centroids.
Depending on the random initialization, Kmeans can end up at different solutions. And in particular, Kmeans can actually end up at local optima.
Local optima
The term local optima refers to local optima of the cost function J.
If you want to increase the chance that Kmeans will find the best possible clustering, try multiple random initializations:
for i = 1 to 100 {
 Randomly initialize Kmeans.
 Run Kmeans; get c, μ.
 Compute cost function J (distortion).
}
Pick the clustering that gave lowest cost J (out of 100).
Choosing the Number of Clusters
Elbow method: run Kmeans with different Ks and compute corresponding cost function Js, plot the graph (K, J) then choose K at the Elbow point.
Shortcoming: The graph in practical don’t usually have the Elbow shape, hard to know where is the location of the Elbow.
Choosing the value of K
Is to ask for what purpose are you running Kmeans?
What is the number of K that serves that? whatever later purpose that you actually run the Kmeans for.
Motivation
Motivation 1: Data Compression
No notes.
Motivation 2: Visualization
No notes.
Principal Component Analysis
Principal Component Analysis Problem Formulation
Reduce from ndimension to kdimension: Find k vector u^{(1)}, u^{(2)}, .., u^{(k)} onto which to project the data, so as to minimize the projection error.
Principal Component Analysis Algorithm
Data preprocessing
Training set: x^{(1)}, .., x^{(m)}.
Preprocessing (feature scaling/mean normalization)
Replace each x_{j}^{(i)} with x_{j}  μ_{j}.
If different features on different scales (e.g., x_{1} = size of house, x_{2} = number of bedrooms), scale features to have comparable range of values.
PCA Algorithm
Reduce data from ndimensions to kdimensions.
Compute “covariance matrix” (ma trận hiệp phương sai):
is going to be an nxn matrix.
Compute “eigenvectors” (vector riêng) of matrix :
SVD: Singular Value Decomposition
U is an nxn matrix, column ith is corresponding to u^{(i)} vector. And if you want to reduce data from ndimensions to kdimensions, just take out the k first columns: u^{(1)}, ..,u^{(k)} which are the k directions on to which we want to project the data, we call this matrix U_{reduce}  an nxk matrix.
Project x (ndimensions) onto U_{reduce} to get z (kdimensions):
Applying PCA
Reconstruction from Compressed Representation
Given the point z in kdimensions, how can we go back to x in ndimensions?
Choosing the Number of Principal Components
k = number of principal components that we’ve retained.
Choosing k
Average squared projection error: , minimize this.
Total variation in the data: , on average: how far are my training examples from the origin?
Typically, choose k to be smallest value so that:
it means 99% of variance is retained.”
Algorithm
Try PCA with k = 1
Compute U_{reduce}, z^{(1)}, .., z^{(m)}, x^{(1)}_{approx}, ..,x^{(m)}_{approx}.
Check if (*) is less than 0.01? If not, increase k and try again, until we find the satisfied value of k.
But doing this can be expensive because with each k, we have to recalculate everything. There’s another way to do this:
S is a diagonal matrix, and for given k:
and check if:
again, it means 99% of variance retained.
Advice for applying PCA
Supervised learning speedup
Input: (x^{(1)}, y^{(1)}), ..,(x^{(m)}, y^{(m)}).
x are very high demensional, say 10000.
Extract input:

Unlabeled dataset: x^{(1)}, ..,x^{(m)} in R^{10000}.

Then we apply PCA and get z^{(1)}, ..,z^{(m)} in R^{1000}.
New training set: (z^{(1)}, y^{(1)}), ..,(z^{(m)}, y^{(m)}).
Then we can pass this into learning algorithms: linear regression, neural network…
Note: Mapping x^{(i)} to z^{(i)} should be defined by running PCA only on the training set. This mapping can be applied as well to the examples x_{cv}^{(i)} and x_{test}^{(i)} in the cross validation and test sets.
Application of PCA
Compression
 Reduce memory/disk needed to store data.
 Speedup learning algorithm.
Visualization
Bad use of PCA: to prevent overfitting
Use z instead of x to reduce the number of features to k < n.
Thus, fewer features, less likely to overfit.
This might work OK, but is not a great way to address overfitting, (use regularization instead); because PCA doesn’t care what the value of y is (the label), and it might throw away some valuable information.
PCA is sometimes used where it shouldn’t be
People usually design ML system with the involvement of PCA right from the beginning. But in fact, one should ask this question first: How about doing the whole thing without using PCA?
Before implementing PCA, first try running whatever you want to do with the original/raw data x. Only if that doesn’t do what you want, then implement PCA and consider using z.