ML Coursera 8  Week 8: Unsupervised Learning
Posted on 27/10/2018, in Machine Learning.This note was first taken when I learnt the machine learning course on Coursera.
Lectures in this week: Lecture 13, Lecture 14.
error This note of Alex Holehouse is also very useful (need to be used alongside with my note).
Clustering
Unsupervised Learning
 This is the first unsupervised learnig.
 There is no label associated to it.
 There is X but not y.
 Unsupervised learning
 Try and determining structure in the data
 Clustering algorithm groups data together based on data features
Kmeans Algorithm
 The most popular and widely used algorithm.
 See here for the figure of idea

Algorithm: Cluster assignment step and Move centroid step
Optimization objective
 Kmeans also have an optimization objective (cost function like supervised learning)
 Knowing this is helpful because
 Help for debugging
 Help find better clustering

distortion cost function

Minimize to find out the $c^{(i)}$ and $\mu_{c^{(i)}}$
Random Initialization
 Also talk about how to avoid local optimal as well.
 $K$: number of clusters, $m$: the number of training examples (usually $K<m$)
 Random choose $K$ training examples as a initial cluster.
 If we choose the wrong examples, it may lead to a local optima
 To avoid that, dun random Kmeans multiple times (100 times, for example)
 Give 100 diff ways to K
 Choose the one make $J$ min
 Only apply when $K=2,...,10$
Choosing the number of clusters
 Elbow method: check the cost function J wrt number of clusters.
 Check the “elbow” (cái cùi chỏ) where the graph change direction much.
 EM isn't used very often!
 If $k=5$ have bigger J than $k=3$ then kmeans got stuck in a bad local minimum. You should try rerunning kmeans with multiple random initializations.
 Choose the number of clusters depend on later/downstream purpose (choose the size of Tshirt, 3 or 5 for example).
Dimensionality reduction
 Motivation: data compression + data visualization
 Househole’s note.
Data compression
 Speeds up algorithms + Reduces space used by data for them.
 If all points in 2D lay on 1 straight line, we can reduce them to 1D cases. The same for 3D to 2D
Data Visualization
 We want to reduce dimension to 2 or 3 so that we can visualize the data.
Principal component analysis (PCA)
PCA formulation
 Househole’s note.
 For the problem of dimensionality reduction the most commonly used algorithm is PCA
 For example, we want to find a line to translate 2D to 1D: all data point projected to this line should be small. The vector of this line is what we wanna find.
 You should normally do mean normalization and feature scaling (see again) on your data before PCA
 PCA is not linear regression!!: PCA lấy error là các projection trong khi LG lấy error là sự sai lệch (theo y)
 Below photo comapre pca (right) and linear regression (left).
 LR has y to compare while PCA has no y, every x has equal role.
PCA algorithm
 How to use PCA for yourself + and how to reduce dimension for your data.
 Data preprocessing: feature scaling/mean normalization
svd
in matlab/octave = “singular value decomposition” oreig
function with the same function (svd
is more numerically stable than eig) Compute covariance matrix $\Sigma$
 Compute Eigenvectors of matrix $\Sigma$

face Covariance matrix
In probability theory and statistics, a covariance matrix is a matrix whose element in the $i, j$ position is the covariance between the ith and jth elements of a random vector.
 After having U, just take first k column if we wanna k dimension from n

Algorithm:
Applying PCA
Reconstruction from compressed representation
From z back to x: 2D back to 1D for example,
Choosing the number of principal components
Choo k from 1 to the one get the fraction less than 0.001.
The numerator is small: we lose very little information in the dimensionality reduction, so when we decompress we regenerate the same data.
Advice for applying PCA
 Andrew uses PCA very often
 Defined by PCA only on the training set
 And then use those obtained parameters to the Cross validation data and test set
 A bad use of PCA: Use it to prevent overfitting –> Better to use regularization
 Always use normal approach (without PCA) to solve a problem, if you CANNOT DO MORE, then think about adding PCA.
Programming assignment
Kmeans Clustering
 The Kmeans algorithm is a method to automatically cluster similar data examples together.

Kmeans algorithm is as follows
% Initialize centroids centroids = kMeansInitCentroids(X, K); for iter = 1:iterations % Cluster assignment step: Assign each data point to the % closest centroid. idx(i) corresponds to cˆ(i), the index % of the centroid assigned to example i idx = findClosestCentroids(X, centroids); % Move centroid step: Compute means based on centroid % assignments centroids = computeMeans(X, idx, K); end
 Size: $X\in \mathbb{R}^{m\times n}$, $K\in \mathbb{R}$, centroids $\in \mathbb{R}^{K\times n}$, $c\in \mathbb{R}^{m\times 1}$ (
idx
is c) $m$ examples, $n$ features

File findClosestCentroids.m
for i=1:size(X,1) min = 100; % initial for k=1:K if norm(X(i,:)centroids(k,:)) < min min = norm(X(i,:)centroids(k,:)); min_idx = k; end end idx(i,1) = min_idx; end

Computing centroid means: file computeCentroids.m
for k=1:K idxCk = find(idx == k); % index of all X that corresponding to centroid k centroids(k,:) = sum(X(idxCk,:))/size(idxCk,1); end

File kMeansInitCentroids.m (Random initialization)
% Initialize the centroids to be random examples % Randomly reorder the indices of examples randidx = randperm(size(X, 1)); % Take the first K examples as centroids centroids = X(randidx(1:K), :);
Image compression with Kmeans
Check the guide in ex7.pdf, page 7.
In this exercise, you will use the Kmeans algorithm to select the 16 colors that will be used to represent the compressed image. Concretely, you will treat every pixel in the original image as a data example and use the Kmeans algorithm to find the 16 colors that best group (cluster) the pixels in the 3dimensional RGB space. Once you have computed the cluster centroids on the image, you will then use the 16 colors to replace the pixels in the original image.
PCA
PCA consists of two computational steps:
 First, you compute the covariance matrix of the data.
 Then, you use Octave/MATLAB’s
SVD
function to compute the eigenvectors $U_1, U_2,\ldots,U_n$.
Before using PCA, it is important to first normalize the data by subtracting the mean value of each feature from the dataset, and scaling each dimension so that they are in the same range.
File pca.m
Sigma = 1/m * (X'*X);
[U, S, ~] = svd(Sigma);
Dimensionality Reduction with PCA
File projectData.m:
Ureduce = U(:,1:K);
Z = X*Ureduce;
File recoverData.m
Ureduce = U(:,1:K);
X_rec = Z*Ureduce';
Face Image Dataset
In this part of the exercise, you will run PCA on face images to see how it can be used in practice for dimension reduction.
For example, if you were training a neural network to perform person recognition (gven a face image, predict the identitfy of the person), you can use the dimension reduced input of only a 100 dimensions instead of the original pixels.