Anh-Thi DINH

ML Coursera 8 - Week 8: Unsupervised Learning

Posted on 27/10/2018, in Machine Learning.

This note was first taken when I learnt the machine learning course on Coursera.
Lectures in this week: Lecture 13, Lecture 14.

error This note of Alex Holehouse is also very useful (need to be used alongside with my note).

settings_backup_restore Go back to Week 7.


Unsupervised Learning

  • This is the first unsupervised learnig.
  • There is no label associated to it.
  • There is X but not y.
  • Unsupervised learning
    • Try and determining structure in the data
    • Clustering algorithm groups data together based on data features

K-means Algorithm

  • The most popular and widely used algorithm.
  • See here for the figure of idea
  • Algorithm: Cluster assignment step and Move centroid step

    Clustering algorithm 1

Optimization objective

  • K-means also have an optimization objective (cost function like supervised learning)
  • Knowing this is helpful because
    • Help for debugging
    • Help find better clustering
  • distortion cost function

    Distortion cost function 1

  • Minimize to find out the $c^{(i)}$ and $\mu_{c^{(i)}}$

    Clustering algorithm 2

Random Initialization

  • Also talk about how to avoid local optimal as well.
  • $K$: number of clusters, $m$: the number of training examples (usually $K<m$)
  • Random choose $K$ training examples as a initial cluster.
    • If we choose the wrong examples, it may lead to a local optima
    • To avoid that, dun random K-means multiple times (100 times, for example)
      • Give 100 diff ways to K
      • Choose the one make $J$ min
      • Only apply when $K=2,...,10$

Choosing the number of clusters

  • Elbow method: check the cost function J wrt number of clusters.
    • Check the “elbow” (cái cùi chỏ) where the graph change direction much.
    • EM isn't used very often!

    Elbow method 1

  • If $k=5$ have bigger J than $k=3$ then k-means got stuck in a bad local minimum. You should try re-running k-means with multiple random initializations.
  • Choose the number of clusters depend on later/downstream purpose (choose the size of T-shirt, 3 or 5 for example).

Dimensionality reduction

Data compression

  • Speeds up algorithms + Reduces space used by data for them.
    • If all points in 2D lay on 1 straight line, we can reduce them to 1D cases. The same for 3D to 2D

Data compression

Data Visualization

  • We want to reduce dimension to 2 or 3 so that we can visualize the data.

Principal component analysis (PCA)

PCA formulation

  • Househole’s note.
  • For the problem of dimensionality reduction the most commonly used algorithm is PCA
  • For example, we want to find a line to translate 2D to 1D: all data point projected to this line should be small. The vector of this line is what we wanna find.
  • You should normally do mean normalization and feature scaling (see again) on your data before PCA
  • PCA is not linear regression!!: PCA lấy error là các projection trong khi LG lấy error là sự sai lệch (theo y)
    • Below photo comapre pca (right) and linear regression (left).
    • LR has y to compare while PCA has no y, every x has equal role.

    PCA vs LR

PCA algorithm

  • How to use PCA for yourself + and how to reduce dimension for your data.
  • Data preprocessing: feature scaling/mean normalization
  • svd in matlab/octave = “singular value decomposition” or eig function with the same function (svd is more numerically stable than eig)
    • Compute covariance matrix $\Sigma$
    • Compute Eigenvectors of matrix $\Sigma$

    PCA algorithm

    • face Covariance matrix

      In probability theory and statistics, a covariance matrix is a matrix whose element in the $i, j$ position is the covariance between the i-th and j-th elements of a random vector.

    • After having U, just take first k column if we wanna k dimension from n

    PCA algorithm

  • Algorithm:

    PCA algorithm

Applying PCA

Reconstruction from compressed representation

From z back to x: 2D back to 1D for example,

PCA Reconstruction

Choosing the number of principal components

PCA choosing k

Choo k from 1 to the one get the fraction less than 0.001.

PCA choosing k

The numerator is small: we lose very little information in the dimensionality reduction, so when we decompress we regenerate the same data.

Advice for applying PCA

  • Andrew uses PCA very often
  • Defined by PCA only on the training set
    • And then use those obtained parameters to the Cross validation data and test set
  • A bad use of PCA: Use it to prevent over-fitting –> Better to use regularization
  • Always use normal approach (without PCA) to solve a problem, if you CANNOT DO MORE, then think about adding PCA.

Programming assignment

keyboard_arrow_right File ex7.pdf.

K-means Clustering

  • The K-means algorithm is a method to automatically cluster similar data examples together.
  • K-means algorithm is as follows

    % Initialize centroids
    centroids = kMeansInitCentroids(X, K);
    for iter = 1:iterations
      % Cluster assignment step: Assign each data point to the
      % closest centroid. idx(i) corresponds to cˆ(i), the index
      % of the centroid assigned to example i
      idx = findClosestCentroids(X, centroids);
      % Move centroid step: Compute means based on centroid
      % assignments
      centroids = computeMeans(X, idx, K);
  • Size: $X\in \mathbb{R}^{m\times n}$, $K\in \mathbb{R}$, centroids $\in \mathbb{R}^{K\times n}$, $c\in \mathbb{R}^{m\times 1}$ (idx is c)
    • $m$ examples, $n$ features
  • File findClosestCentroids.m

    for i=1:size(X,1)
          min = 100; % initial
          for k=1:K
              if norm(X(i,:)-centroids(k,:)) < min
                      min = norm(X(i,:)-centroids(k,:));
                      min_idx = k;
          idx(i,1) = min_idx;
  • Computing centroid means: file computeCentroids.m

    for k=1:K
          idxCk = find(idx == k); % index of all X that corresponding to centroid k
          centroids(k,:) = sum(X(idxCk,:))/size(idxCk,1);
  • File kMeansInitCentroids.m (Random initialization)

    % Initialize the centroids to be random examples
    % Randomly reorder the indices of examples
    randidx = randperm(size(X, 1));
    % Take the first K examples as centroids
    centroids = X(randidx(1:K), :);

Image compression with K-means

Check the guide in ex7.pdf, page 7.

In this exercise, you will use the K-means algorithm to select the 16 colors that will be used to represent the compressed image. Concretely, you will treat every pixel in the original image as a data example and use the K-means algorithm to find the 16 colors that best group (cluster) the pixels in the 3-dimensional RGB space. Once you have computed the cluster centroids on the image, you will then use the 16 colors to replace the pixels in the original image.


PCA consists of two computational steps:

  • First, you compute the covariance matrix of the data.
  • Then, you use Octave/MATLAB’s SVD function to compute the eigenvectors $U_1, U_2,\ldots,U_n$.

Before using PCA, it is important to first normalize the data by subtracting the mean value of each feature from the dataset, and scaling each dimension so that they are in the same range.

File pca.m

Sigma = 1/m * (X'*X);
[U, S, ~] = svd(Sigma);

Dimensionality Reduction with PCA

File projectData.m:

Ureduce = U(:,1:K);
Z = X*Ureduce;

File recoverData.m

Ureduce = U(:,1:K);
X_rec = Z*Ureduce';

Face Image Dataset

In this part of the exercise, you will run PCA on face images to see how it can be used in practice for dimension reduction.

For example, if you were training a neural network to perform person recognition (gven a face image, predict the identitfy of the person), you can use the dimension reduced input of only a 100 dimensions instead of the original pixels.

keyboard_arrow_right Go to Week 9.