Anh-Thi DINH

ML Coursera 7 - Week 7: Support Vector Machines

Posted on 20/10/2018, in Machine Learning.

This note was first taken when I learnt the machine learning course on Coursera.
Lectures in this week: Lecture 12.

error From this note, I see that this note of Alex Holehouse is really detailed so that I can use it for the future reference. I don’t have enough time for noting as previous note.

settings_backup_restore Go back to Week 6.
  • So far, we’ve seen a range of different algorithms
    • With supervised learning algorithms - performance is pretty similar. What matters more often is:
      • The amount of training data
      • Skill of applying algorithms
  • One final supervised learning algorithm that is widely used - support vector machine (SVM)
    • Compared to both logistic regression and neural networks, a SVM sometimes gives a cleaner way of learning non-linear functions
    • Later in the course we’ll do a survey of different supervised learning algorithms

Large margin classification

Optimization objective

  • Alternative view of logistic regression: we see the cost function is a function of $z=X\Theta$.
  • To build a SVM we must redefine our cost functions
    • Instead of a curved line create two straight lines (magenta) which acts as an approximation to the logistic regression $y = 1$ function
    • So here we define the two cost function terms for our SVM graphically

    Two new cost functions

    $$ \begin{align} cost_0 &= -\log(1-\dfrac{1}{1+e^{-z}}) \\ cost_1 &= -\log(\dfrac{1}{1+e^{-z}}) \end{align} $$
  • So we get (new SVM cost function)

  • We use another notation to minimize problem

  • Unlike logistic, $h_{\theta}(x)$ doesn’t give us a probability, but instead we get a direct prediction of 1 or 0 (as mentioned before)

Large margin intuition

  • Logistic regression only need $X\Theta\ge 0$ to get $h=1$ or $X\Theta <0$ to get $0$. SVM gives us a clearer way ($X\Theta \ge 1$ and $X\Theta <-1$ respectively).

    Support vector machine

  • Consider a case in which $C$ very huge, we need to choose $\Theta$ such that $A=0$ ($A$ in $CA+B$).
    • If $y1$, we need to find $\Theta$ such that $X\Theta \ge 1$
    • If $y=0$, find $\Theta$ such that $X\Theta <-1$
  • When $CA=0$, the minimization problem becomes,

    SVM 3

  • Mathematically, that black line has a larger minimum distance (margin) from any of the training examples

    Larger minimum distance

Mathematics behind large classification


Kernels I

  • That a hypothesis computes a decision boundary by taking the sum of the parameter vector multiplied by a new feature vector f, which simply contains the various high order x terms

  • We choose landmarks $l^{(1)}, l^{(2)}, \ldots$ and then using the similarity (kernel) between $x$ and each landmark $l^{(i)}$.

    $$ \begin{align} \text{similarity} = k(x,l^{(i)}) &= \exp\left( -\dfrac{\Vert x-l^{(i)}\Vert^2}{2\sigma^2} \right) \\ f_i &:= \text{similarity}(x,l^{(i)}). \end{align} $$
    • $\sigma$: standard deviation
    • $\sigma^2$: variance.
    • $\Vert \cdot \Vert$: Euclidean distance
    • There are many kernels, above def of similarity is** Gaussian kernel**.

    Kernel and similarity

    • We call $f$ landmark also!

Kernels II

  • Where do we get the landmarks from?
  • For each example place a landmark at exactly the same location
    • Given $m$ examples of $n$ features $(x^{(i)}, y^{(i)}$ for $i=1,m$.
    • Choose landmarks: $l^{(i)} = x^{(i)}$ where $i=1,m$.
    • We will build $n$ landmark $f^{(i)}$, each of them is built from
  • Note that $m$ input elements $x^{(i)}$ becomes $m+1$ landmarks $f$ ($f^{0} = 1$)
  • $\Theta$ now becomes $\Theta \in \mathbb{R}^{(m+1)\times 1}$.
  • $f\in \mathbb{R}^{(m+1)}$ also.
  • Predict $1$ if $\Theta^Tf \ge 0$
  • SVM learning algorithm

    $$ \min_{\Theta} C\sum_{i=1}^m \left( y^{(i)} cost_1 (\Theta^Tf^{(i)}) + (1-y^{(i)})cost_0(\Theta^Tf^{(i)})\right) + \dfrac{1}{2}\sum_{j=1}^m\phi_j^2, \quad (n=m \text{ in this case}) $$
  • We minimize using $f$ as the feature vector instead of $x$
  • $m=n$ because number of features is the number of training data examples we have.
  • It’s really expensive because there may be a lot of features (= number of training examples). It's good to use shelf software to minimize this function instead. DON’T write your own software to do that!!
  • Variance vs Bias trade-off
    • Large $C$ ($\frac{1}{\lambda}$): low bias, high variance $\Rightarrow$ overfitting.
    • Small C gives a hypothesis of high bias low variance $\Rightarrow$ underfitting
    • Large $\sigma^2$: f features vary more smoothly - higher bias, lower variance
    • Small $\sigma^2$: f features vary unexpectedly - low bias, high variance

SVMs in practice

Using an SVM

  • Don’t write yourown codes to linearize the SVM, use already-writen library such as liblinear, libsvm, …
    • Choice of $C$
    • Choice of kernel (similarly functions)
      • No kernel (“linear kernel”, use $X\Theta$)
      • Gaussian kernel (above): need to choose $\sigma^2$
  • Do perform feature scaling before using Gaussian kernel
  • Not all similarity functions you develop are valid kernels $\Rightarrow$ Must satisfy **Merecer's Theorem** to make sure SVM packages’ optimizations run correctly, and do not diverge.
  • Polynomial kernel:
    • use when $x$ and $l$ are both strictly non-negative
    • People not use this much.
    • parameters: $const$ and $degree$
  • Other kernels: string kernel (input data using texts, string,…), chi-square kernel, histogram intersection kernel,…
  • Remember: choose whatever kernel performs best on cross-validation data


  • Many packages have built in multi-class classification packages
  • Otherwise use one-vs all method
  • Not a big issue

Logistic regression vs. SVM

  • If n (features) is large vs. m (training set)
    • Feature vector dimension is 10 000
    • Training set is 10 - 1000
    • Then use logistic regression or SVM with a linear kernel
  • If n is small and m is intermediate
    • n = 1 - 1000
    • m = 10 - 10 000
    • Gaussian kernel is good
  • If n is small and m is large
    • n = 1 - 1000
    • m = 50 000+
    • SVM will be slow to run with Gaussian kernel
    • In that case
      • Manually createMul(Dclass*classifica(on or add more features
      • Use *logistic Mul(Dclassclassifica(onregression of SVM with a linear kernel** Mul(Dclass*classifica(on
  • Logistic regressionMul(Dclass*classifica(on and SVM with a linear kernel are pretty similar (performance, works)
  • SVM has a convex optimization problem - so you get a
  • Neural network likely to work well for most of these settings, but may be slower to train.

Programming Assignment


  • A large $C$ parameter tells the SVM to try to classify all the examples correctly. $C$ plays a role similar to $\frac{1}{\lambda}$ , where $\lambda$ is the regularization parameter that we were using previously for logistic regression.

    settings_backup_restore See again Regularized logistic regression.

    Difference C for SVM $C=1$

    Difference C for SVM $C=100$

  • Most SVM software packages (including svmTrain.m) automatically add the extra feature $x_0 = 1$ for you and automatically take care of learning the intercept term $\theta_0$. So when passing your training data to the SVM software, there is no need to add this extra feature $x_0 = 1$ yourself.

SVM with Gaussian Kernels

  • To find non-linear decision boundaries with the SVM, we need to first implement a Gaussian kernel.
  • You can think of the Gaussian kernel as a similarity function that measures the “distance" between a pair of examples $(x^{(i)}, x^{(j)})$. The Gaussian kernel is also parameterized by a bandwidth parameter, $\sigma$, which determines how fast the similarity metric decreases (to 0) as the examples are further apart.
  • The Gaussian kernel function defined as

  • File gaussianKernel.m

    sim = exp( -(norm(x1-x2))^2/(2*sigma^2) );

Example Dataset 3

File dataset3Params.m

range = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30];
predictionErrMin = 100000; % initial

for i=1:size(range,2)
   for j=1:size(range,2)
       model= svmTrain(X, y, range(i), @(x1, x2) gaussianKernel(x1, x2, range(j))); 
       predictions = svmPredict(model, Xval);
       predictionErr = mean(double(predictions ~= yval));
       if predictionErr < predictionErrMin
           predictionErrMin = predictionErr;
           C = range(i);
           sigma = range(j);

Spam Classification

  • You need to convert each email into a feature vector $x\in \mathbb{R}^n$. The following parts of the exercise will walk you through how such a feature vector can be constructed from an email.

Preprocessing Emails

  • One method often employed in processing emails is to “normalize” these values, so that all URLs are treated the same, all numbers are treated the same, etc.
    • we could replace each URL in the email with the unique string httpaddr to indicate that a URL was present.
  • Usually, we do:
    • Lower-casing: convert entire email to lowercase.
    • Stripping HTML: All HTML tags are removed from the emails.
    • Normalizing URLs: All URLs are replaced with the text httpaddr
    • Normalizing Email Addresses: with the text emailaddr.
    • Normalizing Numbers: number.
    • Normalizing Dollars: All dollar signs ($) are replaced with the text dollar.
    • Word Stemming: “discount”, “discounts”, “discounted” and “discounting” replace by discount
    • Removal of non-words: Non-words and punctuation have been removed. all tabs, spaces, newlines becomes 1-space character.

Vocabulary List

  • Our vocabulary list was selected by choosing all words which occur at least a 100 times in the spam corpus, resulting in a list of 1899 words. In practice, a vocabulary list with about 10,000 to 50,000 words is often used.
  • Given the vocabulary list, we can now map each word in the preprocessed emails (e.g., Figure 9) into a list of word indices that contains the index of the word in the vocabulary list.
    • If the word exists, you should add the index of the word into the word indices variable.
    • If the word does not exist, and is therefore not in the vocabulary, you can skip the word.
  • File processEmail.m

    for i=1:length(vocabList)
        if strcmp(str, vocabList{i})
            word_indices = [word_indices; i]; 

Extracting Features from Emails

  • You will now implement the feature extraction that converts each email into a vector in $\mathbb{R}^n$.
  • $n=$ number of words in vocabulary list.
  • $x_i\in {0,1}$ for an email corresponds to whether the i-th word in the dictionary occurs in the email.
  • File emailFeatures.m

keyboard_arrow_right Go to Week 8.