ML Coursera 7  Week 7: Support Vector Machines
Posted on 20/10/2018, in Machine Learning.This note was first taken when I learnt the machine learning course on Coursera.
Lectures in this week: Lecture 12.
error From this note, I see that this note of Alex Holehouse is really detailed so that I can use it for the future reference. I don’t have enough time for noting as previous note.
 So far, we’ve seen a range of different algorithms
 With supervised learning algorithms  performance is pretty similar. What matters more often is:
 The amount of training data
 Skill of applying algorithms
 With supervised learning algorithms  performance is pretty similar. What matters more often is:
 One final supervised learning algorithm that is widely used  support vector machine (SVM)
 Compared to both logistic regression and neural networks, a SVM sometimes gives a cleaner way of learning nonlinear functions
 Later in the course we’ll do a survey of different supervised learning algorithms
Large margin classification
Optimization objective
 Alternative view of logistic regression: we see the cost function is a function of $z=X\Theta$.
 To build a SVM we must redefine our cost functions
 Instead of a curved line create two straight lines (magenta) which acts as an approximation to the logistic regression $y = 1$ function
 So here we define the two cost function terms for our SVM graphically
$$ \begin{align} cost_0 &= \log(1\dfrac{1}{1+e^{z}}) \\ cost_1 &= \log(\dfrac{1}{1+e^{z}}) \end{align} $$ 
So we get (new SVM cost function)

We use another notation to minimize problem
 Unlike logistic, $h_{\theta}(x)$ doesn’t give us a probability, but instead we get a direct prediction of 1 or 0 (as mentioned before)
Large margin intuition

Logistic regression only need $X\Theta\ge 0$ to get $h=1$ or $X\Theta <0$ to get $0$. SVM gives us a clearer way ($X\Theta \ge 1$ and $X\Theta <1$ respectively).
 Consider a case in which $C$ very huge, we need to choose $\Theta$ such that $A=0$ ($A$ in $CA+B$).
 If $y1$, we need to find $\Theta$ such that $X\Theta \ge 1$
 If $y=0$, find $\Theta$ such that $X\Theta <1$

When $CA=0$, the minimization problem becomes,

Mathematically, that black line has a larger minimum distance (margin) from any of the training examples
Mathematics behind large classification
Kernels
Kernels I

That a hypothesis computes a decision boundary by taking the sum of the parameter vector multiplied by a new feature vector f, which simply contains the various high order x terms

We choose landmarks $l^{(1)}, l^{(2)}, \ldots$ and then using the similarity (kernel) between $x$ and each landmark $l^{(i)}$.
$$ \begin{align} \text{similarity} = k(x,l^{(i)}) &= \exp\left( \dfrac{\Vert xl^{(i)}\Vert^2}{2\sigma^2} \right) \\ f_i &:= \text{similarity}(x,l^{(i)}). \end{align} $$ $\sigma$: standard deviation
 $\sigma^2$: variance.
 $\Vert \cdot \Vert$: Euclidean distance
 There are many kernels, above def of similarity is** Gaussian kernel**.
 We call $f$ landmark also!
Kernels II
 Where do we get the landmarks from?
 For each example place a landmark at exactly the same location
 Given $m$ examples of $n$ features $(x^{(i)}, y^{(i)}$ for $i=1,m$.
 Choose landmarks: $l^{(i)} = x^{(i)}$ where $i=1,m$.
 We will build $n$ landmark $f^{(i)}$, each of them is built from
 Note that $m$ input elements $x^{(i)}$ becomes $m+1$ landmarks $f$ ($f^{0} = 1$)
 $\Theta$ now becomes $\Theta \in \mathbb{R}^{(m+1)\times 1}$.
 $f\in \mathbb{R}^{(m+1)}$ also.
 Predict $1$ if $\Theta^Tf \ge 0$

SVM learning algorithm
$$ \min_{\Theta} C\sum_{i=1}^m \left( y^{(i)} cost_1 (\Theta^Tf^{(i)}) + (1y^{(i)})cost_0(\Theta^Tf^{(i)})\right) + \dfrac{1}{2}\sum_{j=1}^m\phi_j^2, \quad (n=m \text{ in this case}) $$  We minimize using $f$ as the feature vector instead of $x$
 $m=n$ because number of features is the number of training data examples we have.
 It’s really expensive because there may be a lot of features (= number of training examples). It's good to use shelf software to minimize this function instead. DON’T write your own software to do that!!
 Variance vs Bias tradeoff
 Large $C$ ($\frac{1}{\lambda}$): low bias, high variance $\Rightarrow$ overfitting.
 Small C gives a hypothesis of high bias low variance $\Rightarrow$ underfitting
 Large $\sigma^2$: f features vary more smoothly  higher bias, lower variance
 Small $\sigma^2$: f features vary unexpectedly  low bias, high variance
SVMs in practice
Using an SVM
 Don’t write yourown codes to linearize the SVM, use alreadywriten library such as liblinear, libsvm, …
 Choice of $C$
 Choice of kernel (similarly functions)
 No kernel (“linear kernel”, use $X\Theta$)
 Gaussian kernel (above): need to choose $\sigma^2$
 Do perform feature scaling before using Gaussian kernel
 Not all similarity functions you develop are valid kernels $\Rightarrow$ Must satisfy **Merecer's Theorem** to make sure SVM packages’ optimizations run correctly, and do not diverge.
 Polynomial kernel:
 use when $x$ and $l$ are both strictly nonnegative
 People not use this much.
 parameters: $const$ and $degree$
 Other kernels: string kernel (input data using texts, string,…), chisquare kernel, histogram intersection kernel,…
 Remember: choose whatever kernel performs best on crossvalidation data
Mul(Dclass*classifica(on
 Many packages have built in multiclass classification packages
 Otherwise use onevs all method
 Not a big issue
Logistic regression vs. SVM
 If n (features) is large vs. m (training set)
 Feature vector dimension is 10 000
 Training set is 10  1000
 Then use logistic regression or SVM with a linear kernel
 If n is small and m is intermediate
 n = 1  1000
 m = 10  10 000
 Gaussian kernel is good
 If n is small and m is large
 n = 1  1000
 m = 50 000+
 SVM will be slow to run with Gaussian kernel
 In that case
 Manually createMul(Dclass*classifica(on or add more features
 Use *logistic Mul(Dclassclassifica(onregression of SVM with a linear kernel** Mul(Dclass*classifica(on
 Logistic regressionMul(Dclass*classifica(on and SVM with a linear kernel are pretty similar (performance, works)
 SVM has a convex optimization problem  so you get a
 Neural network likely to work well for most of these settings, but may be slower to train.
Programming Assignment
SVM

A large $C$ parameter tells the SVM to try to classify all the examples correctly. $C$ plays a role similar to $\frac{1}{\lambda}$ , where $\lambda$ is the regularization parameter that we were using previously for logistic regression.
settings_backup_restore See again Regularized logistic regression.$C=1$
$C=100$

Most SVM software packages (including svmTrain.m) automatically add the extra feature $x_0 = 1$ for you and automatically take care of learning the intercept term $\theta_0$. So when passing your training data to the SVM software, there is no need to add this extra feature $x_0 = 1$ yourself.
SVM with Gaussian Kernels
 To find nonlinear decision boundaries with the SVM, we need to first implement a Gaussian kernel.
 You can think of the Gaussian kernel as a similarity function that measures the “distance" between a pair of examples $(x^{(i)}, x^{(j)})$. The Gaussian kernel is also parameterized by a bandwidth parameter, $\sigma$, which determines how fast the similarity metric decreases (to 0) as the examples are further apart.

The Gaussian kernel function defined as

File gaussianKernel.m
sim = exp( (norm(x1x2))^2/(2*sigma^2) );
Example Dataset 3
File dataset3Params.m
range = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30];
predictionErrMin = 100000; % initial
for i=1:size(range,2)
for j=1:size(range,2)
model= svmTrain(X, y, range(i), @(x1, x2) gaussianKernel(x1, x2, range(j)));
predictions = svmPredict(model, Xval);
predictionErr = mean(double(predictions ~= yval));
if predictionErr < predictionErrMin
predictionErrMin = predictionErr;
C = range(i);
sigma = range(j);
end
end
end
Spam Classification
 You need to convert each email into a feature vector $x\in \mathbb{R}^n$. The following parts of the exercise will walk you through how such a feature vector can be constructed from an email.
Preprocessing Emails
 One method often employed in processing emails is to “normalize” these values, so that all URLs are treated the same, all numbers are treated the same, etc.
 we could replace each URL in the email with the unique string
httpaddr
to indicate that a URL was present.
 we could replace each URL in the email with the unique string
 Usually, we do:
 Lowercasing: convert entire email to lowercase.
 Stripping HTML: All HTML tags are removed from the emails.
 Normalizing URLs: All URLs are replaced with the text
httpaddr
 Normalizing Email Addresses: with the text
emailaddr
.  Normalizing Numbers:
number
.  Normalizing Dollars: All dollar signs ($) are replaced with the text
dollar
.  Word Stemming: “discount”, “discounts”, “discounted” and “discounting” replace by
discount
 Removal of nonwords: Nonwords and punctuation have been removed. all tabs, spaces, newlines becomes 1space character.
Vocabulary List
 Our vocabulary list was selected by choosing all words which occur at least a 100 times in the spam corpus, resulting in a list of 1899 words. In practice, a vocabulary list with about 10,000 to 50,000 words is often used.
 Given the vocabulary list, we can now map each word in the preprocessed emails (e.g., Figure 9) into a list of word indices that contains the index of the word in the vocabulary list.
 If the word exists, you should add the index of the word into the word indices variable.
 If the word does not exist, and is therefore not in the vocabulary, you can skip the word.

File processEmail.m
for i=1:length(vocabList) if strcmp(str, vocabList{i}) word_indices = [word_indices; i]; end end
Extracting Features from Emails
 You will now implement the feature extraction that converts each email into a vector in $\mathbb{R}^n$.
 $n=$ number of words in vocabulary list.
 $x_i\in {0,1}$ for an email corresponds to whether the ith word in the dictionary occurs in the email.

File emailFeatures.m
x(word_indices)=1;