### ML Coursera 7 - Week 7: Support Vector Machines

Posted on 20/10/2018, in Machine Learning.

This note was first taken when I learnt the machine learning course on Coursera.
Lectures in this week: Lecture 12.

error From this note, I see that this note of Alex Holehouse is really detailed so that I can use it for the future reference. I don’t have enough time for noting as previous note.

settings_backup_restore
• So far, we’ve seen a range of different algorithms
• With supervised learning algorithms - performance is pretty similar. What matters more often is:
• The amount of training data
• Skill of applying algorithms
• One final supervised learning algorithm that is widely used - support vector machine (SVM)
• Compared to both logistic regression and neural networks, a SVM sometimes gives a cleaner way of learning non-linear functions
• Later in the course we’ll do a survey of different supervised learning algorithms

## Large margin classification

### Optimization objective

• Alternative view of logistic regression: we see the cost function is a function of $z=X\Theta$.
• To build a SVM we must redefine our cost functions
• Instead of a curved line create two straight lines (magenta) which acts as an approximation to the logistic regression $y = 1$ function
• So here we define the two cost function terms for our SVM graphically

\begin{align} cost_0 &= -\log(1-\dfrac{1}{1+e^{-z}}) \\ cost_1 &= -\log(\dfrac{1}{1+e^{-z}}) \end{align}
• So we get (new SVM cost function)

• We use another notation to minimize problem

• Unlike logistic, $h_{\theta}(x)$ doesn’t give us a probability, but instead we get a direct prediction of 1 or 0 (as mentioned before)

### Large margin intuition

• Logistic regression only need $X\Theta\ge 0$ to get $h=1$ or $X\Theta <0$ to get $0$. SVM gives us a clearer way ($X\Theta \ge 1$ and $X\Theta <-1$ respectively).

• Consider a case in which $C$ very huge, we need to choose $\Theta$ such that $A=0$ ($A$ in $CA+B$).
• If $y1$, we need to find $\Theta$ such that $X\Theta \ge 1$
• If $y=0$, find $\Theta$ such that $X\Theta <-1$
• When $CA=0$, the minimization problem becomes,

• Mathematically, that black line has a larger minimum distance (margin) from any of the training examples

## Kernels

### Kernels I

• That a hypothesis computes a decision boundary by taking the sum of the parameter vector multiplied by a new feature vector f, which simply contains the various high order x terms

• We choose landmarks $l^{(1)}, l^{(2)}, \ldots$ and then using the similarity (kernel) between $x$ and each landmark $l^{(i)}$.

\begin{align} \text{similarity} = k(x,l^{(i)}) &= \exp\left( -\dfrac{\Vert x-l^{(i)}\Vert^2}{2\sigma^2} \right) \\ f_i &:= \text{similarity}(x,l^{(i)}). \end{align}
• $\sigma$: standard deviation
• $\sigma^2$: variance.
• $\Vert \cdot \Vert$: Euclidean distance
• There are many kernels, above def of similarity is** Gaussian kernel**.

• We call $f$ landmark also!

### Kernels II

• Where do we get the landmarks from?
• For each example place a landmark at exactly the same location
• Given $m$ examples of $n$ features $(x^{(i)}, y^{(i)}$ for $i=1,m$.
• Choose landmarks: $l^{(i)} = x^{(i)}$ where $i=1,m$.
• We will build $n$ landmark $f^{(i)}$, each of them is built from
• Note that $m$ input elements $x^{(i)}$ becomes $m+1$ landmarks $f$ ($f^{0} = 1$)
• $\Theta$ now becomes $\Theta \in \mathbb{R}^{(m+1)\times 1}$.
• $f\in \mathbb{R}^{(m+1)}$ also.
• Predict $1$ if $\Theta^Tf \ge 0$
• SVM learning algorithm

$$\min_{\Theta} C\sum_{i=1}^m \left( y^{(i)} cost_1 (\Theta^Tf^{(i)}) + (1-y^{(i)})cost_0(\Theta^Tf^{(i)})\right) + \dfrac{1}{2}\sum_{j=1}^m\phi_j^2, \quad (n=m \text{ in this case})$$
• We minimize using $f$ as the feature vector instead of $x$
• $m=n$ because number of features is the number of training data examples we have.
• It’s really expensive because there may be a lot of features (= number of training examples). It's good to use shelf software to minimize this function instead. DON’T write your own software to do that!!
• Large $C$ ($\frac{1}{\lambda}$): low bias, high variance $\Rightarrow$ overfitting.
• Small C gives a hypothesis of high bias low variance $\Rightarrow$ underfitting
• Large $\sigma^2$: f features vary more smoothly - higher bias, lower variance
• Small $\sigma^2$: f features vary unexpectedly - low bias, high variance

## SVMs in practice

### Using an SVM

• Don’t write yourown codes to linearize the SVM, use already-writen library such as liblinear, libsvm, …
• Choice of $C$
• Choice of kernel (similarly functions)
• No kernel (“linear kernel”, use $X\Theta$)
• Gaussian kernel (above): need to choose $\sigma^2$
• Do perform feature scaling before using Gaussian kernel
• Not all similarity functions you develop are valid kernels $\Rightarrow$ Must satisfy **Merecer's Theorem** to make sure SVM packages’ optimizations run correctly, and do not diverge.
• Polynomial kernel:
• use when $x$ and $l$ are both strictly non-negative
• People not use this much.
• parameters: $const$ and $degree$
• Other kernels: string kernel (input data using texts, string,…), chi-square kernel, histogram intersection kernel,…
• Remember: choose whatever kernel performs best on cross-validation data

### Mul(Dclass*classifica(on

• Many packages have built in multi-class classification packages
• Otherwise use one-vs all method
• Not a big issue

### Logistic regression vs. SVM

• If n (features) is large vs. m (training set)
• Feature vector dimension is 10 000
• Training set is 10 - 1000
• Then use logistic regression or SVM with a linear kernel
• If n is small and m is intermediate
• n = 1 - 1000
• m = 10 - 10 000
• Gaussian kernel is good
• If n is small and m is large
• n = 1 - 1000
• m = 50 000+
• SVM will be slow to run with Gaussian kernel
• In that case
• Manually createMul(Dclass*classifica(on or add more features
• Use *logistic Mul(Dclassclassifica(onregression of SVM with a linear kernel** Mul(Dclass*classifica(on
• Logistic regressionMul(Dclass*classifica(on and SVM with a linear kernel are pretty similar (performance, works)
• SVM has a convex optimization problem - so you get a
• Neural network likely to work well for most of these settings, but may be slower to train.

## Programming Assignment

### SVM

• A large $C$ parameter tells the SVM to try to classify all the examples correctly. $C$ plays a role similar to $\frac{1}{\lambda}$ , where $\lambda$ is the regularization parameter that we were using previously for logistic regression.

settings_backup_restore See again Regularized logistic regression.

$C=1$

$C=100$

• Most SVM software packages (including svmTrain.m) automatically add the extra feature $x_0 = 1$ for you and automatically take care of learning the intercept term $\theta_0$. So when passing your training data to the SVM software, there is no need to add this extra feature $x_0 = 1$ yourself.

### SVM with Gaussian Kernels

• To find non-linear decision boundaries with the SVM, we need to first implement a Gaussian kernel.
• You can think of the Gaussian kernel as a similarity function that measures the “distance" between a pair of examples $(x^{(i)}, x^{(j)})$. The Gaussian kernel is also parameterized by a bandwidth parameter, $\sigma$, which determines how fast the similarity metric decreases (to 0) as the examples are further apart.
• The Gaussian kernel function defined as

• File gaussianKernel.m

sim = exp( -(norm(x1-x2))^2/(2*sigma^2) );


### Example Dataset 3

File dataset3Params.m

range = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30];
predictionErrMin = 100000; % initial

for i=1:size(range,2)
for j=1:size(range,2)
model= svmTrain(X, y, range(i), @(x1, x2) gaussianKernel(x1, x2, range(j)));
predictions = svmPredict(model, Xval);
predictionErr = mean(double(predictions ~= yval));
if predictionErr < predictionErrMin
predictionErrMin = predictionErr;
C = range(i);
sigma = range(j);
end
end
end


### Spam Classification

• You need to convert each email into a feature vector $x\in \mathbb{R}^n$. The following parts of the exercise will walk you through how such a feature vector can be constructed from an email.

### Preprocessing Emails

• One method often employed in processing emails is to “normalize” these values, so that all URLs are treated the same, all numbers are treated the same, etc.
• we could replace each URL in the email with the unique string httpaddr to indicate that a URL was present.
• Usually, we do:
• Lower-casing: convert entire email to lowercase.
• Stripping HTML: All HTML tags are removed from the emails.
• Normalizing URLs: All URLs are replaced with the text httpaddr
• Normalizing Email Addresses: with the text emailaddr.
• Normalizing Numbers: number.
• Normalizing Dollars: All dollar signs ($) are replaced with the text dollar. • Word Stemming: “discount”, “discounts”, “discounted” and “discounting” replace by discount • Removal of non-words: Non-words and punctuation have been removed. all tabs, spaces, newlines becomes 1-space character. ### Vocabulary List • Our vocabulary list was selected by choosing all words which occur at least a 100 times in the spam corpus, resulting in a list of 1899 words. In practice, a vocabulary list with about 10,000 to 50,000 words is often used. • Given the vocabulary list, we can now map each word in the preprocessed emails (e.g., Figure 9) into a list of word indices that contains the index of the word in the vocabulary list. • If the word exists, you should add the index of the word into the word indices variable. • If the word does not exist, and is therefore not in the vocabulary, you can skip the word. • File processEmail.m for i=1:length(vocabList) if strcmp(str, vocabList{i}) word_indices = [word_indices; i]; end end  ### Extracting Features from Emails • You will now implement the feature extraction that converts each email into a vector in$\mathbb{R}^n$. •$n=$number of words in vocabulary list. •$x_i\in {0,1}\$ for an email corresponds to whether the i-th word in the dictionary occurs in the email.
• File emailFeatures.m

x(word_indices)=1;

keyboard_arrow_right Go to Week 8.
Top