- So far, we've seen a range of different algorithms
- With supervised learning algorithms - performance is pretty similar
- What matters more often is;
- The amount of training data
- Skill of applying algorithms

- What matters more often is;

- With supervised learning algorithms - performance is pretty similar
- One final supervised learning algorithm that is widely used -
**support vector machine (SVM)** - Compared to both logistic regression and neural networks, a SVM sometimes gives a cleaner way of learning non-linear functions
- Later in the course we'll do a survey of different supervised learning algorithms

- Start with logistic regression, see how we can modify it to get the SVM
- As before, the logistic regression hypothesis is as follows
- And the sigmoid activation function looks like this
- In order to explain the math, we use z as defined above

- As before, the logistic regression hypothesis is as follows
- What do we want logistic regression to do?
- We have an example where y = 1
- Then we hope h
_{θ}(x) is close to 1 - With h
_{θ}(x) close to 1, (θ^{T}x) must be**much larger**than 0

- Then we hope h
- Similarly, when y = 0
- Then we hope h
_{θ}(x) is close to 0 - With h
_{θ}(x) close to 0, (θ^{T}x) must be**much less**than 0

- Then we hope h
- This is our classic view of logistic regression
- Let's consider another way of thinking about the problem

- We have an example where y = 1
- Alternative view of logistic regression
- If you look at cost function, each example contributes a term like the one below to the overall cost function
- For the overall cost function, we sum over all the training examples using the above function, and have a 1/m term

- If you look at cost function, each example contributes a term like the one below to the overall cost function
- If you then plug in the hypothesis definition (h
_{θ}(x)), you get an expanded cost function equation; - So each training example contributes that term to the cost function for logistic regression

- If y = 1 then only the first term in the objective matters
- If we plot the functions vs. z we get the following graph
- This plot shows the cost contribution of an example when y = 1 given z
- So if z is big, the cost is low - this is good!
- But if z is 0 or negative the cost contribution is high
- This is why, when logistic regression sees a positive example, it tries to set θ
^{T}x to be a very large term

- This plot shows the cost contribution of an example when y = 1 given z

- If we plot the functions vs. z we get the following graph
- If y = 0 then only the second term matters
- We can again plot it and get a similar graph
- Same deal, if z is small then the cost is low
- But if s is large then the cost is massive

- Same deal, if z is small then the cost is low

- We can again plot it and get a similar graph

- To build a SVM we must redefine our cost functions
- When y = 1
- Take the y = 1 function and create a new cost function
- Instead of a curved line create two straight lines (magenta) which acts as an approximation to the logistic regression y = 1 function
- Take point (1) on the z axis
- Flat from 1 onwards
- Grows when we reach 1 or a lower number

- This means we have two straight lines
- Flat when cost is 0
- Straight growing line after 1

- Take point (1) on the z axis
- So this is the new y=1 cost function
- Gives the SVM a computational advantage and an easier optimization problem
- We call this function
**cost**_{1}(z)

- When y = 1

- Similarly
- When y = 0
- Do the equivalent with the y=0 function plot
- We call this function
**cost**_{0}(z)

- Do the equivalent with the y=0 function plot

- When y = 0
- So here we define the two cost function terms for our SVM graphically
- How do we implement this?

- As a comparison/reminder we have logistic regression below
- If this looks unfamiliar its because we previously had the - sign outside the expression

- For the SVM we take our two logistic regression y=1 and y=0 terms described previously and replace with
- cost
_{1}(θ^{T}x) - cost
_{0}(θ^{T}x)

- cost
- So we get

- In convention with SVM notation we rename a few things here
__1) Get rid of the 1/m terms__- This is just a slightly different convention
- By removing 1/m we should get the same optimal values for
- 1/m is a constant, so should get same optimization
- e.g. say you have a minimization problem which minimizes to u = 5
- If your cost function * by a constant, you still generates the minimal value
- That minimal value is different, but that's irrelevant

__2) For logistic regression we had two terms;__- Training data set term (i.e. that we sum over m) =
**A** - Regularization term (i.e. that we sum over n) =
**B**- So we could describe it as A + λB
- Need some way to deal with the trade-off between regularization and data set terms
- Set different values for λ to parametrize this trade-off

- Instead of parameterization this as A + λB
- For SVMs the convention is to use a different parameter called C
- So do CA + B
- If C were equal to 1/λ then the two functions (CA + B and A + λB) would give the same value

- Training data set term (i.e. that we sum over m) =
- So, our overall equation is

- Unlike logistic, h
_{θ}(x) doesn't give us a probability, but instead we get a direct prediction of 1 or 0 - So if θ
^{T}x is equal to or greater than 0 --> h_{θ}(x) = 1 - Else --> h
_{θ}(x) = 0

- So if θ

- Sometimes people refer to SVM as
**large margin classifiers**- We'll consider what that means and what an SVM hypothesis looks like
- The SVM cost function is as above, and we've drawn out the cost terms below
- Left is cost
_{1}and right is cost_{0} - What does it take to make terms small
- If y =1
- cost
_{1}(z) = 0 only when z >= 1

- cost
- If y = 0
- cost
_{0}(z) = 0 only when z <= -1

- cost

- If y =1
- Interesting property of SVM
- If you have a positive example, you only really
*need*z to be greater or equal to 0- If this is the case then you predict 1

- SVM wants a bit more than that - doesn't want to *just* get it right, but have the value be quite a bit bigger than zero
- Throws in an extra safety margin factor

- If you have a positive example, you only really

- Logistic regression does something similar
- What are the consequences of this?
- Consider a case where we set C to be huge
- C = 100,000
- So considering we're minimizing CA + B
- If C is huge we're going to pick an A value so that A is equal to zero
- What is the optimization problem here - how do we make A = 0?

- Making A = 0
- If y = 1
- Then to make our "A" term 0 need to find a value of θ so (θ
^{T}x) is greater than or equal to 1

- Then to make our "A" term 0 need to find a value of θ so (θ
- Similarly, if y = 0
- Then we want to make "A" = 0 then we need to find a value of θ so (θ
^{T}x) is equal to or less than -1

- Then we want to make "A" = 0 then we need to find a value of θ so (θ

- If y = 1
- So - if we think of our optimization problem a way to ensure that this first "A" term is equal to 0, we re-factor our optimization problem into just minimizing the "B" (regularization) term, because
- When A = 0 --> A*C = 0

- So we're minimizing B, under the constraints shown below

- Turns out when you solve this problem you get interesting decision boundaries
- The green and magenta lines are functional decision boundaries which could be chosen by logistic regression
- But they probably don't generalize too well

- The black line, by contrast is the the chosen by the SVM because of this safety net imposed by the optimization graph
- More robust separator

- Mathematically, that black line has a larger minimum distance (margin) from any of the training examples
- By separating with the largest margin you incorporate robustness into your decision making process

- Consider a case where we set C to be huge
- We looked at this at when C is very large
- SVM is more sophisticated than the large margin might look
- If you were just using large margin then SVM would be very sensitive to outliers
- You would risk making a ridiculous hugely impact your classification boundary
- A single example might not represent a good reason to change an algorithm
- If C is very large then we
*do*use this quite naive maximize the margin approach - So we'd change the black to the magenta

- But if C is reasonably small, or a not too large, then you stick with the black decision boundary

- If you were just using large margin then SVM would be very sensitive to outliers
- What about non-linearly separable data?
- Then SVM still does the right thing if you use a normal size C
- So the idea of SVM being a large margin classifier is only really relevant when you have no outliers and you can easily linearly separable data

- Means we ignore a few outliers

- SVM is more sophisticated than the large margin might look

- Have two (2D) vectors u and v - what is the inner product (
*u*^{T}*v*)?- Plot
*u*on graph- i.e
*u*_{1}vs.*u*_{2}

- i.e
- One property which is good to have is the
**norm**of a vector- Written as ||u||
- This is the euclidean length of vector u

- So ||u|| = SQRT(
*u*_{1}^{2}+*u*_{2}^{2}) = real number- i.e. length of the arrow above
- Can show via Pythagoras

- Written as ||u||
- For the inner product, take
*v*and orthogonally project down onto u- First we can plot v on the same axis in the same way (
*v*_{1 }vs*v*_{1}) - Measure the length/magnitude of the projection
- So here, the green line is the projection
- p = length along u to the intersection
- p is the magnitude of the projection of vector
*v*onto vector*u*

- First we can plot v on the same axis in the same way (
- Possible to show that
*u*^{T}*v*= p * ||u||- So this is one way to compute the inner product

*u*^{T}*v =**u*_{1}*v*_{1}+*u*_{2}*v*_{2}- So therefore
**p * ||u|| =***u*_{1}*v*_{1}+*u*_{2}*v*_{2}- This is an important rule in linear algebra

- We can reverse this too
- So we could do
*v*^{T}*u = v*_{1}*u*_{1}+ v_{2}*u*_{2}- Which would obviously give you the same number

- So we could do

- p can be negative if the angle between them is 90 degrees or more
- So here p is negative

- Plot
- Use the vector inner product theory to try and understand SVMs a little better

- For the following explanation - two simplification
- Set θ
_{0}= 0 (i.e. ignore intercept terms) - Set n = 2 - (x
_{1}, x_{2})- i.e. each example has only 2 features

- Set θ
- Given we only have two parameters we can simplify our function to
- And, can be re-written as
- Should give same thing

- We may notice that
- The term in red is the norm of θ
- If we take θ as a 2x1 vector
- If we assume θ
_{0}= 0 its still true

- The term in red is the norm of θ
- So, finally, this means our optimization function can be re-defined as
- So the SVM is minimizing the squared norm

- Given this, what are the (θ
^{T}x) parameters doing?- Given θ and given example x what is this equal to
- We can look at this in a comparable manner to how we just looked at u and v

- Say we have a single positive training example (red cross below)
- Although we haven't been thinking about examples as vectors it can be described as such
- Now, say we have our parameter vector θ and we plot that on the same axis
- The next question is what is the inner product of these two vectors
- p, is in fact p
^{i}, because it's the length of p for example i- Given our previous discussion we know

(θ^{T}x^{i }) = p^{i }* ||θ||

= θ_{1}x^{i}_{1}+ θ_{2}x^{i}_{2} - So these are both equally valid ways of computing θ
^{T}x^{i}

- Given our previous discussion we know

- p, is in fact p

- Given θ and given example x what is this equal to
- What does this mean?
- The constraints we defined earlier
- (θ
^{T}x) >= 1 if y = 1 - (θ
^{T}x) <= -1 if y = 0

- (θ
- Can be replaced/substituted with the constraints
- p
^{i }* ||θ|| >= 1 if y = 1 - p
^{i }* ||θ|| <= -1 if y = 0

- p

- The constraints we defined earlier
- Writing that into our optimization objective

- Writing that into our optimization objective

- So, given we've redefined these functions let us now consider the training example below
- Given this data, what boundary will the SVM choose? Note that we're still assuming θ
_{0}= 0, which means the boundary has to pass through the origin (0,0)- Green line - small margins
- SVM would not chose this line
- Decision boundary comes very close to examples
- Lets discuss
*why*the SVM would**not**chose this decision boundary

- SVM would not chose this line

- Green line - small margins
- Looking at this line
- We can show that θ is at 90 degrees to the decision boundary
**θ is always at 90 degrees to the decision boundary**(can show with linear algebra, although we're not going to!)

- We can show that θ is at 90 degrees to the decision boundary

- Given this data, what boundary will the SVM choose? Note that we're still assuming θ
- So now lets look at what this implies for the optimization objective
- Look at first example (x
^{1}) - Project a line from x
^{1 }on to to the θ vector (so it hits at 90 degrees)- The distance between the intersection and the origin is (
**p**)^{1}

- The distance between the intersection and the origin is (
- Similarly, look at second example (x
^{2}) - Project a line from x
^{2}into to the θ vector - This is the magenta line, which will be
**negative**(**p**)^{2}

- Project a line from x
- If we overview these two lines below we see a graphical representation of what's going on;
- We find that both these p values are going to be pretty small
- If we look back at our optimization objective
- We know we need p
^{1}* ||θ|| to be bigger than or equal to 1 for positive examples- If p is small
- Means that ||θ|| must be pretty large

- If p is small
- Similarly, for negative examples we need p
^{2}* ||θ|| to be smaller than or equal to -1- We saw in this example p
^{2}is a small negative number- So ||θ|| must be a large number

- We saw in this example p

- We know we need p
- Why is this a problem?
- The optimization objective is trying to find a set of parameters where the norm of theta is small
- So this doesn't seem like a good direction for the parameter vector (because as p values get smaller ||θ|| must get larger to compensate)
- So we should make p values larger which allows ||θ|| to become smaller

- So this doesn't seem like a good direction for the parameter vector (because as p values get smaller ||θ|| must get larger to compensate)

- The optimization objective is trying to find a set of parameters where the norm of theta is small

- Look at first example (x
- So lets chose a different boundary
- Now if you look at the projection of the examples to θ we find that p
^{1 }becomes large and ||θ|| can become small

- So with some values drawn in
- This means that by choosing this second decision boundary we can make ||θ|| smaller
- Which is why the SVM choses this hypothesis as better
- This is how we generate the large margin effect
- The magnitude of this margin is a function of the p values
- So by maximizing these p values we minimize ||θ||

- Now if you look at the projection of the examples to θ we find that p
- Finally, we did this derivation assuming θ
_{0}= 0,- If this is the case we're entertaining only decision boundaries which pass through (0,0)
- If you allow θ
_{0}to be other values then this simply means you can have decision boundaries which cross through the x and y values at points other than (0,0) - Can show with basically same logic that this works, and even when θ
_{0}is non-zero when you have optimization objective described above (when C is very large) that the SVM is looking for a large margin separator between the classes

- What are kernels and how do we use them
- We have a training set
- We want to find a non-linear boundary
- Come up with a complex set of polynomial features to fit the data
- Have h
_{θ}(x) which- Returns 1 if the combined weighted sum of vectors (weighted by the parameter vector) is less than or equal to 0
- Else return 0

- Another way of writing this (new notation) is
- That a hypothesis computes a decision boundary by taking the sum of the parameter vector multiplied by a
**new feature vector f**, which simply contains the various high order x terms - e.g.
- h
_{θ}(x) = θ_{0}+ θ_{1}f_{1}+ θ_{2}f_{2 }+ θ_{3}f_{3} - Where
- f
_{1}= x_{1} - f
_{2 }= x_{1}x_{2} - f
_{3 }= ... - i.e. not specific values, but each of the terms from your complex polynomial function

- f

- h

- That a hypothesis computes a decision boundary by taking the sum of the parameter vector multiplied by a
- Is there a better choice of feature f than the high order polynomials?
- As we saw with computer imaging, high order polynomials become computationally expensive

- Have h

- We have a training set
- New features
- Define three features in this example (ignore x
_{0}) - Have a graph of x
_{1 }vs. x_{2 }(don't plot the values, just define the space) - Pick three points in that space
- These points l
^{1}, l^{2}, and l^{3}, were chosen manually and are called**landmarks**- Given x, define f1 as the similarity between (x, l
^{1})- = exp(- (|| x - l
^{1 }||^{2}) / 2σ^{2})

= **|| x -****l**^{1 }**||**is the euclidean distance between the point x and the landmark l^{1 }squared- Disussed more later

- If we remember our statistics, we know that
- σ is the
**standard****deviation** - σ
^{2 }is commonly called the**variance**

- σ is the

- = exp(- (|| x - l
- Remember, that as discussed

- Given x, define f1 as the similarity between (x, l
- So, f2 is defined as
- f2 = similarity(x, l
^{1}) = exp(- (|| x - l^{2 }||^{2 }) / 2σ^{2})

- f2 = similarity(x, l
- And similarly
- f3 = similarity(x, l
^{2}) = exp(- (|| x - l^{1 }||^{2 }) / 2σ^{2})

- f3 = similarity(x, l
- This similarity function is called a
**kernel**- This function is a
**Gaussian Kernel**

- This function is a
- So, instead of writing similarity between x and l we might write
- f1 = k(x, l
^{1})

- f1 = k(x, l

- Define three features in this example (ignore x

- So lets see what these kernels do and why the functions defined make sense
- Say x is close to a landmark
- Then the squared distance will be ~0
- So
- Which is basically e
^{-0}- Which is close to 1

- Which is basically e

- So
- Say x is far from a landmark
- Then the squared distance is big
- Gives e
^{-large number}- Which is close to zero

- Gives e

- Then the squared distance is big
- Each landmark defines a new features

- Then the squared distance will be ~0

- Say x is close to a landmark
- If we plot f1 vs the kernel function we get a plot like this
- Notice that when x = [3,5] then f1 = 1
- As x moves away from [3,5] then the feature takes on values close to zero
- So this measures how close x is to this landmark

**σ**is a parameter of the Gaussian kernel^{2 }- Defines the steepness of the rise around the landmark

- Above example σ
^{2}= 1 - Below σ
^{2}= 0.5- We see here that as you move away from 3,5 the feature f1 falls to zero much more rapidly

- The inverse can be seen if σ
^{2}= 3

- Given this definition, what kinds of hypotheses can we learn?
- With training examples x we predict "1" when
- θ
_{0}+ θ_{1}f_{1}+ θ_{2}f_{2 }+ θ_{3}f_{3}>= 0- For our example, lets say we've already run an algorithm and got the
- θ
_{0 }= -0.5 - θ
_{1 }= 1 - θ
_{2 }= 1 - θ
_{3 }= 0

- θ
- Given our placement of three examples, what happens if we evaluate an example at the
**magenta dot**below? - Looking at our formula, we know f1 will be close to 1, but f2 and f3 will be close to 0
- So if we look at the formula we have
- θ
_{0}+ θ_{1}f_{1}+ θ_{2}f_{2 }+ θ_{3}f_{3}>= 0 - -0.5 + 1 + 0 + 0 = 0.5
- 0.5 is greater than 1

- θ

- So if we look at the formula we have
- If we had
**another point**far away from all three- This equates to -0.5
- So we predict 0

- This equates to -0.5

- For our example, lets say we've already run an algorithm and got the
- Considering our parameter, for points near l
^{1}and l^{2}you predict 1, but for points near l^{3 }you predict 0 - Which means we create a non-linear decision boundary that goes a lil' something like this;
- Inside we predict y = 1
- Outside we predict y = 0

- So this show how we can create a non-linear boundary with landmarks and the kernel function in the support vector machine
- But
- How do we get/chose the landmarks
- What other kernels can we use (other than the Gaussian kernel)

- Filling in missing detail and practical implications regarding kernels
- Spoke about picking landmarks manually, defining the kernel, and building a hypothesis function
- Where do we get the landmarks from?
- For complex problems we probably want lots of them

- Take the training data
- For each example place a landmark at exactly the same location
- So end up with m landmarks
- One landmark per location per training example
- Means our features measure how close to a training set example something is

- Given a new example, compute all the f values
- Gives you a feature vector f (f
_{0}to f_{m})- f
_{0}= 1 always

- f

- Gives you a feature vector f (f
- A more detailed look at generating the f vector
- If we had a training example - features we compute would be using (x
^{i}, y^{i})- So we just cycle through each landmark, calculating how close to that landmark actually x
^{i}is- f
_{1}^{i}, = k(x^{i}, l^{1}) - f
_{2}^{i}, = k(x^{i}, l^{2}) - ...
- f
_{m}^{i}, = k(x^{i}, l^{m})

- f
- Somewhere in the list we compare x to itself... (i.e. when we're at f
_{i}^{i})- So because we're using the Gaussian Kernel this evalues to 1

- Take these m features (f
_{1}, f_{2}... f_{m}) group them into an [m +1 x 1] dimensional vector called f- f
^{i}is the f feature vector for the ith example - And add a 0th term = 1

- f

- So we just cycle through each landmark, calculating how close to that landmark actually x

- If we had a training example - features we compute would be using (x
- Given these kernels, how do we use a support vector machine

- Predict y = 1 if (θ
^{T}f) >= 0- Because θ = [m+1 x 1]
- And f = [m +1 x 1]

- So, this is how you make a prediction assuming you already have θ
- How do you get θ?

- Use the SVM learning algorithm
- Now, we minimize using f as the feature vector instead of x
- By solving this minimization problem you get the parameters for your SVM

- In this setup, m = n
- Because number of features is the number of training data examples we have

- One final mathematic detail (not crucial to understand)
- If we ignore θ
_{0}then the following is true - What many implementations do is
- Where the matrix M depends on the kernel you use
- Gives a slightly different minimization - means we determine a rescaled version of θ
- Allows more efficient computation, and scale to much bigger training sets
- If you have a training set with 10 000 values, means you get 10 000 features
- Solving for all these parameters can become expensive
- So by adding this in we avoid a for loop and use a matrix multiplication algorithm instead

- If we ignore θ
- You can apply kernels to other algorithms
- But they tend to be very computationally expensive
- But the SVM is far more efficient - so more practical

- Lots of good off the shelf software to minimize this function

**SVM parameters (C)**- Bias and variance trade off
- Must chose C
- C plays a role similar to 1/LAMBDA (where LAMBDA is the regularization parameter)

- Large C gives a hypothesis of
**low bias high variance**--> overfitting - Small C gives a hypothesis of
**high bias low variance**--> underfitting

**SVM parameters (σ**^{2})- Parameter for calculating f values
- Large σ
^{2}- f features vary more smoothly - higher bias, lower variance - Small σ
^{2}- f features vary abruptly - low bias, high variance

- Large σ

- Parameter for calculating f values

- So far spoken about SVM in a very abstract manner
- What do you need to do this
- Use SVM software packages (e.g. liblinear, libsvm) to solve parameters θ
- Need to specify
- Choice of parameter C
- Choice of kernel

- We've looked at the
**Gaussian kernel**- Need to define σ (σ
^{2})- Discussed σ
^{2}

- Discussed σ
- When would you chose a Gaussian?
- If n is small and/or m is large
- e.g. 2D training set that's large

- If n is small and/or m is large
- If you're using a Gaussian kernel then you may need to implement the kernel function
- e.g. a function

fi = kernel(x1,x2)- Returns a real number

- Some SVM packages will expect you to define kernel
- Although, some SVM implementations include the Gaussian and a few others
- Gaussian is probably most popular kernel

- e.g. a function
- NB - make sure you perform
**feature scaling**before using a Gaussian kernel- If you don't features with a large value will dominate the f value

- Need to define σ (σ
- Could use no kernel -
**linear kernel**- Predict y = 1 if (θ
^{T}x) >= 0- So no f vector
- Get a standard linear classifier

- Why do this?
- If n is large and m is small then
- Lots of features, few examples
- Not enough data - risk overfitting in a high dimensional feature-space

- If n is large and m is small then

- Predict y = 1 if (θ
- Other choice of kernel
- Linear and Gaussian are most common
- Not all similarity functions you develop are valid kernels
- Must satisfy
**Merecer's Theorem** - SVM use numerical optimization tricks
- Mean certain optimizations can be made, but they must follow the theorem

- Must satisfy
**Polynomial Kernel**- We measure the similarity of x and l by doing one of
- (x
^{T}l)^{2} - (x
^{T}l)^{3} - (x
^{T}l+1)^{3}

- (x
- General form is
- (x
^{T}l+Con)^{D}

- (x
- If they're similar then the inner product tends to be large
- Not used that often
- Two parameters
- Degree of polynomial (D)
- Number you add to l (Con)

- Usually performs worse than the Gaussian kernel
- Used when x and l are both non-negative

- We measure the similarity of x and l by doing one of
**String kernel**- Used if input is text strings
- Use for text classification

**Chi-squared kernel****Histogram intersection kernel**

- Many packages have built in multi-class classification packages
- Otherwise use one-vs all method
- Not a big issue

- When should you use SVM and when is logistic regression more applicable
- If n (features) is large vs. m (training set)
- e.g. text classification problem
- Feature vector dimension is 10 000
- Training set is 10 - 1000
- Then use logistic regression or SVM with a linear kernel

- If n is small and m is intermediate
- n = 1 - 1000
- m = 10 - 10 000
- Gaussian kernel is good

- If n is small and m is large
- n = 1 - 1000
- m = 50 000+
- SVM will be slow to run with Gaussian kernel

- In that case
- Manually create or add more features
- Use logistic regression of SVM with a linear kernel

- Logistic regression and SVM with a linear kernel are pretty similar
- Do similar things
- Get similar performance

- A lot of SVM's power is using diferent kernels to learn complex non-linear functions
- For all these regimes a well designed NN should work
- But, for some of these problems a NN might be slower - SVM well implemented would be faster

- SVM has a convex optimization problem - so you get a global minimum
- It's not always clear how to chose an algorithm
- Often more important to get enough data
- Designing new features
- Debugging the algorithm

- SVM is widely perceived a very powerful learning algorithm