- Multiple variables = multiple features
- In original version we had
- X = house size, use this to predict
- y = house price

- If in a new scheme we have more variables (such as number of bedrooms, number floors, age of the home)
- x
_{1}, x_{2}, x_{3},_{ }x_{4 }are the four features- x
_{1}- size (feet squared) - x
_{2}- Number of bedrooms - x
_{3}- Number of floors - x
_{4}- Age of home (years)

- x
- y is the output variable (price)

- x
- More notation
**n**- number of features (n = 4)

**m**- number of examples (i.e. number of rows in a table)

**x**^{i}- vector of the input for an example (so a vector of the four parameters for the i
^{th }input example) - i is an index into the training set
- So
- x is an n-dimensional feature vector
- x
^{3}is, for example, the 3rd house, and contains the four features associated with that house

- vector of the input for an example (so a vector of the four parameters for the i
**x**_{j}^{i }- The value of feature j in the ith training example
- So
- x
_{2}^{3 }is, for example, the number of bedrooms in the third house

- x

- Now we have multiple features
- What is the form of our hypothesis?
- Previously our hypothesis took the form;
- h
_{θ}(x) = θ_{0}+ θ_{1}x- Here we have two parameters (theta 1 and theta 2) determined by our cost function
- One variable x

- h
- Now we have multiple features
- h
_{θ}(x) = θ_{0}+ θ_{1}x_{1}+ θ_{2}x_{2}+ θ_{3}x_{3}+ θ_{4}x_{4}

- h
- For example
- h
_{θ}(x) = 80 + 0.1x_{1}+ 0.01x_{2}+ 3x_{3}- 2x_{4}- An example of a hypothesis which is trying to predict the price of a house
- Parameters are still determined through a cost function

- h
- For convenience of notation, x
_{0}= 1- For every example i you have an additional 0th feature for each example
- So now your
**feature vector**is n + 1 dimensional feature vector indexed from 0- This is a column vector called x
- Each example has a column vector associated with it
- So let's say we have a new example called "X"

**Parameters**are also in a 0 indexed n+1 dimensional vector- This is also a column vector called θ
- This vector is the same for each example

- Considering this, hypothesis can be written
- h
_{θ}(x) = θ_{0}x_{0}+ θ_{1}x_{1}+ θ_{2}x_{2}+ θ_{3}x_{3}+ θ_{4}x_{4}

- h
- If we do
- h
_{θ}(x) =θX^{T}- θ
is an [1 x n+1] matrix^{T } - In other words, because θ
is a column vector, the transposition operation transforms it into a row vector^{ } - So before
- θ
was a matrix [n + 1 x 1]^{ }

- θ
- Now
- θ
is a matrix [1 x n+1]^{T }

- θ
- Which means the inner dimensions of θ
and X match, so they can be multiplied together as^{T}- [1 x n+1] * [n+1 x 1]
- = h
_{θ}(x) - So, in other words, the transpose of our parameter vector * an input example X gives you a predicted hypothesis which is [1 x 1] dimensions (i.e. a single value)

- θ
- This x
_{0}= 1 lets us write this like this

- h
- This is an example of multivariate linear regression

- Fitting parameters for the hypothesis with gradient descent
- Parameters are θ
_{0}to θ_{n} - Instead of thinking about this as n separate values, think about the parameters as a single vector (θ)
- Where θ is n+1 dimensional

- Parameters are θ
- Our cost function is

- Similarly, instead of thinking of J as a function of the n+1 numbers, J() is just a function of the parameter vector
- J(θ)

**Gradient descent**

- Once again, this is
- θ
_{j}= θ_{j}- learning rate (α) times the partial derivative of J(θ) with respect to θ_{J(...)} - We do this through a
**simultaneous update**of every θ_{j}value

- θ
- Implementing this algorithm
- When n = 1

- Above, we have slightly different update rules for θ
_{0}and θ_{1}- Actually they're the same, except the end has a previously undefined x
_{0}^{(i)}as 1, so wasn't shown

- Actually they're the same, except the end has a previously undefined x
- We now have an almost identical rule for multivariate gradient descent

- What's going on here?
- We're doing this for each j (0 until n) as a simultaneous update (like when n = 1)
- So, we re-set θ
_{j}to- θ
_{j}minus the learning rate (α) times the partial derivative of of the θ vector with respect to θ_{j} - In non-calculus words, this means that we do
- Learning rate
- Times 1/m (makes the maths easier)
- Times the sum of
- The hypothesis taking in the variable vector, minus the actual value, times the j-th value in that variable vector for EACH example

- θ
- It's important to remember that

- These algorithm are highly similar

- Having covered the theory, we now move on to learn about some of the practical tricks
- Feature scaling
- If you have a problem with multiple features
- You should make sure those features have a similar scale
- Means gradient descent will converge more quickly

- e.g.
- x1 = size (0 - 2000 feet)
- x2 = number of bedrooms (1-5)
- Means the contours generated if we plot θ
_{1}vs. θ_{2}give a very tall and thin shape due to the huge range difference

- Running gradient descent on this kind of cost function can take a long time to find the global minimum

- Pathological input to gradient descent
- So we need to rescale this input so it's more effective
- So, if you define each value from x1 and x2 by dividing by the max for each feature
- Contours become more like circles (as scaled between 0 and 1)

- May want to get everything into -1 to +1 range (approximately)
- Want to avoid large ranges, small ranges or very different ranges from one another
- Rule a thumb regarding acceptable ranges
- -3 to +3 is generally fine - any bigger bad
- -1/3 to +1/3 is ok - any smaller bad

- Can do
**mean normalization** - Take a feature x
_{i} - Replace it by (x
_{i}- mean)/max - So your values all have an average of about 0

- Replace it by (x

- Take a feature x

- Instead of max can also use standard deviation

- Focus on the learning rate (α)
- Topics
- Update rule
- Debugging
- How to chose α

- Plot min J(θ) vs. no of iterations
- (i.e. plotting J(θ) over the course of gradient descent

- If gradient descent is working then J(θ) should decrease after every iteration
- Can also show if you're not making huge gains after a certain number
- Can apply heuristics to reduce number of iterations if need be
- If, for example, after 1000 iterations you reduce the parameters by nearly nothing you could chose to only run 1000 iterations in the future
- Make sure you don't accidentally hard-code thresholds like this in and then forget about why they're their though!

- Number of iterations varies a lot
- 30 iterations
- 3000 iterations
- 3000 000 iterations
- Very hard to tel in advance how many iterations will be needed
- Can often make a guess based a plot like this after the first 100 or so iterations

- Automatic convergence tests
- Check if J(θ) changes by a small threshold or less
- Choosing this threshold is hard
- So often easier to check for a straight line
- Why? - Because we're seeing the straightness in the context of the whole algorithm
- Could you design an automatic checker which calculates a threshold based on the systems preceding progress?

- Check if J(θ) changes by a small threshold or less
- Checking its working
- If you plot J(θ) vs iterations and see the value is increasing - means you probably need a smaller α
- Cause is because your minimizing a function which looks like this

- If you plot J(θ) vs iterations and see the value is increasing - means you probably need a smaller α

- But you overshoot, so reduce learning rate so you actually reach the minimum (green line)

- So, use a smaller α

- Number of iterations varies a lot
- Another problem might be if J(θ) looks like a series of waves
- Here again, you need a smaller α

- However
- If α is small enough, J(θ) will decrease on every iteration
- BUT, if α is too small then rate is too slow
- A less steep incline is indicative of a slow convergence, because we're decreasing by less on each iteration than a steeper slope

- Typically
- Try a range of alpha values
- Plot J(θ) vs number of iterations for each version of alpha
- Go for roughly threefold increases
- 0.001, 0.003, 0.01, 0.03. 0.1, 0.3

- Choice of features and how you can get different learning algorithms by choosing appropriate features
- Polynomial regression for non-linear function

- Example
- House price prediction
- Two features
- Frontage - width of the plot of land along road (x
_{1}) - Depth - depth away from road (x
_{2})

- Frontage - width of the plot of land along road (x

- Two features
- You don't have to use just two features
**Can create new features**

- Might decide that an important feature is the land area
- So, create a new feature = frontage * depth (x
_{3}) - h(x) = θ
_{0}+ θ_{1}x_{3}- Area is a better indicator

- So, create a new feature = frontage * depth (x
- Often, by defining new features you may get a better model

- House price prediction
- Polynomial regression
- May fit the data better
- θ
_{0}+ θ_{1}x + θ_{2}x^{2}e.g. here we have a quadratic function - For housing data could use a quadratic function
- But may not fit the data so well - inflection point means housing prices decrease when size gets really big
- So instead must use a cubic function

- How do we fit the model to this data
- To map our old linear hypothesis and cost functions to these polynomial descriptions the easy thing to do is set
- x
_{1}= x - x
_{2}= x^{2} - x
_{3}= x^{3}

- x
- By selecting the features like this and applying the linear regression algorithms you can do polynomial linear regression
- Remember, feature scaling becomes even more important here

- To map our old linear hypothesis and cost functions to these polynomial descriptions the easy thing to do is set
- Instead of a conventional polynomial you could do variable ^(1/something) - i.e. square root, cubed root etc
- Lots of features - later look at developing an algorithm to chose the best features

- For some linear regression problems the normal equation provides a better solution
- So far we've been using gradient descent
- Iterative algorithm which takes steps to converse

- Normal equation solves θ analytically
- Solve for the optimum value of theta

- Has some advantages and disadvantages

- Simplified cost function
- J(θ) = aθ
^{2 }+ bθ + c- θ is just a real number, not a vector

- Cost function is a quadratic function
- How do you minimize this?
- Do
- Take derivative of J(θ) with respect to θ
- Set that derivative equal to 0
- Allows you to solve for the value of θ which minimizes J(θ)

- Do

- J(θ) = aθ
- In our more complex problems;
- Here θ is an n+1 dimensional vector of real numbers
- Cost function is a function of the vector value
- How do we minimize this function
- Take the partial derivative of J(θ) with respect θ
_{j }and set to 0 for every j - Do that and solve for θ
_{0}to θ_{n} - This would give the values of θ which minimize J(θ)

- Take the partial derivative of J(θ) with respect θ

- How do we minimize this function
- If you work through the calculus and the solution, the derivation is pretty complex
- Not going to go through here
- Instead, what do you need to know to implement this process

- Here
- m = 4
- n = 4

- To implement the normal equation
- Take examples
- Add an extra column (x
_{0}feature) - Construct a matrix (X -
**the design matrix**) which contains all the training data features in an [m x n+1] matrix - Do something similar for y
- Construct a column vector y vector [m x 1] matrix

- Using the following equation (X transpose * X) inverse times X transpose y

- If you compute this, you get the value of theta which minimize the cost function

- Have m training examples and n features
- The design matrix (X)
- Each training example is a n+1 dimensional feature column vector
- X is constructed by taking each training example, determining its transpose (i.e. column -> row) and using it for a row in the design A
- This creates an [m x (n+1)] matrix

**Vector y**- Used by taking all the y values into a column vector

- What is this equation?!
- (X
^{T}* X)^{-1} - What is this --> the inverse of the matrix (X
^{T }* X)- i.e. A = X
^{T }X - A
^{-1 }= (X^{T }X)^{-1}

- i.e. A = X

- What is this --> the inverse of the matrix (X

- (X
- In octave and MATLAB you could do;
**pinv(X'*x)*x'*y**- X' is the notation for X transpose
- pinv is a function for the inverse of a matrix

- In a previous lecture discussed feature scaling
- If you're using the normal equation then no need for feature scaling

*Gradient descent*- Need to chose learning rate
- Needs many iterations - could make it slower
- Works well even when
*n*is massive (millions)- Better suited to big data
- What is a big
*n*though- 100 or even a 1000 is still (relativity) small
- If n is 10 000 then look at using gradient descent

*Normal equation*- No need to chose a learning rate
- No need to iterate, check for convergence etc.
- Normal equation needs to compute (X
^{T }X)^{-1}- This is the inverse of an n x n matrix
- With most implementations computing a matrix inverse grows by O(n
^{3 })- So not great

- Slow of
*n*is large - Can be much slower

- Advanced concept
- Often asked about, but quite advanced, perhaps optional material
- Phenomenon worth understanding, but not probably necessary

- When computing (X
^{T }X)^{-1}* X^{T }* y)- What if (X
^{T }X) is non-invertible (singular/degenerate)- Only some matrices are invertible
- This should be quite a rare problem
- Octave can invert matrices using
- pinv (pseudo inverse)
- This gets the right value even if (X
^{T }X) is non-invertible

- This gets the right value even if (X
- inv (inverse)

- pinv (pseudo inverse)

- Octave can invert matrices using

- What does it mean for (X
^{T }X) to be non-invertible- Normally two common causes
**Redundant features**in learning model- e.g.
- x
_{1}= size in feet - x
_{2}= size in meters squared

- x

- e.g.
**Too many features**- e.g. m <= n (m is much larger than n)
- m = 10
- n = 100

- Trying to fit 101 parameters from 10 training examples
- Sometimes work, but not always a good idea
- Not enough data
- Later look at
*why*this may be too little data - To solve this we
- Delete features
- Use
**regularization**(let's you use lots of features for a small training set)

- e.g. m <= n (m is much larger than n)

- Normally two common causes
- If you find (X
^{T }X) to be non-invertible- Look at features --> are features linearly dependent?
- So just delete one, will solve problem

- Look at features --> are features linearly dependent?

- What if (X