- We will learn about
- State of the art
- How to do the implementation

- Applications of machine learning include
- Search
- Photo tagging
- Spam filters

- The AI dream of building machines as intelligent as humans
- Many people believe best way to do that is mimic how humans learn

- What the course covers
- Learn about state of the art algorithms
- But the algorithms and math alone are no good
- Need to know how to get these to work in problems

- Why is ML so prevalent?
- Grew out of AI
- Build intelligent machines
- You can program a machine how to do some simple thing
- For the most part hard-wiring AI is too difficult

- Best way to do it is to have some way for machines to learn things themselves
- A mechanism for learning - if a machine can learn from input then it does the hard work for you

- You can program a machine how to do some simple thing

- Database mining
- Machine learning has recently become so big party because of the huge amount of data being generated
- Large datasets from growth of automation web
- Sources of data include
- Web data (click-stream or click through data)
- Mine to understand users better
- Huge segment of silicon valley

- Medical records
- Electronic records -> turn records in knowledges

- Biological data
- Gene sequences, ML algorithms give a better understanding of human genome

- Engineering info
- Data from sensors, log reports, photos etc

- Web data (click-stream or click through data)

- Applications that we cannot program by hand
- Autonomous helicopter
- Handwriting recognition
- This is very inexpensive because when you write an envelope, algorithms can automatically route envelopes through the post

- Natural language processing (NLP)
- AI pertaining to language

- Computer vision
- AI pertaining vision

- Self customizing programs
- Netflix
- Amazon
- iTunes genius
- Take users info
- Learn based on your behavior

- Understand human learning and the brain
- If we can build systems that mimic (or try to mimic) how the brain works, this may push our own understanding of the associated neurobiology

- Here we...
- Define what it is
- When to use it

- Not a well defined definition
- Couple of examples of how people have tried to define it

- Arthur Samuel (1959)
*Machine learning:*"**Field of study that gives computers the ability to learn without being explicitly programmed"**- Samuels wrote a checkers playing program
- Had the program play 10000 games against itself
- Work out which board positions were good and bad depending on wins/losses

- Samuels wrote a checkers playing program

- Tom Michel (1999)
**Well posed learning problem:**"**A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."**- The checkers example,
- E = 10000s games
- T is playing checkers
- P if you win or not

- The checkers example,

- Several types of learning algorithms
**Supervised learning**- Teach the computer how to do something, then let it use it;s new found knowledge to do it

**Unsupervised learning**- Let the computer learn how to do something, and use this to determine structure and patterns in data

- Reinforcement learning
- Recommender systems

- This course
- Look at practical advice for applying learning algorithms
- Learning a set of tools and
**how**to apply them

- Probably the most common problem type in machine learning
- Starting with an example
- How do we predict housing prices
- Collect data regarding housing prices and how they relate to size in feet

"Given this data, a friend has a house 750 square feet - how much can they be expected to get?"__Example problem:__- What approaches can we use to solve this?
- Straight line through data
- Maybe $150 000

- Second order polynomial
- Maybe $200 000

- One thing we discuss later - how to chose straight or curved line?
- Each of these approaches represent a way of doing supervised learning

- Straight line through data
*What does this mean?*- We gave the algorithm a data set where a "right answer" was provided
- So we know actual prices for houses
- The idea is we can learn what makes the price a certain value from the
**training data** - The algorithm should then produce more right answers based on new training data where we don't know the price already
- i.e. predict the price

- The idea is we can learn what makes the price a certain value from the

- We also call this a
**regression problem** - Predict continuous valued output (price)
- No real discrete delineation

- Another example
- Can we definer breast cancer as malignant or benign based on tumour size

- Looking at data
- Five of each
- Can you estimate prognosis based on tumor size?
- This is an example of a
**classification problem**- Classify data into one of two discrete classes - no in between, either malignant or not
- In classification problems, can have a discrete number of possible values for the output
- e.g. maybe have four values
- 0 - benign
- 1 - type 1
- 2 - type 2
- 3 - type 4

- e.g. maybe have four values

- In classification problems we can plot data in a different way

- Use only one attribute (size)
- In other problems may have multiple attributes
- We may also, for example, know age and tumor size

- Based on that data, you can try and define separate classes by
- Drawing a straight line between the two groups
- Using a more complex function to define the two groups (which we'll discuss later)
- Then, when you have an individual with a specific tumor size and who is a specific age, you can hopefully use that information to place them into one of your classes

- You might have many features to consider
- Clump thickness
- Uniformity of cell size
- Uniformity of cell shape

- The most exciting algorithms can deal with an infinite number of features
- How do you deal with an infinite number of features?
- Neat mathematical trick in support vector machine (which we discuss later)
- If you have an infinitely long list - we can develop and algorithm to deal with that

**Summary**- Supervised learning lets you get the "right" data a
- Regression problem
- Classification problem

- Second major problem type
- In unsupervised learning, we get unlabeled data
- Just told - here is a data set, can you structure it

- One way of doing this would be to cluster data into to groups
- This is a
**clustering algorithm**

- This is a

- Example of clustering algorithm
- Google news
- Groups news stories into cohesive groups

- Used in any other problems as well
- Genomics
- Microarray data
- Have a group of individuals
- On each measure expression of a gene
- Run algorithm to cluster individuals into types of people

- Organize computer clusters
- Identify potential weak spots or distribute workload effectively

- Social network analysis
- Customer data

- Astronomical data analysis
- Algorithms give amazing results

- Google news

- Basically
- Can you automatically generate structure
- Because we don't give it the answer, it's unsupervised learning

- Cocktail party problem
- Lots of overlapping voices - hard to hear what everyone is saying
- Two people talking
- Microphones at different distances from speakers

- Lots of overlapping voices - hard to hear what everyone is saying

- Record sightly different versions of the conversation depending on where your microphone is
- But overlapping none the less

- Have recordings of the conversation from each microphone
- Give them to a cocktail party algorithm
- Algorithm processes audio recordings
- Determines there are two audio sources
- Separates out the two sources

- Is this a very complicated problem
- Algorithm can be done with one line of code!
**[W,s,v] = svd((repmat(sum(x.*x,1), size(x,1),1).*x)*x');**- Not easy to identify
- But, programs can be short!
- Using octave (or MATLAB) for examples
- Often prototype algorithms in octave/MATLAB to test as it's very fast
- Only when you show it works migrate it to C++
- Gives a much faster agile development

- Understanding this algorithm
**svd**- linear algebra routine which is built into octave- In C++ this would be very complicated!

- Shown that using MATLAB to prototype is a really good way to do this

- Housing price data example used earlier
- Supervised learning regression problem

- What do we start with?
- Training set (this is your data set)
- Notation (
*used throughout the course*)- m = number of
**training examples** - x's = input variables / features
- y's = output variable "target" variables
- (x,y) - single training example
- (x
^{i}, y^{j}) - specific example (i^{th}training example)- i is an index to training set

- m = number of

- With our training set defined - how do we used it?
- Take training set
- Pass into a learning algorithm
- Algorithm outputs a function (denoted
*h*) (h =**hypothesis**)- This function takes an input (e.g. size of new house)
- Tries to output the estimated value of Y

- How do we represent hypothesis
*h*?- Going to present h as;
- h
_{θ}(x) = θ_{0}+ θ_{1}x- h(x) (shorthand)

- h

- Going to present h as;

- What does this mean?
- Means Y is a linear function of x!
- θ
_{i}are**parameters**- θ
_{0}is zero condition - θ
_{1}is gradient

- θ

- This kind of function is a linear regression with one variable
- Also called
**univariate linear regression**

- Also called

- So in summary
- A hypothesis takes in some variable
- Uses parameters determined by a learning system
- Outputs a prediction based on that input

- A cost function lets us figure out how to fit the best straight line to our data
- Choosing values for θ
_{i}(parameters)- Different values give you different functions
- If θ
_{0}is 1.5 and θ_{1}is 0 then we get straight line parallel with X along 1.5 @ y - If θ
_{1}is > 0 then we get a positive slope

- Based on our training set we want to generate parameters which make the straight line
- Chosen these parameters so h
_{θ}(x) is close to y for our training examples- Basically, uses xs in training set with h
_{θ}(x) to give output which is as close to the actual y value as possible - Think of h
_{θ}(x) as a "y imitator" - it tries to convert the x into y, and considering we already have y we can evaluate how well h_{θ}(x) does this

- Basically, uses xs in training set with h

- Chosen these parameters so h
- To formalize this;
- We want to want to solve a
**minimization problem** - Minimize (h
_{θ}(x) - y)^{2 }- i.e. minimize the difference between h(x) and y for each/any/every example

- Sum this over the training set

- We want to want to solve a

- Minimize squared different between predicted house price and actual house price
- 1/2m
- 1/m - means we determine the average
- 1/2m the 2 makes the math a bit easier, and doesn't change the constants we determine at all (i.e. half the smallest value is still the smallest value!)

- Minimizing θ
_{0}/θ_{1}means we get the values of θ_{0}and θ_{1}which find on average the minimal deviation of x from y when we use those parameters in our hypothesis function

- 1/2m
- More cleanly, this is a cost function

- And we want to minimize this cost function
- Our cost function is (because of the summartion term) inherently looking at ALL the data in the training set at any time

**So to recap****Hypothesis**- is like your prediction machine, throw in an*x*value, get a putative*y*value**Cost**- is a way to, using your training data, determine values for your θ values which make the hypothesis as accurate as possible- This cost function is also called the squared error cost function
- This cost function is reasonable choice for most regression functions
- Probably most commonly used function

- This cost function is also called the squared error cost function
- In case J(θ
_{0},θ_{1}) is a bit abstract, going into what it does, why it works and how we use it in the coming sections

- Lets consider some intuition about the cost function and why we want to use it
- The cost function determines parameters
- The value associated with the parameters determines how your hypothesis behaves, with different values generate different

- Simplified hypothesis
- Assumes θ
_{0}= 0

- Assumes θ

- Cost function and goal here are very similar to when we have θ
_{0}, but with a simpler parameter- Simplified hypothesis makes visualizing cost function J() a bit easier

- So hypothesis pass through 0,0
- Two key functins we want to understand
- h
_{θ}(x)- Hypothesis is a function of x - function of what the size of the house is

- J(θ
_{1})- Is a function of the parameter of θ
_{1}

- Is a function of the parameter of θ
- So for example
- θ
_{1}= 1 - J(θ
_{1}) = 0

- θ
- Plot
- θ
_{1}vs J(θ_{1}) - Data
- 1)
- θ
_{1}= 1 - J(θ
_{1}) = 0

- θ
- 2)
- θ
_{1}= 0.5 - J(θ
_{1}) = ~0.58

- θ
- 3)
- θ
_{1}= 0 - J(θ
_{1}) = ~2.3

- θ

- 1)

- θ

- h

- If we compute a range of values plot
- J(θ
_{1}) vs θ_{1}we get a polynomial (looks like a quadratic)

- J(θ

- The optimization objective for the learning algorithm is find the value of θ
_{1}which minimizes J(θ_{1}) - So, here θ
_{1}= 1 is the best value for θ_{1}

- So, here θ

- Assume you're familiar with contour plots or contour figures
- Using same cost function, hypothesis and goal as previously
- It's OK to skip parts of this section if you don't understand cotour plots

- Using our original complex hyothesis with two pariables,
- So cost function is
- J(θ
_{0}, θ_{1})

- J(θ

- So cost function is
- Example,
- Say
- θ
_{0}= 50 - θ
_{1}= 0.06

- θ
- Previously we plotted our cost function by plotting
- θ
_{1}vs J(θ_{1})

- θ
- Now we have two parameters
- Plot becomes a bit more complicated
- Generates a 3D surface plot where axis are
- X = θ
_{1} - Z = θ
_{0} - Y = J(θ
_{0},θ_{1})

- X = θ

- Say

- We can see that the height (y) indicates the value of the cost function, so find where y is at a minimum
- Instead of a surface plot we can use a
**contour figures/plots**- Set of ellipses in different colors
- Each colour is the same value of J(θ
_{0}, θ_{1}), but obviously plot to different locations because θ_{1}and θ_{0}will vary - Imagine a bowl shape function coming out of the screen so the middle is the concentric circles

- Each point (like the red one above) represents a pair of parameter values for Ɵ0 and Ɵ1
- Our example here put the values at
- θ
_{0}= ~800 - θ
_{1}= ~-0.15

- θ
- Not a good fit
- i.e. these parameters give a value on our contour plot far from the center

- If we have
- θ
_{0}= ~360 - θ
_{1}= 0 - This gives a better hypothesis, but still not great - not in the center of the countour plot

- θ
- Finally we find the minimum, which gives the best hypothesis

- Our example here put the values at
- Doing this by eye/hand is a pain in the ass
- What we really want is an efficient algorithm fro finding the minimum for θ
_{0}and θ_{1}

- What we really want is an efficient algorithm fro finding the minimum for θ

- Minimize cost function J
- Gradient descent
- Used all over machine learning for minimization

- Start by looking at a general J() function
- Problem
- We have J(θ
_{0}, θ_{1}) - We want to get
**min J(θ**_{0}, θ_{1})

- We have J(θ
- Gradient descent applies to more general functions
- J(θ
_{0}, θ_{1}, θ_{2}.... θ_{n}) - min J(θ
_{0}, θ_{1}, θ_{2}.... θ_{n})

- J(θ

- Start with initial guesses
- Start at 0,0 (or any other value)
- Keeping changing θ
_{0}and θ_{1}a little bit to try and reduce J(θ_{0},θ_{1})

- Each time you change the parameters, you select the gradient which reduces J(θ
_{0},θ_{1}) the most possible - Repeat
- Do so until you converge to a local minimum
- Has an interesting property
- Where you start can determine which minimum you end up
- Here we can see one initialization point led to one local minimum
- The other led to a different one

- Where you start can determine which minimum you end up

- Do the following until covergence

- What does this all mean?
- Update θ
_{j}by setting it to (θ_{j}- α) times the partial derivative of the cost function with respect to θ_{j}

- Update θ
- Notation
- :=
- Denotes assignment
- NB a = b is a
*truth assertion*

- α (alpha)
- Is a number called the
**learning rate** - Controls how big a step you take
- If α is big have an aggressive gradient descent
- If α is small take tiny steps

- Is a number called the

- :=

- Derivative term
- Not going to talk about it now, derive it later

- There is a subtly about how this gradient descent algorithm is implemented
- Do this for θ
_{0}and θ_{1} - For j = 0 and j = 1 means we
**simultaneously**update both - How do we do this?
- Compute the right hand side for both θ
_{0 }and θ_{1}- So we need a temp value

- Then, update θ
_{0 }and θ_{1}at the same time - We show this graphically below

- Compute the right hand side for both θ

- Do this for θ

- If you implement the non-simultaneous update it's not gradient descent, and will behave weirdly
- But it might look sort of right - so it's important to remember this!

- To understand gradient descent, we'll return to a simpler function where we minimize one parameter to help explain the algorithm in more detail
- min θ
_{1}J(θ_{1}) where θ_{1}is a real number

- min θ
- Two key terms in the algorithm
- Alpha
- Derivative term

- Notation nuances
- Partial derivative vs. derivative
- Use partial derivative when we have multiple variables but only derive with respect to one
- Use derivative when we are deriving with respect to all the variables

- Partial derivative vs. derivative
- Derivative term

- Derivative says
- Lets take the tangent at the point and look at the slope of the line
- So moving towards the mimum (down) will greate a negative derivative, alpha is always positive, so will update j(θ
_{1}) to a smaller value - Similarly, if we're moving up a slope we make j(θ
_{1}) a bigger numbers

- Derivative says
- Alpha term (α)
- What happens if alpha is too small or too large
- Too small
- Take baby steps
- Takes too long

- Too large
- Can overshoot the minimum and fail to converge

- When you get to a local minimum
- Gradient of tangent/derivative is 0
- So derivative term = 0
- alpha * 0 = 0
- So θ
_{1}= θ_{1}- 0 - So θ
_{1}remains the same

- As you approach the global minimum the derivative term gets smaller, so your update gets smaller, even with alpha is fixed
- Means as the algorithm runs you take smaller steps as you approach the minimum
- So no need to change alpha over time

- Apply gradient descent to minimize the squared error cost function J(θ
_{0}, θ_{1}) - Now we have a partial derivative

- So here we're just expanding out the first expression
- J(θ
_{0}, θ_{1}) = 1/2m.... - h
_{θ}(x) = θ_{0}+ θ_{1}*x

- J(θ
- So we need to determine the derivative for each parameter - i.e.
- When j = 0
- When j = 1

- Figure out what this partial derivative is for the θ
_{0}and θ_{1}case- When we derive this expression in terms of j = 0 and j = 1 we get the following

- To check this you need to know multivariate calculus
- So we can plug these values back into the gradient descent algorithm

- How does it work
- Risk of meeting different local optimum
- The linear regression cost function is always a
**convex function**- always has a single minimum- Bowl shaped
- One global optima
- So gradient descent will always converge to global optima

- In action
- Initialize values to
- θ
_{0}= 900 - θ
_{1}= -0.1

- θ

- Initialize values to

- End up at a global minimum
- This is actually
**Batch Gradient Descent**- Refers to the fact that over each step you look at all the training data
- Each step compute over m training examples

- Sometimes non-batch versions exist, which look at small data subsets
- We'll look at other forms of gradient descent (to use when m is too large) later in the course

- Refers to the fact that over each step you look at all the training data
- There exists a numerical solution for finding a solution for a minimum function
**Normal equations**method- Gradient descent scales better to large data sets though
- Used in lots of contexts and machine learning

**1) Normal equation for numeric solution**- To solve the minimization problem we can solve it [ min J(θ
_{0}, θ_{1}) ] exactly using a numeric method which avoids the iterative approach used by gradient descent - Normal equations method
- Has advantages and disadvantages
- Advantage
- No longer an alpha term
- Can be much faster for some problems

- Disadvantage
- Much more complicated

- Advantage
- We discuss the normal equation in the
**linear regression with multiple features**section

- To solve the minimization problem we can solve it [ min J(θ
**2) We can learn with a larger number of features**- So may have other parameters which contribute towards a prize
- e.g. with houses
- Size
- Age
- Number bedrooms
- Number floors

- x1, x2, x3, x4

- e.g. with houses
- With multiple features becomes hard to plot
- Can't really plot in more than 3 dimensions
- Notation becomes more complicated too
- Best way to get around with this is the notation of linear algebra
- Gives notation and set of things you can do with matrices and vectors
- e.g. Matrix

- So may have other parameters which contribute towards a prize

- We see here this matrix shows us
- Size
- Number of bedrooms
- Number floors
- Age of home

- All in one variable
- Block of numbers, take all data organized into one big block

- Vector
- Shown as
*y* - Shows us the prices

- Shown as
- Need linear algebra for more complex linear regression modles
- Linear algebra is good for making computationally efficient models (as seen later too)
- Provide a good way to work with large sets of data sets
- Typically vectorization of a problem is a common optimization technique