### ML Coursera 3 - Week 3: Logistic Regression

Posted on 11/09/2018, in Machine Learning.

This note was first taken when I learnt the machine learning course on Coursera.
Lectures in this week: Lecture 6, Lecture 7.

settings_backup_restore

## Classification & Representation

### Classification

• Variable $y$ has discrete values.

• Other name of : 1 (positive class), 0 (negative class) $\Rightarrow$ binary classification problem

• If y has more than 2 values, it’s called multi classification

• Using linear regression in this case seems not to be very good because there may be some values that effects much more than the others (blue line).

• $h_{\theta}$ may take values >1 or <0 but we want $0\le h_{\theta} \le 1$. That’s why we need logistic regression, i.e. $h_{\theta}$ is always between $[0,1]$

• Remember and not confused that logistic regression is just a classification regression in cases of y taking discrete values.

### Hypothesis representation

• What is the function we are going to use to represent the hypothesis

• Logistic regression

\begin{align} h_{\theta}(x) &= g(\theta^Tx) \\ g(z) &= \dfrac{1}{1+e^{-z}}, \\ h_{\theta} &= \dfrac{1}{e^{-\theta^Tx}} \end{align}
• They are the same: sigmoid function = logistic function = $g(z)$

• Some propabilities

\begin{align*}& h_\theta(x) = P(y=1 | x ; \theta) = 1 - P(y=0 | x ; \theta) \newline& P(y = 0 | x;\theta) + P(y = 1 | x ; \theta) = 1\end{align*}

### Decision Boundary

From the above figure, we see that

We have

\begin{align*}& \theta^T x \geq 0 \Rightarrow y = 1 \newline& \theta^T x < 0 \Rightarrow y = 0 \newline\end{align*}

The decision boundary is the line that separates the area where y = 0 and where y = 1mark>. It is created by our hypothesis function.

An example,

The training set is not used to determine the decision boundary, but parameter $\theta$. The training set is used only for fit the parameter $\theta$.

## Logistic regression model

### Cost function

settings_backup_restore Look back to the cost function in linear regression.

or we can write,

$$\text{Cost}(h_{\theta}(x),y) = -y \log(h_{\theta}(x),y) - (1-y)\log(1-h_{\theta}(x),y)$$

The cost function is rewritten as

$$J(\theta) = - \frac{1}{m} \displaystyle \sum_{i=1}^m [y^{(i)}\log (h_\theta (x^{(i)})) + (1 - y^{(i)})\log (1 - h_\theta(x^{(i)}))]$$

Vectorization

\begin{align*} h &= g(X\theta) \\ J(\theta) &= \frac{1}{m} \cdot \left(-y^{T}\log(h)-(1-y)^{T}\log(1-h)\right) \end{align*}

• If hypothesis seems to be “wrong” ($h \to 1$ while $y\to 0$ or $h \to 0$ while $y\to 1$) then $\text{Cost}\to \infty$
• $J(\theta)$ ins this style is always convex.

### Simplified Cost Function and Gradient Descent

settings_backup_restore Review the gradient decent in linear regression.

In this logistic regression,

Repeat{ \begin{align*} \theta_j := \theta_j - \frac{\alpha}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \end{align*} (Simutanously update all $\theta_j$) }

Notice that, above equation looks the same with one in linear regression, the different is def of $h_{\theta}$!

Vectorization

$$\theta := \theta -\frac{\alpha}{m} X^T(g(X\theta)-y)$$

Conjugate gradient”, “BFGS”, and “L-BFGS” are more sophisticated, faster ways to optimize $\theta$ that can be used instead of gradient descent. We suggest that you should not write these more sophisticated algorithms yourself (unless you are an expert in numerical computing) but use the libraries instead, as they’re already tested and highly optimized. Octave/Matlab provides them.

A single function that returns both $J(\theta)$ and $\frac{\partial}{\partial\theta_j}J(\theta)$

function [jVal, gradient] = costFunction(theta)
jVal = [...code to compute J(theta)...];
gradient = [...code to compute derivative of J(theta)...];
end


Then we can use octave’s fminunc() optimization algorithm along with the optimset() function that creates an object containing the options we want to send to fminunc().

options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);


We give to the function fminunc() our cost function, our initial vector of theta values, and the options object that we created beforehand.

info fmincg works similarly to fminunc, but is more more efficient for dealing with a large number of parameters.

## Multiclass classification: one-vs-all

$y$ has more values than only two 0 and 1. We keep using binary classification for each group of 2 (consider one and see the others as the other group)

(n+1)-values $y \Rightarrow n+1$ binary classification problems.

\begin{align*} y &\in \lbrace 0, 1 ... n\rbrace \\ h_\theta^{(0)}(x) &= P(y = 0 | x ; \theta) \\ h_\theta^{(1)}(x) &= P(y = 1 | x ; \theta) \\ \cdots & \\ h_\theta^{(n)}(x) &= P(y = n | x ; \theta) \\ \mathrm{prediction} &= \max_i( h_\theta ^{(i)}(x) ) \end{align*}

After fiding optTheta from fmincg, we need to find $h_{\theta}$. From $h_{\theta}$ for all classes, we find the one with the highest propability (highest $h$). That’s why we have the line of code prediction above.

• face Why max?

We want to choose a $\Theta$ such that for all $j\in \{ 0,\ldots,n \}$,

Don’t forget that, we consider $h_{\theta} \ge 0.5$ as true. Because of that, there is onlty 1 option, that’s max of all $j$.

info See ex 4 for an example in practice.

## Solving the problem of overfitting

### The problem of overfitting

We have many features, $h_{\theta}$ may fit the training set very well ($J(\theta) \simeq 0$) but fail to generalize.

### Cost function

Options to solve:

• Reduce the number of features
• Manually select which features to keep
• By algorithm
• Keep all features but reduce magnitude/values of parameters $\theta_j$
• Works well when we have a lot of features, each of which contributes a bit to predicting $y$

Because we need to find the minimum, we multiply $\theta_3, \theta_4$ by 1000 to make them very big and never be a min, i.e. they look like 0.

$$J(\theta) = \frac{1}{2m} \left[ \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^n \theta_j^2 \right]$$

If $\lambda$ is too large, the problem of underfitting occurs!

### Regularized linear regression

settings_backup_restore See again GD in linear regression, multiple variables and logistic regression.

Repeat{

}

Intuitively, reduce $\theta_j$ by some amount on every update, the second term is exactly the same it was before.

#### Normal equation

settings_backup_restore See again normal equation linear regression.
\begin{align} \theta &= (X^TX + \lambda \cdot L)^{-1} X^T y \\ L &= \begin{bmatrix} 0 & & & & \newline & 1 & & & \newline & & 1 & & \newline & & & \ddots & \newline & & & & 1 \newline\end{bmatrix}_{(n+1)\times (n+1)} \end{align}
• $X$ : $m\times (n+1)$ matrix
• $m$ training examples, $n$ features.
• We don’t include $x_0$.
• If $m<n$ then $X^TX$ is non-invertible, but after adding $\lambda\cdot L$, $X^TX + \lambda\cdot L$ becomes invertible!

### Regularized logistic regression

settings_backup_restore See again cost function for logistic regression.

We can regularize this equation by adding a term to the end:

$$J(\theta) = - \frac{1}{m} \displaystyle \sum_{i=1}^m [y^{(i)}\log (h_\theta (x^{(i)})) + (1 - y^{(i)})\log (1 - h_\theta(x^{(i)}))] + \dfrac{\lambda}{2m}\sum_{j=1}^n \theta_j^2.$$

Repeat{

}

The same form with GD regularized linear regression, the difference in this case is only the definition of $h_{\theta}(x)$

## Exercice de programmation: Logistic Regression

settings_backup_restore See again How to submit?.

### Logistic Regression

• plotData: Plot from X, y to separate two kind of X

XPos = X(y==1, :);
XNeg = X(y==0, :);
plot(XPos(:,1), XPos(:,2), 'k+', 'LineWidth', 2, 'MarkerSize', 7);
plot(XNeg(:,1), XNeg(:,2), 'ko', 'MarkerFaceColor', 'y', 'MarkerSize', 7);

• sigmoid.m: recall that

g = 1 ./ (1 + exp(-z));

• costFunction.m: recall that, the cost function in logistic regression is

or in vectorization,

or in vectorization,

$$\nabla \theta = \dfrac{1}{m} X^T(g(X\theta) - y).$$
h = sigmoid(X*theta); % hypothesis
J = 1/m * ( -y' * log(h) - (1-y)' * log(1-h) );
grad = 1/m * X' * ( h - y);

• fminunc:
• GradObj option to on, which tells fminunc that our function returns both the cost and the gradient

  options = optimset('GradObj', 'on', 'MaxIter', 400);

• Notice that by using fminunc, you did not have to write any loops yourself, or set a learning rate like you did for gradient descent.

• predict.m: remember that,

h = sigmoid(X*theta); % m x 1
p = (h >= 0.5);


### Regularized logistic regression

costFunctionReg.m: recall that,

\begin{align} \dfrac{\partial J(\theta)}{\partial \theta_0} &= \dfrac{1}{m} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)}) x_j^{(i)}, \text{ for } j=0 \\ \dfrac{\partial J(\theta)}{\partial \theta_j} &= \left( \dfrac{1}{m} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})x_j^{(i)} \right) + \dfrac{\lambda}{m}\theta_j, \text{ for } j\ge 1 \end{align}
h = sigmoid(X*theta); % hypothesis