Stanford CS229: Linear Regression and Gradient Descent

2 minute read

Introduction

I have recently starting taking CS229 with Andrew Ng through Stanford. This is a course that can take you to the moon if done correctly. With that being said I want to document everything I learn and take it to the next level by doing further research and gaining deeper understanding.

Linear Regression:

In the context of machine learning, linear regression uses a set of input features (X1, X2, …, Xn) to predict a continuous output variable Y.

The equation for multiple linear regression is:

Y = θ0 + θ1X1 + θ2X2 + … + θn*Xn + e

Y = Output variable θ0, θ1, …, θn = Parameters of the model X1, X2, …, Xn = Input features e = error term

Cost Function:

Linear regression typically uses a cost function called Mean Squared Error (MSE) to measure the error in the model’s predictions. The goal of linear regression is to find the parameters θ that minimize the cost function.

The cost function J(θ) for linear regression is:

J(θ) = 1/2m Σ (hθ(xi) - yi)^2

where:

m is the number of training examples hθ(xi) is the predicted output for the ith training example using the current parameters θ yi is the actual output for the ith training example

Gradient Descent:

Gradient Descent is an optimization algorithm used to minimize the cost function. It iteratively adjusts the parameters θ in the direction that reduces the cost function the most, until the cost function converges to a minimum.

The update rule for the gradient descent is:

θj := θj - α * ∂/∂θj J(θ)

where:

α is the learning rate, determining the step size in each iteration ∂/∂θj J(θ) is the partial derivative of the cost function with respect to the parameter θj The partial derivative of the cost function with respect to a parameter θj is:

∂/∂θj J(θ) = 1/m * Σ (hθ(xi) - yi) * xij

Therefore, the update rule becomes:

θj := θj - α * 1/m * Σ (hθ(xi) - yi) * xij

Here, α is the learning rate, which controls how large a step we take in each iteration.

Normal Equation:

The Normal Equation is an analytical approach to linear regression that can be used to find the values of the parameters θ that minimize the cost function. Unlike gradient descent, this method does not require choosing a learning rate α, and it does not require many iterations since it directly computes the solution.

The normal equation is:

θ = (X^T * X)^-1 * X^T * y

where:

X is the matrix of input features y is the vector of output variables The cost function in matrix form is:

J(θ) = 1/2 * (Xθ - y)^T * (Xθ - y)

We can differentiate this with respect to θ:

∂/∂θ J(θ) = X^T * (Xθ - y)

Setting this equal to zero, we get:

`X^T * X * θ = X