Stanford CS229: Linear Regression and Gradient Descent
Introduction
I have recently starting taking CS229 with Andrew Ng through Stanford. This is a course that can take you to the moon if done correctly. With that being said I want to document everything I learn and take it to the next level by doing further research and gaining deeper understanding.
Linear Regression:
In the context of machine learning, linear regression uses a set of input features (X1, X2, …, Xn) to predict a continuous output variable Y.
The equation for multiple linear regression is:
Y = θ0 + θ1X1 + θ2X2 + … + θn*Xn + e
Y = Output variable θ0, θ1, …, θn = Parameters of the model X1, X2, …, Xn = Input features e = error term
Cost Function:
Linear regression typically uses a cost function called Mean Squared Error (MSE) to measure the error in the model’s predictions. The goal of linear regression is to find the parameters θ that minimize the cost function.
The cost function J(θ) for linear regression is:
J(θ) = 1/2m Σ (hθ(xi) - yi)^2
where:
m is the number of training examples hθ(xi) is the predicted output for the ith training example using the current parameters θ yi is the actual output for the ith training example
Gradient Descent:
Gradient Descent is an optimization algorithm used to minimize the cost function. It iteratively adjusts the parameters θ in the direction that reduces the cost function the most, until the cost function converges to a minimum.
The update rule for the gradient descent is:
θj := θj - α * ∂/∂θj J(θ)
where:
α is the learning rate, determining the step size in each iteration ∂/∂θj J(θ) is the partial derivative of the cost function with respect to the parameter θj The partial derivative of the cost function with respect to a parameter θj is:
∂/∂θj J(θ) = 1/m * Σ (hθ(xi) - yi) * xij
Therefore, the update rule becomes:
θj := θj - α * 1/m * Σ (hθ(xi) - yi) * xij
Here, α is the learning rate, which controls how large a step we take in each iteration.
Normal Equation:
The Normal Equation is an analytical approach to linear regression that can be used to find the values of the parameters θ that minimize the cost function. Unlike gradient descent, this method does not require choosing a learning rate α, and it does not require many iterations since it directly computes the solution.
The normal equation is:
θ = (X^T * X)^-1 * X^T * y
where:
X is the matrix of input features y is the vector of output variables The cost function in matrix form is:
J(θ) = 1/2 * (Xθ - y)^T * (Xθ - y)
We can differentiate this with respect to θ:
∂/∂θ J(θ) = X^T * (Xθ - y)
Setting this equal to zero, we get:
`X^T * X * θ = X