CS229 Problem Set #1: Supervised Learning

11 minute read

Introduction

In this blog post, I will delve into the first problem set of the CS229 course on Supervised Learning. This problem set focuses on linear classifiers, specifically logistic regression and Gaussian discriminant analysis (GDA). I will explore the concepts, assumptions, and strengths and Iaknesses of these two algorithms. Additionally, I will discuss the Poisson regression and locally Iighted linear regression.

Problem 1: Linear Classifiers (Logistic Regression and GDA)

The first problem in the CS229 problem set deals with logistic regression and Gaussian discriminant analysis (GDA) as linear classifiers. I are given two datasets, and our task is to perform binary classification on these datasets using logistic regression and GDA.

Part (a): Logistic Regression

I are asked to find the Hessian of the average empirical loss function for logistic regression and show that it is positive semi-definite. The average empirical loss for logistic regression is defined as:

J(θ) = -1/m * Σ [y(i)log(hθ(x(i))) + (1-y(i))log(1-hθ(x(i)))],

where y(i) ∈ {0, 1}, hθ(x) = g(θ^T * x), and g(z) = 1/(1 + exp(-z)).

To show that the Hessian is positive semi-definite, I need to prove that z^T * Hz ≥ 0 for any vector z. I can start by showing that PΣΣz_i * x_i * x_j * z_j = (x^T * z)^2 ≥ 0. I can also use the fact that g’(z) = g(z)(1 - g(z)).

Part (b): Logistic Regression Implementation

The next part of the problem involves implementing logistic regression using Newton’s Method. I need to train the logistic regression classifier using the provided training dataset and write the model’s predictions to a file. I continue training until the updates to θ become small.

import numpy as np
import util
from linear_model import LinearModel

def main(train_path, eval_path, pred_path):
    """Problem 1(b): Logistic regression with Newton's Method.
    
    Args:
        train_path: Path to CSV file containing dataset for training.
        eval_path: Path to CSV file containing dataset for evaluation.
        pred_path: Path to save predictions.
    """
    x_train, y_train = util.load_dataset(train_path, add_intercept=True)

    # Train logistic regression
    model = LogisticRegression(eps=1e-5)
    model.fit(x_train, y_train)

    # Plot data and decision boundary
    util.plot(x_train, y_train, model.theta, 'output/p01b_{}.png'.format(pred_path[-5]))

    # Save predictions
    x_eval, y_eval = util.load_dataset(eval_path, add_intercept=True)
    y_pred = model.predict(x_eval)
    np.savetxt(pred_path, y_pred > 0.5, fmt='%d')

class LogisticRegression(LinearModel):
    """Logistic regression with Newton's Method as the solver."""

    def fit(self, x, y):
        """Run Newton's Method to minimize J(theta) for logistic regression.
        
        Args:
            x: Training example inputs. Shape (m, n).
            y: Training example labels. Shape (m,).
        """
        # Init theta
        m, n = x.shape
        self.theta = np.zeros(n)

        # Newton's method
        while True:
            # Save old theta
            theta_old = np.copy(self.theta)
            
            # Compute Hessian Matrix
            h_x = 1 / (1 + np.exp(-x.dot(self.theta)))
            H = (x.T * h_x * (1 - h_x)).dot(x) / m
            gradient_J_theta = x.T.dot(h_x - y) / m

            # Update theta
            self.theta -= np.linalg.inv(H).dot(gradient_J_theta)

            # Check for convergence
            if np.linalg.norm(self.theta - theta_old, ord=1) < self.eps:
                break

    def predict(self, x):
        """Make predictions given new inputs x.
        
        Args:
            x: Inputs of shape (m, n).
            
        Returns:
            Predicted outputs of shape (m,).
        """
        return self.predict_proba(x) >= 0.5

    def predict_proba(self, x):
        """Compute the predicted probabilities of class 1 given new inputs x.
        
        Args:
            x: Inputs of shape (m, n).
            
        Returns:
            Predicted probabilities of class 1 of shape (m,).
        """
        return 1 / (1 + np.exp(-x.dot(self.theta)))

Part (c): Gaussian Discriminant Analysis (GDA)

In this part, I revisit Gaussian discriminant analysis (GDA). I are given the joint distribution of (x, y) and asked to show that the posterior distribution can be written as p(y=1|x; φ, µ0, µ1, Σ) = 1/(1 + exp(-(θ^T * x + θ0))), where θ ∈ R^n and θ0 ∈ R are functions of φ, Σ, µ0, and µ1.

Part (d): Maximum Likelihood Estimation

For this part, I assume that n (the dimension of x) is 1, and I are asked to derive the maximum likelihood estimates of the parameters φ, µ0, µ1, and Σ. I are given the dataset and need to calculate the maximum likelihood estimates based on the formulas provided.

Part (e): GDA Implementation

I need to implement GDA using the provided dataset and calculate the parameters φ, µ0, µ1, and Σ. I then derive θ based on these parameters and use the resulting GDA model to make predictions on the validation set.

import numpy as np
import util

from linear_model import LinearModel


def main(train_path, eval_path, pred_path):
    """Problem 1(e): Gaussian discriminant analysis (GDA)

    Args:
        train_path: Path to CSV file containing dataset for training.
        eval_path: Path to CSV file containing dataset for evaluation.
        pred_path: Path to save predictions.
    """
    # Load dataset
    x_train, y_train = util.load_dataset(train_path, add_intercept=False)

    # *** START CODE HERE ***
    
    # Train GDA
    model = GDA()
    model.fit(x_train, y_train)

    # Plot data and decision boundary
    util.plot(x_train, y_train, model.theta, 'output/p01e_{}.png'.format(pred_path[-5]))

    # Save predictions
    x_eval, y_eval = util.load_dataset(eval_path, add_intercept=True)
    y_pred = model.predict(x_eval)
    np.savetxt(pred_path, y_pred > 0.5, fmt='%d')

    # *** END CODE HERE ***


class GDA(LinearModel):
    def fit(self, x, y):
        """Fit a GDA model to training set given by x and y.

        Args:
            x: Training example inputs. Shape (m, n).
            y: Training example labels. Shape (m,).

        Returns:
            theta: GDA model parameters.
        """
        # *** START CODE HERE ***
        
        # Init theta
        m, n = x.shape
        self.theta = np.zeros(n+1)

        # Compute phi, mu_0, mu_1, sigma
        y_1 = sum(y == 1)
        phi = y_1 / m
        mu_0 = np.sum(x[y == 0], axis=0) / (m - y_1)
        mu_1 = np.sum(x[y == 1], axis=0) / y_1
        sigma = ((x[y == 0] - mu_0).T.dot(x[y == 0] - mu_0) + (x[y == 1] - mu_1).T.dot(x[y == 1] - mu_1)) / m

        # Compute theta
        sigma_inv = np.linalg.inv(sigma)
        self.theta[0] = 0.5 * (mu_0 + mu_1).dot(sigma_inv).dot(mu_0 - mu_1) - np.log((1 - phi) / phi)
        self.theta[1:] = sigma_inv.dot(mu_1 - mu_0)
        
        # Return theta
        return self.theta

        # *** END CODE HERE ***

    def predict(self, x):
        """Make a prediction given new inputs x.

        Args:
            x: Inputs of shape (m, n).

        Returns:
            Outputs of shape (m,).
        """
        # *** START CODE HERE ***
        
        return 1 / (1 + np.exp(-x.dot(self.theta)))

        # *** END CODE HERE

Part (f): Visualization of Decision Boundaries

In this part, I visualize the training data for Dataset 1 and plot the decision boundary found by logistic regression and GDA on the same figure. I use different symbols to represent examples with y=0 and y=1.

Part (g): Comparison of Logistic Regression and GDA

I repeat the visualization and comparison process for Dataset 2. I analyze which algorithm performs better and discuss the reasons for the observed performance.

Part (h): Transformation for Improved GDA Performance

As an extra credit task, I explore if a transformation of the x’s can significantly improve the performance of GDA on the dataset where it initially performed worse. I discuss the transformation and its impact on GDA’s performance.

Problem 2: Incomplete, Positive-Only Labels

The second problem focuses on training binary classifiers in situations where I have labels only for a subset of the positive examples. I are given a dataset with true labels t(i), partial labels y(i), and input features x(i). Our task is to construct a binary classifier h for the true labels t using only the partial labels y and the input features x.

Part (a): Probability of Being Labeled

I are asked to show that the probability of an example being labeled differs by a constant factor from the probability of an example being positive. I need to prove that p(t=1|x) = p(y=1|x)/α for some α ∈ R.

Part (b): Estimating α

In this part, I derive an expression for α using a trained classifier h and a held-out validation set V. I show that h(x(i)) ≈ α for all x(i) ∈ V+ (labeled examples). I assume that p(t=1|x) ≈ 1 when x(i) ∈ V+.

Part (c): Partial Label Classification Implementation

I are provided with a dataset and asked to implement logistic regression using the partial labels y. I train the classifier, rescale the predictions using the estimated value of α, and visualize the decision boundaries on the test set.

import numpy as np
import util

from p01b_logreg import LogisticRegression

# Character to replace with sub-problem letter in plot_path/pred_path
WILDCARD = 'X'


def main(train_path, valid_path, test_path, pred_path):
    """Problem 2: Logistic regression for incomplete, positive-only labels.

    Run under the following conditions:
        1. on y-labels,
        2. on l-labels,
        3. on l-labels with correction factor alpha.

    Args:
        train_path: Path to CSV file containing training set.
        valid_path: Path to CSV file containing validation set.
        test_path: Path to CSV file containing test set.
        pred_path: Path to save predictions.
    """
    pred_path_c = pred_path.replace(WILDCARD, 'c')
    pred_path_d = pred_path.replace(WILDCARD, 'd')
    pred_path_e = pred_path.replace(WILDCARD, 'e')

    # *** START CODE HERE ***
    #######################################################################################
    # Problem (c)
    x_train, t_train = util.load_dataset(train_path, label_col='t', add_intercept=True)
    x_test, t_test = util.load_dataset(test_path, label_col='t', add_intercept=True)

    model_t = LogisticRegression()
    model_t.fit(x_train, t_train)

    util.plot(x_test, t_test, model_t.theta, 'output/p02c.png')

    t_pred_c = model_t.predict(x_test)
    np.savetxt(pred_path_c, t_pred_c > 0.5, fmt='%d')
    #######################################################################################
    # Problem (d)
    x_train, y_train = util.load_dataset(train_path, label_col='y', add_intercept=True)
    x_test, y_test = util.load_dataset(test_path, label_col='y', add_intercept=True)

    model_y = LogisticRegression()
    model_y.fit(x_train, y_train)

    util.plot(x_test, y_test, model_y.theta, 'output/p02d.png')

    y_pred = model_y.predict(x_test)
    np.savetxt(pred_path_d, y_pred > 0.5, fmt='%d')
    #######################################################################################  
    # Problem (e)
    x_valid, y_valid = util.load_dataset(valid_path, label_col='y', add_intercept=True)

    alpha = np.mean(model_y.predict(x_valid))

    correction = 1 + np.log(2 / alpha - 1) / model_y.theta[0]
    util.plot(x_test, t_test, model_y.theta, 'output/p02e.png', correction)

    t_pred_e = y_pred / alpha
    np.savetxt(pred_path_e, t_pred_e > 0.5, fmt='%d')
    #######################################################################################
    # *** END CODER HERE

Problem 3: Poisson Regression

The third problem focuses on Poisson regression, which is a type of generalized linear model (GLM) used for count data. I are asked to derive the properties of Poisson distribution and Poisson regression.

Part (a): Exponential Family Representation

I need to show that the Poisson distribution is in the exponential family and provide the values for b(y), η, T(y), and a(η). The exponential family representation for the Poisson distribution is given by p(y; η) = b(y)exp(ηT(y) - a(η)).

Part (b): Canonical Response Function

I are asked to determine the canonical response function for Poisson regression. The canonical response function for a GLM with a Poisson response variable is derived by setting the mean of the Poisson distribution (λ) equal to the linear combination of the input features (θ^T * x).

Part (c): Stochastic Gradient Ascent Update Rule

In this part, I derive the stochastic gradient ascent update rule for Poisson regression using the negative log-likelihood loss function. I take the derivative of the log-likelihood with respect to θ and set it to zero to find the optimal θ.

Part (d): Poisson Regression Implementation

I are provided with a dataset and asked to implement Poisson regression using gradient ascent to maximize the log-likelihood of θ. I train the model on the training split and make predictions on the test split.

import matplotlib.pyplot as plt
import numpy as np
import util

from linear_model import LinearModel


def main(lr, train_path, eval_path, pred_path):
    """Problem 3(d): Poisson regression with gradient ascent.

    Args:
        lr: Learning rate for gradient ascent.
        train_path: Path to CSV file containing dataset for training.
        eval_path: Path to CSV file containing dataset for evaluation.
        pred_path: Path to save predictions.
    """
    # Load training set
    x_train, y_train = util.load_dataset(train_path, add_intercept=False)

    # *** START CODE HERE ***
    
    model = PoissonRegression(step_size=lr, eps=1e-5)
    model.fit(x_train, y_train)

    x_eval, y_eval = util.load_dataset(eval_path, add_intercept=False)
    y_pred = model.predict(x_eval)
    np.savetxt(pred_path, y_pred)

    plt.figure()
    plt.plot(y_eval, y_pred, 'bx')
    plt.xlabel('true counts')
    plt.ylabel('predict counts')
    plt.savefig('output/p03d.png')

    # *** END CODE HERE ***


class PoissonRegression(LinearModel):
    def fit(self, x, y):
        """Run gradient ascent to maximize likelihood for Poisson regression.

        Args:
            x: Training example inputs. Shape (m, n).
            y: Training example labels. Shape (m,).
        """
        # *** START CODE HERE ***
        
        m, n = x.shape
        self.theta = np.zeros(n)

        while True:
            theta = np.copy(self.theta)
            self.theta += self.step_size * x.T.dot(y - np.exp(x.dot(self.theta))) / m

            if np.linalg.norm(self.theta - theta, ord=1) < self.eps:
                break

        # *** END CODE HERE ***

    def predict(self, x):
        """Make a prediction given inputs x.

        Args:
            x: Inputs of shape (m, n).

        Returns:
            Floating-point prediction for each input, shape (m,).
        """
        # *** START CODE HERE ***
        
        return np.exp(x.dot(self.theta))

        # *** END CODE HERE ***

Problem 4: Convexity of Generalized Linear Models

The fourth problem explores the convexity of Generalized Linear Models (GLMs) and their use of exponential family distributions to model the output.

Part (a): Mean of the Distribution

I derive an expression for the mean of an exponential family distribution and show that it can be represented as the gradient of the log-partition function with respect to the natural parameter.

Part (b): Variance of the Distribution

I derive an expression for the variance of an exponential family distribution and show that it can be expressed as the derivative of the mean with respect to the

natural parameter.

Part (c): Convexity of NLL Loss

In this part, I write the negative log-likelihood (NLL) loss function as a function of the model parameters and calculate its Hessian. I show that the Hessian is positive semi-definite, indicating that the NLL loss of GLMs is a convex function.

Problem 5: Locally Iighted Linear Regression

The fifth problem introduces locally Iighted linear regression, where different training examples are Iighted differently. I minimize a Iighted sum of squared errors to find the optimal parameters.

Part (a): Iighted Loss Function

I express the Iighted loss function in matrix form and define the appropriate Iight matrix W.

Part (b): Normal Equation in Iighted Setting

I generalize the normal equation for the Iighted setting by finding the derivative of the Iighted loss function and setting it to zero. I derive the new value of θ that minimizes the Iighted loss function.

Part (c): Locally Iighted Linear Regression Implementation

I implement locally Iighted linear regression using the derived normal equation and train the model on the provided dataset. I tune the hyperparameter τ and evaluate the model’s performance on the validation and test sets.

Conclusion

In this blog post, I have covered the CS229 Problem Set #1 on Supervised Learning. I explored logistic regression, Gaussian discriminant analysis (GDA), Poisson regression, and locally Iighted linear regression. I discussed the concepts, assumptions, implementations, and evaluations of these algorithms. Through this problem set, I gained a deeper understanding of linear classifiers, the use of exponential family distributions, and the importance of parameter estimation and model evaluation.

Twitter Facebook LinkedIn