Logistic Regression and Stochastic Gradient Descent

2021-07-21

字数统计: 514字 | 阅读时长≈ 3分

SGD and GD on LR

We provide you with a dataset of handwritten digits1 that contains a training set of 60000 examples and a test set of 2018 examples (“hw4 lr.zip”). Each image in this dataset has 28 × 28 pixels and the associated label is the handwritten digit—that is, an integer from the set {0, 1, · · · , 9}—in the image. In this exercise, you need to build a logistic regression classifier to predict if a given image has the handwritten digit 9 in it or not. You can use your favorite programming language to finish this exercise.

(a) Choose a proper normalization method to process the data matrix. Please report the normalization method you use.

(b) Find a Lipschitz constant of ∇L(w), where L(w) is the objective function of the logistic regression after normalizing and w is the model parameter to be estimated. Please report your result.
(a) Use GD and SGD to train the logistic regression classifier on the training set, respectively. Evaluate the classification accuracy on the training set after each iteration. Stop the iteration when Accuracy ≥ 90% or total steps are more than 5000. Please plot the accuracy of these two classifiers (the one trained by GD and the other trained by SGD) versus the iteration step on one graph.

(b) Compare the total iteration counts and the total time cost of the two methods (GD and SGD), respectively. Please report your result.

(c) Compare the confusion matrix, precision, recall and F1 score of the two classifiers (the one trained by GD and the other trained by SGD). Please report your result.
(a) The training set is imbalanced as the majority class has roughly ten times more images than the minority class. Imbalanced data can hurt the performance of the classifiers badly. Thus, please undersample the majority class such that the numbers of images in the two classes are roughly the same.

(b) Use GD to train the logistic regression classifier on the new training set after undersampling. Stop the iteration when Accuracy ≥ 90% or total steps are more than 5000.

(c) Evaluate the two classifiers (the one trained with GD on the original training set and the other trained on the new training set after undersampling) on the test set. Compare the confusion matrix, precision, recall and F1 score of the two classifiers. Please report your result.

Solution

1.

(a).

First ,the original data point is a pixel (28*28), I consider flattening all the graphs and then I get a vector (784,1).After this transformation, the input training data becomes a matrix with size (60000，784）

Since all the pixel values are in the range [0,255], I just divide the data matrix X by 255 , and then insert one column (1,1,1,…,1).T into the matrix.

(b)

$∇^2L(w) = XΣ_wX,\, where\, Σ_w \,is \,a\, diagonal\, matrix \,with\, its\, i_{th} \,entry \,on\, its\, diagonal\, being\\ h_i=\frac{exp(−w, \bar{x}_i)}{(1 + exp(−w, \bar{x}_i))^2}$

Since h_i<=1 , we can get that

$||∇^2L(w)||<\frac{||X^T*X||}{4}$

As total function we use is L(w)/n, thus the Lipschitz constant can get is :

$L=\frac{X^T*X}{4n}\\ (Note: we \, use\quad \frac{L(w)}{n} \quad as\, our \, target\,loss \,fuction )$

The result we get is

1	9.77436284224097

2.

Since in SGD it’s easy to jump to an extreme point(it’s easy to achieve high accuracy such as 0.9, but it performs rather bad on the test sets), I let the stopping point be p > 0.925.

The output is as follows:

the call GD_of_LR():
start time is 2021-05-07 14:21:00.763145
end time is 2021-05-07 14:25:20.313422
the total cost ：0:04:19.550277



the call SGD_of_LR():
start time is 2021-05-07 14:25:20.314141
end time is 2021-05-07 14:25:27.904874
the total cost ：0:00:07.590733

#	accuracy, 			precision, 			recall, 			F1_score,			steps
GD (0.3736372646184341, 0.6104751994450226, 0.9258285113098369, 0.7357859531772576) 420

SGD (0.2869177403369673, 0.5755395683453237, 0.9258285113098369, 0.7098205283323251) 303

We can see that SGD runs much faster than GD . Their effect on the training set is not much different(accuracy, precision, recall,F1_score).

The iteration plot is as follows. The accuracy of GD increase almost all the time, but the accuracy of SGD fluctuates a lot.

3.

In this session, I choose all the 5949 data points whose label is 9 and I randomly select other 5949 samples . Finally I use all the about 12000 data points to train my logistic regression model. The results I get are as follows:

the call GD_of_LR():
start time is 2021-05-07 14:50:16.691031
end time is 2021-05-07 14:56:39.733480
the total cost ：0:06:23.042449



the call SGD_of_LR():
start time is 2021-05-07 14:56:39.733480
end time is 2021-05-07 14:57:33.642944
the total cost ：0:00:53.909464

#	accuracy, 			precision, 			recall, 			F1_score,			steps
GD (0.5584737363726462, 0.7008750994431185, 0.9268805891635981, 0.7981879954699886) 2215
SGD (0.5505450941526263, 0.6967537608867775, 0.9258285113098369, 0.7951208493336346) 10643

In this training model, its step size is small, and it cost a lot steps. From the point of the time cost , SGD is still rather faster than GD, and its steps are about five times of GD. We can see that SGD is really faster than GD in real application, but it’s not stable which means that SGD will not converge monotonously to the best solution (or even worse it won’t), but it’s effect is usually very good and we don’t want a overfitting model either.

The plot of p with iterations are as follows:

UNDER

Codes

import numpy as np
import math
import matplotlib.pyplot as plt
from functools import wraps
from datetime import datetime

np.set_printoptions(threshold=np.inf)


def timer(func):
    @wraps(func)
    def inner(*args, **kwargs):
        start_time = datetime.now()
        print('the call %s():' % func.__name__)
        print("start time is {}".format(datetime.now()))
        result = func(*args, **kwargs)
        end_time = datetime.now()
        print("end time is {}".format(datetime.now()))
        print("the total cost ：{}".format(end_time - start_time))
        print("\n\n")
        return result

    return inner


def load_data(data_path="mnist.npy"):
    """example for loading data"""
    mnist_data = np.load(data_path, allow_pickle=True).item()
    return mnist_data


def logistic_func(w, X, y):
    Y = 0
    for i in range(X.shape[0]):
        t = X[i]@w
        # print(t.shape)
        if y[i] == 0:
            if t < 0:
                t1 = np.log(1+np.exp(t))
            else:
                t1 = -np.log(np.exp(-t)/(1+np.exp(-t)))
        else:
            if t < 0:
                t1 = -np.log(np.exp(t)/(1+np.exp(t)))
            else:
                t1 = np.log(1+np.exp(-t))
        Y = Y+t1
    return Y


def logistic_der(w, X, y):
    # print(w.shape,X.shape,y.shape)
    H = np.ones((X.shape[0], 1))
    for i in range(H.shape[0]):
        t = np.vdot(w.T, X[i])
        if t > 0:
            H[i] = 1/(1+np.exp(-t))-y[i]
        else:
            H[i] = np.exp(t)/(1+np.exp(t))-y[i]
    return X.T@H/X.shape[0]


def transform_X(X):
    X = X/255
    X = np.insert(X, 0, values=1, axis=1)
    return X


def get_Lipschitz(X):
    L = np.linalg.norm(X.T@X)
    # for i in range(X.shape[0]):
    # np.linalg.norm(X.T@X)
    return L/X.shape[0]/4


@timer
def GD_of_LR(X, y, w0, stepsize, precesion=0.9):
    w = w0
    p_arr = []
    while True:
        # f=logistic_func(w,X,y)
        # print("f",f)
        p = get_accurancy(w, X, y)
        p_arr.append(p)
        # print('p', p)
        if(p > precesion):
            break
        w = w-stepsize*logistic_der(w, X, y)
        # print(w)
    return w, p_arr


@timer
def SGD_of_LR(X, y, w0, stepsize, precesion=0.9):
    w = w0
    p_arr = []
    while True:
        p = get_accurancy(w, X, y)
        p_arr.append(p)
        # print('p', p)
        if p > precesion:
            break
        num = np.random.randint(0, X.shape[0])
        # print(num)
        t = np.vdot(w.T, X[num])
        # print(t)
        if t > 0:
            h = 1/(1+math.exp(-t))-y[num]
        else:
            h = math.exp(t)/(1+math.exp(t))-y[num]
        # print(X[num,:].shape,X.shape)
        w = w-h*X[num].reshape(-1, 1)
    return w, p_arr


def get_all_measures(w, X, y):
    y_head = np.multiply(X@w, 2*y-1)
    t = np.sum(y > 0)
    num = np.sum(y_head > 0)
    f_n = 0
    f_p = 0
    for i in range(y_head.shape[0]):
        if y_head[i] > 0:
            continue
        else:
            if y[i] == 1:
                f_n = f_n+1
            else:
                f_p = f_p+1
    precision = (t-f_n)/(t-f_n+f_p)
    recall = (t-f_n)/t
    F_score = 2/(1/precision+1/recall)
    return num/y.shape[0], precision, recall, F_score


def get_accurancy(w, X, y):
    y_head = np.multiply(X@w, 2*y-1)
    num = np.sum(y_head > 0)
    return num/y.shape[0]


def undersample(X, y):
    label = np.array(np.where(y == 1))
    # sample=X[y]
    # print(label.shape)
    for _ in range(label.shape[1]):
        num = np.random.randint(0, X.shape[0])
        # print(type(num),type(y))
        while num in label:
            num = np.random.randint(0, X.shape[0])
        label = np.append(label, num)
    # print(label.shape)
    sample = X[label]
    sample_label = y[label]
    return sample, sample_label.reshape(-1, 1)


def plot_p(p1, p2):
    plt.figure(1, figsize=(8, 5))
    # print(p1.shape)
    x = np.linspace(0, len(p1), num=len(p1))
    # print(x)
    plt.plot(x, p1, color="r", label="GD training line")

    x = np.linspace(0, len(p2), num=len(p2))
    plt.plot(x, p2, color="b", label="SGD training line")
    plt.tick_params(labelsize=15)
    # 曲线风格，label样式
    font = {'family': 'Times New Roman', 'size': 20}
    plt.xlabel('x/steps', font)
    plt.ylabel('y/accuracy', font, rotation='horizontal')
    plt.legend()
    plt.show()


if __name__ == '__main__':
    mnist_data = load_data()
    for key, value in mnist_data.items():
        print(f"{key} shape: {value.shape}")
        # print(type(mnist_data))
    train_img = mnist_data['train_images']
    # print(type(train_img[0]),train_img[0].shape)
    test_img = mnist_data['test_images']
    train_label = mnist_data['train_labels']
    test_label = mnist_data['test_labels']
    testdata = np.reshape(test_img, (-1, 28*28))
    testlabel = test_label.reshape(-1, 1)//9
    for i in range(10):
        print(np.sum(train_label == i))
    X = np.reshape(train_img, (-1, 28*28))
    X_head = transform_X(X)
    L = get_Lipschitz(X_head)
    print(L)

    # w0 = np.ones([np.shape(X_head)[1], 1])
    w0 = np.random.rand(np.shape(X_head)[1], 1)
    y = train_label//9
    sample, sample_label = undersample(X_head, y)
    # GD
    L = get_Lipschitz(sample)
    w1, p1 = GD_of_LR(sample, sample_label, w0, 1/L, 0.925)
    w2, p2 = SGD_of_LR(sample, sample_label, w0, 1/(12000*L), 0.925)
    # w1, p1 = GD_of_LR(X_head, y.reshape(-1, 1), w0, 1/L, 0.925)
    # w2, p2 = SGD_of_LR(X_head, y.reshape(-1, 1), w0, 2/(L*600000), 0.925)
    # data = np.load("data.npz", allow_pickle=True)
    # w1 = data['w1']
    # w2 = data['w2']
    # p1 = data['p1']
    # p2 = data['p2']
    np.savez("data", w1=w1, p1=p1, w2=w2, p2=p2)
    test = transform_X(testdata)
    p = get_all_measures(w1, test, test_label.reshape(-1, 1))
    print("GD", p, len(p1))
    p = get_all_measures(w2, test, test_label.reshape(-1, 1))
    print("SGD", p, len(p2))
    plot_p(p1, p2)