Gradient Descent and Strongly Covex Optimization Problem

2021-04-29

字数统计: 2.8k字 | 阅读时长≈ 17分

We provide you with a data set, where the number of samples n is 16087 and the number of features d is 10013. Suppose that X ∈ R ^n×dis the input feature matrix and y ∈ R ⁿ is the corresponding response vector. We use the linear model to fit the data, and thus we can formulate the optimization problem as

$\arg\min_{\textbf{w}}f(\textbf{w}) = \frac{1}{n}\|\textbf{y}-\bar{\textbf{X}}\textbf{w}\|_2^2$

where X = (1, X) ∈ R ^n×(d+1) and w = (w₀, w₁, . . . , w_d) > ∈ R ^d+1.

training data

Normalization:

We can normalize our training data as follows:

$z_{ij}\leftarrow\frac{x_{ij}-\min (\textbf{x}_i)}{ \max (\textbf{x}_i) - \min (\textbf{x}_i) }$ $where \quad z_{ij}\quad denotes\quad the \quad j_{th}\quad entry\quad of\quad \textbf{z}_i \in \mathbb{R}^n \, and \,\, \bar{\textbf{Z}} = (\textbf{1},\textbf{z}_1,\dots,\textbf{z}_d)\in \mathbb{R}^{n\times(d+1)}. \, \\ The \quad problem \quad becomes\\ \arg\min_{\textbf{u}}g(\textbf{u}) = \frac{1}{n}\|\textbf{y}-\bar{\textbf{Z}}\textbf{u}\|_2^2$

The code in python is as follows to normalize the training data (meanwhile we plug the extra column (1,1,…,1).T into X ):

def transform_X(X):
    # Normalization in colunm space
    range_of_X = X.max(axis=0)-X.min(axis=0)
    X = (X-X.min(axis=0))/range_of_X
    # insert (1,1,...,1)^T into the first column of matrix X ,axis=1 代表行，插入每一行的0号位置
    X = np.insert(X, 0, values=1, axis=1)
    return X

GET the Lipschitz Constant

Please find the Lipschitz constants of f(w) and g(u) respectively.

Since the target function is twice differentiable, we can just calculate L by getting its Hessian Matrix’s max eigenvalue, that is also the 2nd norm of ∇²f(w) , we use numpy.linalg.norm() in python to calculate the constant, the code is as follows:

############################################################
#               Find Lipschitz constrants                  #
#   The problem: argmin{ F(w) = (1/n)||y − X_*W||}         #
#   About Lipschitz constrant,we calculate the the 2-norm  #
# of ∇^2(f(w)),that is the most eigenvalue of the Hesian   #
# Matrix of ∇^2(f(w)),and we have||∇^2(f(w))||2 < L ,use   #
# this value to represent the Lipschitz constrant.(in this #
# case it's exactly ||∇^2(f(w))||2.                        #
# And it's easy to calculate that                          #
#           ||∇^2(f(w))||2=(2/n)*||X_^T*X_||               #
############################################################


@timer
@numba.jit(nopython=True,parallel=True)
def get_Lipschitz_constrant(X, n):
    Z = (2/n)*X.T@X
    Lipschitz = np.linalg.norm(Z, ord=2)
    return Lipschitz

Closed Solution

Use the closed form solution to solve the problem, and get the solution u and the corresponding optimal value g_optimal = g(u)

In this section, we directly solve this optimization problem by derivative where the optimal solution in strongly convex function lies uniquely where ∇f(w)=0. From this conclusion , we have this code in python:

#############################################################
#                   By Derivative form                      #
#   Since F(w) = (1/n)||y − X_*W||^2 is strongly convex,the #
# optimal solution is unique, that is ∇f(w)= -2X_^T(y-X_*w) #
# let ∇f(w)=0, we get : w=(X_^T*X_)^-1*X_^T*y.              #
#   In this case, wo get the exact optimal solution w*      #
#############################################################

@timer
@numba.jit(nopython=True,parallel=True) 
def closed_form_solu(X, y):
    # if X_^T*X_ is not invertible,consider adding a regularization lamda*I
    z = X.T@X
    z_inv = np.linalg.inv(z)
    # @ is matrix multiply of narray, if type(X) is matrix ,use * directly
    w = z_inv@X.T@y
    f_w = (1/np.shape(X)[0])*(np.linalg.norm(y-X@w, ord=2)**2)
    return w, f_w

The GD Algorithm

In this section we solve the optimization problem by gradient descent. Before running GD, we have to get the Lipschitz constant to set an appropriate stepsize (also called learning rate).

If the stepsize is larger than 2/L, it may not converge to the optimal point.
If the stepsize is smaller than 2/L, it is always convergent to the optimal point monotonously.

The GD algorithm is simple. Python code as follows:

@timer 
def Gradient_Descent(w0, stepsize, X, y, optimal):
    # x0 is an initial point
    # stepsize is learning rate: α, fixed
    # this time we know the optimal soluion, we just test the speed of GD
    @numba.jit(nopython=True,parallel=True)
    def compute_derivative(w):
        return (-2/np.shape(X)[0])*X.T@(y-X@w)
    @numba.jit(nopython=True,parallel=True)
    def compute_g(w):
        return (1/np.shape(X)[0])*(np.linalg.norm(y-X@w, ord=2)**2)
    w = w0
    g_array=[]
    counter_round = 0
    flag=1
    while True:
        g_w = compute_g(w)
        # print(g_w)
        if(np.fabs(g_w-optimal) > 0.05):
            der_g = compute_derivative(w)
            w = w-stepsize*der_g
            if counter_round%100==0:
                print(g_w)
                g_array.append(g_w)
            counter_round = counter_round+1
            if(np.fabs(g_w-optimal)<0.1 and flag==1):
                np.savetxt("g_w_0.1.txt",g_array)
                np.savetxt("w_temp.txt",w)
                print("acheive precision 0.1, round cost {}".format(counter_round))
                flag=0
        else:
            break
    return counter_round, w, g_w,g_array

Since the training time is a little long , I train this model in two times , in the first time, the precision achieves 0.1. Due to the low battery, it breaks down and I train it from the saved w again and finally achieve the set precision 0.05. Totally it cost about 2.5 hours.(by GPU in my PC and parallel computing, however, it seems that the GPU occupation is rather low while running)

Training results

The results are as follows:

Get Lipschitz

# output

the call load_file():
start time is 2021-04-28 23:03:27.319401
end time is 2021-04-28 23:03:27.582746
the total cost ：0:00:00.263345



the call transform_X():
start time is 2021-04-28 23:03:27.582746
end time is 2021-04-28 23:03:29.839131
the total cost ：0:00:02.256385



the call get_Lipschitz_constrant():
start time is 2021-04-28 23:03:30.312397
end time is 2021-04-28 23:05:01.550433
the total cost ：0:01:31.238036



the call get_Lipschitz_constrant():
start time is 2021-04-28 23:05:01.550433
end time is 2021-04-28 23:06:55.352183
the total cost ：0:01:53.801750      



2.0000112840744864 3.1174322806638823

Get closed solution

# ouput:

the call load_file():
start time is 2021-04-28 21:54:19.973622
end time is 2021-04-28 21:54:20.230936
the total cost ：0:00:00.257314



the call transform_X():
start time is 2021-04-28 21:54:20.232928
end time is 2021-04-28 21:54:22.283296
the total cost ：0:00:02.050368



the call closed_form_solu():
start time is 2021-04-28 21:54:22.650506
end time is 2021-04-28 21:55:50.054951
the total cost ：0:01:27.404445



the call closed_form_solu():
start time is 2021-04-28 21:55:50.057943
end time is 2021-04-28 21:57:25.356765
the total cost ：0:01:35.299819



w is [[-3.27719862e+00]
 [ 4.03040410e+03]
 [ 1.08720575e+02]
 ...
 [ 9.99005423e+02]
 [ 6.18355307e+02]
 [ 4.25105490e+01]]
f_optimal is 0.06380990067366617

u is [[-3.27719862]
 [ 1.28726354]
 [ 0.93520616]
 ...
 [ 1.45922689]
 [ 2.23809503]
 [ 0.10054219]]
g_optimal is 0.06380990067366608

Get solution by GD Algorithm

# this is the second time output (some results omitted)

the call load_file():
start time is 2021-04-27 17:56:27.794754
end time is 2021-04-27 17:56:28.011175
the total cost ：0:00:00.216421



the call transform_X():
start time is 2021-04-27 17:56:28.012173
end time is 2021-04-27 17:56:30.245313
the total cost ：0:00:02.234138

this time the initial is [[-3.69020197]
 [ 0.20611856]
 [ 0.69362077]
 ...
 [ 0.46368779]
 [ 0.66783387]
 [ 0.07874851]]

the call Gradient_Descent():
start time is 2021-04-27 17:56:30.764009
...
end time is 2021-04-27 19:10:51.241209
the total cost ：1:14:20.477200



The GD algorithm running rounds are 6123
the optimal value we find is 0.11380950530145387

The total number of rounds is 12423. The time cost is about 2.5 hours.

It’s rather slower than the closed solution.

I find the most time cost is in the derivative part, so I optimize my code in GD as follows:

@timer
def Gradient_Descent(w0, stepsize, X, y, optimal):
    # x0 is an initial point
    # stepsize is learning rate: α, fixed
    # this time we know the optimal soluion, we just test the speed of GD
    n = np.shape(X)[0]
    w = w0
    a = (-2/n)*X.T@y
    b = (-2/n)*X.T@X
    @numba.jit(nopython=True,parallel=True)
    def compute_derivative(w):
        return a-b@w
    @numba.jit(nopython=True,parallel=True)
    def compute_g(w):
        return (1/np.shape(X)[0])*(np.linalg.norm(y-X@w, ord=2)**2)
    g_array=[]
    counter_round = 0
    # flag=1
    while True:
        g_w = compute_g(w)
        # print(g_w)
        if(np.fabs(g_w-optimal) > 0.05):
            der_g = compute_derivative(w)
            w = w-stepsize*der_g
            if counter_round%100==0:
                print(g_w)
                g_array.append(g_w)
            counter_round = counter_round+1
        else:
            break
    return counter_round, w, g_w,g_array

This time I train the model in one time, and the result is faster than before:

# some results omitted
the call load_file():
start time is 2021-04-29 11:51:58.006474
end time is 2021-04-29 11:51:58.317645
the total cost ：0:00:00.311171



the call transform_X():
start time is 2021-04-29 11:51:58.318640
end time is 2021-04-29 11:52:00.830085
the total cost ：0:00:02.511445



the call Gradient_Descent():
start time is 2021-04-29 11:52:01.253305
...
end time is 2021-04-29 12:20:59.498509
the total cost ：0:28:58.245204



The GD algorithm running rounds are 12224
the optimal value we find is 0.11380657939847696

About 0.5 hour. It’s faster than before , and the number of steps is similar, but it’s still rather slower than the closed solution.

The Training plot

The full data plot is as follows(The plots of f and g are the same) :

Since the first point is too large for the other points, I weed out the first point. Thus, the plot is much more smooth :

Recover w_k from u_k

Since we let:

$f(w_k)=g(u_k)$

Thus the plot of f(w_k) and g(w_k) is the same. And we can recover the original result as follows

$w_k=(X^T*X)^{-1}*X^T*Z*U$

The code is as follows :

###############################################################
#                   Recover w from u                          #
#   From the equation: X*W=Z*U , by multiplying X.T in both   #
# sides, we have:       X.T*X*W=X.T*Z*U                       #
#   Thus, we get w in this format:                            #
#               W=(X.T*X).inv*X.T*Z*U                         #
#   In this transformation: f(w_k)=g(u_k)                     #
############################################################# #


@timer
def u_to_w(u, X, Z):
    t = np.linalg.inv((X.T@X))
    return t@X.T@Z@u

Note : I upload all my output data , codes and experiment records in my blog : https://tl2cents.github.io

All Codes and Output

import numpy as np
import scipy.io as scio
from functools import wraps
from datetime import datetime
import numba
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"


def timer(func):
    @wraps(func)
    def inner(*args, **kwargs):
        start_time = datetime.now()
        print('the call %s():' % func.__name__)
        print("start time is {}".format(datetime.now()))
        result = func(*args, **kwargs)
        end_time = datetime.now()
        print("end time is {}".format(datetime.now()))
        print("the total cost ：{}".format(end_time - start_time))
        print("\n\n")
        return result

    return inner


@timer
def load_file(file):
    data = scio.loadmat(file)
    X = data['X']
    y = data['y']
    # scipy.sparse.csc.csc_matrix -> numpy.narray
    X = X.toarray()
    return X, y

# normalizaton and insert a colunm, get Z_head


@timer
def transform_X(X):
    # Normalization in colunm space
    range_of_X = X.max(axis=0)-X.min(axis=0)
    X = (X-X.min(axis=0))/range_of_X
    # insert (1,1,...,1)^T into the first column of matrix X ,axis=1 代表行，插入每一行的0号位置
    X = np.insert(X, 0, values=1, axis=1)
    return X

############################################################
#               Find Lipschitz constrants                  #
#   The problem: argmin{ F(w) = (1/n)||y − X_*W||}         #
#   About Lipschitz constrant,we calculate the the 2-norm  #
# of ∇^2(f(w)),that is the most eigenvalue of the Hesian   #
# Matrix of ∇^2(f(w)),and we have||∇^2(f(w))||2 < L ,use   #
# this value to represent the Lipschitz constrant.(in this #
# case it's exactly ||∇^2(f(w))||2.                        #
# And it's easy to calculate that                          #
#           ||∇^2(f(w))||2=(2/n)*||X_^T*X_||               #
############################################################


@timer
@numba.jit(nopython=True, parallel=True)
def get_Lipschitz_constrant(X, n):
    Z = (2/n)*X.T@X
    Lipschitz = np.linalg.norm(Z, ord=2)
    return Lipschitz

#############################################################
#                   By Derivative form                      #
#   Since F(w) = (1/n)||y − X_*W||^2 is strongly convex,the #
# optimal solution is unique, that is ∇f(w)= -2X_^T(y-X_*w) #
# let ∇f(w)=0, we get : w=(X_^T*X_)^-1*X_^T*y.              #
#   In this case, wo get the exact optimal solution w*      #
#############################################################


@timer
@numba.jit(nopython=True, parallel=True)
def closed_form_solu(X, y):
    # if X_^T*X_ is not invertible,consider adding a regularization lamda*I
    z = X.T@X
    z_inv = np.linalg.inv(z)
    # @ is matrix multiply of narray, if type(X) is matrix ,use * directly
    w = z_inv@X.T@y
    f_w = (1/np.shape(X)[0])*(np.linalg.norm(y-X@w, ord=2)**2)
    return w, f_w

###############################################################
#                   By GD algorithm                           #
#   Since F(w) = (1/n)||y − X_*W||^2 is strongly convex,the   #
# optimal solution is unique. And we have known the Lipschitz #
# constrant L, we know the min g(w), set stepsize= 1.9/L.     #
#   w_k+1=w_k-α* ∇f(w_k), loop until |g(w)-min g(w)|<0.01     #
############################################################# #


@timer
def Gradient_Descent(w0, stepsize, X, y, optimal):
    # x0 is an initial point
    # stepsize is learning rate: α, fixed
    # this time we know the optimal soluion, we just test the speed of GD
    n = np.shape(X)[0]
    w = w0
    a = (-2/n)*X.T@y
    b = (-2/n)*X.T@X

    @numba.jit(nopython=True, parallel=True)
    def compute_derivative(w):
        return a-b@w

    @numba.jit(nopython=True, parallel=True)
    def compute_g(w):
        return (1/np.shape(X)[0])*(np.linalg.norm(y-X@w, ord=2)**2)
    g_array = []
    counter_round = 0
    while True:
        g_w = compute_g(w)
        # print(g_w)
        if(np.fabs(g_w-optimal) > 0.05):
            der_g = compute_derivative(w)
            w = w-stepsize*der_g
            if counter_round % 100 == 0:
                print(g_w)
                g_array.append(g_w)
            counter_round = counter_round+1
        else:
            break
    return counter_round, w, g_w, g_array

###############################################################
#                   Recover w from u                          #
#   From the equation: X*W=Z*U , by multiplying X.T in both   #
# sides, we have:       X.T*X*W=X.T*Z*U                       #
#   Thus, we get w in this format:                            #
#               W=(X.T*X).inv*X.T*Z*U                         #
#   In this transformation: f(w_k)=g(u_k)                     #
############################################################# #


@timer
def u_to_w(u, X, Z):
    t = np.linalg.inv((X.T@X))
    return t@X.T@Z@u


if __name__ == "__main__":
    X, y = load_file("hw3.mat")
    n = np.shape(X)[0]
    Z_head = transform_X(X)
    X_head = np.insert(X, 0, values=1, axis=1)
    # closed solution
    w, f_optimal = closed_form_solu(X_head, y)
    u, g_optimal = closed_form_solu(Z_head, y)
    print("w is {}\nf_optimal is {}".format(w, f_optimal))
    print("u is {}\ng_optimal is {}".format(u, g_optimal))
    print("\n")

    # using GD to solve the problem
    Lip_Z = get_Lipschitz_constrant(Z_head, n)
    Lip_X = get_Lipschitz_constrant(X_head, n)
    print(Lip_X,Lip_Z)
    # random initial w0
    # f_optimal = g_optimal = 0.06380990067366608
    w0 = np.random.rand(np.shape(X_head)[1], 1)
    w0=np.array(np.loadtxt("w_temp.txt")).reshape(-1,1)
    print("this time the initial is {}".format(w0))
    # np.savetxt("DG_w0.txt",w0)
    L_u=3.1174322806638823
    L_w = 2.0000112840744864
    r, w, g ,g_arr= Gradient_Descent(w0, 1.9/L_u, Z_head, y, g_optimal)
    np.savetxt("GD_w.txt", w)
    np.savetxt("GD_g.txt",g_arr)
    print("The GD algorithm running rounds are {}".format(r))
    print("the optimal value we find is {}".format(g))
    w = u_to_w(u, X_head, Z_head)
    np.savetxt("recover_w.txt", w)

Some Records