We provide you with a data set, where the number of samples n is 16087 and the number of features d is 10013. Suppose that X ∈ R n×d is the input feature matrix and y ∈ R n is the corresponding response vector. We use the linear model to fit the data, and thus we can formulate the optimization problem as
where X = (1, X) ∈ R n×(d+1) and w = (w0, w1, . . . , wd) > ∈ R d+1.
training data
Normalization:
- We can normalize our training data as follows:
- The code in python is as follows to normalize the training data (meanwhile we plug the extra column (1,1,…,1).T into X ):
1 | def transform_X(X): |
GET the Lipschitz Constant
- Please find the Lipschitz constants of f(w) and g(u) respectively.
Since the target function is twice differentiable, we can just calculate L by getting its Hessian Matrix’s max eigenvalue, that is also the 2nd norm of ∇2f(w) , we use numpy.linalg.norm()
in python to calculate the constant, the code is as follows:
1 | ############################################################ |
Closed Solution
- Use the closed form solution to solve the problem, and get the solution u and the corresponding optimal value g_optimal = g(u)
In this section, we directly solve this optimization problem by derivative where the optimal solution in strongly convex function lies uniquely where ∇f(w)=0. From this conclusion , we have this code in python:
1 | ############################################################# |
The GD Algorithm
In this section we solve the optimization problem by gradient descent. Before running GD, we have to get the Lipschitz constant to set an appropriate stepsize (also called learning rate).
- If the stepsize is larger than 2/L, it may not converge to the optimal point.
- If the stepsize is smaller than 2/L, it is always convergent to the optimal point monotonously.
The GD algorithm is simple. Python code as follows:
1 |
|
Since the training time is a little long , I train this model in two times , in the first time, the precision achieves 0.1. Due to the low battery, it breaks down and I train it from the saved w again and finally achieve the set precision 0.05. Totally it cost about 2.5 hours.(by GPU in my PC and parallel computing, however, it seems that the GPU occupation is rather low while running)
Training results
The results are as follows:
- Get Lipschitz
1 |
|
- Get closed solution
1 |
|
- Get solution by GD Algorithm
1 |
|
The total number of rounds is 12423. The time cost is about 2.5 hours.
It’s rather slower than the closed solution.
I find the most time cost is in the derivative part, so I optimize my code in GD as follows:
1 |
|
This time I train the model in one time, and the result is faster than before:
1 |
|
About 0.5 hour. It’s faster than before , and the number of steps is similar, but it’s still rather slower than the closed solution.
The Training plot
The full data plot is as follows(The plots of f and g are the same) :
Since the first point is too large for the other points, I weed out the first point. Thus, the plot is much more smooth :
Recover w_k from u_k
Since we let:
Thus the plot of f(w_k) and g(w_k) is the same. And we can recover the original result as follows
The code is as follows :
1 | ############################################################### |
Note : I upload all my output data , codes and experiment records in my blog : https://tl2cents.github.io
All Codes and Output
1 | import numpy as np |
Some Records