Deep learning Guide 13: Stochastic Gradient Descent 隨機梯度下降

type

status

date

slug

password

summary

隨機梯度下降

在前面的章节中，我们一直在训练过程中使用随机梯度下降，但没有解释它为什么起作用。为了澄清这一点，我们刚在 11.3节中描述了梯度下降的基本原则。本节继续更详细地说明随机梯度下降（stochastic gradient descent）。

在深度学习中，目标函数通常是训练数据集中每个样本的损失函数的平均值。

Given a training dataset of $n$ examples, we assume that is the loss function with respect to the training example of index $i$, where is the parameter vector. Then we arrive at the objective function

The gradient of the objective function at is computed as

If gradient descent is used, the computational cost for each independent variable iteration is , which grows linearly with $n$. Therefore, when the training dataset is larger, the cost of gradient descent for each iteration will be higher.

Stochastic gradient descent (SGD) reduces computational cost at each iteration. At each iteration of stochastic gradient descent, we uniformly sample an index for data examples at random, and compute the gradient to update :

where is the learning rate. We can see that the computational cost for each iteration drops from of the gradient descent to the constant . Moreover, we want to emphasize that the stochastic gradient is an unbiased estimate of the full gradient because

This means that, on average, the stochastic gradient is a good estimate of the gradient. Now, we will compare it with gradient descent by adding random noise with a mean of 0 and a variance of 1 to the gradient to simulate a stochastic gradient descent. 這裡用了 random noise來模擬選和不選的平均值，簡單來說就是在訓練結果上乘以 [0,1] 之間的小數

As we can see, the trajectory of the variables in the stochastic gradient descent is much more noisy than the one we observed in gradient descent in :numref:sec_gd. This is due to the stochastic nature of the gradient. That is, even when we arrive near the minimum, we are still subject to the uncertainty injected by the instantaneous gradient via . Even after 50 steps the quality is still not so good. Even worse, it will not improve after additional steps (we encourage you to experiment with a larger number of steps to confirm this).

This leaves us with the only alternative: change the learning rate . However, if we pick this too small, we will not make any meaningful progress initially. On the other hand, if we pick it too large, we will not get a good solution, as seen above. The only way to resolve these conflicting goals is to reduce the learning rate dynamically as optimization progresses. This is also the reason for adding a learning rate function lr into the sgd step function. In the example above any functionality for learning rate scheduling lies dormant as we set the associated lr function to be constant.

效果不好，簡單分析一下，可以從 learning rate 下手

Dynamic Learning Rate

Replacing with a time-dependent learning rate adds to the complexity of controlling convergence of an optimization algorithm. In particular, we need to figure out how rapidly should decay. If it is too quick, we will stop optimizing prematurely. If we decrease it too slowly, we waste too much time on optimization. The following are a few basic strategies that are used in adjusting over time (we will discuss more advanced strategies later):

In the first piecewise constant scenario we decrease the learning rate, e.g., whenever progress in optimization stalls. This is a common strategy for training deep networks. Alternatively we could decrease it much more aggressively by an exponential decay. Unfortunately this often leads to premature stopping before the algorithm has converged. A popular choice is polynomial decay with $\alpha = 0.5$. In the case of convex optimization there are a number of proofs that show that this rate is well behaved.

Let's see what the exponential decay looks like in practice.

As expected, the variance in the parameters is significantly reduced. However, this comes at the expense of failing to converge to the optimal solution . Even after 1000 iteration steps are we are still very far away from the optimal solution. Indeed, the algorithm fails to converge at all. On the other hand, if we use a polynomial decay where the learning rate decays with the inverse square root of the number of steps, convergence gets better after only 50 steps.

Convergence Analysis for Convex Objectives 凸目标的收敛性分析

看不懂

what can i say

隨機梯度下降

Dynamic Learning Rate

Convergence Analysis for Convex Objectives 凸目标的收敛性分析

tom-ci

交流頻道

加入我們的社群討論分享