Deep learning Guide 14: Minibatch Stochastic Gradient Descent

type

status

date

slug

password

summary

Vectorization and Caches

除了计算效率之外，Python和深度学习框架本身带来的额外开销也是相当大的。回想一下，每次我们执行代码时，Python解释器都会向深度学习框架发送一个命令，要求将其插入到计算图中并在调度过程中处理它。这样的额外开销可能是非常不利的。总而言之，我们最好用向量化（和矩阵）。

Since we will benchmark the running time frequently in the rest of the book, let's define a timer.

Element-wise assignment simply iterates over all rows and columns of and respectively to assign the value to .

A faster strategy is to perform column-wise assignment.

高效率的記憶體存取模式：在第一個程式中，矩陣 A 的元素是逐一計算的，這需要重複存取矩陣 B 的行和矩陣 C 的列。元素不是以連續方式存取的。

利用向量運算：第二個程式利用 torch.mv() 函數，該函數執行矩陣向量乘法。此操作比第一個程式中使用的 torch.dot() 函數更有效，該函數計算兩個向量的點積。向量運算可以利用底層硬體的功能，例如 SIMD（單指令、多資料）指令，更有效地執行計算。

減少循環開銷：第二個程式有一個循環遍歷矩陣 A 的列，而第一個程式有嵌套循環遍歷矩陣 A 的行和列。

Last, the most effective manner is to perform the entire operation in one block. Note that multiplying any two matrices and takes approximately floating point operations, when scalar multiplication and addition are counted as separate operations (fused in practice). Thus, multiplying two matrices takes billion floating point operations. Let's see what the respective speed of the operations is.

利用高度最佳化的矩陣乘法：torch.mm() 函數是 PyTorch 函式庫提供的高度最佳化的矩陣乘法實作。它利用低階、特定於硬體的最佳化，例如使用 BLAS（基本線性代數子程式）函式庫，這些函式庫針對特定硬體和 CPU 架構進行了微調。這些最佳化使得 torch.mm() 函數比手動實現矩陣乘法快得多。

Minibatches

In the past we took it for granted that we would read *minibatches* of data rather than single observations to update parameters. We now give a brief justification for it. Processing single observations requires us to perform many single matrix-vector (or even vector-vector) multiplications, which is quite expensive and which incurs a significant overhead on behalf of the underlying deep learning framework. This applies both to evaluating a network when applied to data (often referred to as inference) and when computing gradients to update parameters. That is, this applies whenever we perform where

We can increase the *computational* efficiency of this operation by applying it to a minibatch of observations at a time. That is, we replace the gradient over a single observation by one over a small batch

Let's see what this does to the statistical properties of :

since both and also all elements of the minibatch are drawn uniformly at random from the training set, the expectation of the gradient remains unchanged.

The variance, on the other hand, is reduced significantly. Since the minibatch gradient is composed of independent gradients which are being averaged, its standard deviation is reduced by a factor of . This, by itself, is a good thing, since it means that the updates are more reliably aligned with the full gradient.

Naively this would indicate that choosing a large minibatch would be universally desirable. Alas, after some point, the additional reduction in standard deviation is minimal when compared to the linear increase in computational cost. In practice we pick a minibatch that is large enough to offer good computational efficiency while still fitting into the memory of a GPU. To illustrate the savings let's have a look at some code. In it we perform the same matrix-matrix multiplication, but this time broken up into "minibatches" of 64 columns at a time.

Reading the Dataset

Let's have a look at how minibatches are efficiently generated from data. In the following we use a dataset developed by NASA to test the wing noise from different aircraft to compare these optimization algorithms. For convenience we only use the first $1,500$ examples. The data is whitened for preprocessing, i.e., we remove the mean and rescale the variance to $1$ per coordinate.

Implementation from Scratch

stochastic gradient descent

讓我們看看批量梯度下降的最佳化是如何進行的。這可以透過將小批量大小設定為 1500（即範例總數）來實現。因此，模型參數每個時期僅更新一次。進展甚微。事實上，在 6 個步驟之後，進度就停止了。

當批量大小等於1時，我們使用隨機梯度下降進行最佳化。為了簡化實施，我們選擇了恆定（儘管很小）的學習率。在隨機梯度下降中，每當處理範例時模型參數都會更新。在我們的例子中，這相當於每個 epoch 1500 次更新。我們可以看到，目標函數值的下降在一個 epoch 後減慢。儘管這兩個過程在一個 epoch 內處理了 1500 個範例，但在我們的實驗中，隨機梯度下降比梯度下降消耗更多的時間。這是因為隨機梯度下降更頻繁地更新參數，並且一次處理單一觀測值的效率較低。

最後，當批量大小等於 100 時，我們使用小批量隨機梯度下降進行最佳化。每個epoch所需的時間比隨機梯度下降所需的時間和批量梯度下降所需的時間短。

將批次大小減少到 10，每個 epoch 的時間都會增加，因為每個批次的工作負載執行效率較低。

現在我們可以比較前四個實驗的時間與損失。可以看出，儘管隨機梯度下降在處理樣本數量方面比 GD 收斂得更快，但它比 GD 使用更多的時間來達到相同的損失，因為逐個樣本計算梯度效率不高。小批量隨機梯度下降能夠權衡收斂速度和計算效率。小批量大小為 10 比隨機梯度下降更有效；就運行時間而言，100 的小批量大小甚至優於 GD。

Concise Implementation

In Gluon, we can use the Trainer class to call optimization algorithms. This is used to implement a generic training function. We will use this throughout the current chapter.

Using Gluon to repeat the last experiment shows identical behavior.

Compare minibatch stochastic gradient descent with a variant that actually samples with replacement from the training set. What happens?

將小批量隨機梯度下降 (SGD) 與從訓練集中進行替換採樣的變體進行比較，主要區別在於：

抽樣方法：

小批量 SGD：將訓練集分成較小的批次，每次迭代使用不同批次的樣本，無需放回。
放回抽樣：每次迭代從整個訓練集中隨機抽樣一批樣本，並進行放回。這意味著每批中有些樣品可能會重複，有些樣品可能會被省略。

收斂性質：

小批量 SGD：從小批量計算的梯度提供了真實梯度的無偏估計，並且隨著批次大小的增加和迭代次數的增加，演算法收斂到最優解。
放回採樣：從隨機採樣批次計算的梯度仍然是真實梯度的無偏估計。然而，由於放回採樣，梯度的變異數較高，這可能會降低收斂速度。

Vectorization and Caches

Minibatches

Reading the Dataset

Implementation from Scratch

Concise Implementation

Compare minibatch stochastic gradient descent with a variant that actually samples with replacement from the training set. What happens?

tom-ci

交流頻道

加入我們的社群討論分享