Deep learning Guide 4: Batch Normalization, ResNet, DenseNet, 面試經典題

type

status

date

slug

password

summary

Why Batch Normalization?

Training deep neural networks is difficult. Getting them to converge in a reasonable amount of time can be tricky. In this section, we describe batch normalization, a popular and effective technique

Denote by a minibatch and let be an input to batch normalization (). In this case the batch normalization is defined as follows:

is the sample mean

is the sample standard deviation of the minibatch . After applying standardization, the resulting minibatch has zero mean and unit variance. The choice of unit variance (rather than some other magic number) is arbitrary.

We recover this degree of freedom by including an elementwise scale parameter and shift parameter that have the same shape as . Both are parameters that need to be learned as part of model training.

中間層的變數大小在訓練期間不會發散，因為批量歸一化會主動將其集中並將其重新調整回給定的平均值和大小

The variable magnitudes for intermediate layers cannot diverge during training since batch normalization actively centers and rescales them back to a given mean and size (via and ). Practical experience confirms that, as alluded to when discussing feature rescaling, batch normalization seems to allow for more aggressive learning rates.

We calculate and in above forumala as follows:

簡單來說，方差估計，用來以確保:

我們永遠不會嘗試除以零，即使在經驗方差估計可能非常小或消失的情況下。

透過使用均值和方差的雜訊估計來抵消縮放問題。

事實證明，這是深度學習中反覆出現的主題。由於理論上尚未充分錶徵的原因，優化中的各種噪音源通常會導致更快的訓練和更少的過度擬合：這種變化似乎是正則化的一種形式。泰耶等。（2018）和羅等人 (2018) 將批量歸一化的屬性分別與貝葉斯先驗和懲罰相關聯。特別是，這揭示了為什麼批量歸一化最適合 50-100 範圍內的中等小批量大小的謎題。這種特定大小的小批量似乎在每層中註入了「適量」的噪聲，無論是在規模方面還是透過平均數，並通過偏移量較大的小批量由於更穩定的估計而正則化較少，而小小批量由於高方差而破壞了有用的訊號。進一步探索這個方向，考慮替代類型的預處理和過濾可能會導致其他有效類型的正則化。…

BatchNorm vs LayerNorm 面試經典題

批量歸一化和層歸一化之間的區別。用A班小明考試成績做例子

BatchNorm 所有人和所有人比，標準化一批樣本中的每個特徵，標準化A班所有同學的數學成績

LayerNorm 自己跟自己比，則標準化每個樣本中的所有特徵。標準化小明的所有成績

全连接层

When applying batch normalization to fully connected layers, in their original paper inserted batch normalization after the affine transformation and before the nonlinear activation function. Later applications experimented with inserting batch normalization right after activation functions. Denoting the input to the fully connected layer by , the affine transformation by (with the weight parameter and the bias parameter ), and the activation function by , we can express the computation of a batch-normalization-enabled, fully connected layer output as follows:

Normallize

ResNet

残差块

稠密连接网络（DenseNet）

ResNet极大地改变了如何参数化深层网络中函数的观点。 稠密连接网络（DenseNet） (Huang et al., 2017)在某种程度上是ResNet的逻辑扩展。让我们先从数学上了解一下。

回想一下任意函数的泰勒展开式（Taylor expansion）

同样，ResNet将函数展开为

过渡层

DenseNet模型

类似于ResNet使用的4个残差块，DenseNet使用的是4个稠密块