Deep learning Guide 11: The Transformer Architecture

type

status

date

slug

password

summary

整體結構

Positionwise Feed-Forward Networks

Although the Transformer architecture was originally proposed for sequence-to-sequence learning, as we will discover later in the book, either the Transformer encoder or the Transformer decoder is often individually used for different deep learning tasks.

The Transformer is an instance of the encoder--decoder architecture, though either the encoder or the decoder can be used individually in practice. In the Transformer architecture, multi-head self-attention is used for representing the input sequence and the output sequence, though the decoder has to preserve the autoregressive property via a masked version. Both the residual connections and the layer normalization in the Transformer are important for training a very deep model. The positionwise feed-forward network in the Transformer model transforms the representation at all the sequence positions using the same MLP.

用MLP轉換所有序列位置的表示，用兩層 MLP 轉換

The positionwise feed-forward network transforms the representation at all the sequence positions using the same MLP. This is why we call it positionwise. In the implementation below, the input X with shape (batch size, number of time steps or sequence length in tokens, number of hidden units or feature dimension) will be transformed by a two-layer MLP into an output tensor of shape (batch size, number of time steps, ffn_num_outputs).

The following example shows that the innermost dimension of a tensor changes to the number of outputs in the positionwise feed-forward network. Since the same MLP transforms at all the positions, when the inputs at all these positions are the same, their outputs are also identical.

Residual Connection and Layer Normalization 層歸一化和殘差連結

Now let's focus on the "add & norm" component. As we described at the beginning of this section, this is a residual connection immediately followed by layer normalization. Both are key to effective deep architectures. In :numref:sec_batch_norm, we explained how batch normalization recenters and rescales across the examples within a minibatch. As discussed in :numref:subsec_layer-normalization-in-bn, layer normalization is the same as batch normalization except that the former normalizes across the feature dimension, thus enjoying benefits of scale independence and batch size independence. Despite its pervasive applications in computer vision, batch normalization is usually empirically less effective than layer normalization in natural language processing tasks, where the inputs are often variable-length sequences. The following code snippet [compares the normalization across different dimensions by layer normalization and batch normalization].

Encoder

The following TransformerEncoderBlock class contains two sublayers: multi-head self-attention and positionwise feed-forward networks, where a residual connection followed by layer normalization is employed around both sublayers.

整體結構

Positionwise Feed-Forward Networks

Residual Connection and Layer Normalization 層歸一化和殘差連結

Encoder

Decoder

Training

visualize the Transformer attention weights

tom-ci

交流頻道

加入我們的社群討論分享