type
status
date
slug
password
summary
tags
category
icon
整體結構


Positionwise Feed-Forward Networks
Although the Transformer architecture was originally proposed for sequence-to-sequence learning, as we will discover later in the book, either the Transformer encoder or the Transformer decoder is often individually used for different deep learning tasks.
The Transformer is an instance of the encoder--decoder architecture, though either the encoder or the decoder can be used individually in practice. In the Transformer architecture, multi-head self-attention is used for representing the input sequence and the output sequence, though the decoder has to preserve the autoregressive property via a masked version. Both the residual connections and the layer normalization in the Transformer are important for training a very deep model. The positionwise feed-forward network in the Transformer model transforms the representation at all the sequence positions using the same MLP.

用MLP轉換所有序列位置的表示,用兩層 MLP 轉換
The positionwise feed-forward network transforms the representation at all the sequence positions using the same MLP. This is why we call it positionwise. In the implementation below, the input
X
with shape (batch size, number of time steps or sequence length in tokens, number of hidden units or feature dimension) will be transformed by a two-layer MLP into an output tensor of shape (batch size, number of time steps, ffn_num_outputs
).The following example shows that the innermost dimension of a tensor changes to the number of outputs in the positionwise feed-forward network. Since the same MLP transforms at all the positions, when the inputs at all these positions are the same, their outputs are also identical.
Residual Connection and Layer Normalization 層歸一化和殘差連結
Now let's focus on the "add & norm" component. As we described at the beginning of this section, this is a residual connection immediately followed by layer normalization. Both are key to effective deep architectures. In :numref:
sec_batch_norm
, we explained how batch normalization recenters and rescales across the examples within a minibatch. As discussed in :numref:subsec_layer-normalization-in-bn
, layer normalization is the same as batch normalization except that the former normalizes across the feature dimension, thus enjoying benefits of scale independence and batch size independence. Despite its pervasive applications in computer vision, batch normalization is usually empirically less effective than layer normalization in natural language processing tasks, where the inputs are often variable-length sequences. The following code snippet [compares the normalization across different dimensions by layer normalization and batch normalization].Encoder
The following
TransformerEncoderBlock
class contains two sublayers: multi-head self-attention and positionwise feed-forward networks, where a residual connection followed by layer normalization is employed around both sublayers.Decoder
Training

visualize the Transformer attention weights
- Author:tom-ci
- URL:https://www.tomciheng.com//article/Transformer-Architecture
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!