Deep learning Guide 10: Multi-Head Attention, Self-Attention and Positional Encoding

type

status

date

slug

password

summary

Model

Before providing the implementation of multi-head attention, let's formalize this model mathematically. Given a query , a key , and a value , each attention head () is computed as

where , , and are learnable parameters and $f$ is attention pooling, such as additive attention and scaled dot product attention. The multi-head attention output is another linear transformation via learnable parameters of the concatenation of heads:

MultiHeadAttention

In our implementation, we [choose the scaled dot product attention for each head] of the multi-head attention. To avoid significant growth of computational cost and parametrization cost, we set . Note that heads can be computed in parallel if we set the number of outputs of linear transformations for the query, key, and value to . In the following implementation, $p_o$ is specified via the argument num_hiddens.

Self-Attention

In deep learning, we often use CNNs or RNNs to encode sequences.

Now with attention mechanisms in mind, imagine feeding a sequence of tokens into an attention mechanism such that at every step, each token has its own query, keys, and values.

Here, when computing the value of a token’s representation at the next layer, the token can attend (via its query vector) to any other’s token (matching based on their key vectors). Using the full set of query-key compatibility scores, we can compute, for each token, a representation by building the appropriate weighted sum over the other tokens.

Because every token is attending to each other token (unlike the case where decoder steps attend to encoder steps), such architectures are typically described as self-attention models (Lin et al., 2017, Vaswani et al., 2017), and elsewhere described as intra-attention model (Cheng et al., 2016, Parikh et al., 2016, Paulus et al., 2017).

In this section, we will discuss sequence encoding using self-attention, including using additional information for the sequence order.

Self-Attention

Given a sequence of input tokens where any (), its self-attention outputs a sequence of the same length , where

文本序列作為一維圖像

一維卷積神經網絡(CNN)可以處理文本中的 n-gram 等局部特徵。

CNN 的計算複雜度和層次結構

具有 k 大小核和 d 個輸入/輸出通道的卷積層的計算複雜度為 O(knd^2)。

CNN 具有層次結構,有常數級別的順序操作,最大路徑長度為 O(n/k)。

RNN 的計算複雜度和遞歸結構

RNN 更新隱藏狀態的計算複雜度為 O(nd^2)。

RNN 有 O(n) 級別的順序操作,最大路徑長度也為 O(n)。

自注意力機制的計算複雜度和並行特性

自注意力計算複雜度為 O(n^2d)。

自注意力機制下每個token可直接連接到其他所有token,計算可並行,最大路徑長度為 O(1)。

CNN 和自注意力都能利用並行計算,自注意力擁有最短的最大路徑長度。

但自注意力的計算複雜度與序列長度呈二次關係,在很長序列上會非常緩慢。

Positional Encoding

相对位置编码（RPE）技术，具体又分三种：1.在计算attention score和weighted value时各加入一个可训练的表示相对位置的参数。2.在生成多头注意力时，把对key来说将绝对位置转换为相对query的位置3.复数域函数，已知一个词在某个位置的词向量表示，可以计算出它在任何位置的词向量表示。前两个方法是词向量+位置编码，属于亡羊补牢，复数域是生成词向量的时候即生成对应的位置信息。

***学习位置编码，****学习位置编码跟生成词向量的方法相似，对应每一个位置学得一个独立的向量

Unlike RNNs, which recurrently process tokens of a sequence one-by-one, self-attention ditches sequential operations in favor of parallel computation. Note that self-attention by itself does not preserve the order of the sequence. What do we do if it really matters that the model knows in which order the input sequence arrived?

The dominant approach for preserving information about the order of tokens is to represent this to the model as an additional input associated with each token. These inputs are called positional encodings,

and they can either be learned or fixed a priori.

Suppose that the input representation contains the -dimensional embeddings for tokens of a sequence. The positional encoding outputs using a positional embedding matrix of the same shape, whose element on the row and the or the column is

Model

MultiHeadAttention

Self-Attention

Self-Attention

Positional Encoding

Absolute Positional Information

tom-ci

交流頻道

加入我們的社群討論分享