Deep learning Guide 9: Attention Mechanisms

type

status

date

slug

password

summary

Attention Mechanisms

例如，厨房里有苹果、青菜、西红柿、玛瑙筷子、朱砂碗

每个物品都有一个key（𝑑𝑘维向量）和value（𝑑𝑣 维向量）。

现在有一个“红色”的query（𝑑𝑞维向量），注意力机制首先计算“红色”的query与苹果的key、青菜的key、西红柿的key、玛瑙筷子的key、朱砂碗的key的关联性

再计算得到每个物品对应的权重，最终输出

=（苹果的权重x苹果的value + 青菜的权重x青菜的value + 西红柿的权重x西红柿的value + 玛瑙筷子的权重x玛瑙筷子的value + 朱砂碗的权重x朱砂碗的value）。

最终输出包含了每个物品的信息，由于苹果、西红柿的权重较大（因为与“红色”关联性更大），因此最终输出受到苹果、西红柿的value的影响更大。

Compare this to databases. In their simplest form they are collections of keys () and values (). For instance, our database might consist of tuples

("Zhang", "Aston"), ("Lipton", "Zachary"), ("Li", "Mu"), ("Smola", "Alex"), ("Hu", "Rachel"), ("Werness", "Brent")

with the last name being the key and the first name being the value. We can operate on , for instance with the exact query () for "Li" which would return the value "Mu". If ("Li", "Mu") was not a record in , there would be no valid answer. If we also allowed for approximate matches, we would retrieve ("Lipton", "Zachary") instead. This quite simple and trivial example nonetheless teaches us a number of useful things:

We can design queries $q$ that operate on () pairs in such a manner as to be valid regardless of the database size.

The same query can receive different answers, according to the contents of the database.

The "code" being executed for operating on a large state space (the database) can be quite simple (e.g., exact match, approximate match, top-).

There is no need to compress or simplify the database to make the operations effective.

Clearly we would not have introduced a simple database here if it wasn't for the purpose of explaining deep learning. Indeed, this leads to one of the most exciting concepts introduced in deep learning in the past decade: the *attention mechanism* . We will cover the specifics of its application to machine translation later. For now, simply consider the following: denote by a database of tuples of *keys* and *values*. Moreover, denote by a *query*. Then we can define the *attention* over as

where () are scalar attention weights. The operation itself is typically referred to as attention pooling. The name attention derives from the fact that the operation pays particular attention to the terms for which the weight $\alpha$ is significant (i.e., large). As such, the attention over generates a linear combination of values contained in the database. In fact, this contains the above example as a special case where all but one weight is zero. We have a number of special cases:

The weights are nonnegative. In this case the output of the attention mechanism is contained in the convex cone spanned by the values .

The weights form a convex combination, i.e., and for all . This is the most common setting in deep learning.

Exactly one of the weights is , while all others are . This is akin to a traditional database query.

所有的注意力參數都一樣，All weights are equal, i.e., for all . This amounts to averaging across the entire database, also called average pooling in deep learning.

A common strategy for ensuring that the weights sum up to is to normalize them via

In particular, to ensure that the weights are also nonnegative, one can resort to exponentiation. This means that we can now pick any function and then apply the softmax operation used for multinomial models to it via

Attention Pooling by Similarity

Now that we have introduced the primary components of the attention mechanism, let’s use them in a rather classical setting, namely regression and classification via kernel density estimation (Nadaraya, 1964, Watson, 1964). This detour simply provides additional background: it is entirely optional and can be skipped if needed. At their core, Nadaraya–Watson estimators rely on some similarity kernel relating queries to keys . 看看都有哪些 Kernels

There are many more choices that we could pick. See a Wikipedia article

sometimes also called *Parzen Windows*. All of the kernels are heuristic and can be tuned. For instance, we can adjust the width, not only on a global basis but even on a per-coordinate basis. Regardless, all of them lead to the following equation for regression and classification alike:

Different kernels correspond to different notions of range and smoothness. For instance, the boxcar kernel only attends to observations within a distance of 1 (or some otherwise defined hyperparameter) and does so indiscriminately.

To see Nadaraya–Watson estimation in action, let’s define some training data. In the following we use the dependency

where 𝜖 is drawn from a normal distribution with zero mean and unit variance. We draw 40 training examples.

Attention Pooling via Nadaraya--Watson Regression

Now that we have data and kernels, all we need is a function that computes the kernel regression estimates. Note that we also want to obtain the relative kernel weights in order to perform some minor diagnostics. Hence we first compute the kernel between all training features (covariates) x_train and all validation features x_val. This yields a matrix, which we subsequently normalize. When multiplied with the training labels y_train we obtain the estimates.

Adapting Attention Pooling

We could replace the Gaussian kernel with one of a different width. That is, we could use where determines the width of the kernel. Let's see whether this affects the outcomes.

正如我們所期望的，內核越窄，大注意力權重的範圍就越窄。同樣明顯的是，選擇相同的寬度可能並不理想。事實上，Silverman (1986) 提出了一種取決於局部密度的啟發式方法。人們也提出了更多這樣的「技巧」。例如，諾雷利等人。 (2022) 使用類似的最近鄰插值技術來設計跨模式圖像和文字表示。

精明的讀者可能想知道為什麼我們要對一種已有半個多世紀歷史的方法進行如此深入的探討。首先，它是現代注意力機制最早的先驅之一。其次，它非常適合可視化。第三，同樣重要的是，它展示了手工注意力機制的限制。更好的策略是透過學習查詢和鍵的表示來學習該機制。這就是我們將在以下部分中著手進行的內容。

Attention Scoring Functions

Dot Product Attention

Let's review the attention function (without exponentiation) from the Gaussian kernel for a moment:

Note that attention weights still need normalizing. We can simplify this further via by using the softmax operation:

One of the most popular applications of the attention mechanism is to sequence models. Hence we need to be able to deal with sequences of different lengths. In some cases, such sequences may end up in the same minibatch, necessitating padding with dummy tokens for shorter sequences. These special tokens do not carry meaning. For instance, assume that we have the following three sentences:

Since we do not want blanks in our attention model we simply need to limit to for however long, , the actual sentence is. Since it is such a common problem, it has a name: the *masked softmax operation*.

Batch Matrix Multiplication

Another commonly used operation is to multiply batches of matrices by one another. This comes in handy when we have minibatches of queries, keys, and values. More specifically, assume that

Scaled Dot Product Attention

In practice, we often think of minibatches for efficiency, such as computing attention for queries and $m$ key-value pairs, where queries and keys are of length $d$ and values are of length . The scaled dot product attention of queries , keys , and values thus can be written as

Additive Attention

When queries and keys are vectors of different dimension, we can either use a matrix to address the mismatch via , or we can use additive attention as the scoring function. Another benefit is that, as its name indicates, the attention is additive. This can lead to some minor computational savings. Given a query and a key , the additive attention scoring function Bahdanau.Cho.Bengio.2014 is given by

where , , and are the learnable parameters. This term is then fed into a softmax to ensure both nonnegativity and normalization.