課後功課答案 11.1. Queries, Keys, and Values

type

status

date

slug

password

summary

1. Suppose that you wanted to reimplement approximate (key, query) matches as used in classical databases, which attention function would you pick?

Using cosine similarity attention function can be expressed as follows:

Attention(Q, K) = softmax(Q * K^T / sqrt(d_k))

Where:

Q is the matrix of query vectors

K is the matrix of key vectors

d_k is the dimension of the key vectors

Efficient Computation: Cosine similarity can be computed efficiently using matrix multiplications, which can take advantage of highly optimized linear algebra libraries.

Intuitive Interpretation: Cosine similarity has a straightforward interpretation as the cosine of the angle between two vectors, which aligns well with the notion of approximate (key, query) matching.

Robustness to Magnitude: Cosine similarity is invariant to the magnitudes of the key and query vectors, which can be useful when the absolute values of the features are not as important as their relative values.

Scalability: Cosine similarity attention can be applied to large-scale problems due to its computational efficiency and the availability of highly optimized linear algebra libraries.

One important consideration when using cosine similarity attention is that the vectors should be normalized (e.g., L2-normalized) to ensure that the magnitude of the vectors does not affect the attention weights.

Suppose that the attention function is given by and that for . Denote by the probability distribution over keys when using the softmax normalization. Prove that .協方差(covariance, 共變異數)

Define the attention function:

and the softmax normalization:

The term represents the covariance of the key vectors under the probability distribution . 共變異數(covariance, 又稱協方差)

Specifically, the covariance is defined as:

Where:

are the key vectors, with

is the probability distribution over the keys, given the query vector $\mathbf{q}$, obtained using the softmax normalization:

is the expected value of the key vectors under the distribution .

The attention output is the expectation of the values under the distribution :

To compute the gradient , we can use the log-derivative trick:

Observe that

Substituting this expression into the gradient, we get:

Recognizing the above expression as the covariance of the vectors under the distribution , we can conclude that:

Design a differentiable search engine using the attention mechanism.

DifferentiableSearchEngine class that takes a document encoder, a query encoder, and an attention module as input. The forward method of this class computes the relevance scores between the input documents and queries using the attention mechanism.

train function demonstrates how to train the search engine model using a pairwise ranking loss (implemented here as BCEWithLogitsLoss). The model is trained for a specified number of epochs, and the training and validation losses are reported after each epoch.

Note that this is a simplified example, and in a real-world implementation, you would need to handle additional components, such as:

Efficient encoding of large-scale document and query corpora

Handling variable-length documents and queries

Incorporating additional signals (e.g., user feedback, document metadata) to improve relevance scoring

Efficient retrieval of top-k relevant documents

Evaluation metrics and strategies for model selection

The key aspects of this design are the use of the attention mechanism to compute relevance scores in a differentiable way, and the end-to-end training of the entire system using gradient-based optimization. This allows the model to learn the most effective way to rank and retrieve documents based on the user's query.

Review the design of the Squeeze and Excitation Networks (Hu et al., 2018) and interpret them through the lens of the attention mechanism.

Hu 等人提出的擠壓與激勵（SE）網絡。 2018 年，是一種神經網路架構，引入了一種新穎的基於注意力的機制來增強卷積神經網路（CNN）的表示能力。 SE 模組可以輕鬆整合到各種 CNN 架構中，以提高其在各種電腦視覺任務中的效能。

SE 網路背後的關鍵想法是對特徵圖通道之間的相互依賴性進行明確建模，從而允許網路自適應地重新校準通道方面的特徵響應。這是透過注意力機制來實現的，這是SE模組的核心。

SE 模組由兩個主要組件組成：

擠壓：擠壓操作涉及全域平均池化，它將輸入特徵圖的空間維度壓縮為每個通道的單一向量。此向量表示通道特徵響應的全局分佈。

激勵：激勵組件採用壓縮的特徵圖，並應用兩個全連接層，中間有 ReLU 活化。這使得模組能夠學習非線性特徵轉換，對通道之間的相互依賴性進行建模。激勵組件的輸出是一組通道權重，用於重新校準原始特徵圖。

SE模組中的注意力機制可以解釋如下：

Channel-wise Attention：SE 模組學習為特徵圖的不同通道分配不同的重要性（權重）。這使得網路能夠專注於資訊最豐富的通道並抑制不太相關的通道，從而有效增強特徵圖的表示能力。

Adaptive Recalibration 自適應重新校準：激勵組件產生的通道權重用於縮放原始特徵圖，從而自適應地重新校準特徵響應。這個過程使網路能夠強調資訊最豐富的特徵並抑制不太相關的特徵，從而提高效能。 SE 模組中的注意力機制可以看作是「通道注意力」的一種形式，其中模型學習專注於特徵圖中最相關的通道。這與更常見的「空間注意力」形成對比，其中模型學習專注於特徵圖中最相關的空間位置。

1. Suppose that you wanted to reimplement approximate (key, query) matches as used in classical databases, which attention function would you pick?

Suppose that the attention function is given by and that for . Denote by the probability distribution over keys when using the softmax normalization. Prove that .協方差(covariance, 共變異數)

Design a differentiable search engine using the attention mechanism.

Review the design of the Squeeze and Excitation Networks (Hu et al., 2018) and interpret them through the lens of the attention mechanism.

tom-ci

交流頻道

加入我們的社群討論分享