🧐課後功課答案 11.3. Attention Scoring Functions
00 min
2024-6-26
2024-6-27
type
status
date
slug
password
summary
tags
category
icon

1. Implement distance-based attention by modifying the DotProductAttention code. Note that you only need the squared norms of the keys for an efficient implementation. 透過修改DotProductAttention程式碼實現基於距離的注意力。

原先的函數:

新的函數:

notion image

2. Modify the dot product attention to allow for queries and keys of different dimensionalities by employing a matrix to adjust dimensions.透過使用矩陣調整維度來修改點積注意力,以允許不同維度的查詢和鍵。

3. How does the computational cost scale with the dimensionality of the keys, queries, values, and their number? What about the memory bandwidth requirements?

  1. Computational Cost:
      • Matrix Multiplication: The key step in the attention mechanism is the matrix multiplication between the queries and the transposed keys, which has a time complexity of O(batch_size * num_queries * num_keys * d), where d is the dimensionality of the keys and queries.
      • Softmax: The softmax operation on the attention scores has a time complexity of O(batch_size * num_queries * num_keys).
      • Weighted Sum: The final weighted sum of the values has a time complexity of O(batch_size * num_queries * value_dim), where value_dim is the dimensionality of the values.
      • Overall Complexity: The overall computational cost of the attention mechanism is O(batch_size * (num_queries * num_keys * d + num_queries * num_keys + num_queries * value_dim)).
  1. Memory Bandwidth Requirements:
      • Input Tensors: The input tensors (queries, keys, values) need to be fetched from memory, which requires bandwidth proportional to the total size of these tensors, i.e., O(batch_size * (num_queries * d + num_keys * d + num_keys * value_dim)).
      • Intermediate Tensors: The intermediate tensors, such as the attention scores and the attention weights, also need to be stored and accessed, which adds to the memory bandwidth requirements.
      • Output Tensor: The final output tensor, which is the weighted sum of the values, needs to be written back to memory, requiring bandwidth proportional to O(batch_size * num_queries * value_dim).
      • Overall Bandwidth: The overall memory bandwidth requirement is proportional to the sum of the input, intermediate, and output tensor sizes, i.e., O(batch_size * (num_queries * d + num_keys * d + num_keys * value_dim + num_queries * num_keys + num_queries * value_dim)).
      計算成本:
    1. 矩陣乘法:注意力機制中的關鍵步驟是查詢和轉置鍵之間的矩陣乘法,其時間複雜度為 O(batch_size * num_queries * num_keys * d),其中 d 是鍵和查詢的維度。
    2. Softmax:對注意力分數進行softmax操作的時間複雜度為O(batch_size * num_queries * num_keys)。
    3. 加權和:值的最終加權和的時間複雜度為 O(batch_size * num_queries * value_dim),其中 value_dim 是值的維數。
    4. 整體複雜度:注意力機制的整體計算成本為 O(batch_size * (num_queries * num_keys * d + num_queries * num_keys + num_queries * value_dim))。
    5. 記憶體頻寬需求:
    6. 輸入張量:需要從記憶體中取得輸入張量(查詢、鍵、值),這需要與這些張量的總大小成正比的頻寬,即O(batch_size * (num_queries * d + num_keys * d + num_keys * value_dim) ))。
    7. 中間張量:中間張量,例如注意力分數和注意力權重,也需要儲存和訪問,這增加了記憶體頻寬要求。
    8. 輸出張量:最終的輸出張量,即值的加權和,需要寫回內存,需要與 O(batch_size * num_queries * value_dim) 成比例的頻寬。
    9. 整體頻寬:總體記憶體頻寬需求與輸入、中間和輸出張量大小的總和成正比,即 O(batch_size * (num_queries * d + num_keys * d + num_keys * value_dim + num_queries * num_keys + num_queries * value_dim ))。
上一篇
Deep learning Guide 10: Multi-Head Attention, Self-Attention and Positional Encoding
下一篇
課後功課答案 11.2. Attention Pooling by Similarity

Comments
Loading...