type
status
date
slug
password
summary
tags
category
icon
1. Implement distance-based attention by modifying the DotProductAttention
code. Note that you only need the squared norms of the keys for an efficient implementation. 透過修改DotProductAttention程式碼實現基於距離的注意力。
原先的函數:
新的函數:

2. Modify the dot product attention to allow for queries and keys of different dimensionalities by employing a matrix to adjust dimensions.透過使用矩陣調整維度來修改點積注意力,以允許不同維度的查詢和鍵。
3. How does the computational cost scale with the dimensionality of the keys, queries, values, and their number? What about the memory bandwidth requirements?
- Computational Cost:
- Matrix Multiplication: The key step in the attention mechanism is the matrix multiplication between the queries and the transposed keys, which has a time complexity of O(batch_size * num_queries * num_keys * d), where d is the dimensionality of the keys and queries.
- Softmax: The softmax operation on the attention scores has a time complexity of O(batch_size * num_queries * num_keys).
- Weighted Sum: The final weighted sum of the values has a time complexity of O(batch_size * num_queries * value_dim), where value_dim is the dimensionality of the values.
- Overall Complexity: The overall computational cost of the attention mechanism is O(batch_size * (num_queries * num_keys * d + num_queries * num_keys + num_queries * value_dim)).
- Memory Bandwidth Requirements:
- Input Tensors: The input tensors (queries, keys, values) need to be fetched from memory, which requires bandwidth proportional to the total size of these tensors, i.e., O(batch_size * (num_queries * d + num_keys * d + num_keys * value_dim)).
- Intermediate Tensors: The intermediate tensors, such as the attention scores and the attention weights, also need to be stored and accessed, which adds to the memory bandwidth requirements.
- Output Tensor: The final output tensor, which is the weighted sum of the values, needs to be written back to memory, requiring bandwidth proportional to O(batch_size * num_queries * value_dim).
- Overall Bandwidth: The overall memory bandwidth requirement is proportional to the sum of the input, intermediate, and output tensor sizes, i.e., O(batch_size * (num_queries * d + num_keys * d + num_keys * value_dim + num_queries * num_keys + num_queries * value_dim)).
- 矩陣乘法:注意力機制中的關鍵步驟是查詢和轉置鍵之間的矩陣乘法,其時間複雜度為 O(batch_size * num_queries * num_keys * d),其中 d 是鍵和查詢的維度。
- Softmax:對注意力分數進行softmax操作的時間複雜度為O(batch_size * num_queries * num_keys)。
- 加權和:值的最終加權和的時間複雜度為 O(batch_size * num_queries * value_dim),其中 value_dim 是值的維數。
- 整體複雜度:注意力機制的整體計算成本為 O(batch_size * (num_queries * num_keys * d + num_queries * num_keys + num_queries * value_dim))。
- 輸入張量:需要從記憶體中取得輸入張量(查詢、鍵、值),這需要與這些張量的總大小成正比的頻寬,即O(batch_size * (num_queries * d + num_keys * d + num_keys * value_dim) ))。
- 中間張量:中間張量,例如注意力分數和注意力權重,也需要儲存和訪問,這增加了記憶體頻寬要求。
- 輸出張量:最終的輸出張量,即值的加權和,需要寫回內存,需要與 O(batch_size * num_queries * value_dim) 成比例的頻寬。
- 整體頻寬:總體記憶體頻寬需求與輸入、中間和輸出張量大小的總和成正比,即 O(batch_size * (num_queries * d + num_keys * d + num_keys * value_dim + num_queries * num_keys + num_queries * value_dim ))。
計算成本:
記憶體頻寬需求:
上一篇
Deep learning Guide 10: Multi-Head Attention, Self-Attention and Positional Encoding
下一篇
課後功課答案 11.2. Attention Pooling by Similarity
- Author:tom-ci
- URL:https://www.tomciheng.com//article/11-3-Att-scoring
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!