課後功課答案 11.3. Attention Scoring Functions | tomci Blog

深度學習

🧐課後功課答案 11.3. Attention Scoring Functions

00 min

2024-6-27

type

status

date

slug

password

summary

tags

category

icon

1. Implement distance-based attention by modifying the `DotProductAttention` code. Note that you only need the squared norms of the keys for an efficient implementation. 透過修改DotProductAttention程式碼實現基於距離的注意力。

原先的函數:

新的函數:

notion image

2. Modify the dot product attention to allow for queries and keys of different dimensionalities by employing a matrix to adjust dimensions.透過使用矩陣調整維度來修改點積注意力，以允許不同維度的查詢和鍵。

3. How does the computational cost scale with the dimensionality of the keys, queries, values, and their number? What about the memory bandwidth requirements?

Computational Cost:

Matrix Multiplication: The key step in the attention mechanism is the matrix multiplication between the queries and the transposed keys, which has a time complexity of O(batch_size * num_queries * num_keys * d), where d is the dimensionality of the keys and queries.

Softmax: The softmax operation on the attention scores has a time complexity of O(batch_size * num_queries * num_keys).

Weighted Sum: The final weighted sum of the values has a time complexity of O(batch_size * num_queries * value_dim), where value_dim is the dimensionality of the values.

Overall Complexity: The overall computational cost of the attention mechanism is O(batch_size * (num_queries * num_keys * d + num_queries * num_keys + num_queries * value_dim)).

Memory Bandwidth Requirements:

Input Tensors: The input tensors (queries, keys, values) need to be fetched from memory, which requires bandwidth proportional to the total size of these tensors, i.e., O(batch_size * (num_queries * d + num_keys * d + num_keys * value_dim)).

Intermediate Tensors: The intermediate tensors, such as the attention scores and the attention weights, also need to be stored and accessed, which adds to the memory bandwidth requirements.

Output Tensor: The final output tensor, which is the weighted sum of the values, needs to be written back to memory, requiring bandwidth proportional to O(batch_size * num_queries * value_dim).

Overall Bandwidth: The overall memory bandwidth requirement is proportional to the sum of the input, intermediate, and output tensor sizes, i.e., O(batch_size * (num_queries * d + num_keys * d + num_keys * value_dim + num_queries * num_keys + num_queries * value_dim)).

計算成本：

矩陣乘法：注意力機制中的關鍵步驟是查詢和轉置鍵之間的矩陣乘法，其時間複雜度為 O(batch_size * num_queries * num_keys * d)，其中 d 是鍵和查詢的維度。
Softmax：對注意力分數進行softmax操作的時間複雜度為O(batch_size * num_queries * num_keys)。
加權和：值的最終加權和的時間複雜度為 O(batch_size * num_queries * value_dim)，其中 value_dim 是值的維數。
整體複雜度：注意力機制的整體計算成本為 O(batch_size * (num_queries * num_keys * d + num_queries * num_keys + num_queries * value_dim))。

記憶體頻寬需求：

輸入張量：需要從記憶體中取得輸入張量（查詢、鍵、值），這需要與這些張量的總大小成正比的頻寬，即O(batch_size * (num_queries * d + num_keys * d + num_keys * value_dim) ））。
中間張量：中間張量，例如注意力分數和注意力權重，也需要儲存和訪問，這增加了記憶體頻寬要求。
輸出張量：最終的輸出張量，即值的加權和，需要寫回內存，需要與 O(batch_size * num_queries * value_dim) 成比例的頻寬。
整體頻寬：總體記憶體頻寬需求與輸入、中間和輸出張量大小的總和成正比，即 O(batch_size * (num_queries * d + num_keys * d + num_keys * value_dim + num_queries * num_keys + num_queries * value_dim )）。

Deep learning Guide 10: Multi-Head Attention, Self-Attention and Positional Encoding

課後功課答案 11.2. Attention Pooling by Similarity

課後功課答案 11.2. Attention Pooling by Similarity

Author:tom-ci
URL:https://www.tomciheng.com//article/11-3-Att-scoring
Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!

Relate Posts

從零開始實現 Stable Diffusion - 1

Deep learning Guide 14: Minibatch Stochastic Gradient Descent

Deep learning Guide 12: Gradient Descent 梯度下降

Deep learning Guide 13: Stochastic Gradient Descent 隨機梯度下降

課後功課答案 12.2. Convexity

Deep learning Guide 11: The Transformer Architecture

Comments

Loading...

你好！接下來登場的是......

tom-ci

tom-ci

tom_ci，也可以叫我CI，tom，ic，湯姆希艾或者是湯姆，艾希，Macau SMP現任群主

交流頻道

加入我們的社群討論分享

點擊加入社群

Latest posts

²⁵

²³

²¹

¹¹

⁹

⁵

⁴

³

³

²

²

¹

¹

文章数:

49

建站天数:

326 天