type
status
date
slug
password
summary
tags
category
icon
1. Suppose that we design a deep architecture to represent a sequence by stacking self-attention layers with positional encoding. What could the possible issues be?
- 計算複雜度過高:我們知道自注意力機制的計算複雜度與序列長度呈二次關係,隨著序列越長,計算量會急劇上升。在實際應用中,如果序列過長,會導致計算效率低下
- 訓練困難:由於模型深度較大,訓練深度自注意力網絡可能會面臨梯度消失/爆炸等問題。模型的收敛速度和性能也可能會受到影響。泛化能力:過於複雜的模型架構可能會導致模型過擬合,無法很好地泛化到新的數據。降低性能。
- 解釋性較差
- 參數冗余:堆疊過多的自注意力層
2.Can you design a learnable positional encoding method?
Learnable positional encoding method
Let be the input sequence of length , and be the dimension of the input embeddings. The learnable positional encoding can be defined as:
where and are learnable parameters, and represents the position in the sequence.
- Author:tom-ci
- URL:https://www.tomciheng.com//article/11-5-self-attention
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!