| 九万里

Self attention loses rank doubly exponentially with depth at initialization.(i.e. the more deep, the more same the outputs are)
The skip connections and MLPs mitigate this situation.

Why:

convex combination

$V’_i=\sum_{n}w_nV_n$, Where $W_n\ge0$ and $\sum_{n}w_n=1$

Like the below, the convex combination of those input points must be in Convex Hull.
![[Pasted image 20230904205822.png|575]]
and, the biggest angle of the convex combination is must be equal to or less than the biggest angle of the original input points(features). (assuming the input points don’t contain the origin)
![[Pasted image 20230904210239.png|575]]

[[AI/otherTech/humanRepresentation/action_detection/TriDet#The second question Attention in Temporal Dimension.|Self attention is Convex combination]]

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth