注意力可视化预备 | Qianming Blog

注意力可视化预备

Qianming Huang Three

2025-05-03 08:40:18 2025-07-17 10:08:37

图论前提

度：与结点连接的边数。
- 对于有向图
- 出度：从结点出去的边。
- 入度：进入结点的边。
邻接矩阵：
- 每一个位置：（a,b）=1表示a到b有一条边
- 每一行的和：表示行头（a）的总出度。
- 每一列的和：表示列头（a）的总入度。
- 邻接矩阵相乘：相乘几次表示路径长度，（a，b）的值表示当前路径长度下的边数。
  
  注意
  使用权重表示的邻接矩阵相乘后的结果不具备类似可解释性。

论文参考

Quantifying Attention Flow in Transformers

Since in Transformer decoder, future tokens are masked, naturally there is more attention toward initial tokens in the input sequence, and both attention rollout and attention flow will be biased toward these tokens. Hence, to apply these methods on a Transformer decoder, we should first normalize based on the receptive field of attention.

存在的问题：有掩码时，此方法可能会失效。
解决方案：根据接收域来归一化。
注意力矩阵：邻接矩阵的转置。
本文对于其内部存在的局限性上
- 使用effective attention weights 替代raw attention( $A = 0.5 W a t t + 0.5 I$ )进行递归计算
  
  effective attention weights的目的是从raw attention中找出真正影响结果的那部分，解决raw attention带来的偏差。
  - 来源：当输入的token数大于注意力头的维度的时候，注意力矩阵 $A$ 就会存在非零零空间，这意味 $A$ 是不唯一的。当输入的token数小于等于注意力头的维度时，effective attention weights = raw attention。
  - 核心思想：找出真正与结果相关的部分，而剩下的部分就是导致 $A$ 不唯一的部分。（类似于去噪提纯）
    $A \cdot T = (A + \bar{A}) \cdot T$
  - 算法初步推导：
    
    先构建 $T$
    
    $T = E W^{V} H$ ，其中：
    - E是输入嵌入矩阵维度 $760 \times 512$ ；
    - $W^{V}$ 是值投影矩阵维度 $512 \times d_{v}$ ，假设多头注意力中 $d_{v} = 64$
    - $H$ 是头部混合矩阵维度 $d_{v} \times 512$ 。
      $rank (T) \leq min (d_{s}, d_{v}) = 2$ ，因此 T 的零空间维度为 $d_{s} - d_{v} = 760 - 64 = 696$ 。
    $A^{⊥} = A - A^{∥}$
    
    其中 $A^{⊥}$ 是最后获得的结果， $A^{∥}$ 通过 $A^{∥} T = 0$ 获得
- 基于gradient算 -> Gradient-Based Attribution Methods | SpringerLink

ON IDENTIFIABILITY IN TRANSFORMERS

query： $Q \in R^{d_{s} \times d_{q}}$

key： $K \in R^{d_{s} \times d_{q}}$

value: $V \in R^{d_{s} \times d_{v}}$

由公式1输出的结果叫做contextual word embedding。

$x_{i} \in R^{d}$ 表示输入Token； $e_{i}^{l}$ 是第 $l$ 层输出的contextual word embedding的第 $i$ 行。

若干个 $x_{i}$ 组成了输入矩阵 $X \in R^{d_{s} \times d}$ ；若干个 $e_{i}^{l}$ 组成了embedding matrix $E \in R^{d_{s} \times d}$ 。

$H \in R^{d_{v} \times d}$ 。公式2的左侧公式表示的意思是：若干个头有若干组Q\K\V，这些头输出的结果的尺寸为 $d_{s} \times d_{v}$ ，其中 $d_{v} = \frac{d}{h}$ ， $h$ 为头的数量。

1. 图论前提
2. 论文参考
1. 2.1. ON IDENTIFIABILITY IN TRANSFORMERS

1. 图论前提
2. 论文参考
1. 2.1. ON IDENTIFIABILITY IN TRANSFORMERS