ML101 - Self Attention

1 minute read

One more good resource for this introduction is here

What the output could be for a vector input sent to a model. “Sequence labeling” is #input=#output. Alt text

FC can consider a limited length context. Alt text

Self-attention is the solution to use full input as context. Alt text

The self-attention can be used directly on input or on middle layers. and can be processed in parallel!!! Alt text

First get Query and Key matrix Alt text

Get Attention scores by multiply Q and V Alt text

Adding Softwax(Can replaced by ReLU.), and normalization. A masked version could be used here to achieve causality. Alt text

Use attention scores as weights, and the output the weighted sum on V matrix Alt text

Matrix view of the steps above.Q/K/V matrix are the paramters to be learnt. Alt text

Single head limits the ability of self-attention to focus on multiple positions within the sequence—the probability distribution can easily be dominated by one (or a few) words. Now let’s expand it to multi-head attention. Use the previous results as $b^{i,1}$ Alt text

And use another set of matrix to get $b^{i,2}$ Alt text

Concatentate multi head results and times a matrix to get the final output $b^i$. Because each attention head outputs token vectors of dimension $d // H$, the concatenated output of all attention heads has dimension $d$ Alt text