ML101 - Self Attention

1 minute read

One more good resource for this introduction is here

What the output could be for a vector input sent to a model. “Sequence labeling” is #input=#output. Alt text

FC can consider a limited length context. Alt text

Self-attention is the solution to use full input as context. Alt text

The self-attention can be used directly on input or on middle layers. and can be processed in parallel!!! Alt text

First get Query and Key matrix Alt text

Get Attention scores by multiply Q and V Alt text

Adding Softwax(Can replaced by ReLU.), and normalization. A masked version could be used here to achieve causality. Alt text Alt text

Use attention scores as weights, and the output the weighted sum on V matrix Alt text

Matrix view of the steps above.Q/K/V matrix are the paramters to be learnt. Alt text

Single head limits the ability of self-attention to focus on multiple positions within the sequence—the probability distribution can easily be dominated by one (or a few) words. Now let’s expand it to multi-head attention. Use the previous results as $b^{i,1}$ Alt text

And use another set of matrix to get $b^{i,2}$ Alt text

Concatentate multi head results and times a matrix to get the final output $b^i$. Because each attention head outputs token vectors of dimension $d // H$, the concatenated output of all attention heads has dimension $d$ Alt text

Adding position vector created by positial encoding. $sin$ encoding is hand crafted, and can be learnt as well.
Alt text

For speech, the vector seq. could be VERY long, 1 second signal is 100 vectors, and the complexity is square to the length. Alt text

Image can be seens as a long vector sequence as well. Alt text

CNN is simplified SA(self-attention) with limited receptive field.
SA is CNN with leanable receptive field. Alt text

SA needs more data to train than CNN. (Results from ViT paper, An image is worith 16x16 word) Alt text

RNN can NOT be parallel processed and easy to forgot early input. Alt text

Use SA on graph, only consider the connected edge. and this is one type of GNN Alt text

Tags:

Categories:

Updated: