BERT

less than 1 minute read

I think I need to review BERT more closely to undertstand the encoder structure for dLLM. So I checked out this video

0 Encoder

It’s called encoder structure, b/c it outputs an embedding code, CAE Alt text It helps cluster not just words, sentences but documents Alt text Alt text

Now let’s take a closer look at Encoder architecture for Transformers for BERT (video)

1 BERT Pre-Training

Two phases for the Pretraining phase.

  1. Masked LM: Fill in the blanks
  2. Next Setence Prediction: If two sentences are related Alt text Alt text The input is sum of 3 embeddings
  3. Token embedding is from wordpiece w 30k tokens
  4. Segement embedding is A or B
  5. Position embedding Alt text The output of each word is mapped to 30k neurons (token vocab size) and compaire w one hot encoding for a loss calculation.
    Alt text

3 BERT Fine Tuning

  1. The Finetuning is for a Q/A pair
  2. Only output layer is FTed, so the process is fast Alt text
  3. The input is Question, and a Passage contains the answer
  4. The output is the start/end words for the answer Alt text

Tags:

Categories:

Updated: