Diffusion LLM for real

less than 1 minute read

Video source for dLLM

1 The progress of dLLM

  1. Token level of dLLM By adding noise to token embeddings. The drawbacks is that small numerical error can lead to wrong decoding. Like this example, it meant to be “no matter”, but “matter” is similar to “topic” in token space, and get the a translation with totally different words combination as “no topic” Alt text

  2. Seg level of dLLM It uses an encoder to convert n tokens into a fixed length k vector. and apply adding noise to the embedding. An AR decoded was used to decode Alt text

  3. Masked words of dLLM To make it more like image diffusion, a masked version of dLLM is here. Each word is a one-hot encoding, so use a special position for the mask Alt text So words are generated at each step with certain masked words get de-masked Alt text An issue is that unmasked words can NOT be fixed once unmasked. So new methods are used to remask words if needed. Alt text This is highlevel of LlaDa works Alt text

  4. The overall training efficient of dLLM is better than AR, because the training data can be masked in various of ways, instead of just next token prediction in AR Alt text

Tags:

Categories:

Updated: