Deepseek Conditional Memory Engram

less than 1 minute read

N-Gram is coming back to life with this paper. There is already a detailed code explanation version of Engram, but I may defer to read it at a later time. Here is high level overviews

0 N-gram

  • N-gram is good at catching local and fixed texture pattern.
  • But it’s bad at long distance dependency and data sparsity issues. Alt text

1 Over Tokenized Transformer

This is from ByteDance’s seed team, decoupling input and output vocabularies to improve language modeling performance. Alt text Over Encoding + Over Decoding (MTP) = Over Tokennized Transformer

Tags:

Categories:

Updated: