Engram - The history

1 minute read

EZ encoder gives a review of Engram

0 Overview

  1. Transformer 的 FFN 层本质是一个 Key-Value Memory
  2. Sparsity 是打破「不可能三角」的关键(Performance / Compute / Model Size)
  3. 通过 Hash 查表实现 O(1) 复杂度的 N-gram 检索
  4. Engram 让模型不用「计算」就能「记住」常识知识
  5. DeepSeek 的工作是前人研究的集大成者 Alt text

1 Memory branch

Engram is actually a word invited by Ricard Semon, who created engram(记忆痕迹) and ecphory. Alt text and you can actually view the engram cells from this Science paper Alt text Let’s take a look at papers on LLM memories

  1. Facebook 2019 - Language Models as Knowledge Bases
    • KNN was used as KB in the traditional ML and LM is actually KB Alt text
  2. Google 2020 - T5 作为 Knowledge Base
    • Model can “look up information” stored in the parameters Alt text
  3. 2021 - Transformer FFN 层是 Key-Value Memory
    • Critical paper starting FFN as KV memory.
    • $FFN(x) = f(x * K^T)*V$ is a confusing formula here, isn’t it self-atten?
    • Key multiplication get memory coefficient Alt text
  4. Microsoft 2022 - Knowledge Neurons
    • Active neurons store knowlege and are referred as KN Alt text
    • $FFN(H)=Gelu(HW_1)W_2$, here Hidden H is the concatentation of all attention heads
    • $Self_Atten(x)=softmax(Q*K^T)V$, this makes more sense Alt text
  5. Concept depth concept introduced in the following paper Alt text Alt text
  6. Facebook 2019 - Product Key Memory (PKM) Alt text
    • K nearest neightboards used to find keys for each query Alt text
    • product key method is split keys due to large key size Alt text
  7. Meta - Memory Layer 扩展到 128B 参数
    • ML is scalable Alt text
    • Adding gating for control on tops of PKM Alt text
  8. DeepMind 2022 - Retrieval-Enhanced Transformer (RETRO)
    • Instead of using parameters as KN, external docs can be used as KN, smiliar to RAG Alt text
    • CCA(Chunked Cross Attention) was introdued here. similar to N-gram in Engram Alt text
  9. Google 2023 - External Memory 提升模型能力 Alt text
    • additional parameters for input representation (Ngram)
    • expert function (MoE) Alt text

2 N-Gram branch

Original bag of words is lacking words orders, and N-gram can be used to improve that Alt text

  1. Google 2022 - N-Grammer:用 N-gram Embedding 增强 Transformer Alt text
    • Codebook ID is used to reduce vacab size, which is V^2 in case of 2-gram.
    • k-means cluster is uesd to reduce dim. Alt text
  2. Google 2025 - Scaling Embedding Layer
    • Introduced f-gram, frequently occuring n-grams Alt text
    • The embedding table can be precomputed and offload to disk! Alt text

      4 Engram paper

      The Engram paper is easy to understand now with all previous information

    • A lookup hash table for 2 or 3 gram
    • Multi Hash were used to avoid colision
    • A gating mechanisium is similiart to Q/K/V memory mathcing and weighted sum Alt text
    • Tokenizer compression was used (like combining upper/lower cases) Alt text

Tags:

Categories:

Updated: