Deekseek mHC

1 minute read

Deepseek published two papers and they will be new technoloies used in the comming V4 model, which is rumored to have significant improvements in coding abilities

1 Residuals and Hyper-Connection

  • Residuals was proposed by Kaiming He and used in almost every neural networks \(X_{l+1}=X_l+f(X_l)\)
  • HC was proposed by Bytedance team and the key changes are using a matrix (nxC) instead of a vector(1xC) for both residual connection and network layers, which significantly increases topological complexity without altering the computational overhead Alt text Here H are learnable matrix. The AGM is used to show how easily this is going to explode Alt text

spectral norm is a new concept which I heard for the first time. and the following manifest related discussion is out of my knowledge as well.

And here is the simple version in Chinese

  • RC:CEO给CTO命令,CTO把自己的理解传给VP,同时把CEO的原始命令也传给VP
  • HC:CEO给CTO命令,CTO选择性传递。跟技术相关的,就加重权重传给Engineer VP,或者非技术相关的就降低权重

2 Manifold-Constrained Hyper-Connections

N-1维的东西,实际上是N维事物的投影,这就是N维流行 Alt text More specifically, is about contrainting the spectral norm to be 1 Alt text

3 Methodology

  1. doubly stochastic matrix Alt text
  2. The properties of DSM Alt text
  3. Sinkhorn-Knopp algorithm (Sinkhorn and Knopp, 1967) The iterative algorithm which can be used to get DSM Alt text
  4. This only applies to $H_res$ which is nxn size but not to $H_pre/post$ which is 1xn Alt text
  5. Engineer improvments
    • RC uses C and mHC uses nC, how to change the improved I/O cost
    • Kernel Fusion, TileLang,dual pipeline, all are used here Alt text
  6. Macro Design
    • MQA/GQA/MLA in attention layers , or MoE/GLU are micro design changes
    • Networking connection are macro designs, what’s next for Layernorm and softmax ??

Tags:

Categories:

Updated: