Deekseek mHC
Deepseek published two papers and they will be new technoloies used in the comming V4 model, which is rumored to have significant improvements in coding abilities
1 Residuals and Hyper-Connection
- Residuals was proposed by Kaiming He and used in almost every neural networks \(X_{l+1}=X_l+f(X_l)\)
- HC was proposed by Bytedance team and the key changes are using a matrix (nxC) instead of a vector(1xC) for both residual connection and network layers, which significantly increases topological
complexity without altering the computational overhead
Here H are learnable matrix. The AGM is used to show how easily this is going to explode

spectral norm is a new concept which I heard for the first time. and the following manifest related discussion is out of my knowledge as well.
And here is the simple version in Chinese
- RC:CEO给CTO命令,CTO把自己的理解传给VP,同时把CEO的原始命令也传给VP
- HC:CEO给CTO命令,CTO选择性传递。跟技术相关的,就加重权重传给Engineer VP,或者非技术相关的就降低权重
2 Manifold-Constrained Hyper-Connections
N-1维的东西,实际上是N维事物的投影,这就是N维流行
More specifically, is about contrainting the spectral norm to be 1

3 Methodology
- doubly stochastic matrix

- The properties of DSM

- Sinkhorn-Knopp
algorithm (Sinkhorn and Knopp, 1967)
The iterative algorithm which can be used to get DSM

- This only applies to $H_res$ which is nxn size but not to $H_pre/post$ which is 1xn

- Engineer improvments
- RC uses C and mHC uses nC, how to change the improved I/O cost
- Kernel Fusion, TileLang,dual pipeline, all are used here

- Macro Design
- MQA/GQA/MLA in attention layers , or MoE/GLU are micro design changes
- Networking connection are macro designs, what’s next for Layernorm and softmax ??