Knowledge Distillation

less than 1 minute read

Distillation was introduced by Hinton and Dean in 2015, another masterpiece from Google.

The fundamental idea is that training and inference have different requirements, so model compressoin, which is the only knowlege reference in this paper, by Dr. Rich Caruana.

The implementatoin would let student learn from teacher’s logits, on top of learning from the groundtruth label.

1 Softmax with Temperature

$q_i=\frac{exp(z_i/T)}{\Sigma_jexp(z_j/T)}$

The larger the $T$, the smaller are the differences

Alt text

2 Soft target and Hard target

The teacher’s logit are called soft target, and the true labels are hard target. Get totally lose by add two lose. Alt text

3 Result

Alt text

Twitter Facebook LinkedIn

Knowledge Distillation

1 Softmax with Temperature

2 Soft target and Hard target

3 Result

You May Also Enjoy

Stream Batch process

CUDA

Slurm and Enroot

NVLink, InfiniBand and SpectrumX