## 1503.02531-Distilling the Knowledge in a Neural Network.md

$q_i=\frac{e^{z_i/T}}{\sum_j{e^{z_j/T}}}$

In [6]: np.exp(np.array([1,2,3,4])/2)/np.sum(np.exp(np.array([1,2,3,4])/2))
Out[6]: array([0.10153632, 0.1674051 , 0.27600434, 0.45505423])

In [7]: mx.nd.softmax(mx.nd.array([1,2,3,4]))
Out[7]:

[0.0320586 0.08714432 0.23688284 0.6439143 ]
<NDArray 4 @cpu(0)>

Using a higher value for T produces a softer probability distribution over classes.

Our more general solution, called “distillation”, is to raise the temperature of the final softmax until the cumbersome model produces a suitably soft set of targets. We then use the same high temperature when training the small model to match these soft targets. We show later that matching the logits of the cumbersome model is actually a special case of distillation.

In the simplest form of distillation, knowledge is transferred to the distilled model by training it on a transfer set and using a soft target distribution for each case in the transfer set that is produced by using the cumbersome model with a high temperature in its softmax. The same high temperature is used when training the distilled model, but after it has been trained it uses a temperature of 1.

confusion matrix 这个东西可以被用来探查模型最容易弄错的是哪些分类。

