ML101 -3

less than 1 minute read

Start with Regression vs Classification. and introduce softmax It seems there are long stories behind softmax, rather than normalization Alt text (Answer: Use Sigmoid for binary classification, which is equivalenet to Softmax in this case)

1 Loss functions

Cross-entropy is actually based on maximizing likelihood method Alt text

Cross-entropy surface is more smooth and not easy to be trapped by local min.

2 Batch Normalization

Feature normalization is important when different feature have different range. In general, it makes gradient descent converge faster. Alt text

Even if you have feature normalized, after first layer of networks, the output are no longer normalized. So we’d better normalize them again, with different sample data, in another word, with in a batch. That’s where batch normaliztion applies. Alt text

During inferences, we don’t always have a batch, and we can’t simplily use the norm/var from the whole training data like feature norm, so the weight norm/var are the solution here. Alt text

Why batch norm helps? Internal Covariate shift doesn’t NOT supported by data, even though sounds making sense. Alt text

The data and theoretical analysis is due to changng the landscape of error surface, but somewhat serendipitous. Alt text

Twitter Facebook LinkedIn

ML101 -3

1 Loss functions

2 Batch Normalization

You May Also Enjoy

K9S and Kubeadm

Sliding Window Attention

Eagle 1/2/3 + HASS

LLM Scores Pass@k to Perplexity