Trainability¶
Problems of Gradients¶
Dead ReLU: neurons with negative value (before ReLU) are not undated. Recovered if the updated outputs from the previous layer make its value positive.
Vanishing Gradients¶
Exploding Gradients¶
Initialization¶
Exercise: what happen if we initialize all parameter to be 0? No update.
Variance of the value of a node (before activation) grows with number of in-flow outputs.
We want to keep variance of all neurons roughly the same upon unitialization.