This paper shows how the initialization of neural network weights affects the success of training, and that larger networks are more likely to have subnetworks within them with the “lucky” initial weight numbers.
What can we learn from this paper?
The results of this paper are theoretical and do not provide, as far as I can tell, any specific practical insights. However, I believe that reading it is very important for a better overall understanding of how deep neural networks operate.
Prerequisites (to understand the paper, what does one need to be familiar with?)
General structure of deep neural networks, initialization of weights, gradient descent algorithm
It is well known that the majority of connections (and corresponding weights) in modern trained deep neural networks can be removed without seriously affecting their predictive performance. This is known as neural network pruning, and there are established techniques for doing it, the simplest being simple removal of a given percentage of connections with the smallest absolute magnitudes of weights.
Since after pruning the full neural network we end up with a much more computationally efficient configuration, the natural question is, why not train a smaller network with this configuration to begin with? The reason is that, when the chosen subnetwork is much smaller than the original full network (perhaps below 20% of the size), its accuracy when trained this way is usually inferior to that of the one pruned from a trained full network. This begs the question, why is this happening?
This paper shows that it’s not just the configuration of the smaller network, but the specific values of the initial weights (normally chosen randomly under certain constraints) that, for this configuration, make the gradient-descent based training process end up in a good minimum of the objective function. If the smaller network is trained from the start with random weights instead, it likely will get stuck in local minima, resulting in lower accuracy.
In the case of a much larger network, there may be millions (or perhaps many more than that, depending on the network size) of candidate subnetworks, and thus a good chance that some of them will be initialized in a way that the gradient descent algorithm can guide them toward the better minima without ending up in the inferior ones. These subnetworks have “won the lottery” due to the fortunate initialization. The authors show that the pruned subnetworks retrained without the larger network but with the same weights train faster and achieve similar accuracy to that of the full network, in some cases exceeding it.
In the discussion and in Appendix F.5, the paper notes that the weights of the winning subnetwork move more during the training than other weights, which means that these weights did not just get lucky by being in the neighborhood of the global minimum to begin with. To me, this makes perfect sense though, as the weights of a smaller neural network will generally be larger in magnitude to compensate for the decrease in size.
Mathematically, neural networks are usually initialized using Xavier (Glorot) initialization, according to which the weights are chosen randomly from an interval with size proportional to the inverse of a square root of the number of incoming and outgoing network connections for each layer. If a subnetwork is picked that is only 1/10 of the size of the original network, it means that its weights should be larger by an average factor of the square root of 10, so the weights of the winning subnetwork, when chosen from the initialization based on the larger full network, have ways to go to increase their magnitude. Finding a way to make the initial weights larger (as they should be in the intended pruned subnetwork) without compromising the training of the full network could definitely be an interesting topic of future research.