Every Model Learned by Gradient Descent Is Approximately a Kernel Machine

Review of paper by Pedro Domingos, University of Washington, 2020

In this paper, the author shows that neural networks trained using first-order gradient descent with a small learning rate can be represented as similarity kernels and that they memorize the training points and subsequently use this information to make predictions the same way as SVMs and other kernel methods. This insight should lead to a better general understanding of how deep neural networks operate and, hopefully, will help improve future algorithms.

What can we learn from this paper?

That the output of a trained neural network can be viewed as a superposition of its training examples.

Prerequisites (to better understand the paper, what should one be familiar with?)

  • Neural networks
  • Kernel methods
  • Neural tangent kernels

Discussion

Kernel machines are models that predict an output value for a given input x by calculating a weighted sum of the values of a symmetric positive-definite similarity function (kernel) K between x and each of the training inputs xi, possibly with an added bias constant and a final nonlinear transformation. The most common example of kernel machines is support vector machines (SVMs), which typically use kernels that are linear (dot product), polynomial (polynomial expressions of the dot product), or RBF (based on radial basis functions, i.e. Gaussian similarity).

It has recently been shown that the evolution of a deep neural network during training with first-order gradient descent can be described by its neural tangent kernel (NTK), which is calculated as the product of gradients of the network output for two given inputs with respect to each weight in the network, summed over all weights. The contribution of each training input to the change in the output of the network in response to a test input is proportional to the NTK calculated between these two inputs (with the current weight values).

In overparameterized networks, the weights do not change significantly during training. Thus, by calculating the NTK between a test input and all training inputs for the fully trained network, we can estimate the magnitude of the effect of each of the training inputs on the prediction for this test input compared to the original untrained network.

The present paper takes this line of reasoning one step further, defining the path kernel as the integral of the neural tangent kernel over the path in the weight space during the training (not assuming a small change in weights), thus producing a more accurate estimate of the influence of each training input on the output for our test input. It proves mathematically that in the limit of a very small learning rate, the change in this output during training can be described as a linear combination of path kernels with respect to each training input. This observation provides better interpretability to deep neural network training and explains the fact that, like kernel methods, neural networks often generalize well while appearing to memorize all of the training instances, but can also quickly lose their accuracy as the test input moves away from the nearest training input.

In the author’s view, the most significant implication of the results above is that neural networks may not be as unique as they are thought to be, viewed as a way to discover new representations of the data, and that in fact, similar to other ML techniques, they rely on predefined features (gradients of a function determined by the architecture of the network). Understanding this mechanism may lead to optimization methods superior to gradient descent.

In terms of kernel methods, the path kernel overcomes the scalability issues of existing approaches for large training sets since during training it is no longer necessary to compute and store the Gram matrix (the matrix of application of the kernel to all pairs of training examples, which is quadratic in size in terms of the number of examples). Since all training inputs are stored as a superposition of the model parameters, the size of the model and its time performance during inference does not depend on the size of the training dataset.

The author also discusses the similarities between gradient descent and gradient boosting, suggests that the solution of every convex learning problem is a kernel machine, and discusses some of the potential research ideas stemming from the paper.

To me, this is definitely a very interesting paper that follows a novel kernel-based approach to neural network analysis and is likely to foster new ideas and better future models.

P.S. (December 19, 2020). In a comment on Medium, Alan Fern brings up a point that since the coefficients a and b in the author’s expression for the neural network output depend on the input, it is not a true kernel machine, and since in fact a is obtained by dividing by the same kernel that it is a coefficient for, it appears to be just a mathematical trick to make it look like one. This is a valid critique, and the author mentions this issue in Remark 1 (which I probably should have discussed in my review).

However, if the derivative of the loss does not change a lot during training (I would need to investigate it using some practical examples to see how valid this assumption might be), the dependence on the input does (approximately) go away. Furthermore, since 1) the coefficients a and b do have a clear physical meaning as the average of the derivative of the network output along the gradient descent path weighted by the kernel, and 2) the kernel representation nicely explains some similarities observed between neural networks and kernel methods, I believe that the paper is still valid and interesting.

Original paper link

Suggested further reading

Leave a Reply