This paper suggests a new algorithm for training deep neural networks that can be run efficiently without a GPU.
What can we learn from this paper?
That increasing the performance of dedicated hardware (such as GPUs) is not the only way to speed up training and inference of large neural networks.
Prerequisites (to understand the paper, what does one need to be familiar with?)
- Deep neural networks
- Hash functions and sparse networks, for those who want to understand the details of the algorithm
To find alternatives to hardware acceleration in terms of improving deep neural network time performance.
In this article, the authors use the ideas, previously developed by themselves and other researchers, of adaptive dropouts and locality sensitive hash tables (LSH). The main point of the strategy is to selectively drop out the majority of neurons in the network during every training step, thus greatly reducing the number of calculations. The neurons that are chosen to be active at each training step are picked among the ones with the largest activations, and using LSH allows this step to be performed efficiently. For those interested in the details of this process, I recommend reading the LSH paper.
The sparsity of the resulting network allows for asynchronous parallelization, since conflicts between updates are unlikely, thus giving a superior performance on systems with many CPU cores as compared to standard TensorFlow-based training.
The authors discuss various tweaks that can be used to improve the performance of the suggested algorithm and evaluate the performance against Tensorflow on both GPU and CPU systems. The two large datasets used for evaluation were Delicious-200K (bookmarks) and Amazon-670K (product to product recommendations) from the Extreme Classification Repository.
For both datasets, the same fully-connected network with more than 100 million parameters was used. All three methods achieved nearly identical performance on the validation set. However, the new SLIDE approach was about twice as fast to train as TensorFlow-GPU and almost an order of magnitude faster than TensorFlow-CPU.
Since the authors only consider a fully-connected network, it is not clear from the paper if other, currently more popular architectures (e.g. convolutional networks) can be made more efficient using a similar approach. If the suggested technique is expanded to arbitrary network structures, it would be an important step in making modern deep learning more efficient.
Original paper link
SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems by B. Chen, T. Medini, J. Farwell, S. Gobriel, C. Tai, and A. Shrivastava, 2020
Scalable and Sustainable Deep Learning via Randomized Hashing by R. Spring and A. Shrivastava, 2017
Adaptive Sampled Softmax with Kernel Based Sampling by G. Blanc and S. Rendle, 2018