Inspired by the widespread use of the standard MNIST as a playground dataset for deep learning, the author has developed a new MNIST-1D dataset that is even smaller (just a one-dimensional sequence of 40 numbers for each sample) but is harder to predict on, demonstrates a more obvious difference in performance across network architectures, and is more conducive to exploring various interesting topics such as, for example, analyzing “lottery tickets” and the double descent phenomenon.
What can we learn from this paper?
That, by looking at a dataset with smaller inputs, we can hope to better understand and interpret some fundamental concepts of deep learning.
Prerequisites (to better understand the paper, what should one be familiar with?)
- Dense and convolutional neural networks
- MNIST dataset
- For part 4 of the paper, the lottery ticket hypothesis, double descent, gradient-based meta-learning, activation functions.
To advance deep learning research, a lot of experimentation is required. It is essential to be able to first try new ideas on small enough problems so that a single run of the train-predict cycle doesn’t take hours or even days. That’s why the MNIST handwritten digits dataset, consisting of 60,000 training and 10,000 test grayscale images of size 28×28, is so popular among researchers. However, one of the shortcomings of MNIST is that many modern approaches can achieve 98-99% prediction accuracy on it, which makes it harder to differentiate between them.
In this paper, the author suggests a new, artificially generated one-dimensional dataset that has even smaller inputs (each sample is of size 40, about 20 times less than the 28×28=784 for MNIST), can be easily recreated with a different number of samples and different noise levels, and provides greater variability in performance between architectures (e.g. convolutional and dense). The basic sample generation technique, as illustrated by the picture, consists of first manually developing a short base class pattern of length 12 for each digit between 0 and 9 (as shown in the middle row, where the horizontal axis represents the values and the vertical axis the index of each value 0 to 11), and then applying a series of randomized padding, dilation, scaling, and translation operations to generate multiple distorted samples of each class. The default version of the dataset contains 4000 training and 1000 test samples, although these numbers can be changed by adjusting the code.
According to the paper, the human prediction accuracy on this dataset is about 96% (I will take the author’s word for it), while a fully-connected neural network achieves about 68% (+/-2%) accuracy and a convolutional model about 94% (+/-2%) accuracy. There is a discussion in Appendix B about a human outperforming a CNN on this task. This seems dubious to me, however, as by making a few small changes (increasing the size of the network, adding a small weight decay penalty, and using PReLU activations) I was able to achieve, with the same test set and trained on the same data, the average accuracy of 96.8% (+/-0.5%) for a convolutional network and 73.3% (+/-1.4%) for a dense network over 100 runs with different random seeds holding out a random 10% of the training samples as a validation set and keeping the model with the best validation accuracy.
The accuracy of the dense network on MNIST-1D improves quite substantially with more training data, reaching over 93% for a 10x bigger (40,000 samples) training set. This suggests that the dense model’s accuracy on the original data set with 4,000 samples could probably be increased a lot as well by using appropriate augmentations (compensating for the lack of spatial awareness), similarly to MNIST where augmentations let dense networks achieve 99.65% accuracy (almost the same as the best CNNs) as opposed to just above 98% when trained on base images only.
In any case, none of the above invalidates the main point in the paper that the performance of convolutional and dense networks on the new MNIST-1D dataset, when they are trained in a straightforward way, is substantially different.
The paper goes on to show possible applications of the new dataset by analyzing different deep learning problems, with some pretty interesting results. For the lottery ticket hypothesis, as formulated in the paper by Frankie and Carbin (2019), the analysis on MNIST-1D suggests that, contrary to the conclusions of the prior paper that were based on MNIST, the performance of the smaller “winning ticket” network seems to depend more on its sparsity pattern (which may, according to the author, reflect the spatial inductive bias) and less on the initializations, which were the primary factor in the original lottery ticket hypothesis paper. While more analysis is clearly needed, this is not necessarily a contradiction, as the structure of the MNIST-1D dataset is quite different and seems to be affected by spatial features a lot more (as evidenced by the inferior performance of dense networks), which may influence the behavior of “winning tickets”. To me, this example shows the need for deeper analysis and understanding, since all datasets have their own unique characteristics, and poorly understood phenomena may manifest on them in different, hard-to-interpret ways (thus providing more data for analysis).
The author also examines the deep “double descent” phenomenon (a temporary growth of the test loss with the increase in the network’s size, followed by a decrease once the interpolation threshold in the number of network parameters is reached) with a negative log-likelihood (NLL) loss function and finds that for this type of loss the threshold equals to the number of training examples and, contrary to the case of mean squared error (MSE) loss, is not proportional to the number of classes. This is a great example of how the new dataset can be used to look at fundamental deep learning issues. I guess the caveat here is that, as opposed to MSE loss, the NLL loss is not particularly meaningful (in fact, the metric that we are likely to be interested in, the prediction accuracy, doesn’t experience any noticeable decline as the network grows), but it’s unquestionably a behavior worth studying nonetheless.
There are some other thought-provoking results in the paper related to hyperparameter optimization via meta-learning, activation function optimization, etc. Overall, this is definitely an interesting dataset to examine, and, although, as mentioned above, one needs to be careful when generalizing the results obtained on MNIST-1D data to other, bigger examples, it has the potential to provide many new insights into modern deep learning.
Original paper link
Scaling *down* Deep Learning by S. Greydanus, 2020
The authors’ blog post about the paper
Suggested further reading
Synthetic Petri Dish: A Novel Surrogate Model for Rapid Architecture Search by A. Rawal, J. Lehman, F.P. Such, et al, 2020