Generative Teaching Networks: Accelerating Neural Architecture Search by Learning to Generate Synthetic Training Data

Review of paper by Felipe Petroski Such, Aditya Rawal, Joel Lehman, Kenneth O. Stanley, and Jeff Clune, Uber AI Labs, 2019

This article develops a framework for faster deep NN learning by adding a trainable generator of synthetic data.

What can we learn from this paper?

A novel generative structure for accelerated neural network training and architecture search.

Prerequisites (to understand the paper, what does one need to be familiar with?)

  • General neural network training concepts
  • Generative networks

Helpful to know but can also be learned from the paper and its references

  • Neural architecture search
  • Meta-learning and meta-gradients
  • Curriculum learning

Authors’ own review

I would like to note that there is an excellent review of this paper by its authors, which is available here. The goal of this article is neither to rehash it nor to compete with it, but to attract more attention to this great paper and to explain some of the ideas in simpler terms from a point of view of someone not intimately familiar with the topic.


The process of training a neural network involves:

1) Finding enough training data with assigned ground truth labels. For many real-life tasks this requires manual labeling, which is slow and potentially expensive.
2) Choosing the right architecture (number, type, and size of layers) for the task, which, especially when performed automatically, is referred to as neural architecture search (NAS). As there are many possible architectures and hyperparameters to choose from, this step may be quite time-consuming.

Developing methods to automate and speed up these two steps is definitely an important task in modern deep learning.


The authors develop a framework called a generative training networks (GTN), in which new (synthetic) training data is generated in such a way that the resulting training and network architecture selection are performed more efficiently. As opposed to generative adversarial networks (GANs), in this case, the generator and the learner parts work together to achieve better training. The general setup of the system is shown in the figure.

One of the main ideas this paper relies upon is that while using as much quality training data as possible is generally beneficial for training the final version of the neural network, the performance of candidate network architectures can often be adequately evaluated on an intelligently chosen small subset of the available data. Taken one step further, instead of real data, synthetic data that is properly generated can be equally or even more efficient for this task.

In order to train the generator of the synthetic data, the authors use a technique developed by Maclaurin et al in 2015. The essence of this technique, the code for which can be found at, is to run the backpropagation, which is normally used to train the neural network parameters, through the whole run of the training process in order to apply gradient descent to various hyperparameters, including the training data itself, or the parameters of the network used to generate it.

Since optimization through meta-gradients is often unstable, the authors developed a weight normalization technique that significantly improves stability.

As shown in the paper, GTNs can be used for developing a curriculum, that is, a set of training examples combined with the order in which they are presented to the network to maximize training efficiency.

The authors apply their approach to two well-known real-world datasets, MNIST and CIFAR 10. In both cases, GTNs allow for much faster initial training than other existing methods, which results in a significantly more efficient architecture search. It is noted that once the best neural architecture is chosen, the final network still needs to be trained on the actual full dataset to achieve state-of-the-art results.

It is interesting that the synthetic data generated by GTNs does not particularly resemble the real data, even though the neural networks can use this data to be trained to recognize real images. The authors propose several possible explanations of this phenomenon in the appendix, although they do not know which one of these explanations is correct.

Original paper link

Further reading

Leave a Reply