Generative Language Modeling for Automated Theorem Proving

Review of paper by Stanislas Polu and Ilya Sutskever, Open AI, 2020

The authors use GPT-3-like language models to develop GPT-f, a tool that automatically generates proofs of mathematical theorems.

What can we learn from this paper?

That Transformer models, when applied to the Metamath formal language, can be used for mathematical reasoning.

Prerequisites (to better understand the paper, what should one be familiar with?)

  • Would be helpful to be familiar with formal symbolic languages and logic.
  • Basic understanding of deep learning and Transformer/GPT models.


Deep neural networks have recently been applied with success to a variety of cognitive tasks, including computer vision, speech recognition, language generation, etc. However, reasoning challenges have not been actively addressed (except possibly AlphaGo and AlphaZero), largely due to the lack of tasks with sufficient amounts of labeled training data. Automated generation of such data (which made AlphaZero and AlphaGo successful, as in their cases new labeled data is generated by the program playing against itself) could be the key to advancement in neural reasoning.

In this paper, the authors apply deep networks to automated theorem proving within the framework of Metamath formal language, in which it is possible to quickly verify the correctness of generated proofs, thus generating more labeled data for subsequent training. As authors note, Metamath is a very low-level and verbose language, which is the reason why it is better suited for computer analysis than human use.

The theorems are proven by starting from the desired statement and making substitutions until a statement that is known to be true is encountered. The current training dataset contains about 38,000 proven theorems with approximately 3 million proof steps. The training objective was to predict the next step of the proof given the current statement (goal).

At each step of the proof, a list of current unexplored goals was maintained, and the one with the highest probability (lowest cumulative absolute logarithmic probability value) was explored. This made the search generally breadth-first since deeper branches of the search tree would typically have higher absolute log probability values due to more summation steps. The maximum depth was set at 128.

The verifier, which checks that all of the steps in a given proof complied with the formal rules of the Metamath language and which, together with the proof search procedure, constitutes the GPT-f prover, was implemented by the authors in Python in a few hundred lines of code.

When the model was trained exclusively on the available Metamath data, the version with 700 million parameters was able to prove 31.58% of theorems in the test dataset, which was already a significant improvement over 21.16% of MetaGen-IL, the previous state-of-the-art result. This was further improved by pre-training the model on external data such as the CommonCrawl language dataset, resulting in 39.61% performance, or a combination of more reasoning-focused GitHub, Arxiv Math, and Math StackExchange data, which gave 42.56%. Additional retraining on the data generated by the prover itself, combined with increasing the number of attempts per proposition, led to 56.22% best performance.

Overall, this paper seems very interesting in terms of advancing reasoning capabilities of deep neural networks, not to mention that, although the current given examples are trivial for a human mathematician, this approach could potentially be used to prove new mathematical theorems as well.

Original paper link

Suggested further reading

Leave a Reply