Point Transformer

Review of paper by Hengshuang Zhao1, Li Jiang2, Jiaya Jia2, et al, 1University of Oxford, 2The Chinese University of Hong Kong, 2020

The authors develop a neural attention layer for working with 3D point data and implement it in a Point Transformer network that shows new state-of-the-art results (in some cases by a significant margin) on a number of standard 3D benchmarks.

What can we learn from this paper?

That self-attention works very well for 3D data due to the latter being essentially a sparse unordered set of points, resulting in permutational invariance and the absence of a strong local structure that makes convolutional approaches effective for 2D images.

Prerequisites (to better understand the paper, what should one be familiar with?)

  • Neural attention
  • 3D point clouds


It is well-known that in the case of dense 2-dimensional data (such as images), modern state-of-the-art approaches for standard ML tasks (object detection, instance segmentation, semantic segmentation, etc) are mostly based on convolutional networks that take advantage of the local structure of the data. Although there have been some recent efforts to apply attention models to 2D data, such models have not been widely adopted yet.

Three-dimensional data, however, is quite different, as it normally comes in the form of a point cloud, which is simply a set of values with specified x, y, z coordinates, usually obtained by scanning visible surfaces of objects using a 3D scanner. While any point cloud can be voxelized, that is, projected onto a regular 3D grid, this will result in a very sparse structure, with most of the points (those not belonging to a visible surface) missing. While convolutional approaches can still be applied, they are not nearly as efficient for sparse data.

Another popular approach to 3D point cloud tasks is PointNet (2017), as well as other, more recent architectures that rely on the same idea. PointNet is a deep MLP-based network that only uses operations that are symmetric to its inputs, that is, invariant to permutations within the set of 3D points. These operations include fully-connected layers with weights shared between all points, as well as global max pooling. The main disadvantage of using PointNet is that it does not directly take advantage of the local structure of the data, thus making learning more difficult.

Recently, a number of methods have been proposed that try to make better use of the local structure, usually by sampling a subset of representative points, using k-nearest neighbors to create a group around each point in the subset, and then typically using a PointNet-like approach on each group, thus mapping these groups into local representations, which are later combined into a global feature vector via subsequent layers.

The present paper takes a different approach, taking advantage of the neural attention mechanism that is already immensely popular in Natural Language Processing and is quickly spreading to other areas of machine learning. The authors use vector attention, which was developed by them recently and is a generalization of the typical scalar attention technique.

In the resulting Point Transformer layer, the attention is local as it is calculated over the k nearest neighbors of each point. The mapping function that is used to compute vector attention is an MLP with two linear layers and a ReLU nonlinearity. Trainable positional encodings, which are calculated using a different MLP of the same type (two linear layers and a ReLU) applied to differences of position vectors, are added to both the attention and the value vector, after which the latter two are multiplied to generate the output.

The Point Transformer network for segmentation tasks follows the U-Net structure, consisting of 5 downsampling point attention layers followed by 5 similar upsampling layers and a final MLP layer, with skip connections between the down and up layers of the same output size. For classification tasks, only the downsampling part is used with a global average pooling layer and an MLP layer at the end.

The resulting network was applied to several popular 3D data benchmarks: the Stanford Large-Scale 3D Indoor Spaces (S3DIS) dataset for semantic segmentation, the ModelNet40 dataset for 3D shape classification, and the ShapeNetPart dataset for object part segmentation. The specific training parameters are presented in the paper.

On the S3DIS dataset of 271 annotated interior room images, the Point Transformer set a new state-of-the-art mean IoU of 73.5% for semantic segmentation (based on 6-fold cross-validation), far above the previous best of 70.6%. On ModelNet40, a dataset containing 12,311 CAD models from 40 object categories, the Point Transformer also achieved a new state of the art with 93.7% accuracy, 0.1% above the next best. On the ShapeNetPart dataset, Point Transformer’s performance was not the absolute best (86.6% instance average IoU vs 88.8% of Multi-scale U-Net with Delaunay triangulation, and 83.7% class average IoU vs 85.1% of KPConv deformable convolutions), but very close to the state of the art for both metrics.

The authors optimize the hyperparameters of the Point Transformer network for the S3DIS segmentation task when tested on withheld Area 5 (one of the standard ways of measuring performance on this dataset), and determine that the maximum performance was achieved with k=16 nearest neighbors in the attention layer, and relative positional encoding and vector attention (both as described above) versus other alternatives.

Overall, this definitely looks like an important paper that, based on the authors’ concept of vector attention, shows a way to use modern attention networks on 3D point clouds with state-of-the-art results on multiple standard benchmarks.

Original paper link

GitHub repository with PyTorch code for the attention layer (not by the authors, currently seems to be global, not limited to the k nearest neighbors as described in the paper)

Suggested further reading

Leave a Reply