Abstract

Understanding the implication of point cloud is still challenging in the aim of classification or segmentation for point cloud due to its irregular and sparse structure. As we have known, PointNet architecture as a ground-breaking work for point cloud process can learn shape features directly on unordered 3D point cloud and has achieved favorable performance, such as 86% mean accuracy and 89.2% overall accuracy for classification task, respectively. However, this model fails to consider the fine-grained semantic information of local structure for point cloud. Then, a multiscale receptive fields graph attention network (named after MRFGAT) by means of semantic features of local patch for point cloud is proposed in this paper, and the learned feature map for our network can well capture the abundant features information of point cloud. The proposed MRFGAT architecture is tested on ModelNet datasets, and results show it achieves state-of-the-art performance in shape classification tasks, such as it outperforms GAPNet (Chen et al.) model by 0.1% in terms of OA and compete with DGCNN (Wang et al.) model in terms of MA.

1. Introduction

Point cloud as a simple and efficient representation for 3D shapes and scenes has become more and more popular in the fields of both academia and industry. For example, autonomous vehicle [14], robotic mapping and navigation [57], 3D shape representation and modelling [8, 9], and other relevant applications [1015]. Lots of ways can be used to obtain 3D point cloud data, such as utilizing 3D scanners including physical touch or noncontact measurements with light, sound, LiDAR, etc.

Up to now, a variety of approaches have been developed to handle this kind of data, such as the commonly used traditional handcraft algorithms [1618]. In terms of these methods, it is significant to classify or segment point cloud by choosing salient features of point cloud, such as normals, curvatures, and colors. Handcrafted features are usually employed to address specific problems but tough to transfer to new tasks. Then, it is a hot topic, in last decades, that how to overcome the shortcomings for traditional methods.

With the development of deep learning, some existed end-to-end neural networks have overcame many challenges’ stem from 3D data and made great breakthrough for point cloud, see Figure 1. In particular, the modificatory works of convolutional neural networks (CNNs) have achieved significant success for point cloud data in computer vision tasks, such as PointNet [19] and its improved version [20], PointCNN [21, 22], and PointSift [23]. Unfortunately, lots of neural networks for point cloud only capture global feature without local information which are also an import semantic feature for point cloud. Hence, exploiting reasonably the local information of point cloud has become a new research hotspot, and some valuable works also have sprung up recently. PointNet++ [20] extends the PointNet model by constructing a hierarchical neural network that recursively applies PointNet with designed sampling and grouping layers to extract local features. Graph neural networks [24, 25] can not only directly address a more general class of graphs, e.g., cyclic, directed, and undirected graphs, but also be applied to deal with point cloud data. Recently, DGCNN [26] and its variant [27] well utilized the graph network with respect to the edges’ convolution on points and then obtained the local edges’ information of point cloud. Other relevant works applying the graph structure of point cloud can be found in [2830].

Attention mechanism plays a significant role in machine translation task [31], vision-based task [32], and graph-based task [33]. Combining graph structure and attention mechanism, some favorable network architectures are constructed which leverage well the local semantic features of point cloud. Readers can refer to [3436].

However, the scale of different graphs for the existed graph networks are fixed; then, the semantic expression of the point will not be good. Hence, in this work, inspired by graph attention network [33], graph convolution network [37], and local contextual information networks, we design a multiscale receptive fields’ graph attention network for point cloud classification. Unlike previous models that only consider the attribute information such as coordinate of each single point or only exploit local semantic information of point, we pay attention to the spatial context information of both local and global structure for point cloud. Finally, like the standard convolution in grid domain, our model can also be efficiently implemented for the graph representation of a point cloud.

The key contributions of our work are summarized as follows:(i)We construct graph of local patch for point cloud and then enhance the feature representation of point in point cloud by combining edges’ information and neighbors’ information(ii)We introduce a multiscale receptive fields’ mechanism to capture the local semantic features in various ranges for point cloud(iii)We balance the influence between neighbors and centroid in the local graph by means of attention mechanism(iv)We release our code to facilitate reproducibility and future research (https://github.com/Blue-Giant/MRFGAT–NET)

The rest parts of this paper are structured as follows. In Section 2, we review the most closely related literatures on point cloud. In Section 3, we introduce our proposed MRFGAT architecture and provide the details of our framework in terms of shape classification for point cloud. We describe the dataset and design comparison algorithms in Section 4, followed by the experiments’ results and discussion. Finally, some concluding remarks are made in Section 5.

2.1. Pointwise MLP and Point Convolution Networks

Utilizing the deep learning technique, the classical PointNet [19] was proposed to deal with directly unordered point clouds without using any volumetric or grid-mesh representation. The main idea of this network is as follows. At first, a Spatial Transformer Network (STN) module similar to feature-extracting process is constructed which guarantees the invariance of transformations. Then, a shared pointwise Multilayer-Perceptron (MLP) module is introduced which is used to extract semantic features form point sets. At last, the final semantic information of point cloud is aggregated by means of a max pooling layer. Due to the favorable ability to approximate any continuous function for MLP which is easy to implement by point convolution, some related works were presented according to the PointNet architecture [38, 39].

Similar to convolution operator in 2D space, some convolution kernels for points in 3D space are designed which can capture the abundant information of point cloud. PointCNN [21] used a local -transformation kernel to fulfill the invariance of permutation for points and then generalized this technique to the hierarchical form in analogy to that of image CNNs. The authors in [4042] extended the convolution operator of 2D space, applied at individual point in local region of point cloud, and then collected the neighbors’ information in the hierarchical convolution layer to the center point. Kernel Point Convolution (KPConv) [43] consists of a set of local 3D filters and overcomes stand point convolution limitation. This novel kernel structure is very flexible to learn local geometric patterns without any weights.

2.2. Learning Local Features

In order to overcome the shortcoming for PointNet-like networks which fail to exploit local features, some hierarchical architectures have been developed, for example, PointNet [20] and So-Net [38], to aggregate local information with MLP operation by considering local spatial relationships of 3D data. In contrast to the previous type, these methods can avoid sparsity and update dynamically in different feature dimensions. According to a Capsule Networks, 3D Capsule Convolutional Networks were developed which can learn well the local features of point cloud, see [4446].

2.3. Graph Convolutional Networks

Graph Convolutional Neural Networks (GCNNs) have gained more and more attraction to address irregularly structured data, such as citation networks and social networks. In terms of 3D point cloud data, GCNNs have shown its powerful ability on classification and segmentation tasks. Using the convolution operator with respect to the graph in the spectral domain is an important approach [4749], but it needs to calculate a lot of parameters on polynomial or rational spectral filters [50]. Recently, many researchers constructed local graph of point cloud by utilizing each point’s neighbors in low-dimensional manifold based on -dimensional Euclidean distance and then grouped each point’s neighbors in the form of high-dimensional vectors, such as EdgeConv-like works [26, 27, 51] and graph convolutions [37, 52]. Compared with the spectral methods, its main merit is that it is more consistent with the characteristics of data distribution. Specially, EdgeConv extracts edge features through the relationship between central point and neighbor points by successively constructing graph in the hierarchical model. To sum up, the graph convolution network combines features on local surface patches which are invariant to the deformations of patches for point cloud in Euclidean space.

2.4. Attention Mechanism

The idea of attention has been successfully used in natural language processing (NLP) [31] and graph-based work [33, 53]. Attention module can balance the weight relationship of different nodes in graph structure data or different parts in sequence data.

Recently, the attention idea has obtained more and more attraction and made a great contribution to point cloud processes [34, 35]. In these works, it is significant to aggregate point or edge features by means of attention module. Unlike the existing methods, we try to enhance the high-level representation of point cloud by capturing the relation of points and local information along its channel.

3. Our Approach

The framework of point cloud classification includes two contents: taking the 3D point cloud as input and assigning one semantic class label for each point. Based on the technique of extracting features from the local directed graph and attention mechanism, a new architecture for shape classification task is proposed to better learn point’s representation for unstructured point cloud. This new architecture is composed of three components which are the point enhancement, the feature representation, and the prediction. These three components are fully coupled together, which leads to an end-to-end training pipeline.

3.1. Problem Statement

At first, we let represent a raw set of unordered points as the input for our mode, where is the number of the points and is a feature vector with a dimension . In actual applications, the feature vector might contain 3D space coordinates , color, intensity, surface normal, etc. For the sake of simplicity, we set in our work and only take 3D coordinates of point as the feature representation for point. A classification or semantic segmentation of a point cloud are map or , respectively, which assign individual point semantic labels or point cloud semantic labels, respectively, i.e.,

Here, represents the map or . The objective of our model is finding the optimal map that can obtain accurate semantic labels.

The above map should satisfy some constraints including the following. (1) Permutation invariance: the order of points may vary but does not influence the category of the point or point cloud. (2) Transformation invariance: for the uncertain translation and rotation of point cloud, the results of classification or segmentation should not be changed for point or point cloud.

3.2. Graph Generation for Point Cloud

Some works indicate that local features of point cloud can be used to improve the discriminability of point; then, exploring the relationship among points in a whole sets or local patch is a keypoint for our work. Graph neural network is a feasible approach to process point cloud because it propagates on each node for the whole sets or a local patch of point cloud individually, ignores the permutation order of nodes, and then extracts the local information between nodes. To apply the graph neural network on the point cloud, we firstly convert it to a directed graph. Like DGCNN [26, 27] and GAPNet [34], we can obtain the neighbors (including self) of each point in point cloud by means of –NN algorithm and then construct a local directed graph in Euclidean space. Figure 2 depicts the directed graph of local patch for point cloud, is the vertice set of , namely, the nodes of local patch, stands for the edge set of , and each edge is with and being centroid and neighbors, respectively.

3.3. Single Receptive Field Graph Attention Layer (SRFGAT)

In order to aggregate the information of neighbors, we use a neighboring-attention mechanism which is introduced to obtain attention coefficients of neighbors for each point, see Figure 3. Additionally, edge features are important local features which can enhance the semantic expression of point; then, an edge-attention mechanism is also introduced to aggregate information of different edges, see Figure 3. In light of the attention mechanism [33, 34], we firstly transform the neighbors and edges into a high-level feature space to obtain sufficient expressive power. To this end, as an initial step, a parametric nonlinear function is applied to every neighbor and edge, and the results are defined byrespectively, where is a set of learnable parameters of the filter and is output dimension. In our method, function is set to a single-layer neural network.

It is worthwhile to noting that edges in Euclidean space not only stand for the local features but also indicate the dependency between centroid and neighbor. We then obtain attentional coefficients of edges and neighbors which arerespectively, where and are single-layer neural network with 1-dimensional output. denotes nonlinear activation function leaky with . To make coefficients easily comparable across different neighbors and edges, we use a softmax operation to normalize the above coefficients which are defined asrespectively; then, the normalized coefficients are used to compute contextual feature for every point, and it iswhere is a nonlinear activation function and is concatenation operation. In our model, we choose function as .

3.4. Multiscale Receptive Fields Graph Attention Layer (MRFGAT)

In order to obtain sufficient feature information and stabilize the network, the multiscale receptive field strategy analogous to multiheads mechanism is proposed, see Figure 4. Unlike previous works, the sizes of receptive fields in our model are different for various branches. Therefore, we concatenate independent SRFGAT module and generate a semantic feature with channels:where is the receptive field feature of the th branch, is the total number of branches, and is the concatenation operation over feature channels.

3.5. MRFGAT Architecture

Our MRFGAT model shown in Figure 5 considers shape classification task for point cloud. The architecture is similar to PointNet [19]. However, there are three main differences between the architectures of MRFGAT and PointNet. Firstly, according to the analyses of LinkDGCNN model, we remove the transformation network which is used in many architectures such as PointNet, DGCNN, and GAPNet. Secondly, instead of only processing individual points of point cloud, we also exploit local features by a SRFGAT-layer before the stacked MLP layers. Thirdly, an attention pooling layer is used to obtain local feature information that is connected to the intermediate layer for forming a global descriptor. In addition, we aggregate individually the original edge feature of every SRFGAT channel and then obtain local features which can enhance the semantic feature of MRFGAT.

4. Experiments

In this section, we evaluate our MRFGAT model on 3D point cloud analysis for the classification tasks. To demonstrate effectiveness of our model, we then compare the performance for our model to recent state-of-the-art methods and perform ablation study to investigate different design variations.

4.1. Classification
4.1.1. Dataset

We demonstrate the feasibility and effectiveness of our model on the ModelNet dataset such as ModelNet40 benchmarks [54] for shape classification. The ModelNet40 dataset contains 12,311 meshed CAD models that are classified to 40 man-made categories. In this work, we divide the ModelNet40 dataset into two parts: the part one is named as training set which includes 9843 models and the part two is called as testing set includes 2468 models. Then, we normalize the models in the unit sphere and uniformly sample 1,024 points over model surface. Besides, we further augment the training dataset by randomly rotating, scaling the point cloud, and jittering the location of every point by means of Gaussian noise with zero mean and 0.01 standard deviation for all the models.

4.1.2. Implementation Details

According to the analysis of the LinkDGCNN model [27], we omit the spatial transformation network to align the point cloud to a canonical space. The network employs four SRFGAP layer modules with (8, 16, 16, 24) channels to capture attention features, respectively. Then, four shared MLP layers with sizes (128, 64, 64, 64), respectively, followed by it are used to aggregate the feature information. Next, the output features are fed into an aggregation operation followed by the MLP layer with 1024 neurons. In the end of network, a max pooling operation and two full-connected layers (512, 256) are used to finally obtain the classification score. The training is carried out using Adam optimizer with minibatch training (batch size of 16) and an initial learning rate of 0.001. The ReLU activate function and Batch Normalization (BN) are also used in both the SRFGAP module and MLP layer. At last, the network was implemented using TensorFlow and executed on the server equipped with four NVIDIA GTX2080Ti.

4.1.3. Results

Figures 68 depict the process for training and testing. From the figures, we see that our model will quickly attain the stage of high accuracy, which means our model is highly efficient. Table 1 lists the results of our method and several recent state-of-the-art works. The methods listed in Table 1 have one thing in common. The input is only raw point cloud with 3D coordinates . Based on these results, we can conclude that our model performs better than other methods and obtains wonderful performance on both the ModelNet40 benchmark. Compared to other point-based methods, the performance for our model is only a little weaker than that of DGCNN in terms of MA on ModelNet 40. However, it outperforms the previous state-of-the-art model GAPNet by 0.1% accuracy in terms of OA. These phenomena show that the strategy employing local and global features in different receptive fields is efficient, and it will help us to capture the prominent semantic feature for point cloud. And, in our model, since we introduce the structure of the data by providing the local interconnection between points and explore graph features from different scale field levels by the localized graph convolutional layers, it guarantees the exploration of more distinctive latent representations for each object class.

5. Conclusion

Enlightening by graph convolutional networks for the task of classification in 3D computer vision, we design a novel MRFGAT-based modules for point feature and context aggregation. Utilizing different receptive fields and attention strategies, the pipeline MRFGAT can capture more fine features of point clouds for classification task. In addition, we list some comparable results with recent works which show that our model can achieve the state-of-the-art performance on the dataset ModelNet for classification task of point clouds; it outperforms the GAPNet model by 0.1 % in terms of OA and competes with the DGCNN model in terms of MA. It is necessary to point out that our model will have some burden for constructing varying scale graphs. Based on the state-of-the-art Graph Convolution Networks (GCN) for semantic segmentation in point cloud, it would be interesting to introduce our model to address this problem for unstructured data in the future.

Data Availability

This dataset used in this manuscript is available at https://shapenet.cs.stanford.edu/media/modelnet40_ply_hdf5_2048.zip.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by GDAS′ Project of Science and Technology Development (no, 2018GDASCX-0804) and Project of Guangdong Engineering Technology Research Center (no. 810115228131).