Abstract

The recognition of human pose based on machine vision usually results in a low recognition rate, low robustness, and low operating efficiency. That is mainly caused by the complexity of the background, as well as the diversity of human pose, occlusion, and self-occlusion. To solve this problem, a feature extraction method combining directional gradient of depth feature (DGoD) and local difference of depth feature (LDoD) is proposed in this paper, which uses a novel strategy that incorporates eight neighborhood points around a pixel for mutual comparison to calculate the difference between the pixels. A new data set is then established to train the random forest classifier, and a random forest two-way voting mechanism is adopted to classify the pixels on different parts of the human body depth image. Finally, the gravity center of each part is calculated and a reasonable point is selected as the joint to extract human skeleton. The experimental results show that the robustness and accuracy are significantly improved, associated with a competitive operating efficiency by evaluating our approach with the proposed data set.

1. Introduction

Human perception to the external world is mainly through the sense organs such as sight, touch, hearing, and smell, of which about 80% of information is obtained by the vision. It is important for the next generation intelligent computers to mount visual functions on computers so that they can automatically recognize and analyze the activities of people in the surrounding environment [13].

At present, pose and action recognition is widely used in the many fields like advanced human-computer interaction, intelligent monitoring system, motion analysis, and medical rehabilitation [46]. Pose recognition is a challenging research in motion analysis. The core target is to infer the posture parameters from the image sequence on various parts of the human body, such as the actual position in the three-dimensional space or the angle between the various joints. Human body motion can be reconstructed in three-dimensional space through posture parameters mentioned above. At present, the human pose recognition algorithms based on machine vision are mainly divided into two categories: one is based on traditional RGB images and the other is based on depth images. The biggest difference between them is that pixels in the RGB image record the color information of the object, while pixels in the depth image record the distance between the object and the camera. Human pose recognition based on RGB images mainly utilizes the apparent features on the image, such as HOG (histogram of oriented gradient) features [7] and contour features [8]. However, these methods are usually affected by the external environment and are particularly vulnerable to the light, resulting in low detection accuracy. In addition, due to the large differences in the size of the human body, these algorithms are only suitable for the limited environments and people. In recent years, with the development of depth sensors, especially the Kinect developed by Microsoft which has color and depth information (RGB-D), the recognition rate of human pose has been greatly improved compared with ordinary sensors [913]. The main reason is that the depth images have many advantages over the RGB images. First, depth images have robustness against changing of color and illumination. Also, the depth image, which is 3D, has more information than the RGB image. Human pose recognition methods can be divided into two categories: model-based method and feature learning. In model-based human pose detection, the human body is divided into multiple components which are combined into a model and then the human pose is estimated by inverting the kinematics or solving optimization problems. Pishchulin et al. proposed a new articulated posture model based on image morphology [14]. Sun and Savarese proposed APM (articulated part-based model) based on the joint detection [15], and Sharma et al. proposed an EPM (expanded parts’ model) based on a collection of body parts [16]. Siddiqui and Medioni used a Markov chain Monte Carlo (MCMC) framework with head, hand, and forearm detectors to estimate the body [17].

Feature learning tries to get advanced features from depth images through analyzing each pixel, and uses various machine learning algorithms to realize human pose recognition [12, 1823]. Shotton et al. proposed two different methods for estimating human body poses [18]. One of the methods uses a random forest to classify each pixel in the depth image. Another method predicts the position of a human joint. Both methods are based on random forest classifiers that train a large number of synthetic and real human depth images. Hernández-Vela et al. proposed graph cuts’ optimization based on Shotton’s method [24]. Kim et al. proposed another human pose estimation method based on SVM (support vector machine) and superpixel [25]. In addition, deep learning algorithms are also used to solve the pose estimation of the target [2628], and the convolution neural network (CNN) is used for large-scale data set processing [2932].

In general, the advantage of model-based human pose recognition is that there is no need to build a large data set; instead, it only establishes some models. It has a higher recognition rate for the pose as the model matched. However, this method also has some disadvantages. For example, it is difficult to construct complex human body models mainly because of the diversity of human postures in actual situations.

The main merit of feature learning is that it does not need to establish a complex human body model, so it is not restricted to the model and can be applied to various situations. However, this method also has disadvantages. On one hand, it has to build a huge data set to fit in different environments. On the other hand, many feature extraction methods have high complexity and cannot meet real-time requirements. Therefore, a human pose recognition method based on depth image multifeature fusion is proposed in this paper. First, the body parts were encoded with number in depth images and a data set was constructed. Afterwards, the LDoD and DGoD features are extracted for training to get a random forest classifier. Finally, the gravity center is calculated and possible joints are screened out. The LDoD and DGoD have lower computational complexity than other algorithms, so they can satisfy the real-time requirement. Moreover, the recognition rate of human pose improves by combining LDoD and DGoD.

The rest of this paper is organized as follows: Section 2 introduces the algorithm flow about depth image multifeature fusion for recognizing human pose. Section 3 details each step of pose recognition and related algorithms. Section 4 constructs a random forest classifier. Section 5 describes the positioning of the joints in the human body image. Section 6 analyzes the experimental results. Section 7 is the conclusion.

2. Algorithm Overview

The flowchart of human pose recognition based on depth image multifeature fusion is shown in Figure 1. Firstly, the original depth image is segmented to extract the human target so that the different parts of segmented body is easily tagged with a specific code. And then LDoD and DGoD features are extracted for training multiple decision trees to obtain a random forest classifier. The classifier is used to classify the body parts of the test samples. Finally, the position of the joints in the human body image is calculated.

3. Human Pose Recognition

3.1. Depth Image Segmentation

In image processing, we often focus on special areas which are called ROI (region of interest) [3335]. Usually, these areas have rich feature information. Therefore, in order to identify and analyze the target, the area, where the target is located, needs to be separated from the background. On this basis, feature extraction and human body recognition can be performed.

Due to the fact that the actual scene is fixed in this paper, depth background difference is used to segment the human body. The depth value is quantized to a grayscale space of 0–255, that is, little number is corresponding to large depth. Therefore, the 3D image can be displayed as a 2D gray image, where pixel value represents a different meaning from the conventional RGB image.

Because the camera shoots downwards from head top, the leg information of the human body cannot be considered. The depth range is controlled between and , and it can be expressed as . First, Gaussian filtering is performed on the original depth data to filter out noise and suppress the drift of depth data. Then, the original depth image is subtracted from the background image, and the foreground target is extracted according to the threshold , shown as follows: where is the background image, is the original image, and is the binary image. Then, the depth image of the corresponding area is extracted, shown as follows: where is the effective depth area and .

3.2. Tagging Body Parts

Since there is no standard human pose depth image library, we builds a data set, including common human actions such as running, jumping, lifting, bending, knee flexion, and interaction. Random forest learning algorithm belongs to supervised learning; the data samples are a known category, and these samples need to be tagged [3639]. The tagging method is to divide the human body into 11 parts, and the rest is the background; the approximate position of each part of the human body in the depth image is observed, and then, the position is tagged with the corresponding color. As shown in Figure 2, the valid points inside the rectangle of the head area are all marked in red.

The tagging result is shown in Table 1. This paper divides the waist above the human body into the head, the left shoulder, the right shoulder, the left upper arm, the right upper arm, the left lower arm, the lower right arm, the left hand, the right hand, the left body, the right body, and the background.

3.3. LDoD Feature Extraction

According to the depth image of the human body that has been manually tagged, the features of 12 parts need to be extracted. This paper uses the local difference feature as a feature representation of the pixel, which can reflect the neighborhood information of the pixel. It uses the difference between two pixels among the eight neighborhood points to represent the characteristics of the pixel. The location of the eight neighborhood pixels is shown in Figure 3.

LDoD feature can be represented as where , , and is the depth value of .

Assuming , is replaced by and . The feature vector of a point can be expressed as

According to LDoD feature, features of the same type of pixels are mostly similar and features of different pixel types have large differences. Therefore, for various parts of the human body, this feature has a good division degree. Figure 4(a) shows the divided depth image, and Figure 4(b) is an enlarged image of the left lower arm. As can be seen from the figures, pixel and pixel are in the body area and pixel is out of body area and its value is 0. Therefore, the value of is smaller and is larger. Figure 4(c) is an enlarged image of the right lower arm. The value of is larger, and the value of is smaller. Therefore, these two values can distinguish the left and right lower arms of the human body.

The computational complexity of this feature is very low. Formula (3) only uses subtraction between values. In addition, it also has space translation and rotation invariance and can be applied to people’s changes in postures.

3.4. DGoD Feature Extraction

Due to that, the depth information represents the distance between the object and the depth camera; the angle relationship between the plane where the pixel is located and the plane where the depth camera is located can be obtained by simply calculating the arctangent of the gradient, that is, the DGoD feature, which can be calculated as follows: where  = 1, 2, 3. Three DGoD features were selected, which are represented as , , and .

The range of directional gradients is . When the pixel points are on the same plane, the direction gradients are similar in size; otherwise, the direction gradients are quite different. The diagram of DGoD feature is shown in Figure 5. The green dots from left to right are , , , and . It can be seen from formula (5) that the value of is smaller and the value of is larger, which means that and are in the same plane, while and are in different planes.

4. The Design of the Random Forest Classifier

4.1. Random Forest Model Construction

Decision tree is one of the most widely used inductive inference algorithms at present. Its generated rules are simple and easy to understand. Pixels of depth images can be classified quickly and efficiently by decision tree, so it can be widely used in target detection and recognition. However, a single decision tree can easily lead to overfitting causing wrong classification. The random forest is composed of multiple decision trees [40, 41], and the decision tree is trained with different data sets and parameters, which cannot only reduce the degree of overfitting but also the classification accuracy can be improved because its output is voted by multiple decision trees.

The classification effect of random forest classifiers is affected by many factors, including the size of the training set , the dimension of the sample feature vector , the number of the decision tree , the maximum depth of each tree , the eigenvector dimension , and the termination condition for growth of each tree.

In the previous sections, the human body was divided into 12 different parts and then LDoD and DGoD features were extracted as the input of the random forest classifier. All of preliminary works are prepared for the design of the classifier model. The set of attributes can be represented as

ID3 decision tree algorithm is used to train each decision tree in a random forest. Training sample set can be expressed as where is a set of training pixels and is a collection of categories to which a pixel belongs, that is, 12 parts of the human body.

The set of parameters can be expressed as where is the attribute parameter and and are the thresholds.

The flow chart of the construction of a single decision tree is shown in Figure 6. First, putting back is adopted in the extraction method and the training set , which is the same size of , is extracted from to get subsets. Then, a tree node is created, and if it reaches the termination condition, the process is stopped and the current node is set as a leaf node. Otherwise, features is extracted from the -dimensional feature set using a fixed-scale and nonreturned extraction method. The one-dimensional feature is determined according to the metric of the feature attribute, and the current node is split into the left subset and the right subset :

The information gain is used to select the partitioning property of the decision tree, which can be calculated by as follows: where is information entropy.

4.2. Random Forest Two-Way Voting

In the traditional random forest classification [4244], the sample is judged by every decision tree and voted by every tree. Every tree has equal decision right. The random forest two-way voting mechanism is adopted with different decision rights in this paper. Data set is divided into in-of-bag and out-of-bag data. The data subset is called in-of-bag data when it is used to build a random forest. Otherwise, a data subset is called out-of-bag. The decision right of a tree is gained according to the result of testing out-of-bag data. When the result is true, then the tree is voted. If a decision tree has more votes, the weight will be higher. The basic algorithm steps for two-way voting are as follows.

Step 1. Create decision trees. And in-of-bag data and out-of-bag data are generated for every tree.

Step 2. Perform a performance evaluation. That is, a tree is evaluated by a certain amount of out-of-bag data. If the decision tree’s classification result is true, the tree is voted.

Step 3. Assign the total number of votes to the decision tree as weight and normalize the weights of all decision trees.

Step 4. Input the test sample to the sorted random forest model, and the obtained classification result multiplies the weight to get the final classification result, shown as follows: where is the final classification result, is the weight coefficient corresponding to -th decision tree, and . is the evaluation result of -th decision tree.

5. Human Joint Positioning

Determining the human body joints is the final step in human pose recognition [45]. The above chapters have used the random forest classifier to classify 12 parts in the human body image, but the joint position has not still been determined. The joints will be determined by calculating the gravity center of the 12 body parts in this paper.

For the depth image with size , the moment and central moment at the pixel can be calculated by formula (12) and (13), respectively. where is the gravity center, which can be calculated by formula (14) and (15).

The gravity center of the upper arm and the gravity center of the lower arm are calculated to obtain the joint of the left elbow or the right elbow, as shown in formula (16). where the size of area of the upper arm is , the size of the area of the lower arm is , and and are the offsets.

6. Experimental Results and Analysis

In this paper, 1000 depth images are used for the training of the classifier model and 100 images are used for the test, including the poses of 10 different people. The algorithm is programmed in C++ and compiled in Visual Studio 2013. The test computer uses an Intel Core i5-4570 processor clocked at 3.20 GHz. The ToF (time of flight) depth camera is used with the resolution of in this paper.

6.1. Qualitative Analysis

The results of human body part recognition and joint positioning with 6 postures are shown in Figure 7. The first column is the segmented depth images, the second column is the outputs of the random forest classifier, the third column is the gravity center of each part, and the last column is the skeletons composed of the joints. As can be seen from Figure 7, the random forest classifier can correctly classify most of the pixels in the human body image, such as the body and the head. Incorrect classification often happens at the intersection of the two parts. Fortunately, the joints are almost positioned accurately and a reasonable human skeleton can be obtained. Finally, in the sixth picture, one of the hands blocks the body, and according to the positioning result, the posture based on the fusion of DGoD and LDoD features proposed in this paper can solve the mutual fusion and occlusion.

6.2. Quantitative Analysis
6.2.1. Comparison of experimental results of different parameters of classifiers

When constructing a random forest model, the number of decision trees , the maximum depth of numbers , and the minimum number of samples in leaf nodes can affect the classifier performance. The experiment first determines the optimal classifier parameters by training five sample images. Figures 810 compare the classification accuracy and the training time of the algorithm with different parameters.

From Figure 8, we can see that with the other parameters fixed, as the number of decision tree increases, the training time consumed and the accuracy of the classification show an increasing trend. When is 20, the classification accuracy of the test sample reaches 77.2% and the training time is 100 s. When is 25, the correct rate of classification only increased by 1% but the required training time is increased to 140 s. Therefore, the optimal in this paper is 20.

As shown in Figure 9, when the other parameters are fixed, the greater the depth of the tree is, the higher the accuracy is. When the value of is 30, the correct rate reaches its maximum and then the correct rate is almost constant with the increase of the depth. So the optimal number of depth is 30.

The minimum number of samples in the leaf nodes can be used as the termination condition for the growth of the decision tree. When it is too large, the structure of the tree will stop prematurely, which will affect the classification accuracy. When it is too small, the structure of the tree will become more complicated and will consume too much time. In Figure 10, when other parameters are fixed, the classification accuracy of the test sample reaches 78.4% when . When , the test sample classification accuracy rate drops to 77.6%. So in this paper, .

6.2.2. Comparison of the recognition rate of various algorithms

This paper compares the recognition rate of each part with a single feature LDoD and the recognition rate with the combination of two features of DGoD and LDoD, as shown in Figure 11. The recognition rate of the random forest algorithm with multifeature fusion is obviously improved, reaching about 80%. Among these 12 parts, the recognition rates of the left and right arms are lower, mainly because of the complex movements of the upper limbs. In addition, as the amount of samples collected increases, the recognition rate will increase.

The traditional voting mechanism of random forests and the two-way voting mechanism are compared in this paper, as shown in Figure 12. It can be seen from the figure that the classification accuracy of the random forest two-way voting mechanism is significantly higher than the traditional one-way voting mechanism.

Finally, we also compare our algorithm with the popular algorithms in other literatures, as shown in Table 2; our classification method is superior to that of Shotton and Kim. In addition, the computation time is about 54.9% of Shotton’s algorithm. Therefore, the proposed method is more suitable for high real-time and high recognition rate occasions.

7. Conclusion

In this work, we propose a human pose recognition algorithm based on the fusion of LDoD and DGoD features. In human pose recognition, we first establish our own sample data set including depth images with a specific code. Then, we extract the LDoD and DGoD features from the sample. It is simple to calculate the above two features. Thus, the computation is greatly reduced. In the next step, these two features are used to train the random forest classifier. In order to improve the accuracy of classification, a random forest two-way voting mechanism is used to detect and classify different parts of the human body. Finally, according to the classification results, the gravity center of different body parts is calculated, so that accurate joints and skeleton can be obtained.

The experimental results show that the random forest classifier has higher classification accuracy and robustness. In addition, our method has low computation cost compared with the other methods and meets the real-time requirements. However, no method is perfect in terms of human body pose recognition, so it is necessary to research the following aspects. (i)Extracting better features for body part recognition and classification(ii)Using other classification algorithms or classifiers for body part recognition and classification, such as some efficient deep learning methods(iii)Studying body part recognition with more complex human poses, such as lying on the ground

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is supported by National Science Foundation of China (nos. 61877065 and 61473182).