Home About the Journal Latest Work Current Issue Archive Special Issues Editorial Board

2019,  1 (5):   511 - 524

Published Date：2019-10-20 DOI: 10.1016/j.vrih.2019.08.002

Abstract

Background Generally, it is difficult to obtain accurate pose and depth for a non-rigid moving object from a single RGB camera to create augmented reality (AR). In this study, we build an augmented reality system from a single RGB camera for a non-rigid moving human by accurately computing pose and depth, for which two key tasks are segmentation and monocular Simultaneous Localization and Mapping (SLAM). Most existing monocular SLAM systems are designed for static scenes, while in this AR system, the human body is always moving and non-rigid.
Methods
In order to make the SLAM system suitable for a moving human, we first segment the rigid part of the human in each frame. A segmented moving body part can be regarded as a static object, and the relative motions between each moving body part and the camera can be considered the motion of the camera. Typical SLAM systems designed for static scenes can then be applied. In the segmentation step of this AR system, we first employ the proposed BowtieNet, which adds the atrous spatial pyramid pooling (ASPP) of DeepLab between the encoder and decoder of SegNet to segment the human in the original frame, and then we use color information to extract the face from the segmented human area.
Results
Based on the human segmentation results and a monocular SLAM, this system can change the video background and add a virtual object to humans.
Conclusions
The experiments on the human image segmentation datasets show that BowtieNet obtains state-of-the-art human image segmentation performance and enough speed for real-time segmentation. The experiments on videos show that the proposed AR system can robustly add a virtual object to humans and can accurately change the video background.

Content

1 Introduction
Augmented reality (AR) is a technology that can fuse virtual objects in real images or videos. To make virtual objects look like real objects, AR systems need to reconstruct 3D models and calculate the camera pose for each frame. Simultaneous Localization and Mapping (SLAM) technology can calculate camera poses and reconstruct 3D models at the same time. Therefore, SLAM provides the foundation for AR. In this paper, we build an augmented reality system that can add virtual objects on a rigid human part while changing the video background.
At present, most applications that can add virtual accessories on a human head are based on tracking key points and applying 2D stickers. In these applications, users have a poor sense of reality. Unlike the applications described above, we employ our SLAM-based AR system to add a virtual object on a human head. Several comparison examples are shown Figure 1.
Our AR system is based on SLAM technology. Specifically, the location of the 3D virtual hat is determined by the location of the reconstructed 3D face. The projected picture of the virtual hat in each frame is decided by the computed camera pose. Most of the existing SLAM systems[1,2,3] are designed for static and rigid scenes, while in our system, the human body is non-rigid and always moving. Therefore, these SLAM systems[1,2,3] are not suitable for our task. Park et al.[4] and Ren et al.[5] proposed methods for 3D object online reconstruction and tracking. However, their methods needed RGB-D images. Feng et al.[6] proposed an online moving object reconstruction and tracking method (OBRAT) that needed only RGB images. However, this method[6] was suitable only for moving rigid objects, and a simple color histogram-based traditional method, fast local kernel density estimation (FLKDE), was used in its segmentation step. In recent years, deep learning-based image segmentation methods have developed quickly[7,8,9,10]. They obviously have better performance than the traditional methods. Therefore, in our system, a deep learning-based segmentation network is employed.
Deep learning-based image segmentation methods have ever been implemented with regional classification[11,12]. However, these methods run slowly, and cannot be trained end to end. In contrast, fully convolutional networks (FCN)[7] can be trained end to end, and directly output dense segmentation results that are the same size as the inputs. Because of their excellent performance, FCN-based methods have rapidly become mainstream in the field of image segmentation. Until now, many excellent FCN-based methods have been proposed, such as SegNet[8], DeepLab[10], U-Net[9], and so on. SegNet[8] and U-Net[9] are encoder-decoder networks, which have elegant symmetrical structures. They generally have higher accuracy around object boundaries. DeepLab[10] does not have a decoder structure. It uses atrous convolution to enlarge the receptive field, and employs atrous spatial pyramid pooling (ASPP) to incorporate multi-scale features. Therefore, DeepLab shows better performance in segmenting multi-scale objects. To integrate the advantages of encoder-decoder networks and DeepLab, we added the ASPP of DeepLab between the encoder and decoder of SegNet. We used SegNet rather than U-Net because SegNet is faster. We called our segmentation network BowtieNet, because its encoder-ASPP-decoder structure looked like a bowtie. The recently proposed DeepLab v3+[13] also combines spatial pyramid pooling and an encoder-decoder model. Our experimental comparisons show that BowtieNet has higher accuracy around human boundaries than DeepLab v3+, because of its more detailed upsampling steps.
In the segmentation step of the proposed AR system, BowtieNet was trained by a human image segmentation dataset. Then, when processing videos, the trained BowtieNet was used to segment the humans in each frame. Having obtained the segmented humans, different parts of the human could be extracted using color information and optical flows. Based on the human segmentation results, the proposed AR system could change the video background and add virtual objects.
In summary, the contributions in this study are:
(1) We built a novel AR system for humans from only a single RGB camera that could add a virtual object to a human while changing the video background.
(2) We employed a deep learning-based segmentation method and an optical flow method to segment humans and track their rigid parts. Each extracted rigid human part could be regarded as a static object, and the relative motion between the human parts and the camera could be considered camera motion. The monocular SLAM system could then be used.
(3) We built BowtieNet for segmentation, which added the ASPP of DeepLab between the encoder and decoder of SegNet.
2 Methods
Generally, without prior knowledge, it is difficult to obtain accurate pose and depth for a non-rigid object from a single RGB camera to create AR. However, we can build an AR system from a single RGB camera for non-rigid humans by accurately computing pose and depth, for which two key tasks are segmentation and monocular SLAM. Fortunately, the human body is a chain structure linked with arthroses. A part between two successive arthroses can be regarded as a proximately rigid. Through the proposed AR system, virtual objects can be added to a human rigid part, and the video background can be changed in the meantime. A flowchart of the proposed AR system is shown in Figure 2 using a human head. As shown in Figure 2, the entire system contains four parts: the segmentation part, reconstruction part, tracking part, and projection part. The segmentation part removes the background and produces the images of the human and the face. Based on the images of the extracted human, new frames with a new background can be generated. Based on the images of the extracted face, reconstruction and tracking are implemented. The reconstruction part reconstructs a 3D face model using key frames. Then, a 3D model of a virtual hat, such as a Christmas hat (as shown in Figure 2), can be added to the 3D face model. The tracking part is designed to predict the camera pose of each frame. Last, but not least, the projection part projects the virtual object onto the new frames with a new background according to the camera pose provided by the tracking part. In our system, the camera coordinate system of the first frame is used as the world coordinate system. Next, we describe each part of the system in detail.
2.1　Segmentation
We segmented the human using a CNN-based image segmentation network, BowtieNet. In each video frame, the human is located within a small area. Therefore, it is not necessary to segment the human from the whole frame. In our experiment, we segmented the whole image region for the first frame, and then segmented only the regions of interests (ROIs) in the other frames. By using this strategy, we successfully decreased the time consumption for segmentation. The segmentation processing steps are shown in Table 1.
Segmentation processing steps for videos
Segmentation processing steps for videos
Begin:
(1) Segment human through the whole image region in the first frame.
While the video has not ended:
(2) Capture a new frame.
(3) Expand the bounding box of the segmentation result in previous frame and use it as the segmentation ROI in the current frame. If there is no human in the previous frame, use the whole current frame as the segmentation ROI.
(4) Use BowtieNet to segment the ROI.
(5) Generate the segmentation result of the current frame according to the ROI’s location and the ROI’s segmentation result.
End
BowtieNet was built by combining the advantages of encoder-decoder networks and DeepLab[10]. Benefiting from the encoder-decoder structure, networks such as SegNet[8] and U-Net[9] generally have better performance around object boundaries. In contrast, DeepLab is generally more robust for segmenting objects with different scales because of its employment of ASPP. BowtieNet adds ASPP between its encoder and decoder to combine the advantages of these two kinds of network. After comparing the typical encoder-decoder networks SegNet[8] and U-Net[9], we found that SegNet had a lighter decoder structure and obviously faster speed. Therefore, the encoder and decoder of BowtieNet were built with reference to SegNet[8]. The recently proposed DeepLab v3+[13] also combines spatial pyramid pooling and the encoder-decoder model. However, DeepLab v3+ recovers the scale of feature maps using only two upsampling steps and one skip connection, while the proposed BowtieNet recovers the scale of feature maps using three upsampling steps and three skip connections. Our experimental results show that BowtieNet has higher accuracy around human boundaries than DeepLab v3+. The architecture of BowtieNet is shown in Figure 3.
As shown in Figure 3, the proposed BowtieNet can be divided into three parts: encoder, ASPP, and decoder. The encoder design is based on VGG-16, in which the resolution of the feature maps becomes smaller with the increase in the number of pooling layers. To keep the feature maps from becoming too small, we make some changes by referring to DeepLab: the strides of the last two pooling layers are changed from 2 to 1, and atrous convolutional layers are applied in the last convolutional block. We prevent the feature maps from becoming too small, partly because it is more difficult to recover spatial information from smaller feature maps, and also because if the feature maps are too small, some branches of ASPP will degenerate to simple convolutional layers with 1×1 filters[14]. Atrous convolutional layers upsample normal convolutional filters by inserting holes between the nonzero filter taps[10]. With the help of atrous convolution, when we change the strides of the pooling layers from 2 to 1, the size of the receptive field will not decrease.
ASPP is added after the encoder. We use a similar ASPP architecture to that in [10]. In the ASPP of [10], the number of output channels is equal to the number of segmentation classes. However, in our ASPP, the number of output channels is set to 256 to reserve more features. Different branches of ASPP have different atrous rates. Therefore, different branches have different sizes of receptive fields, and can extract features at different scales. Thus, the utilization of ASPP increases the robustness of BowtieNet when segmenting humans at various scales.
In the BowtieNet decoder, feature maps are upsampled according to the max pooling indices used in the encoder, similar to SegNet. This upsampling method is called unpooling. It upsamples feature maps and recovers spatial information at the same time. Compared with the decoder network used in U-Net, which upsamples feature maps using deconvolutional layers and recovers spatial information by feature map skip connections, decoder networks composed of unpooling layers contain fewer parameters and are faster.
It is worth mentioning that even though the proposed BowtieNet is built based on SegNet, BowtieNet has even better ability to obtain accurate object boundaries than SegNet. This is because the output stride of the encoder used in SegNet equals 32, while the output stride of the encoder used in BowtieNet equals 8. A smaller encoder output stride reduces the difficulty of recovering spatial information.
In each frame, humans are segmented by the CNN-based network described above. Then, from the segmented human area, human faces can be segmented according to color information. For other human parts, the segmentation can be realized by skeleton extraction and optical flow tracking.
2.2　Reconstruction
The reconstruction part is designed to reconstruct a 3D model of the segmented body parts. It is put in a different thread than the other parts and is performed by key frames rather than all the frames. Thus, even though the reconstruction runs slowly, it can still update its 3D model in a timely manner. Key frames are selected according to the uncertainty measure
$Q$
[6]. The 3D model is initialized by the first two key frames. Then, when a new key frame is entered, the 3D model is updated.
During initialization, Shi-Tomasi feature points[15] are extracted from the first key frame. Then, a revised Lucas-Kanade optical flow tracker[6] is applied to track these feature points with 3D aided information. Finally, based on the feature points in the first key frame and their corresponding feature points in the second key frame, a sparse 3D model of each segmented rigid human part is built by the triangulation reconstruction process[16].
After initialization, when a new key frame is entered, the Shi-Tomasi feature points are also extracted first. Then the feature points, which are new for the current 3D model, are selected and tracked to find their corresponding feature points in the nearest key frame. Next, the matching feature point pairs are used for reconstruction, which generates new 3D points in the 3D model. After the above processes, bundle adjustment is applied for optimization.
2.3　Tracking
The tracking part is designed to predict the camera pose for each frame. Suppose there is a point in the 3D world with coordinates
$X$
. The imaging process is
$x = K [ R | t ] X$
, where
$K$
contains the intrinsic parameters of the camera, which can be calibrated beforehand.
$[ R | t ]$
contains the extrinsic parameters of the camera. It reflects the rotation angle and the translation of the camera.
$[ R | t ]$
is the camera pose that we aim to calculate. If we have a series of 3D points
, and their corresponding points in images
, where
$N$
denotes the total number of 3D points,
$[ R | t ]$
can be calculated by minimizing the error function
$E$
[6]:
$[ R | t ] = m i n [ R | t ] E$
, where
$E = ∑ i = 1 N ρ ( d ( x i , K [ R | t ] X i ) )$
In the above equation,
calculates the distance between the real projection
$x i$
and reprojection
$K [ R | t ] X i$
of a 3D point.
$ρ ( * )$
is the robust Tukey M-estimator[17].
According to the above description, if we have a series of point pairs
, we can estimate the camera pose
$[ R | t ]$
. Now, the next problem is to find point pairs
. Finding point pairs
starts with selecting 3D point candidates
$X i$
from the reconstructed 3D model. Then, the revised Lucas-Kanade tracker is applied to search
$x i$
for each
$X i$
in the current frame. If it fails to find
$x i$
, it removes the 3D point candidate
$X i$
. Selecting the 3D point candidate
$X i$
follows two rules: the 3D point candidate should be visible in the current frame; and 3D points that are also visible in the nearby key frames should preferentially be selected.
2.4　Projection
Now, we have obtained the 3D model by reconstruction and obtained the camera pose by tracking. A 3D model of a virtual object, such as a Christmas hat, can be added to the reconstructed 3D model. The virtual object can then be projected onto our new frame with a new background by the camera projection function:
$x = K [ R | t ] X$
, where
$X$
represents the coordinates of a 3D point that is ready to be projected,
$K$
contains the intrinsic parameters of the camera,
$[ R | t ]$
is the camera pose predicted by the tracking part, and
$x$
is the projection result of the 3D point
$X$
.
3 Experiment
A human head is a good example of an approximately rigid object. The proposed AR system is tested on a human head in this section. Meanwhile, the system can change the video background. The task of changing the background is implemented based on the human segmentation results provided by the proposed CNN-based segmentation network, BowtieNet. The task of adding a virtual hat is implemented based on the moving object reconstruction and tracking system described in Section 2. In the following subsections, we first show our experiments on the Baidu Person dataset[18,19,20], which is a human image segmentation dataset, to demonstrate the performance of the proposed BowtieNet. Then, we display our video experimental results of adding a Christmas hat on a human head and changing the background. Our experiments are implemented under Caffe[21], with an Nvidia Titan X Pascal GPU and an Intel E5-2460 CPU.
3.1　Experiments on a human image segmentation dataset
In our system, we propose BowtieNet, which adds the ASPP between its encoder and decoder to segment the humans in each frame. BowtieNet is trained on the Baidu Person dataset. In the following subsections, we give the experimental details of this dataset.
3.1.1 　 Dataset and evaluation method
The Baidu Person dataset contains 5387 training images and 1316 testing images. The ground truth of the testing images is not released and there is no testing website. Therefore, we used only the 5387 training images to train and test our models. We randomly separated the 5387 training images into training, validation, and testing subsets, which contained 3000, 1000, and 1387 images, respectively. Only the 3000 training images were used for training, and there was no extra training dataset.
The performance of different networks was evaluated using intersection over union (IOU). Let
$P$
denote the predicted result, and
$G$
denote ground truth. IOU is calculated by
$I O U = | P ⋂ G | | P ⋃ G |$
where
$| P ⋂ G |$
denotes the intersection area of
$P$
and
$G$
, and
$| P ⋃ G |$
denotes the union area of
$P$
and
$G$
.
3.1.2 　 Training details
All our training and testing images were resized to 256×256 pixels. During training, the learning rate was set to 0.001. The training ended after 50000 iterations. Every 10000 iterations costed approximately 2-3h. The momentum was set to 0.9. The weight decay was set to 0.005. The batch size was set to 6. To augment the training dataset, random mirroring, random resizing between 0.5 and 1.5, and random rotation between -10 and 10 deg were adopted. In addition, during training, Dice loss[22] was applied. The training processes were monitored using the validation subset.
3.1.3 　 Experimental results and analysis
In this subsection, we present the performance evaluation of the proposed BowtieNet and compare it with many other segmentation networks. The evaluation results of different networks are shown in Table 2. Subsequently, we first compare BowtieNet with Pixel-by-Pixel[18], VGG-seg-net [19], and DCAN[20], which were also trained and tested on the Baidu Person dataset. They randomly chose 500 images as the validation dataset and used the remaining 4887 images as the training dataset. VGG-seg-net[19] also used thousands of extra training images. Obviously, compared with Pixel-by-Pixel, VGG-seg-net, and DCAN, we used considerably fewer training images and many more testing images. Under this condition, BowtieNet still achieved much better segmentation accuracy. Then, we compared BowtieNet with many other state-of-the-art segmentation networks, such as SegNet, DeepLab, and U-Net. For fair comparison, we fine-tuned these networks on the Baidu Person dataset with Dice loss. Of these models, DeepLab v3+ was implemented under TensorFlow, and the other models were implemented under Caffe. While fine-tuning DeepLab v3+, we continued to use the default parameters and data augmentation method provided by its authors. During the fine-tuning of the other models, we used the same parameters and data augmentation method as shown in Section 3.1.2. Some examples of segmenta-tion on the testing subset are shown in Figure 4. From Table 2 and Figure 4, we can see that BowtieNet is fast enough to run in real time and has the highest accuracy. From Figure 4, we can see that BowtieNet has the most accurate boundaries.
Evaluation results on Baidu Person dataset
Methods Mean IOU (%) Speed (fps)
Validation Testing
Pixel-by-Pixel[18] 86.70 86.83 0.033
VGG-seg-net[19] 83.57 - 1000
DCAN[20] 90.89 - -
SegNet[8] 90.12 90.00 49.5
DeepLab v2-VGG[10] 91.55 91.37 43.7
U-Net[9] 90.85 90.87 36.4
DeepLab v2-ResNet[10] 92.66 92.64 22.2
DeepLab v3+[13] 92.83 92.53 34.5
BowtieNet (ours) 93.64 93.42 39.1
Below, we compare BowtieNet with each of the other networks in detail.
(1) BowtieNet vs. SegNet. Compared with SegNet, BowtieNet has an ASPP structure between its encoder and decoder, which improves the size of the receptive field and improves the robustness when segmenting objects at various scales. In addition, compared with SegNet, the output stride of the encoder of BowtieNet is smaller. This allows BowtieNet to recover image details more easily. Therefore, around boundaries, the segmentation accuracy of BowtieNet is higher than that of SegNet.
(2) BowtieNet vs. DeepLab v2-VGG. Compared with DeepLab v2-VGG, BowtieNet has an additional decoder structure, which helps to regain spatial information. Therefore, BowtieNet has better performance than DeepLab v2-VGG around boundaries.
(3) BowtieNet vs. U-Net. Compared with U-Net, BowtieNet has a lighter decoder structure and smaller pooling strides. Therefore, BowtieNet is faster and has more accurate boundaries.
(4) BowtieNet vs. DeepLab v2-ResNet. DeepLab v2-ResNet has deeper layers and can extract higher-level features. BowtieNet has a decoder structure and can obtain more accurate boundaries. In our experimental dataset, the main foregrounds are humans. This is a much easier task than semantic segmentation. Under this condition, networks that can obtain more accurate boundaries have higher segmentation accuracy. Therefore, in our task, BowtieNet is more accurate than DeepLab v2-ResNet.
(5) BowtieNet vs. DeepLab v3+. DeepLab v3+ also combines spatial pyramid pooling and a decoder. However, it upsamples feature maps using 2 steps and adds low-level features only once. In contrast, BowtieNet upsamples feature maps using 3 steps and recovers spatial information in each step. Therefore, BowtieNet can obtain more accurate boundaries than DeepLab v3+.
3.2　Experiments on videos
In this section, we describe the capture of videos using a common RGB camera. The size of each frame was 640×480 pixels. As described in Section 2.1, during segmenting, only ROIs were fed into the CNN segmentation network for acceleration. The ROI in the current frame was the expanded bounding box of the segmentation result of the previous frame. All ROIs were resized to 256×256 pixels before segmentation. BowtieNet outputs human segmentation results, based on which the video background can be changed. After we obtained the human segmentation results, the human faces could be easily extracted using color information. The images of the extracted human face were then used for reconstruction and tracking. Based on the reconstructed 3D face model and the computed camera pose, a virtual Christmas hat 3D model was added to our new image with a new background. In Figure 5, we show some of our experimental results, including the human segmentation results provided by BowtieNet, the face and Christmas hat 3D models and camera poses (green box), and finally the new frames with a new background and the virtual Christmas hat.
From the experimental examples shown in Figure 5, we can see that our system performs well in changing the video background and adding a virtual Christmas hat. Particularly, in the first three examples of Figure 5, the human heads obviously have different angles. In the corresponding newly generated frames, which are shown in the last column, the projection results of the Christmas hat are obviously different. This demonstrates that the virtual Christmas hat can rotate with the human head and project different sides into the new frames like a real hat. In addition, to show the robustness of the segmentation part and the tracking part, we show additional examples with different human gestures in Figure 5. The experimental results show that when the human makes large movements, the segmentation can still output accurate results, and the tracking can still track the human face accurately.
To further demonstrate the robustness of our AR system, some more challenging examples are shown in Figure 6. In the first row of examples, frames are captured while the human moves quickly. The videos contain obvious blurs and jitters. Even so, the experimental results show that the proposed AR system can accurately track a human face. In the second row of examples, human faces have obviously different angles. The experimental results demonstrate that the virtual Christmas hat can rotate with the human head. In the third row of examples, part of the human face is covered by a hand or moves out of frame. Under these conditions, the proposed AR system can still add the virtual hat accurately and naturally.
Figure 7 shows the performance of our AR system when the light intensity changes and the camera is moving. It is obvious that the light intensity decreases from left to right. In addition, it is evident that the background of each frame is different from the others, which demonstrates that the camera is not static. The new frames with a new background and virtual hat displayed in the second row of Figure 7 show that our AR system is robust to variations in illumination and camera movement.
In Figure 8, more examples of our AR system are shown. In the first row of this figure, different kinds of virtual hats, such as a straw hat and a baseball hat, are put on the human head. In the second and third rows of this figure, a Christmas hat is put on the head of different people using different gestures, positioned at different distances from the camera, moving at different speeds, and wearing clothes of different colors.
In our AR system, the segmented moving body part can be regarded as a static object, and the relative motion between each moving body part and the camera can be considered the motion of the camera. Thus, tracking the moving body part is equivalent to calculating the relative camera poses. As described in Section 2.3, camera poses are estimated by minimizing an error function, which is calculated by quite a large amount of selected high-quality 3D and 2D matching point pairs. If most of the matching point pairs are correct, the estimated camera poses are correct. Our tracking system can tolerant a certain degree of error matching. Therefore, our tracking system is stable and robust. From the examples shown in Figure 5, Figure 6, Figure 7, and Figure 8, we can see that the virtual hats can be correctly put on human heads when humans make different gestures, are positioned at different distances from the camera, move at different speeds, wear clothes of at different colors, and so on, even if the environment suffers from changes in illumination and the cameras are moving. This demonstrates that our tracking system can always provide stable and accurate camera poses, and the robustness of our tracking system is quite high.
4 Conclusion and future work
In this study, we built a novel AR system for non-rigid humans, using only a monocular camera, that could add a 3D virtual object on an approximately rigid human part while changing the video background. Currently, most applications add virtual accessories on the face of a human by applying 2D stickers. Compared with those applications, our system provides a better sense of reality. An AR system is typically built based on SLAM. However, most current SLAM systems are designed for static scenes or rigid objects, while in our system, we reconstruct and track moving human parts using segmentation by BowtieNet and an optical flow method. The experimental results have shown that the proposed BowtieNet obtains state-of-the-art human image segmentation performance, and is fast enough for real-time segmentation. In addition, experiments on videos have shown that the proposed AR system can robustly add a virtual hat on a human head, and accurately change the video background.
In the present study, we added only a virtual hat. Adding a virtual hat does not require the accurate location of key points on the face. In the future, we will try to accurately locate 3D key points on the human face in the reconstructed 3D face model in order to add additional kinds of virtual accessories, such as glasses, earrings, gauze masks, and so on.

Reference

1.

Klein G, Murray D. Parallel tracking and mapping for small AR workspaces. In: Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality. IEEE Computer Society, 2007, 1–10 DOI:10.1109/ismar.2007.4538852

2.

Mur-Artal R, Montiel J M M, Tardos J D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Transactions on Robotics, 2015, 31(5): 1147–1163 DOI:10.1109/tro.2015.2463671

3.

Mur-Artal R, Tardos J D. ORB-SLAM2: an open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Transactions on Robotics, 2017, 33(5): 1255–1262 DOI:10.1109/tro.2017.2705103

4.

Park Y, Lepetit V, Woo W. Texture-less object tracking with online training using an RGB-D camera. In: 2011 10th IEEE International Symposium on Mixed and Augmented Reality. New York, USA, IEEE, 2011 DOI:10.1109/ismar.2011.6162879

5.

Ren C Y, Prisacariu V, Murray D, Reid I. STAR3D: simultaneous tracking and reconstruction of 3D objects using RGB-D data. In: 2013 IEEE International Conference on Computer Vision. Sydney, Australia. New York, USA, IEEE, 2013 DOI:10.1109/iccv.2013.197

6.

Feng Y J, Wu Y H, Fan L X. On-line object reconstruction and tracking for 3D interaction. In: 2012 IEEE International Conference on Multimedia and Expo. Melbourne, Australia, IEEE, 2012, 711–716 DOI:10.1109/icme.2012.144

7.

Shelhamer E, Long J, Darrell T. Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 640–651 DOI:10.1109/tpami.2016.2572683

8.

Badrinarayanan V, Kendall A, Cipolla R. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12): 2481–2495 DOI:10.1109/tpami.2016.2644615

9.

Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation// Lecture Notes in Computer Science. Cham: Springer International Publishing, 2015, 234–241 DOI:10.1007/978-3-319-24574-4_28

10.

Chen L C, Papandreou G, Kokkinos I, Murphy K, Yuille A L. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4): 834–848 DOI:10.1109/tpami.2017.2699184

11.

Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA, IEEE, 2014, 580–587 DOI:10.1109/cvpr.2014.81

12.

Mostajabi M, Yadollahpour P, Shakhnarovich G. Feedforward semantic segmentation with zoom-out features. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, USA, IEEE, 2015, 3376–3385 DOI:10.1109/cvpr.2015.7298959

13.

Chen L C, Zhu Y K, Papandreou G, Schroff F, Adam H. Encoder-decoder with atrous separable convolution for semantic image segmentation//Computer Vision–ECCV 2018. Cham: Springer International Publishing, 2018, 833–851 DOI:10.1007/978-3-030-01234-2_49

14.

Chen L C, Papandreou G, Schroff F, Adam H. Rethinking atrous convolution for semantic image segmentation. 2017, arXiv preprint arXiv:1706.05587

15.

Jianbo S, Tomasi C. Good features to track. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Seattle, WA, 1994, 593–600 DOI:10.1109/cvpr.1994.323794

16.

Hartley R, Zisserman A. Multiple view geometry in computer vision. Cambridge university press, 2003

17.

Huber P J. Robust statistics. Springer Berlin Heidelberg, 2011

18.

Wu Z, Huang Y, Yu Y, Wang L, Tan T. Early hierarchical contexts learned by convolutional networks for image segmentation. In: Proceedings of the 22nd International Conference on Pattern Recognition. Stockholm, Sweden, 2014, 1538–1543 DOI:10.1109/icpr.2014.273

19.

Song C F, Huang Y Z, Wang Z Y, Wang L. 1000fps human segmentation with deep convolutional neural networks. In: Proceedings of the 3rd IAPR Asian Conference on Pattern Recognition. Kuala Lumpur, Malaysia, 2015, 474–478 DOI:10.1109/acpr.2015.7486548

20.

Tesema F B, Wu H, Zhu W. Human segmentation with deep contour-aware network. In: Proceedings of the 2018 International Conference on Computing and Artificial Intelligence. Medan, Indonesia, 2018, 98–103 DOI:10.1145/3194452.3194471

21.

Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T. Caffe: Convolutional architecture for fast feature fmbedding. In: Proceedings of the 22nd ACM international conference on Multimedia. Orlando, Florida, USA, ACM, 2014, 675–678 DOI:10.1145/2647868.2654889

22.

Milletari F, Navab N, Ahmadi S A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV). Stanford, CA, 2016, 565–571 DOI:10.1109/3DV.2016.79