# Survey of 3D modeling using depth cameras

State Key Lab of CAD&CG, Zhejiang University, Hangzhou 310058, China

Abstract

Keywords： 3D Modeling ; Depth camera ; Camera tracking ; Signed distance function ; Surfel

Content

^{[1]}, structure from motion

^{[2,3]}: As image sensors demonstrate low-cost and easy to deploy characteristics, 3D modeling from multi-view 2D images can significantly automate and simplify the modeling process; however, it relies heavily on image quality and are easily affected by lighting and camera resolution.

^{[4]}, including 3D scanner and depth camera: The sensors actively project structured light or laser stripes to the surface of the object and then reconstruct spatial shape of objects and scenes, i.e., sampling a large number of 3D points from the 3D surface according to the spatial information decoded from the structured light. The advantage of 3D sensors is that they are robust to interference in the environment and are able to scan 3D models in high precision automatically. However, professional 3D scanners are expensive, which limits their applicability for ordinary users.

^{[5,6]}. It is capable of generating 3D models with high-resolution details in real time using color and depth images obtained using depth cameras, which is attracting the attention of researchers in the field of 3D reconstruction. Researchers in computer graphics and computer vision have continuously innovated the algorithms and systems for 3D modeling based on depth cameras and achieved satisfactory research results.

**1.1**

**Preliminaries of depth cameras**

**1.2**

**Pipeline of 3D modeling based on depth cameras**

**2.1**

**Volumetric representation**

^{[7]}. They introduced the concept of representing 3D surfaces using signed distance function (SDF) sampled at voxels, where SDF values are defined as the distance of a 3D point to the surface. Thus, the surface of an object or a scene, according to this definition, is an iso-distance surface whose SDF is zero. The free space, i.e., the space outside of the object, is indicated by a positive SDF value, because the distance increases as a point moves along the surface normal towards the free space. Conversely, the occupied space (the space inside of the object) is indicated by negative SDF values. These SDF values are stored in a cube surrounding the object and the cube is rasterized into voxels. The reconstruction algorithm must define the voxel size and spatial extent of the cube. Other data, such as colors, are usually stored as voxel attributes. As only voxels close to the actual surface are important, researchers often use the truncated SDF (TSDF) to represent the model. Figure 2 shows the general pipeline of reconstruction based on volumetric representation. As the TSDF value implicitly represents the surface of the object, it is necessary to extract the surface of the object using ray casting before estimating the pose of the camera. The fusion step is a simple weighted average operation at each voxel center, which is very effective for eliminating the noise of the depth camera.

^{[8]}proposed a method of dynamically shifting voxel grids such that the voxel grids follow the motion of the depth camera. When the voxel grids move, the method extracts the parts outside the current voxel mesh and stores them separately, adding their camera pose to the global pose. Although this can achieve a larger scan volume, it requires a large amount of external memory usage; moreover, the scanned surface cannot be arbitrarily revisited. Henry et al.

^{[9]}also proposed a similar idea.

^{[10]}. Although its definition is simple, owing to the sparsity of nodes, it is considerably difficult to utilize the parallelism of the GPU. Sun et al.

^{[11]}established an octree with only leaf nodes to store voxel data and accelerate tracking based on octree. Zhou et al.

^{[12]}used the GPU to construct a complete octree structure to perform Poisson reconstruction on scenes with 300,000 vertices at an interactive rate. Zeng et al.

^{[13]}implemented a 9- to 10-level octree for KinectFusion and extended the reconstruction results to a medium-sized office. Chen et al.

^{[14]}proposed a similar three-level hierarchy with equally good resolution. Steinbrücker et al.

^{[15]}represented the scene in a multi-resolution data structure for real-time accumulation on the CPU; further, they outputted a real-time grid model at approximately 1 Hz.

^{[16]}proposed a voxel hashing structure. In this method, the index corresponding to the spatial position of the voxel is stored in a linearized spatial hash table and addressed by a spatial hash function. As only voxels containing spatial information are stored in the memory, this strategy significantly reduces memory consumption and theoretically allows voxel blocks of a certain size and resolution to represent infinitely large spaces. When compared with the voxel hierarchies, the hashing structure has great advantages in data insertion and access, whose time complexity is O(1). However, the hashing function inevitably maps multiple different spatial voxel blocks to the same location in the hash table. Kahler et al.

^{[17]}proposed the use of a different hashing method to reduce the number of hash collisions. Hash-based 3D reconstruction has the advantage of small memory requirements and efficient calculations, making it suitable for even mobile devices, such as application to Google Tango

^{[18]}.

**2.2**

**Surfel**

^{[19]}. As shown in Figure 3, a surfel usually contains the following attributes: space coordinates

**p**

^{3}, normal vector

**n**

^{3}, color information

**c**

^{3}, weight (or confidence)

**w**, radius

**r**, timestamp

**t**. The weight of a surfel is used for the weighted average of neighboring 3D points and the judgment of the stable and unstable points. It is generally initialized by the following formula:

^{[20]}. The radius of a surfel is determined by the distance of the surface of the scene from the optical center of the camera. The further the distance is, the larger the radius of the surfel will be. When compared to the volumetric representation, the surfel-based 3D reconstruction is more compact and intuitive for input from depth cameras, eliminating the need to switch back and forth between different representations. Similar to volumetric representation, octrees can be used to improve the surfel-based data structures

^{[21,22]}.

**Pros and cons:**Volumetric representation is more convenient in depth denoising and surface rendering as the TSDF computed according to each frame can be efficiently fused at each voxel, as discussed in the next section. In contrast, the surfel representation can be viewed as an extension of the point cloud representation. When the vertex correspondences are necessary in the registration, such information can be directly computed using surfels, while the volumetric representation has to be converted to 3D meshes in this case.

**3.1**

**Depth map preprocessing**

^{[23]}classified these noises into the three following categories. The first is missing depth. When the object is too close or too far from the depth camera, the camera sets the depth value to zero because it cannot measure the depth. It should be noted that even within the measurement range of the camera, zero depth may be generated owing to surface discontinuities, highlights, shadows, or other reasons. The second is depth error. When a depth camera records a depth value, its accuracy depends on its depth itself. The third is depth inconsistency. Even if the depth of the scene is constant, the depth value obtained will change with time.

^{[24]}is used to reduce the noise of the depth image. After noise reduction, the KinectFusion

^{[5] }application implies a downsampling to obtain a three-layer depth map pyramid to estimate the pose of the camera from coarse to fine in the subsequent step. For the depth image after noise reduction, according to the internal parameters of the depth camera, the 3D coordinates of each pixel point in the camera coordinate system can be calculated by inverse projection to form a vertex map. The normal of each vertex can be obtained by the cross product of vectors formed by the vertex and its adjacent vertices. For the surfel representation, it is also necessary to calculate the radius of each point to represent the local surface area around a given point while minimizing the visible holes between adjacent points, as shown in Figure 3. Reference [25] calculates the vertex radius

**r**using the following formula:

**d**represents the depth value of the vertex and

**f**represents the focal length of the depth camera.

**3.2**

**Camera tracking**

**T**. This transformation matrix maps the camera coordinate system to the world coordinate system. Given the position of the point

_{g}**p**in the local camera coordinate system, it can be converted to its 3D coordinate in the world coordinate system, denoted by

_{l}**p**, by the following equation:

_{l}**p**

_{g}*<sub/>*=

**T**

_{g}_{*}p_{l}_{.}^{[26,27]}. The rigid transformation is calculated by determining matching points between two adjacent frames and minimizing the sum of squares of the Euclidean distance between these pairs. This step is iterated until a certain convergence criterion is satisfied. A serious problem with this frame-to-frame strategy is that the registration error of two frames is accumulated as the scan progresses.

^{[6,8,16]}. Although frame-to-model tracking significantly reduces the tracking drift of each frame, it does not completely solve the problem of error accumulation, as tracking errors still accumulate over time. The accumulated drift eventually leads to misalignment of the reconstructed surface at the loop closure of the track. Therefore, researchers have introduced the idea of global pose optimization. Henry et al.

^{[29] }proposed the concept of the keyframe, which produces a keyframe whenever the accumulated camera pose is greater than a threshold. The loop closure detection is performed only when a keyframe occurs; moreover, the random sample consensus algorithm

^{[30]}is used to perform feature point matching on the current and previous keyframes. After the loop closure is detected, there are two globally optimized strategies to choose from. One is the pose graph optimization, where the pose graph is used to represent the constraint between the frames; further, the edge between the frames corresponds to the geometric constraint. The second is sparse bundle adjustment, which globally minimizes the re-projection error of the feature points using sparse bundle adjustment matching on all frames

^{[31,32,33]}. Zhou et al.

^{[34]}proposed the use of interest points to preserve local details and combine global pose optimization to evenly distribute registration errors in the scene. Zhou et al.

^{[35]}also proposed the segmentation of a video sequence into multiple segments using frame-to-model integration, followed by the reconstruction of locally accurate scene segments from each segment, the establishment of a dense correspondence between overlapping segments, and the optimization of a global energy function to align the fragment. The optimized images can subtly distort the image fragments to correct inconsistencies caused by low-frequency distortion in the input images. Whelan et al.

^{[36]}used depth and color information to perform global pose estimation for each input frame. The prediction of the depth and color images used for registration is obtained by using the surface splatting algorithm

^{[19]}for the global model. They divided the loop closure problem into global and partial loop closures. Global loop closure detection uses the random fern coding method

^{[37]}. For loop closure optimization, they used a spatial deformation method based on the embedded deformation map

^{[38]}. A deformation map consists of a set of nodes and edges that are distributed among the models to be deformed. Each surfel is affected by a set of nodes; moreover, the influence weight is inversely proportional to the distance of the surfel to the node.

^{[39]}is a popular choice. The speeded-up robust features (SURF) algorithm

^{[40]}improves the SIFT algorithm and the speed of detecting feature points. The ORB feature descriptor is an alternative to SIFT and SURF, which is faster and correspondingly less stable. Endres et al.

^{[41]}exploited these three feature descriptors when extracting feature points and compared their effects. Whelan et al.

^{[8]}explored feature point-based fast odometry from vision instead of the ICP method in KinectFusion and adopted SURF feature descriptors for loop closure detection. Zhou et al.

^{[42]}proposed a method to track camera based on object contours and introduced corresponding points of contour features to ensure more stable tracking. The recent BundleFusion method

^{[43]}used the SIFT feature for coarse registration first and subsequently dense methods for fine registration.

^{[44]}significantly speeds up this process. This strategy projects the three-dimensional coordinates of the input point to a pixel on the target depth map according to the camera pose and the internal parameters of the camera; then, it obtains the best corresponding point through the search in the neighborhood of the pixel. There are many ways to measure the error of the corresponding point, such as point-to-point metric

^{[26]}and point-to-plane metric

^{[27]}. When compared to the point-to-point metric, the point-to-plane metric results in faster convergence in the camera tracking; thus, it is widely used. In addition to using geometric distance as constraints, many methods use optical differences (i.e., color differences) between corresponding points as constraints, such as

^{[43,44,45]}. Lefloch et al.

^{[46]}considered curvature information as an independent surface property for real-time reconstruction. They considered the curvature of the input depth map and global model when determining dense corresponding points.

**3.3**

**Fusion of depth images**

**Volumetric representation:**When compared to computing a complete SDF for all voxels, the TSDF is usually calculated because its computational cost is more affordable. The voxel center is first projected to the input depth map after camera tracking to compute the TSDF value corresponding to the current input depth map. Then, the new TSDF value is fused to the TSDF of the global model by weighted averaging, which is effective in removing depth noises.

**Surfel:**After estimating the camera pose of the current input frame, each vertex and its normal and radius are integrated into the global model. The fusion of depth maps using surfels has three steps: data association, weighted average, and removal of invalid points

^{[20]}. First, corresponding points are determined by projecting the vertices of the current 3D model to the image plane of the current camera. As some model points might be projected onto the same pixel, the upsampling method is used to improve the accuracy. If the corresponding point is identified, the most reliable point is merged with the new point estimate using a weighted average. If it is not determined, the new point estimate is added to the global model as an unstable point. Over time, the global model is cleaned up to eliminate outliers according to the visibility and time constraints.

**3.4**

**3D reconstruction of dynamic scene**

**Volumetric representation:**DynamicFusion

^{[47]}proposed by Newcombe et al. was the first method to reconstruct a non-rigid deformation scene in real time using the input obtained from a depth camera. It reconstructs an implicit volumetric representation, similar to the KinectFusion method, and simultaneously optimizes the rigid and non-rigid motion of the scene based on the warped field parameterized by the sparse deformation map

^{[38]}. The main problem with DynamicFusion is that it cannot handle topology changes that have occurred during the capturing. Dou et al. proposed the Fusion4D method

^{[48]}. This method used eight depth cameras to acquire a multi-view input, combined the concept of volumetric fusion with the estimation of the smooth deformation field, and proposed a fusion method of multiple camera depth information that can be applied to complex scenes. In addition, it proposed a machine-learning-based corresponding points detection method that can robustly handle frame-to-frame fast motion. Recently, Yu et al. proposed a new real-time reconstruction system, called DoubleFusion

^{[49]}. This system combined dynamic reconstruction based on volumetric representation and skeletal model SMPL of the reference [50], which can reconstruct detailed geometry, non-rigid body motion, and internal shape simultaneously from a single depth camera. One of the key contributions of this method is the two-layer representation, which consists of a fully parametric internal shape and a gradually fused outer surface layer. The predefined node map on the body surface parameterizes the non-rigid body deformation near the body; moreover, the free-form dynamic change map parameterizes the outer surface layer away from the body for more general reconstruction. Furthermore, a joint motion tracking method based on the two-layer representation was proposed to make joint motion tracking robust and fast. Guo et al.

^{[51] }proposed a shading-based scheme to leverage appearance information for motion estimation and the computed albedo information was fused into the volume. The accuracy of the reconstructed motion of the dynamic object was significantly improved owing to the incorporation of the albedo information.

**Surfel:**Keller et al.

^{[20]}used the foreground segmentation method to process dynamic scenes. The proposed algorithm removes objects with moving surfaces from the global model so that the dynamic part does not act on the camera tracking. As shown in Figure 5, a person seated on a chair is initially reconstructed and then he begins to move. The system segments the moving person in the ICP step by ignoring the dynamic scene of part (A) (refer Figure 5); thus, it ensures the robustness of the camera tracking to dynamic motion.

^{[52]}proposed to use motion or semantic cues to segment the scene into backgrounds and different rigid foreground objects, simultaneously tracking and reconstructing their 3D geometry over time, which is one of the earliest real-time methods for intensive tracking and fusion of multiple objects. They proposed two segmentation strategies: motion segmentation (grouping points that are consistent in motion in 3D) and object instance segmentation (detect and segment single objects in RGB images for a given semantic tag). Once detected and segmented, objects are added to the active model list, then tracked, and their 3D shape models are updated by merging the data that belongs to the object. Gao et al.

^{[53]}proposed the utilization of a deformation field to align the depth input of the current moment with a reference geometry (template). The reference geometry is initialized with the deformation field of the previous frame and then optimized by the ICP algorithm, similar to DynamicFusion. Once the deformation field is updated, data fusion is performed to accumulate the current depth observations into the global geometric model. According to the updated global model, the global deformation field is inversely computed for the registration of the next frame.

**3.5**

**Scene understanding**

^{[55]}proposed a method of region growth to detect and label new planes using planarity and orthogonality constraints to reduce the camera tracking drift and flat region bending. Dzitsiuk et al.

^{[56] }fit the plane to the symbol distance function value of the volume based on the robust least squares method and merged the detected local planes to determine a global continuous plane. By using the detected plane to correct the SDF value, the noise on the flat surface can be significantly reduced. Further, they used an a priori plane to directly obtain object-level and semantic segmentation (such as walls) of the indoor environment. Shi et al.

^{[57]}proposed a method for predicting the co-planarity of image patches for RGB-D image registration. They used a self-supervised approach to train a deep network to predict whether two image patches are coplanar or not and to incorporate coplanar constraints when solving camera pose problems. Experiments demonstrated that this method of using co-planarity is still effective when feature point-based methods cannot achieve loop closure.

^{[58]}proposed by Salas-Moreno et al. is one of the earliest methods for object-oriented mapping methods. They used point-to-point features to detect objects and reconstruct 3D maps at the object level. Although this scheme enables the SLAM system to be extended to object-level loop closure processing, a detailed 3D model preprocessing of objects is required in advance to learn the 3D feature descriptors. Tateno et al.

^{[59]}incrementally segmented the reconstructed region in TSDF volumes and used the oriented, unique, and repeatable clustered viewpoint feature histogram

^{[60]}feature descriptor to directly match other objects in the database. Xu et al.

^{[61]}proposed a recursive 3D attention model for the active 3D object recognition by a robot. The model can perform both the instance-level shape classification and the continuous next best view regression. They inserted the retrieved 3D model into the scene and built the 3D scene model step by step. In reference [63], the 3D indoor scene scanning task was decomposed to concurrent subtasks for multiple robots to explore the unknown regions simultaneously, where the target regions of subtasks were computed via optimal transportation optimization.

^{[64]}adopted the convolutional neural network (CNN) model

^{[65]}proposed by Noh et al. to perform 2D semantic segmentation. The Bayesian framework for 2D-3D label transfer was used to fuse 2D semantic segmentation labels into 3D models. Then, the model was post-processed using convolutional random field. Rünz et al.

^{[54]}adopted a mask region-based CNN model

^{[66]}to detect object instances, as can be seen in Figure 6. They applied ElasticFusion

^{[36]}using surfels to represent each object and static background for real-time tracking and the dense reconstruction of dynamic instances. Differing from the above two methods for semantic labeling of the scene at the object level, Hu et al.

^{[67]}proposed a method for reconstructing a single object based on the labeling of each part of the object. To minimize the labeling work of the user, they used an active self-learning method to generate the data required by the neural network. Schönberger et al.

^{[68]}obtained 3D descriptors by jointly encoding high-level 3D geometric information and semantic information. Experiments have shown that such descriptors enable reliable positioning even under extreme viewing angles, illumination, and geometric variations.

**4.1**

**Offline texture reconstruction**

^{[69,70]}, the alignment of image and geometric features

^{[71,72]}, and mutual information maximization between the projected images

^{[73,74]}. While these methods can effectively handle camera calibration inaccuracies, they cannot handle inaccurate geometries and optical distortions in RGB images, which is a common problem in the data captured by consumer-level depth cameras. Zhou and Koltun

^{[75]}solved the above problems by optimizing for the RGB camera poses and non-rigid warping of each associated image to maximize the photo consistency. In contrast, Bi et al.

^{[76] }proposed a patch-based optimization method, which synthesizes a set of feature-aligned color images to produce high-quality texture mapping even with large geometric errors. Huang et al.

^{[66] }proposed several improvements based on the method presented by Zhou et al.

^{[75]}for 3D indoor scene reconstruction (refer Figure 7). After calculating a primitive abstraction of the scene, they first corrected color values by compensating for the varying exposure and white balance; then, the RGB images were aligned by optimization based on sparse features and dense photo consistency. Finally, a temporally coherent sharpening algorithm was developed to obtain clean texture for the reconstructed 3D model.

**4.2**

**Online texture reconstruction**

^{[77]}refused to fuse the color values of the edges of the object into the model as this will result in artifacts or discontinuities. In the subsequent work

^{[78]}, they estimated the position and orientation of the light source in the scene to further reject the fusion from samples containing only highlights. Although this has significantly improved the visual quality of the texture, artifacts may still occur. As high dynamic range (HDR) images provide more image details than low dynamic range images, they are also adopted in the 3D reconstruction based on RGB-D cameras

^{[79,80,81]}. First, the pre-calibrated response curve is used to linearize the observed intensity values. Then, the HDR pixel colors based on the exposure time are computed and fused into the texture of the 3D model in real time. A recent contribution by Xu et al. proposed a dynamic atlas texturing scheme for warping and updating dynamic texture on the fused geometry in real time, which significantly improved the visual quality of 3D human modeling

^{[82]}.

^{[83]}. Second, the moving objects in the scene are not friendly to the depth camera tracking. We need to explore how to accurately separate the moving objects from the static scene in the captured depth to improve the accuracy of the camera tracking-based static scene. One possible solution is to integrate the object detection and motion flow together, where the movement of the objects conforming to the camera motion should be static with respect to the camera. Third, to support the application of depth cameras integrated into mobile phones, it is necessary to develop real-time and energy-saving depth data processing and camera tracking algorithms for the hardware of mobile phone. The convergence acceleration algorithms, such as Anderson acceleration, can be explored to speed-up the camera tracking

^{[84]}.

Reference

Seitz S M, Curless B, Diebel J, Scharstein D, Szeliski R. A comparison and evaluation of multi-view stereo reconstruction algorithms. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Volume 1 (CVPR'706), New York, NY, USA, 519–528 DOI:10.1109/cvpr.2006.19

Davison. Real-time simultaneous localisation and mapping with a single camera. In: Proceedings of Ninth IEEE International Conference on Computer Vision. Nice, France, IEEE, 2003 DOI:10.1109/iccv.2003.1238654

Klein G, Murray D. Parallel tracking and mapping for small AR workspaces. In: 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality. Nara, Japan, IEEE, 2007 DOI:10.1109/ismar.2007.4538852

Rusinkiewicz S, Hall-Holt O, Levoy M. Real-time 3D model acquisition. ACM Transactions on Graphics, 2002, 21(3): 438–446 DOI:10.1145/566654.566600

Izadi S, Kim D, Hilliges O, Molyneaux D, Newcombe R, Kohli P, Shotton J, Hodges S, Freeman D, Davison A, Fitzgibbon A. KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera. In:Proceedings of the 24th annual ACM symposium on User interface software and technology, 2011, 559–568 DOI:10.1145/2047196.2047270

Newcombe R A, Davison A J, Izadi S, Kohli P, Hilliges O, Shotton J, Molyneaux D, Hodges S, Kim D, Fitzgibbon A. KinectFusion: Real-time dense surface mapping and tracking. In: 2011 10th IEEE International Symposium on Mixed and Augmented Reality. Basel, New York, USA, IEEE, 2011, 127–136 DOI:10.1109/ismar.2011.6092378

Curless B, Levoy M. A volumetric method for building complex models from range images. In: Proceedings of the 23rd annual conference on Computer graphics and interactive techniques-SIGGRAPH '96. New York, USA, ACM Press, 1996, 303–312 DOI:10.1145/237170.237269

Whelan T, Kaess M, Fallon M, Johannsson H, Leonard J, McDonald J. Kintinuous: Spatially extended KinectFusion. RSS Workshop on RGB-D: Advanced Reasoning with Depth Cameras, 2012

Roth H, Vona M. Moving volume KinectFusion. In: Procedings of the British Machine Vision Conference. British Machine Vision Association, 2012, 1–11 DOI:10.5244/c.26.112

Frisken S F, Perry R N, Rockwood A P, Jones T R. Adaptively sampled distance Fields. In: Proceedings of the 27th annual conference on Computer graphics and interactive techniques-SIGGRAPH '00. New York, USA, ACM Press, 2000, 249–254 DOI:10.1145/344779.344899

Sun X, Zhou K, Stollnitz E, Shi J Y, Guo B N. Interactive relighting of dynamic refractive objects. ACM Transactions on Graphics, 2008, 27(3): 1–9 DOI:10.1145/1360612.1360634

Zhou K, Gong M M, Huang X, Guo B N. Data-parallel octrees for surface reconstruction. IEEE Transactions on Visualization and Computer Graphics, 2011, 17(5): 669–681 DOI:10.1109/tvcg.2010.75

Zeng M, Zhao F K, Zheng J X, Liu X G. Octree-based fusion for realtime 3D reconstruction. Graphical Models, 2013, 75(3): 126–136 DOI:10.1016/j.gmod.2012.09.002

Chen J W, Bautembach D, Izadi S. Scalable real-time volumetric surface reconstruction. ACM Transactions on Graphics, 2013, 32(4): 1–16 DOI:10.1145/2461912.2461940

Steinbrucker F, Sturm J, Cremers D. Volumetric 3D mapping in real-time on a CPU. In: 2014 IEEE International Conference on Robotics and Automation (ICRA). Hong Kong, China, IEEE, 2014, 2021–2028 DOI:10.1109/icra.2014.6907

Nießner M, Zollhöfer M, Izadi S, Stamminger M. Real-time 3D reconstruction at scale using voxel hashing. ACM Transactions on Graphics, 2013, 32(6): 1–11 DOI:10.1145/2508363.2508374

Kahler O, Adrian Prisacariu V, Yuheng Ren C, Sun X, Torr P, Murray D. Very high frame rate volumetric integration of depth images on mobile devices. IEEE Transactions on Visualization and Computer Graphics, 2015, 21(11): 1241–1250 DOI:10.1109/tvcg.2015.2459891

Dryanovski I, Klingensmith M, Srinivasa S S, Xiao J Z. Large-scale, real-time 3D scene reconstruction on a mobile device. Autonomous Robots, 2017, 41(6): 1423–1445 DOI:10.1007/s10514-017-9624-2

Pfister H, Zwicker M, van Baar J, Gross M.Surfels: Surface elements as rendering primitives. In: Proceedings of the 27th annual conference on Computer graphics and interactive techniques-SIGGRAPH '00. ACM Press, 2000, 335–342 DOI:10.1145/344779.344936

Keller M, Lefloch D, Lambers M, Izadi S, Weyrich T, Kolb A. Real-time 3D reconstruction in dynamic scenes using point-based fusion. In: 2013 International Conference on 3D Vision. Seattle, WA, USA, IEEE, 2013, 1–8 DOI:10.1109/3dv.2013.9

Stuckler J, Behnke S. Integrating depth and color cues for dense multi-resolution scene mapping using RGB-D cameras. In: 2012 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI). Hamburg, Germany, IEEE, 2012 DOI:10.1109/mfi.2012.6343050

Stückler J, Behnke S. Multi-resolution surfel maps for efficient dense 3D modeling and tracking. Journal of Visual Communication and Image Representation, 2014, 25(1): 137–147 DOI:10.1016/j.jvcir.2013.02.008

Mallick T, Das P P, Majumdar A K. Characterizations of noise in kinect depth images: A review. IEEE Sensors Journal, 2014, 14(6): 1731–1740 DOI:10.1109/jsen.2014.2309987

Tomasi C, Manduchi R. Bilateral filtering for gray and color images. In: Proceedding of IEEE international conference on computer vision, 1998, 839–846 DOI:10.1109/iccv.1998.710815

Salas-Moreno R F, Glocken B, Kelly P H J, Davison A J. Dense planar SLAM. In: 2014 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). Munich, Germany, IEEE, 2014 DOI:10.1109/ismar.2014.6948422

Besl P J, McKay N D. A method for registration of 3-D shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence,1992, 14(2): 239–256 DOI:10.1109/34.121791

Chen Y, Medioni G. Object modeling by registration of multiple range images. In: Proceedings of 1991 IEEE International Conference on Robotics and Automation. Sacramento, CA, IEEE Comput. Soc. Press, 2724–2729 DOI:10.1109/robot.1991.132043

Kerl C, Sturm J, Cremers D. Dense visual SLAM for RGB-D cameras. In: 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems. Tokyo, IEEE, 2013 DOI:10.1109/iros.2013.6696650

Henry P, Krainin M, Herbst E, Ren X F, Fox D. RGB-D mapping: using depth cameras for dense 3D modeling of indoor environments//Experimental Robotics. Berlin, Heidelberg, Springer Berlin Heidelberg, 2014, 477–491 DOI:10.1007/978-3-642-28572-1_33

Fischler M A, Bolles R C. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 1981, 24(6): 381–395 DOI:10.1145/358669.358692

Triggs B, McLauchlan P F, Hartley R I, Fitzgibbon A W. Bundle adjustment—A modern synthesis//Vision Algorithms: Theory and Practice. Berlin, Heidelberg, Springer Berlin Heidelberg, 2000, 298–372 DOI:10.1007/3-540-44480-7_21

Lourakis M., Argyros A. SBA: A software package for generic sparse bundle adjustment. ACM Transactions on Mathematical Software, 36, 2009, 1–30

Konolige K. Sparse sparse bundle adjustment. In: Procedings of the British Machine Vision Conference. Aberystwyth, British Machine Vision Association, 2010 DOI:10.5244/c.24.102

Zhou Q Y, Miller S, Koltun V. Elastic fragments for dense scene reconstruction. In: 2013 IEEE International Conference on Computer Vision. Sydney, Australia, IEEE, 2013 DOI:10.1109/iccv.2013.65

Zhou Q Y, Koltun V. Dense scene reconstruction with points of interest. ACM Transactions on Graphics, 2013, 32(4):1–8 DOI:10.1145/2461912.2461919

Whelan T, Leutenegger S, Salas Moreno R, Glocker B, Davison A. ElasticFusion: dense SLAM without A pose graph. In: Robotics: Science and Systems XI, Robotics: Science and Systems Foundation, 2015 DOI:10.15607/rss.2015.xi.001

Glocker B, Shotton J, Criminisi A, Izadi S. Real-time RGB-D camera relocalization via randomized ferns for keyframe encoding. IEEE Transactions on Visualization and Computer Graphics, 2015, 21(5): 571–583 DOI:10.1109/tvcg.2014.2360403

Sumner R W, Schmid J, Pauly M. Embedded deformation for shape manipulation. ACM Transactions on Graphics, 2007, 26(3): 80 DOI:10.1145/1276377.1276478

Lowe D G. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 2004, 60(2): 91–110 DOI:10.1023/b:visi.0000029664.99615.94

Bay H, Ess A, Tuytelaars T, van Gool L. Speeded-up robust features (SURF). Computer Vision and Image Understanding, 2008, 110(3): 346–359 DOI:10.1016/j.cviu.2007.09.014

Endres F, Hess J, Engelhard N, Sturm J, Cremers D, Burgard W. An evaluation of the RGB-D SLAM system. In: 2012 IEEE International Conference on Robotics and Automation. St Paul, MN, USA, IEEE, 2012 DOI:10.1109/icra.2012.6225199

Zhou Q Y, Koltun V. Depth camera tracking with contour cues. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, USA, IEEE, 2015, 632–638 DOI:10.1109/cvpr.2015.7298662

Mehta D, Sridhar S, Sotnychenko O, Rhodin H, Shafiei M, SeidelH-P, Xu W, Casas D, Theobalt C. VNect: real-time 3D human pose estimation with a single RGB camera. ACM Transactions on Graphics, 2017, 36(4):1–14 DOI:10.1145/3072959.3073596

Blais G, Levine M D. Registering multiview range data to create 3D computer objects. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1995, 17(8): 820–824 DOI:10.1109/34.400574

Steinbrucker F, Kerl C, Cremers D. Large-scale multi-resolution surface reconstruction from RGB-D sequences. In: 2013 IEEE International Conference on Computer Vision. Sydney, Australia, IEEE, 2013, 3264–3271 DOI:10.1109/iccv.2013.405

Lefloch D, Kluge M, Sarbolandi H, Weyrich T, Kolb A. Comprehensive use of curvature for robust and accurate online surface reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12): 2349–2365 DOI:10.1109/tpami.2017.2648803

Newcombe R A, Fox D, Seitz S M. DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA, IEEE, 2015 DOI:10.1109/cvpr.2015.7298631

Orts-Escolano S, Rhemann C, Fanello S, Chang W, Kowdle A, Degtyarev Y, Kim D, Davidson P L, Khamis S, Dou M, Tankovich V, Loop C, Cai Q, Chou P A, Mennicken S, Valentin J, Pradeep V, Wang S, Kang S B, Kohli P, Lutchyn Y, Keskin C, Izadi S. Holoportation: Virtual 3D Teleportation in Real-time. In: Proceedings of the 29th Annual Symposium on User Interface Software and Technology. Tokyo, Japan, ACM, 2016, 741–754 DOI:10.1145/2984511.2984517.

Yu T, Zheng Z R, Guo K W, Zhao J H, Dai Q H, Li H, Pons-Moll G, Liu Y B. DoubleFusion: real-time capture of human performances with inner body shapes from a single depth sensor. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, IEEE, 2018 DOI:10.1109/cvpr.2018.00761

Loper M, Mahmood N, Romero J, Pons-Moll G, Black M J. Smpl. ACM Transactions on Graphics, 2015, 34(6): 1–16 DOI:10.1145/2816795.2818013

Guo K W, Xu F, Yu T, Liu X Y, Dai Q H, Liu Y B. Real-time geometry, albedo and motion reconstruction using a single RGBD camera. ACM Transactions on Graphics, 2017, 36(4): 1 DOI:10.1145/3072959.3126786

Runz M, Agapito L. Co-fusion: Real-time segmentation, tracking and fusion of multiple objects. In: 2017 IEEE International Conference on Robotics and Automation (ICRA). Singapore, IEEE, 2017, 4471–4478 DOI:10.1109/icra.2017.7989518

Gao W, Tedrake R. SurfelWarp: efficient non-volumetric single view dynamic reconstruction. In: Robotics: Science and Systems XIV. Robotics: Science and Systems Foundation, 2018

Runz M, Buffier M, Agapito L. MaskFusion: real-time recognition, tracking and reconstruction of multiple moving objects. In: 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). Munich, Germany, IEEE, 2018 DOI:10.1109/ismar.2018.00024

Zhang Y Z, Xu W W, Tong Y Y, Zhou K. Online structure analysis for real-time indoor scene reconstruction. ACM Transactions on Graphics, 2015, 34(5): 1–13 DOI:10.1145/2768821

Dzitsiuk M, Sturm J, Maier R, Ma L N, Cremers D. De-noising, stabilizing and completing 3D reconstructions on-the-go using plane priors. In: 2017 IEEE International Conference on Robotics and Automation (ICRA). Singapore, IEEE, 2017 DOI:10.1109/icra.2017.7989457

Shi Y F, Xu K, Nießner M, Rusinkiewicz S, Funkhouser T. PlaneMatch: patch coplanarity prediction for robust RGB-D reconstruction//Computer Vision–ECCV 2018. Cham: Springer International Publishing, 2018, 767–784 DOI:10.1007/978-3-030-01237-3_46

Salas-Moreno R F, Newcombe R A, Strasdat H, Kelly P H J, Davison A J. SLAM++: simultaneous localisation and mapping at the level of objects. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA, IEEE, 2013, 1352–1359 DOI:10.1109/cvpr.2013.178

Tateno K, Tombari F, Navab N. When 2.5D is not enough: Simultaneous reconstruction, segmentation and recognition on dense SLAM. In: 2016 IEEE International Conference on Robotics and Automation (ICRA). Stockholm, Sweden, IEEE, 2016, 2295–2302 DOI:10.1109/icra.2016.7487378

Aldoma A, Tombari F, Rusu R B, Vincze M. OUR-CVFH–oriented, unique and repeatable clustered viewpoint feature histogram for object recognition and 6DOF pose estimation//Lecture Notes in Computer Science. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, 113–122 DOI:10.1007/978-3-642-32717-9_12

Xu K, Shi Y F, Zheng L T, Zhang J Y, Liu M, Huang H, Su H, Cohen-Or D, Chen B Q. 3D attention-driven depth acquisition for object identification. ACM Transactions on Graphics, 2016, 35(6): 1–14 DOI:10.1145/2980179.2980224

Huang J W, Dai A, Guibas L, Niessner M. 3Dlite: towards commodity 3D scanning for content creation. ACM Transactions on Graphics, 2017, 36(6): 1–14 DOI:10.1145/3130800.3130824

Dong S Y, Xu K, Zhou Q, Tagliasacchi A, Xin S Q, Nießner M, Chen B Q. Multi-robot collaborative dense scene reconstruction. ACM Transactions on Graphics, 2019, 38(4): 1–16 DOI:10.1145/3306346.3322942

McCormac J, Handa A, Davison A, Leutenegger S. SemanticFusion: Dense 3D semantic mapping with convolutional neural networks. In: 2017 IEEE International Conference on Robotics and Automation (ICRA). Singapore, IEEE, 2017, 4628–4635 DOI:10.1109/icra.2017.7989538

Noh H, Hong S, Han B. Learning deconvolution network for semantic segmentation. In: 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile, IEEE, 2015, 1520–1528 DOI:10.1109/iccv.2015.178

He K M, Gkioxari G, Dollar P, Girshick R. Mask R-CNN. In: 2017 IEEE International Conference on Computer Vision (ICCV). Venice, IEEE, 2017, 2980–2988 DOI:10.1109/iccv.2017.322

Hu R Z, Wen C, van Kaick O, Chen L M, Lin D, Cohen-Or D, Huang H. Semantic object reconstruction via casual handheld scanning. ACM Transactions on Graphics, 2019, 37(6): 1–12 DOI:10.1145/3272127.3275024

Schonberger J L, Pollefeys M, Geiger A, Sattler T. Semantic visual localization. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, IEEE, 2018

Bernardini F, Martin I M, Rushmeier H. High-quality texture reconstruction from multiple scans. IEEE Transactions on Visualization and Computer Graphics, 2001, 7(4): 318–332 DOI:10.1109/2945.965346

Pulli K, Shapiro L G. Surface reconstruction and display from range and color data. Graphical Models, 2000, 62(3): 165–201 DOI:10.1006/gmod.1999.0519

Lensch H P A, Heidrich W, Seidel H P. A silhouette-based algorithm for texture registration and stitching. Graphical Models, 2001, 63(4): 245–262 DOI:10.1006/gmod.2001.0554

Stamos I, Allen P K. Geometry and texture recovery of scenes of large scale. Computer Vision and Image Understanding, 2002, 88(2): 94–118 DOI:10.1006/cviu.2002.0963

Corsini M, Dellepiane M, Ponchio F, Scopigno R. Image-to-geometry registration: a mutual information method exploiting illumination-related geometric properties. Computer Graphics Forum, 2009, 28(7): 1755–1764 DOI:10.1111/j.1467-8659.2009.01552.x

Corsini M, Dellepiane M, Ganovelli F, Gherardi R, Fusiello A, Scopigno R. Fully automatic registration of image sets on approximate geometry. International Journal of Computer Vision, 2013, 102(1/2/3): 91–111 DOI:10.1007/s11263-012-0552-5

Zhou Q Y, Koltun V. Color map optimization for 3D reconstruction with consumer depth cameras. ACM Transactions on Graphics, 2014, 33(4): 1–10 DOI:10.1145/2601097.2601134

Whelan T, Kaess M, Johannsson H, Fallon M, Leonard J J, McDonald J. Real-time large-scale dense RGB-D SLAM with volumetric fusion. The International Journal of Robotics Research, 2015, 34(4/5): 598–626 DOI:10.1177/0278364914551008

Bi S, Kalantari N K, Ramamoorthi R. Patch-based optimization for image-based texture mapping. ACM Transactions on Graphics, 2017, 36(4): 1–11 DOI:10.1145/3072959.3073610

Whelan T, Salas-Moreno R F, Glocker B, Davison A J, Leutenegger S. ElasticFusion: Real-time dense SLAM and light source estimation. The International Journal of Robotics Research, 2016, 35(14): 1697–1716 DOI:10.1177/0278364916669237

Meilland M, Barat C, Comport A. 3D High Dynamic Range dense visual SLAM and its application to real-time object re-lighting. In: 2013 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). Adelaide, Australia, IEEE, 2013, 143–152 DOI:10.1109/ismar.2013.6671774

Li S D, Handa A, Zhang Y, Calway A. HDRFusion: HDR SLAM using a low-cost auto-exposure RGB-D sensor. In: 2016 Fourth International Conference on 3D Vision (3DV). CA, USA, IEEE, 2016, 314–322 DOI:10.1109/3dv.2016.40

Alexandrov S V, Prankl J, Zillich M, Vincze M. Towards dense SLAM with high dynamic range colors. In Computer Vision Winter Workshop (CVWW), 2017

Xu L, Su Z, Han L, Yu T, Liu Y B, Fang L. Unstructured Fusion: realtime 4D geometry and texture reconstruction using Commercial RGBD cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 1 DOI:10.1109/tpami.2019.2915229

Xu K, Zheng L T, Yan Z H, Yan G H, Zhang E, Niessner M, Deussen O, Cohen-Or D, Huang H. Autonomous reconstruction of unknown indoor scenes guided by time-varying tensor Fields. ACM Transactions on Graphics, 2017, 36(6): 1–15 DOI:10.1145/3130800.3130812

Peng Y, Deng B L, Zhang J Y, Geng F Y, Qin W J, Liu L G. Anderson acceleration for geometry optimization and physics simulation. ACM Transactions on Graphics, 2018, 37(4): 1–14 DOI:10.1145/3197517.3201290