Adv Search
Home | Accepted | Article In Press | Current Issue | Archive | Special Issues | Collections | Featured Articles | Statistics

2019,  1 (5):   483 - 499   Published Date:2019-10-20

DOI: 10.1016/j.vrih.2019.09.003
1 Introduction 1.1 Preliminaries of depth cameras 1.2 Pipeline of 3D modeling based on depth cameras 2 Data structure for 3D surfaces 2.1 Volumetric representation 2.2 Surfel 3 3D Reconstruction based on depth cameras 3.1 Depth map preprocessing 3.2 Camera tracking 3.3 Fusion of depth images 3.4 3D reconstruction of dynamic scene 3.5 Scene understanding 4 Texture reconstruction 4.1 Offline texture reconstruction 4.2 Online texture reconstruction 5 Conclusions and future work


Three-dimensional (3D) modeling is an important topic in computer graphics and computer vision. In recent years, the introduction of consumer-grade depth cameras has resulted in profound advances in 3D modeling. Starting with the basic data structure, this survey reviews the latest developments of 3D modeling based on depth cameras, including research works on camera tracking, 3D object and scene reconstruction, and high-quality texture reconstruction. We also discuss the future work and possible solutions for 3D modeling based on the depth camera.


1 Introduction
Three-dimensional (3D) modeling is an important research area in computer graphics and computer vision. Its target is to capture the 3D shape and appearance of objects and scenes in high quality to simulate 3D interaction and perception in digital space. The 3D modeling methods can be primarily categorized into three types:
(1) Modeling based on professional software, such as 3DS Max and Maya: It requires the user to be proficient with the software. Thus, the learning curve is usually long; further, the creation of 3D content is usually time consuming.
(2) Modeling based on two-dimensional (2D) color (RGB) images, such as multiview stereo[1], structure from motion[2,3]: As image sensors demonstrate low-cost and easy to deploy characteristics, 3D modeling from multi-view 2D images can significantly automate and simplify the modeling process; however, it relies heavily on image quality and are easily affected by lighting and camera resolution.
(3) Modeling based on professional, active 3D sensors[4], including 3D scanner and depth camera: The sensors actively project structured light or laser stripes to the surface of the object and then reconstruct spatial shape of objects and scenes, i.e., sampling a large number of 3D points from the 3D surface according to the spatial information decoded from the structured light. The advantage of 3D sensors is that they are robust to interference in the environment and are able to scan 3D models in high precision automatically. However, professional 3D scanners are expensive, which limits their applicability for ordinary users.
In recent years, the development of depth cameras, i.e., an RGB camera in association with depth sensing devices, called as RGB-D camera, has led to the rapid development of 3D modeling. The depth camera is not only low-cost and compact, but also capable of capturing pixel-level color and depth information with sufficient resolution and frame rate. Because of these characteristics, depth cameras have a huge advantage in the case of consumer-oriented applications when compared to other expensive scanning devices. The KinectFusion algorithm, proposed by the Imperial College London and Microsoft Research in 2011, performs 3D modeling with depth cameras[5,6]. It is capable of generating 3D models with high-resolution details in real time using color and depth images obtained using depth cameras, which is attracting the attention of researchers in the field of 3D reconstruction. Researchers in computer graphics and computer vision have continuously innovated the algorithms and systems for 3D modeling based on depth cameras and achieved satisfactory research results.
This survey will review and compare the latest methods of 3D modeling based on depth cameras. We first present two basic data structures that are predominantly used for 3D modeling using depth cameras in Section 2. In Section 3, we focus on the progress of geometric reconstruction based on depth cameras. Then, in Section 4, the reconstruction of high-quality texture is reviewed. Finally, we conclude and discuss the future work in Section 5.
1.1 Preliminaries of depth cameras
Each frame of data obtained by the depth camera contains the distance from each point in the real scene to the vertical plane where the depth camera is located. This distance is called the depth value; moreover, these depth values at sampled pixels constitute the depth image of a depth frame. Usually, the color image and the depth image are registered; thus, there is a one-to-one correspon-dence between the pixels (Figure 1).
Currently, most of the depth cameras are developed based on structured light, such as Kinect V1 by Microsoft and Structure Sensor by Occipital, or time of flight (ToF) technology, such as Microsoft Kinect 2.0. A structured-light depth camera first projects the infrared structured light onto the surface of an object, and then receives the light pattern reflected by the surface of the object. As the received pattern is modulated by the 3D shape of the object, the spatial information of the 3D surface can be calculated by the position and degree of modulation of the pattern on the image. The structured-light camera is suitable for scenes with insufficient illumination; moreover, it can achieve high precision within a certain range and produce high-resolution depth images. The disadvantage of structured light is that it is easy to be affected by strong natural light outdoors, which results in the projected coded light becoming submerged and unusable. At the same time, it is susceptible to plane mirror reflection. The ToF camera uses a continuous emission of light pulses (generally invisible light) onto an object to be observed and then receives a pulse of light reflected back from the object. The distance from the measured object to the camera is calculated by detecting the flight (round trip) time of the light pulse. Overall, the measurement error of the ToF depth camera does not increase with the increase of the measurement distance and the anti-interference ability is strong, which is suitable for a relatively long measurement distance (such as automatic driving). The disadvantage is that the depth image obtained is not high in resolution; moreover, the measurement error is larger than that of other depth cameras when measuring at a close range.
1.2 Pipeline of 3D modeling based on depth cameras
As the depth data obtained by the depth camera inevitably has noise, the depth map is first preprocessed, i.e., the noise and removed outliers are reduced. Next, the camera poses are estimated to register the captured depth images into a consistent coordinate frame. The classic approach is to determine the corresponding points between the depth maps (i.e., data associations) and then estimate the motion of the camera by registering the depth images. Finally, the current depth image is merged into the global model based on the estimated camera pose. By repeating these steps, the geometry of the scene can be obtained. However, to completely reconstruct the appearance of the scene, it is obviously not sufficient to just capture the geometry. We should also provide the model with color and texture.
According to the data structure used to represent the 3D modeling result, we can approximately categorize 3D modeling into volumetric representation-based modeling and surfel-based modeling. The steps of 3D reconstruction will vary based on the different data structures. This review clarifies how each method solves the classic problems in 3D reconstruction by reviewing and comparing the different strategies adopted by the various modeling methods.
2 Data structure for 3D surfaces
2.1 Volumetric representation
The concept of volumetric representation was first proposed in 1996 by Curless and Levoy[7]. They introduced the concept of representing 3D surfaces using signed distance function (SDF) sampled at voxels, where SDF values are defined as the distance of a 3D point to the surface. Thus, the surface of an object or a scene, according to this definition, is an iso-distance surface whose SDF is zero. The free space, i.e., the space outside of the object, is indicated by a positive SDF value, because the distance increases as a point moves along the surface normal towards the free space. Conversely, the occupied space (the space inside of the object) is indicated by negative SDF values. These SDF values are stored in a cube surrounding the object and the cube is rasterized into voxels. The reconstruction algorithm must define the voxel size and spatial extent of the cube. Other data, such as colors, are usually stored as voxel attributes. As only voxels close to the actual surface are important, researchers often use the truncated SDF (TSDF) to represent the model. Figure 2 shows the general pipeline of reconstruction based on volumetric representation. As the TSDF value implicitly represents the surface of the object, it is necessary to extract the surface of the object using ray casting before estimating the pose of the camera. The fusion step is a simple weighted average operation at each voxel center, which is very effective for eliminating the noise of the depth camera.
Uniform voxel grids are significantly inefficient in terms of memory consumption and are limited by predefined volume and resolutions. In the context of real-time scene reconstruction, most methods rely heavily on the processing power of modern GPUs. The spatial extent and resolution of voxel grids are often limited by GPU memory. To support a larger spatial extent, researchers have proposed a variety of methods to improve the memory efficiency of volumetric representation-based algorithms. To prevent data loss caused by depth images acquired outside of a predefined cube, Whelan et al.[8] proposed a method of dynamically shifting voxel grids such that the voxel grids follow the motion of the depth camera. When the voxel grids move, the method extracts the parts outside the current voxel mesh and stores them separately, adding their camera pose to the global pose. Although this can achieve a larger scan volume, it requires a large amount of external memory usage; moreover, the scanned surface cannot be arbitrarily revisited. Henry et al.[9] also proposed a similar idea.
Voxel hierarchies, such as octrees, are another way to efficiently store the geometry of 3D surfaces. The octree can adaptively subdivide the scene space according to the complexity of the scene, effectively utilizing the memory of the computer[10]. Although its definition is simple, owing to the sparsity of nodes, it is considerably difficult to utilize the parallelism of the GPU. Sun et al.[11] established an octree with only leaf nodes to store voxel data and accelerate tracking based on octree. Zhou et al.[12] used the GPU to construct a complete octree structure to perform Poisson reconstruction on scenes with 300,000 vertices at an interactive rate. Zeng et al.[13] implemented a 9- to 10-level octree for KinectFusion and extended the reconstruction results to a medium-sized office. Chen et al.[14] proposed a similar three-level hierarchy with equally good resolution. Steinbrücker et al.[15] represented the scene in a multi-resolution data structure for real-time accumulation on the CPU; further, they outputted a real-time grid model at approximately 1 Hz.
Nießner et al.[16] proposed a voxel hashing structure. In this method, the index corresponding to the spatial position of the voxel is stored in a linearized spatial hash table and addressed by a spatial hash function. As only voxels containing spatial information are stored in the memory, this strategy significantly reduces memory consumption and theoretically allows voxel blocks of a certain size and resolution to represent infinitely large spaces. When compared with the voxel hierarchies, the hashing structure has great advantages in data insertion and access, whose time complexity is O(1). However, the hashing function inevitably maps multiple different spatial voxel blocks to the same location in the hash table. Kahler et al.[17] proposed the use of a different hashing method to reduce the number of hash collisions. Hash-based 3D reconstruction has the advantage of small memory requirements and efficient calculations, making it suitable for even mobile devices, such as application to Google Tango[18].
2.2 Surfel
Surfel is another important surface representation, which was originally used as the primitive for model rendering[19]. As shown in Figure 3, a surfel usually contains the following attributes: space coordinates p
R3, normal vector n
R3, color information c
R3, weight (or confidence) w, radius r, timestamp t. The weight of a surfel is used for the weighted average of neighboring 3D points and the judgment of the stable and unstable points. It is generally initialized by the following formula:
w = e - γ 2 / 2 σ 2 ,
where γ is the normalized radial distance from the current depth measurement to the center of the camera; further, σ is usually considered as 0.6[20]. The radius of a surfel is determined by the distance of the surface of the scene from the optical center of the camera. The further the distance is, the larger the radius of the surfel will be. When compared to the volumetric representation, the surfel-based 3D reconstruction is more compact and intuitive for input from depth cameras, eliminating the need to switch back and forth between different representations. Similar to volumetric representation, octrees can be used to improve the surfel-based data structures[21,22].
Pros and cons: Volumetric representation is more convenient in depth denoising and surface rendering as the TSDF computed according to each frame can be efficiently fused at each voxel, as discussed in the next section. In contrast, the surfel representation can be viewed as an extension of the point cloud representation. When the vertex correspondences are necessary in the registration, such information can be directly computed using surfels, while the volumetric representation has to be converted to 3D meshes in this case.
3 3D Reconstruction based on depth cameras
3.1 Depth map preprocessing
The depth image obtained by the depth camera inevitably contains noise or even outliers owing to the complex lighting condition, geometric variations, and spatially variant materials of the objects. Mallick et al.[23] classified these noises into the three following categories. The first is missing depth. When the object is too close or too far from the depth camera, the camera sets the depth value to zero because it cannot measure the depth. It should be noted that even within the measurement range of the camera, zero depth may be generated owing to surface discontinuities, highlights, shadows, or other reasons. The second is depth error. When a depth camera records a depth value, its accuracy depends on its depth itself. The third is depth inconsistency. Even if the depth of the scene is constant, the depth value obtained will change with time.
In most cases, bilateral filtering[24] is used to reduce the noise of the depth image. After noise reduction, the KinectFusion[5] application implies a downsampling to obtain a three-layer depth map pyramid to estimate the pose of the camera from coarse to fine in the subsequent step. For the depth image after noise reduction, according to the internal parameters of the depth camera, the 3D coordinates of each pixel point in the camera coordinate system can be calculated by inverse projection to form a vertex map. The normal of each vertex can be obtained by the cross product of vectors formed by the vertex and its adjacent vertices. For the surfel representation, it is also necessary to calculate the radius of each point to represent the local surface area around a given point while minimizing the visible holes between adjacent points, as shown in Figure 3. Reference [25] calculates the vertex radius r using the following formula:
  r = 2   *   d   /   f
, where d represents the depth value of the vertex and f represents the focal length of the depth camera.
3.2 Camera tracking
Usually, the six-degree of freedom camera pose of a captured depth image is represented by a rigid transformation matrix T g . This transformation matrix maps the camera coordinate system to the world coordinate system. Given the position of the point p l in the local camera coordinate system, it can be converted to its 3D coordinate in the world coordinate system, denoted by p l , by the following equation: p g <sub/> = T g * p l .
Camera tracking, i.e., estimating the camera pose, is a prerequisite for fusing the depth image of the current frame into the global model, and the fused model is represented by either volumetric representation or a surfel dependent on the choice of each 3D reconstruction system (Figure 4). The iterative closest point algorithm (ICP) might be the most important algorithm in relative camera pose estimation. In the early days, the ICP algorithm was applied for 3D shape registration[26,27]. The rigid transformation is calculated by determining matching points between two adjacent frames and minimizing the sum of squares of the Euclidean distance between these pairs. This step is iterated until a certain convergence criterion is satisfied. A serious problem with this frame-to-frame strategy is that the registration error of two frames is accumulated as the scan progresses.
To alleviate this problem, the depth camera-based reconstruction methods adopted a frame-to-model camera tracking method[6,8,16]. Although frame-to-model tracking significantly reduces the tracking drift of each frame, it does not completely solve the problem of error accumulation, as tracking errors still accumulate over time. The accumulated drift eventually leads to misalignment of the reconstructed surface at the loop closure of the track. Therefore, researchers have introduced the idea of ​​global pose optimization. Henry et al.[29] proposed the concept of the keyframe, which produces a keyframe whenever the accumulated camera pose is greater than a threshold. The loop closure detection is performed only when a keyframe occurs; moreover, the random sample consensus algorithm[30] is used to perform feature point matching on the current and previous keyframes. After the loop closure is detected, there are two globally optimized strategies to choose from. One is the pose graph optimization, where the pose graph is used to represent the constraint between the frames; further, the edge between the frames corresponds to the geometric constraint. The second is sparse bundle adjustment, which globally minimizes the re-projection error of the feature points using sparse bundle adjustment matching on all frames[31,32,33]. Zhou et al.[34] proposed the use of interest points to preserve local details and combine global pose optimization to evenly distribute registration errors in the scene. Zhou et al.[35] also proposed the segmentation of a video sequence into multiple segments using frame-to-model integration, followed by the reconstruction of locally accurate scene segments from each segment, the establishment of a dense correspondence between overlapping segments, and the optimization of a global energy function to align the fragment. The optimized images can subtly distort the image fragments to correct inconsistencies caused by low-frequency distortion in the input images. Whelan et al.[36] used depth and color information to perform global pose estimation for each input frame. The prediction of the depth and color images used for registration is obtained by using the surface splatting algorithm[19] for the global model. They divided the loop closure problem into global and partial loop closures. Global loop closure detection uses the random fern coding method[37]. For loop closure optimization, they used a spatial deformation method based on the embedded deformation map[38]. A deformation map consists of a set of nodes and edges that are distributed among the models to be deformed. Each surfel is affected by a set of nodes; moreover, the influence weight is inversely proportional to the distance of the surfel to the node.
Regardless of the registration strategy, identifying corresponding points between two frames is an essential step in camera tracking. The set of corresponding point pairs is brought into the objective function to be optimized, thereby calculating the optimal solution of the camera pose. The process of determining the corresponding points can be divided into sparse and dense methods according to the pair of points used. The sparse method only obtains the corresponding feature points, while the dense method searches for the correspondences for all points.
For feature extraction and matching, the scale-invariant feature transform (SIFT) algorithm[39] is a popular choice. The speeded-up robust features (SURF) algorithm[40] improves the SIFT algorithm and the speed of detecting feature points. The ORB feature descriptor is an alternative to SIFT and SURF, which is faster and correspondingly less stable. Endres et al.[41] exploited these three feature descriptors when extracting feature points and compared their effects. Whelan et al.[8] explored feature point-based fast odometry from vision instead of the ICP method in KinectFusion and adopted SURF feature descriptors for loop closure detection. Zhou et al.[42] proposed a method to track camera based on object contours and introduced corresponding points of contour features to ensure more stable tracking. The recent BundleFusion method[43] used the SIFT feature for coarse registration first and subsequently dense methods for fine registration.
For dense methods, the traditional methods of identifying corresponding points are time consuming. The projection data association algorithm[44] significantly speeds up this process. This strategy projects the three-dimensional coordinates of the input point to a pixel on the target depth map according to the camera pose and the internal parameters of the camera; then, it obtains the best corresponding point through the search in the neighborhood of the pixel. There are many ways to measure the error of the corresponding point, such as point-to-point metric[26] and point-to-plane metric[27]. When compared to the point-to-point metric, the point-to-plane metric results in faster convergence in the camera tracking; thus, it is widely used. In addition to using geometric distance as constraints, many methods use optical differences (i.e., color differences) between corresponding points as constraints, such as[43,44,45]. Lefloch et al.[46] considered curvature information as an independent surface property for real-time reconstruction. They considered the curvature of the input depth map and global model when determining dense corresponding points.
3.3 Fusion of depth images
Volumetric representation: When compared to computing a complete SDF for all voxels, the TSDF is usually calculated because its computational cost is more affordable. The voxel center is first projected to the input depth map after camera tracking to compute the TSDF value corresponding to the current input depth map. Then, the new TSDF value is fused to the TSDF of the global model by weighted averaging, which is effective in removing depth noises.
Surfel: After estimating the camera pose of the current input frame, each vertex and its normal and radius are integrated into the global model. The fusion of depth maps using surfels has three steps: data association, weighted average, and removal of invalid points[20]. First, corresponding points are determined by projecting the vertices of the current 3D model to the image plane of the current camera. As some model points might be projected onto the same pixel, the upsampling method is used to improve the accuracy. If the corresponding point is identified, the most reliable point is merged with the new point estimate using a weighted average. If it is not determined, the new point estimate is added to the global model as an unstable point. Over time, the global model is cleaned up to eliminate outliers according to the visibility and time constraints.
3.4 3D reconstruction of dynamic scene
The reconstruction of dynamic scenes is more challenging than static scenes because of deforming objects. We have to deal with not only the overall movement of the objects but also their non-rigid deformations. The key issue to dynamic reconstruction is to develop a fast and robust non-rigid registration algorithm to handle non-rigid motions in the scene.
Volumetric representation: DynamicFusion[47] proposed by Newcombe et al. was the first method to reconstruct a non-rigid deformation scene in real time using the input obtained from a depth camera. It reconstructs an implicit volumetric representation, similar to the KinectFusion method, and simultaneously optimizes the rigid and non-rigid motion of the scene based on the warped field parameterized by the sparse deformation map[38]. The main problem with DynamicFusion is that it cannot handle topology changes that have occurred during the capturing. Dou et al. proposed the Fusion4D method[48]. This method used eight depth cameras to acquire a multi-view input, combined the concept of volumetric fusion with the estimation of the smooth deformation field, and proposed a fusion method of multiple camera depth information that can be applied to complex scenes. In addition, it proposed a machine-learning-based corresponding points detection method that can robustly handle frame-to-frame fast motion. Recently, Yu et al. proposed a new real-time reconstruction system, called DoubleFusion[49]. This system combined dynamic reconstruction based on volumetric representation and skeletal model SMPL of the reference [50], which can reconstruct detailed geometry, non-rigid body motion, and internal shape simultaneously from a single depth camera. One of the key contributions of this method is the two-layer representation, which consists of a fully parametric internal shape and a gradually fused outer surface layer. The predefined node map on the body surface parameterizes the non-rigid body deformation near the body; moreover, the free-form dynamic change map parameterizes the outer surface layer away from the body for more general reconstruction. Furthermore, a joint motion tracking method based on the two-layer representation was proposed to make joint motion tracking robust and fast. Guo et al.[51] proposed a shading-based scheme to leverage appearance information for motion estimation and the computed albedo information was fused into the volume. The accuracy of the reconstructed motion of the dynamic object was significantly improved owing to the incorporation of the albedo information.
Surfel: Keller et al.[20] used the foreground segmentation method to process dynamic scenes. The proposed algorithm removes objects with moving surfaces from the global model so that the dynamic part does not act on the camera tracking. As shown in Figure 5, a person seated on a chair is initially reconstructed and then he begins to move. The system segments the moving person in the ICP step by ignoring the dynamic scene of part (A) (refer Figure 5); thus, it ensures the robustness of the camera tracking to dynamic motion.
Rünz et al.[52] proposed to use motion or semantic cues to segment the scene into backgrounds and different rigid foreground objects, simultaneously tracking and reconstructing their 3D geometry over time, which is one of the earliest real-time methods for intensive tracking and fusion of multiple objects. They proposed two segmentation strategies: motion segmentation (grouping points that are consistent in motion in 3D) and object instance segmentation (detect and segment single objects in RGB images for a given semantic tag). Once detected and segmented, objects are added to the active model list, then tracked, and their 3D shape models are updated by merging the data that belongs to the object. Gao et al.[53] proposed the utilization of a deformation field to align the depth input of the current moment with a reference geometry (template). The reference geometry is initialized with the deformation field of the previous frame and then optimized by the ICP algorithm, similar to DynamicFusion. Once the deformation field is updated, data fusion is performed to accumulate the current depth observations into the global geometric model. According to the updated global model, the global deformation field is inversely computed for the registration of the next frame.
3.5 Scene understanding
Scene understanding refers to the process of analyzing the scene by considering the geometry and semantic context of the scene content and the intrinsic relationship between them, including object detection and recognition, the relationship between objects, and semantic segmentation. Researches in recent years have shown that the understanding of scenes, such as plane detection and semantic segmentation, can provide significant assistance for the 3D modeling of scenes.
Observing that there are many planes in the indoor scene, such as walls, floors, and desktops, Zhang et al.[55] proposed a method of region growth to detect and label new planes using planarity and orthogonality constraints to reduce the camera tracking drift and flat region bending. Dzitsiuk et al.[56] fit the plane to the symbol distance function value of the volume based on the robust least squares method and merged the detected local planes to determine a global continuous plane. By using the detected plane to correct the SDF value, the noise on the flat surface can be significantly reduced. Further, they used an a priori plane to directly obtain object-level and semantic segmentation (such as walls) of the indoor environment. Shi et al.[57] proposed a method for predicting the co-planarity of image patches for RGB-D image registration. They used a self-supervised approach to train a deep network to predict whether two image patches are coplanar or not and to incorporate coplanar constraints when solving camera pose problems. Experiments demonstrated that this method of using co-planarity is still effective when feature point-based methods cannot achieve loop closure.
SLAM++[58] proposed by Salas-Moreno et al. is one of the earliest methods for object-oriented mapping methods. They used point-to-point features to detect objects and reconstruct 3D maps at the object level. Although this scheme enables the SLAM system to be extended to object-level loop closure processing, a detailed 3D model preprocessing of objects is required in advance to learn the 3D feature descriptors. Tateno et al.[59] incrementally segmented the reconstructed region in TSDF volumes and used the oriented, unique, and repeatable clustered viewpoint feature histogram[60] feature descriptor to directly match other objects in the database. Xu et al.[61] proposed a recursive 3D attention model for the active 3D object recognition by a robot. The model can perform both the instance-level shape classification and the continuous next best view regression. They inserted the retrieved 3D model into the scene and built the 3D scene model step by step. In reference [63], the 3D indoor scene scanning task was decomposed to concurrent subtasks for multiple robots to explore the unknown regions simultaneously, where the target regions of subtasks were computed via optimal transportation optimization.
In recent years, with the development of 2D semantic image segmentation algorithms, many methods combine them with the 3D modeling methods to improve the reconstruction quality. McCormac et al.[64] adopted the convolutional neural network (CNN) model[65] proposed by Noh et al. to perform 2D semantic segmentation. The Bayesian framework for 2D-3D label transfer was used to fuse 2D semantic segmentation labels into 3D models. Then, the model was post-processed using convolutional random field. Rünz et al.[54] adopted a mask region-based CNN model[66] to detect object instances, as can be seen in Figure 6. They applied ElasticFusion[36] using surfels to represent each object and static background for real-time tracking and the dense reconstruction of dynamic instances. Differing from the above two methods for semantic labeling of the scene at the object level, Hu et al.[67] proposed a method for reconstructing a single object based on the labeling of each part of the object. To minimize the labeling work of the user, they used an active self-learning method to generate the data required by the neural network. Schönberger et al.[68] obtained 3D descriptors by jointly encoding high-level 3D geometric information and semantic information. Experiments have shown that such descriptors enable reliable positioning even under extreme viewing angles, illumination, and geometric variations.
4 Texture reconstruction
In addition to the 3D geometry of objects and scenes, which are of great interest in many applications, surface color and general appearance information can significantly enhance the sense of reality of 3D geometry in augmented reality/virtual reality applications, which is important to simulate the real world in virtual environments. In the following sections, offline and online texture reconstruction methods are reviewed and discussed.
4.1 Offline texture reconstruction
The goal of these techniques is to reconstruct the high-quality, globally consistent texture for the 3D model with the multi-view RGB images after the capturing process is completed. They rely on optimization algorithms to remove the ghosting and over-smoothing effects in the direct blending of the captured images. The frequently explored optimization objectives include color consistency[69,70], the alignment of image and geometric features[71,72], and mutual information maximization between the projected images[73,74]. While these methods can effectively handle camera calibration inaccuracies, they cannot handle inaccurate geometries and optical distortions in RGB images, which is a common problem in the data captured by consumer-level depth cameras. Zhou and Koltun[75] solved the above problems by optimizing for the RGB camera poses and non-rigid warping of each associated image to maximize the photo consistency. In contrast, Bi et al.[76] proposed a patch-based optimization method, which synthesizes a set of feature-aligned color images to produce high-quality texture mapping even with large geometric errors. Huang et al.[66] proposed several improvements based on the method presented by Zhou et al.[75] for 3D indoor scene reconstruction (refer Figure 7). After calculating a primitive abstraction of the scene, they first corrected color values by compensating for the varying exposure and white balance; then, the RGB images were aligned by optimization based on sparse features and dense photo consistency. Finally, a temporally coherent sharpening algorithm was developed to obtain clean texture for the reconstructed 3D model.
4.2 Online texture reconstruction
Online texture reconstruction is used to generate high-quality textures in parallel to the scanning of the 3D-shaped objects. Its advantage is that the user can observe the real-time textured model during online scanning; however, it may demonstrate inferior texture quality as it cannot exploit the unseen images. Whelan et al.[77] refused to fuse the color values of the edges of the object into the model as this will result in artifacts or discontinuities. In the subsequent work[78], they estimated the position and orientation of the light source in the scene to further reject the fusion from samples containing only highlights. Although this has significantly improved the visual quality of the texture, artifacts may still occur. As high dynamic range (HDR) images provide more image details than low dynamic range images, they are also adopted in the 3D reconstruction based on RGB-D cameras[79,80,81]. First, the pre-calibrated response curve is used to linearize the observed intensity values. Then, the HDR pixel colors based on the exposure time are computed and fused into the texture of the 3D model in real time. A recent contribution by Xu et al. proposed a dynamic atlas texturing scheme for warping and updating dynamic texture on the fused geometry in real time, which significantly improved the visual quality of 3D human modeling[82].
5 Conclusions and future work
In the past few years, 3D modeling using depth cameras has evolved rapidly. Its current development includes the entire reconstruction pipeline and brings together various levels of innovation from deep camera hardware to high-level applications, such as augmented reality and teleconference. Research works on 3D modeling using depth cameras, including camera tracking, 3D object and scene reconstruction, and high-quality texture reconstruction, are reviewed and discussed in this paper.
Despite the achieved success, 3D modeling based on depth cameras still encounters many challenges. First, the depth cameras are actually short-range 3D sensors. It is still time consuming to apply them in the large-scale scene reconstructions. Thus, it is necessary to investigate how to integrate the data of long-range sensors, such as laser scanner, with depth data to expedite the 3D scene reconstruction. A multi-source large-scale 3D data registration algorithm, including correspondence detection and subsequent distance minimization, should be developed in this case. We hypothesize that domain decomposition technique, which supports the alternative minimization of the objective function using a subset of optimization variables in each iteration, can be applied to handle such large-scale registration problems. It was also interesting to investigate the active vision approach to drive the robots with depth cameras in a scene for a complete scanning[83]. Second, the moving objects in the scene are not friendly to the depth camera tracking. We need to explore how to accurately separate the moving objects from the static scene in the captured depth to improve the accuracy of the camera tracking-based static scene. One possible solution is to integrate the object detection and motion flow together, where the movement of the objects conforming to the camera motion should be static with respect to the camera. Third, to support the application of depth cameras integrated into mobile phones, it is necessary to develop real-time and energy-saving depth data processing and camera tracking algorithms for the hardware of mobile phone. The convergence acceleration algorithms, such as Anderson acceleration, can be explored to speed-up the camera tracking[84].



Seitz S M, Curless B, Diebel J, Scharstein D, Szeliski R. A comparison and evaluation of multi-view stereo reconstruction algorithms. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Volume 1 (CVPR'706), New York, NY, USA, 519–528 DOI:10.1109/cvpr.2006.19


Davison. Real-time simultaneous localisation and mapping with a single camera. In: Proceedings of Ninth IEEE International Conference on Computer Vision. Nice, France, IEEE, 2003 DOI:10.1109/iccv.2003.1238654


Klein G, Murray D. Parallel tracking and mapping for small AR workspaces. In: 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality. Nara, Japan, IEEE, 2007 DOI:10.1109/ismar.2007.4538852


Rusinkiewicz S, Hall-Holt O, Levoy M. Real-time 3D model acquisition. ACM Transactions on Graphics, 2002, 21(3): 438–446 DOI:10.1145/566654.566600


Izadi S, Kim D, Hilliges O, Molyneaux D, Newcombe R, Kohli P, Shotton J, Hodges S, Freeman D, Davison A, Fitzgibbon A. KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera. In:Proceedings of the 24th annual ACM symposium on User interface software and technology, 2011, 559–568 DOI:10.1145/2047196.2047270


Newcombe R A, Davison A J, Izadi S, Kohli P, Hilliges O, Shotton J, Molyneaux D, Hodges S, Kim D, Fitzgibbon A. KinectFusion: Real-time dense surface mapping and tracking. In: 2011 10th IEEE International Symposium on Mixed and Augmented Reality. Basel, New York, USA, IEEE, 2011, 127–136 DOI:10.1109/ismar.2011.6092378


Curless B, Levoy M. A volumetric method for building complex models from range images. In: Proceedings of the 23rd annual conference on Computer graphics and interactive techniques-SIGGRAPH '96. New York, USA, ACM Press, 1996, 303–312 DOI:10.1145/237170.237269


Whelan T, Kaess M, Fallon M, Johannsson H, Leonard J, McDonald J. Kintinuous: Spatially extended KinectFusion. RSS Workshop on RGB-D: Advanced Reasoning with Depth Cameras, 2012


Roth H, Vona M. Moving volume KinectFusion. In: Procedings of the British Machine Vision Conference. British Machine Vision Association, 2012, 1–11 DOI:10.5244/c.26.112


Frisken S F, Perry R N, Rockwood A P, Jones T R. Adaptively sampled distance Fields. In: Proceedings of the 27th annual conference on Computer graphics and interactive techniques-SIGGRAPH '00. New York, USA, ACM Press, 2000, 249–254 DOI:10.1145/344779.344899


Sun X, Zhou K, Stollnitz E, Shi J Y, Guo B N. Interactive relighting of dynamic refractive objects. ACM Transactions on Graphics, 2008, 27(3): 1–9 DOI:10.1145/1360612.1360634


Zhou K, Gong M M, Huang X, Guo B N. Data-parallel octrees for surface reconstruction. IEEE Transactions on Visualization and Computer Graphics, 2011, 17(5): 669–681 DOI:10.1109/tvcg.2010.75


Zeng M, Zhao F K, Zheng J X, Liu X G. Octree-based fusion for realtime 3D reconstruction. Graphical Models, 2013, 75(3): 126–136 DOI:10.1016/j.gmod.2012.09.002


Chen J W, Bautembach D, Izadi S. Scalable real-time volumetric surface reconstruction. ACM Transactions on Graphics, 2013, 32(4): 1–16 DOI:10.1145/2461912.2461940


Steinbrucker F, Sturm J, Cremers D. Volumetric 3D mapping in real-time on a CPU. In: 2014 IEEE International Conference on Robotics and Automation (ICRA). Hong Kong, China, IEEE, 2014, 2021–2028 DOI:10.1109/icra.2014.6907


Nießner M, Zollhöfer M, Izadi S, Stamminger M. Real-time 3D reconstruction at scale using voxel hashing. ACM Transactions on Graphics, 2013, 32(6): 1–11 DOI:10.1145/2508363.2508374


Kahler O, Adrian Prisacariu V, Yuheng Ren C, Sun X, Torr P, Murray D. Very high frame rate volumetric integration of depth images on mobile devices. IEEE Transactions on Visualization and Computer Graphics, 2015, 21(11): 1241–1250 DOI:10.1109/tvcg.2015.2459891


Dryanovski I, Klingensmith M, Srinivasa S S, Xiao J Z. Large-scale, real-time 3D scene reconstruction on a mobile device. Autonomous Robots, 2017, 41(6): 1423–1445 DOI:10.1007/s10514-017-9624-2


Pfister H, Zwicker M, van Baar J, Gross M.Surfels: Surface elements as rendering primitives. In: Proceedings of the 27th annual conference on Computer graphics and interactive techniques-SIGGRAPH '00. ACM Press, 2000, 335–342 DOI:10.1145/344779.344936


Keller M, Lefloch D, Lambers M, Izadi S, Weyrich T, Kolb A. Real-time 3D reconstruction in dynamic scenes using point-based fusion. In: 2013 International Conference on 3D Vision. Seattle, WA, USA, IEEE, 2013, 1–8 DOI:10.1109/3dv.2013.9


Stuckler J, Behnke S. Integrating depth and color cues for dense multi-resolution scene mapping using RGB-D cameras. In: 2012 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI). Hamburg, Germany, IEEE, 2012 DOI:10.1109/mfi.2012.6343050


Stückler J, Behnke S. Multi-resolution surfel maps for efficient dense 3D modeling and tracking. Journal of Visual Communication and Image Representation, 2014, 25(1): 137–147 DOI:10.1016/j.jvcir.2013.02.008


Mallick T, Das P P, Majumdar A K. Characterizations of noise in kinect depth images: A review. IEEE Sensors Journal, 2014, 14(6): 1731–1740 DOI:10.1109/jsen.2014.2309987


Tomasi C, Manduchi R. Bilateral filtering for gray and color images. In: Proceedding of IEEE international conference on computer vision, 1998, 839–846 DOI:10.1109/iccv.1998.710815


Salas-Moreno R F, Glocken B, Kelly P H J, Davison A J. Dense planar SLAM. In: 2014 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). Munich, Germany, IEEE, 2014 DOI:10.1109/ismar.2014.6948422


Besl P J, McKay N D. A method for registration of 3-D shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence,1992, 14(2): 239–256 DOI:10.1109/34.121791


Chen Y, Medioni G. Object modeling by registration of multiple range images. In: Proceedings of 1991 IEEE International Conference on Robotics and Automation. Sacramento, CA, IEEE Comput. Soc. Press, 2724–2729 DOI:10.1109/robot.1991.132043


Kerl C, Sturm J, Cremers D. Dense visual SLAM for RGB-D cameras. In: 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems. Tokyo, IEEE, 2013 DOI:10.1109/iros.2013.6696650


Henry P, Krainin M, Herbst E, Ren X F, Fox D. RGB-D mapping: using depth cameras for dense 3D modeling of indoor environments//Experimental Robotics. Berlin, Heidelberg, Springer Berlin Heidelberg, 2014, 477–491 DOI:10.1007/978-3-642-28572-1_33


Fischler M A, Bolles R C. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 1981, 24(6): 381–395 DOI:10.1145/358669.358692


Triggs B, McLauchlan P F, Hartley R I, Fitzgibbon A W. Bundle adjustment—A modern synthesis//Vision Algorithms: Theory and Practice. Berlin, Heidelberg, Springer Berlin Heidelberg, 2000, 298–372 DOI:10.1007/3-540-44480-7_21


Lourakis M., Argyros A. SBA: A software package for generic sparse bundle adjustment. ACM Transactions on Mathematical Software, 36, 2009, 1–30


Konolige K. Sparse sparse bundle adjustment. In: Procedings of the British Machine Vision Conference. Aberystwyth, British Machine Vision Association, 2010 DOI:10.5244/c.24.102


Zhou Q Y, Miller S, Koltun V. Elastic fragments for dense scene reconstruction. In: 2013 IEEE International Conference on Computer Vision. Sydney, Australia, IEEE, 2013 DOI:10.1109/iccv.2013.65


Zhou Q Y, Koltun V. Dense scene reconstruction with points of interest. ACM Transactions on Graphics, 2013, 32(4):1–8 DOI:10.1145/2461912.2461919


Whelan T, Leutenegger S, Salas Moreno R, Glocker B, Davison A. ElasticFusion: dense SLAM without A pose graph. In: Robotics: Science and Systems XI, Robotics: Science and Systems Foundation, 2015 DOI:10.15607/rss.2015.xi.001


Glocker B, Shotton J, Criminisi A, Izadi S. Real-time RGB-D camera relocalization via randomized ferns for keyframe encoding. IEEE Transactions on Visualization and Computer Graphics, 2015, 21(5): 571–583 DOI:10.1109/tvcg.2014.2360403


Sumner R W, Schmid J, Pauly M. Embedded deformation for shape manipulation. ACM Transactions on Graphics, 2007, 26(3): 80 DOI:10.1145/1276377.1276478


Lowe D G. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 2004, 60(2): 91–110 DOI:10.1023/b:visi.0000029664.99615.94


Bay H, Ess A, Tuytelaars T, van Gool L. Speeded-up robust features (SURF). Computer Vision and Image Understanding, 2008, 110(3): 346–359 DOI:10.1016/j.cviu.2007.09.014


Endres F, Hess J, Engelhard N, Sturm J, Cremers D, Burgard W. An evaluation of the RGB-D SLAM system. In: 2012 IEEE International Conference on Robotics and Automation. St Paul, MN, USA, IEEE, 2012 DOI:10.1109/icra.2012.6225199


Zhou Q Y, Koltun V. Depth camera tracking with contour cues. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, USA, IEEE, 2015, 632–638 DOI:10.1109/cvpr.2015.7298662


Mehta D, Sridhar S, Sotnychenko O, Rhodin H, Shafiei M, SeidelH-P, Xu W, Casas D, Theobalt C. VNect: real-time 3D human pose estimation with a single RGB camera. ACM Transactions on Graphics, 2017, 36(4):1–14 DOI:10.1145/3072959.3073596


Blais G, Levine M D. Registering multiview range data to create 3D computer objects. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1995, 17(8): 820–824 DOI:10.1109/34.400574


Steinbrucker F, Kerl C, Cremers D. Large-scale multi-resolution surface reconstruction from RGB-D sequences. In: 2013 IEEE International Conference on Computer Vision. Sydney, Australia, IEEE, 2013, 3264–3271 DOI:10.1109/iccv.2013.405


Lefloch D, Kluge M, Sarbolandi H, Weyrich T, Kolb A. Comprehensive use of curvature for robust and accurate online surface reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12): 2349–2365 DOI:10.1109/tpami.2017.2648803


Newcombe R A, Fox D, Seitz S M. DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA, IEEE, 2015 DOI:10.1109/cvpr.2015.7298631


Orts-Escolano S, Rhemann C, Fanello S, Chang W, Kowdle A, Degtyarev Y, Kim D, Davidson P L, Khamis S, Dou M, Tankovich V, Loop C, Cai Q, Chou P A, Mennicken S, Valentin J, Pradeep V, Wang S, Kang S B, Kohli P, Lutchyn Y, Keskin C, Izadi S. Holoportation: Virtual 3D Teleportation in Real-time. In: Proceedings of the 29th Annual Symposium on User Interface Software and Technology. Tokyo, Japan, ACM, 2016, 741–754 DOI:10.1145/2984511.2984517.


Yu T, Zheng Z R, Guo K W, Zhao J H, Dai Q H, Li H, Pons-Moll G, Liu Y B. DoubleFusion: real-time capture of human performances with inner body shapes from a single depth sensor. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, IEEE, 2018 DOI:10.1109/cvpr.2018.00761


Loper M, Mahmood N, Romero J, Pons-Moll G, Black M J. Smpl. ACM Transactions on Graphics, 2015, 34(6): 1–16 DOI:10.1145/2816795.2818013


Guo K W, Xu F, Yu T, Liu X Y, Dai Q H, Liu Y B. Real-time geometry, albedo and motion reconstruction using a single RGBD camera. ACM Transactions on Graphics, 2017, 36(4): 1 DOI:10.1145/3072959.3126786


Runz M, Agapito L. Co-fusion: Real-time segmentation, tracking and fusion of multiple objects. In: 2017 IEEE International Conference on Robotics and Automation (ICRA). Singapore, IEEE, 2017, 4471–4478 DOI:10.1109/icra.2017.7989518


Gao W, Tedrake R. SurfelWarp: efficient non-volumetric single view dynamic reconstruction. In: Robotics: Science and Systems XIV. Robotics: Science and Systems Foundation, 2018


Runz M, Buffier M, Agapito L. MaskFusion: real-time recognition, tracking and reconstruction of multiple moving objects. In: 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). Munich, Germany, IEEE, 2018 DOI:10.1109/ismar.2018.00024


Zhang Y Z, Xu W W, Tong Y Y, Zhou K. Online structure analysis for real-time indoor scene reconstruction. ACM Transactions on Graphics, 2015, 34(5): 1–13 DOI:10.1145/2768821


Dzitsiuk M, Sturm J, Maier R, Ma L N, Cremers D. De-noising, stabilizing and completing 3D reconstructions on-the-go using plane priors. In: 2017 IEEE International Conference on Robotics and Automation (ICRA). Singapore, IEEE, 2017 DOI:10.1109/icra.2017.7989457


Shi Y F, Xu K, Nießner M, Rusinkiewicz S, Funkhouser T. PlaneMatch: patch coplanarity prediction for robust RGB-D reconstruction//Computer Vision–ECCV 2018. Cham: Springer International Publishing, 2018, 767–784 DOI:10.1007/978-3-030-01237-3_46


Salas-Moreno R F, Newcombe R A, Strasdat H, Kelly P H J, Davison A J. SLAM++: simultaneous localisation and mapping at the level of objects. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA, IEEE, 2013, 1352–1359 DOI:10.1109/cvpr.2013.178


Tateno K, Tombari F, Navab N. When 2.5D is not enough: Simultaneous reconstruction, segmentation and recognition on dense SLAM. In: 2016 IEEE International Conference on Robotics and Automation (ICRA). Stockholm, Sweden, IEEE, 2016, 2295–2302 DOI:10.1109/icra.2016.7487378


Aldoma A, Tombari F, Rusu R B, Vincze M. OUR-CVFH–oriented, unique and repeatable clustered viewpoint feature histogram for object recognition and 6DOF pose estimation//Lecture Notes in Computer Science. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, 113–122 DOI:10.1007/978-3-642-32717-9_12


Xu K, Shi Y F, Zheng L T, Zhang J Y, Liu M, Huang H, Su H, Cohen-Or D, Chen B Q. 3D attention-driven depth acquisition for object identification. ACM Transactions on Graphics, 2016, 35(6): 1–14 DOI:10.1145/2980179.2980224


Huang J W, Dai A, Guibas L, Niessner M. 3Dlite: towards commodity 3D scanning for content creation. ACM Transactions on Graphics, 2017, 36(6): 1–14 DOI:10.1145/3130800.3130824


Dong S Y, Xu K, Zhou Q, Tagliasacchi A, Xin S Q, Nießner M, Chen B Q. Multi-robot collaborative dense scene reconstruction. ACM Transactions on Graphics, 2019, 38(4): 1–16 DOI:10.1145/3306346.3322942


McCormac J, Handa A, Davison A, Leutenegger S. SemanticFusion: Dense 3D semantic mapping with convolutional neural networks. In: 2017 IEEE International Conference on Robotics and Automation (ICRA). Singapore, IEEE, 2017, 4628–4635 DOI:10.1109/icra.2017.7989538


Noh H, Hong S, Han B. Learning deconvolution network for semantic segmentation. In: 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile, IEEE, 2015, 1520–1528 DOI:10.1109/iccv.2015.178


He K M, Gkioxari G, Dollar P, Girshick R. Mask R-CNN. In: 2017 IEEE International Conference on Computer Vision (ICCV). Venice, IEEE, 2017, 2980–2988 DOI:10.1109/iccv.2017.322


Hu R Z, Wen C, van Kaick O, Chen L M, Lin D, Cohen-Or D, Huang H. Semantic object reconstruction via casual handheld scanning. ACM Transactions on Graphics, 2019, 37(6): 1–12 DOI:10.1145/3272127.3275024


Schonberger J L, Pollefeys M, Geiger A, Sattler T. Semantic visual localization. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, IEEE, 2018


Bernardini F, Martin I M, Rushmeier H. High-quality texture reconstruction from multiple scans. IEEE Transactions on Visualization and Computer Graphics, 2001, 7(4): 318–332 DOI:10.1109/2945.965346


Pulli K, Shapiro L G. Surface reconstruction and display from range and color data. Graphical Models, 2000, 62(3): 165–201 DOI:10.1006/gmod.1999.0519


Lensch H P A, Heidrich W, Seidel H P. A silhouette-based algorithm for texture registration and stitching. Graphical Models, 2001, 63(4): 245–262 DOI:10.1006/gmod.2001.0554


Stamos I, Allen P K. Geometry and texture recovery of scenes of large scale. Computer Vision and Image Understanding, 2002, 88(2): 94–118 DOI:10.1006/cviu.2002.0963


Corsini M, Dellepiane M, Ponchio F, Scopigno R. Image-to-geometry registration: a mutual information method exploiting illumination-related geometric properties. Computer Graphics Forum, 2009, 28(7): 1755–1764 DOI:10.1111/j.1467-8659.2009.01552.x


Corsini M, Dellepiane M, Ganovelli F, Gherardi R, Fusiello A, Scopigno R. Fully automatic registration of image sets on approximate geometry. International Journal of Computer Vision, 2013, 102(1/2/3): 91–111 DOI:10.1007/s11263-012-0552-5


Zhou Q Y, Koltun V. Color map optimization for 3D reconstruction with consumer depth cameras. ACM Transactions on Graphics, 2014, 33(4): 1–10 DOI:10.1145/2601097.2601134


Whelan T, Kaess M, Johannsson H, Fallon M, Leonard J J, McDonald J. Real-time large-scale dense RGB-D SLAM with volumetric fusion. The International Journal of Robotics Research, 2015, 34(4/5): 598–626 DOI:10.1177/0278364914551008


Bi S, Kalantari N K, Ramamoorthi R. Patch-based optimization for image-based texture mapping. ACM Transactions on Graphics, 2017, 36(4): 1–11 DOI:10.1145/3072959.3073610


Whelan T, Salas-Moreno R F, Glocker B, Davison A J, Leutenegger S. ElasticFusion: Real-time dense SLAM and light source estimation. The International Journal of Robotics Research, 2016, 35(14): 1697–1716 DOI:10.1177/0278364916669237


Meilland M, Barat C, Comport A. 3D High Dynamic Range dense visual SLAM and its application to real-time object re-lighting. In: 2013 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). Adelaide, Australia, IEEE, 2013, 143–152 DOI:10.1109/ismar.2013.6671774


Li S D, Handa A, Zhang Y, Calway A. HDRFusion: HDR SLAM using a low-cost auto-exposure RGB-D sensor. In: 2016 Fourth International Conference on 3D Vision (3DV). CA, USA, IEEE, 2016, 314–322 DOI:10.1109/3dv.2016.40


Alexandrov S V, Prankl J, Zillich M, Vincze M. Towards dense SLAM with high dynamic range colors. In Computer Vision Winter Workshop (CVWW), 2017


Xu L, Su Z, Han L, Yu T, Liu Y B, Fang L. Unstructured Fusion: realtime 4D geometry and texture reconstruction using Commercial RGBD cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 1 DOI:10.1109/tpami.2019.2915229


Xu K, Zheng L T, Yan Z H, Yan G H, Zhang E, Niessner M, Deussen O, Cohen-Or D, Huang H. Autonomous reconstruction of unknown indoor scenes guided by time-varying tensor Fields. ACM Transactions on Graphics, 2017, 36(6): 1–15 DOI:10.1145/3130800.3130812


Peng Y, Deng B L, Zhang J Y, Geng F Y, Qin W J, Liu L G. Anderson acceleration for geometry optimization and physics simulation. ACM Transactions on Graphics, 2018, 37(4): 1–14 DOI:10.1145/3197517.3201290