Home About the Journal Latest Work Current Issue Archive Special Issues Editorial Board

TABLE OF CONTENTS

2019,  1 (5):   500 - 510

Published Date:2019-10-20 DOI: 10.1016/j.vrih.2019.09.004

Abstract

Background
Based on the seminal work proposed by Zhou et al., much of the recent progress in learning monocular visual odometry, i.e., depth and camera motion from monocular videos, can be attributed to the tricks in the training procedure, such as data augmentation and learning objectives.
Methods
Herein, we categorize a collection of such tricks through the theoretical examination and empirical evaluation of their effects on the final accuracy of the visual odometry.
Results/Conclusions
By combining the aforementioned tricks, we were able to significantly improve a baseline model adapted from SfMLearner without additional inference costs. Furthermore, we analyzed the principles of these tricks and the reason for their success. Practical guidelines for future research are also presented.

Content

1 Introduction
Emerging deep learning models, especially deep convolutional neural networks (CNNs), have provided remarkable insights into the field of computer vision. Additionally, they have enhanced the performances of various applications, from high-level vision tasks such as image classification and object detection to low-level vision tasks such as image enhancement and semantic editing. Specifically, monocular visual odometry, a fundamental and interdisciplinary application across computer vision, robotics, and virtual reality, has successfully employed data-driven deep models to improve its quality. As evidenced by the increasing accuracy of current methods on well-known autonomous driving benchmarks such as KITTI and Cityscapes, these methods have achieved an impressive level of reliability and accuracy, following the unsupervised learning manner. However what has led to this progress? The majority of recent approaches strongly resemble the original network structure and learning pipeline of SfMLearner[1] proposed by Zhou et al.They combine a monocular depth predictor
D
that estimates the plausible depth map of an input RGB frame with a pairwise camera pose estimator
P
that regresses relative camera poses between adjacent RGB frames. These two modules are designed as two independent CNN networks. An objective function for strengthening the photometric consistency between adjacent frames is optimized to learn the parameters of both modules. As this basic structure is unchanged since SfMLearner, what has enabled the performance gains of modern approaches?
We herein focus on investigating effective and general tricks that can improve the performance of monocular visual odometry, i.e., the depth prediction and camera pose estimation but without introducing additional computational cost during inference. Many recent approaches rely on iterative refinement modules that involve postprocessing to progressively enhance the regressed depth maps and ego-motions, for instance, DDVO[2] and BA-Net[3]. We examine a collection of tricks, including network architectures, input preprocessing, output postprocessing, and loss functions that already allow the baseline model to satisfy the quality from those with time consuming postprocessing or additional modules. These tricks have been proposed in recent years but has received relatively low attention; many of which have been briefly mentioned as implementation details or can only be found in their source codes. We will evaluate them on multiple test scenarios and datasets and report their effects on the final accuracy via thorough an ablation study.
Our major contributions are as follows: (1) Comprehensive analysis of the tricks and their principles contributing to their success; practical guidelines for future studies. (2) Systematic evaluation of various tricks applied to the unsupervised training pipelines and network structures of monocular video odometry. (3) To the best of our knowledge, this is the first study investigating valuable tricks for designing visual odometry models from monocular videos, following the unsupervised manner. These improvements do not incur additional inference costs.
The remainder of this paper is organized as follows. First, we briefly introduce the baseline method in Section 2. In Section 3, the proposed tricks are discussed in detail. The experimental results are benchmarked in Section 4. Finally, Section 5 concludes this study.
2 Baseline model
We describe the baseline model following the same framework as SfMLearner[1], i.e., this model includes two components: a monocular depth predictor
D t = D ϕ D I t
to estimate a depth map
D t
for an RGB frame
I t
; a camera pose estimator
θ a b = P ϕ P I a ,   I b
to estimate the relative camera pose
θ a b
from frame
I a
to frame
I b
. The outcome
θ = ω ,   t
is a vector indicating 6-DOF pose parameters, where
ω
is the 3D Euler rotation angle and
t
is the 3D translation. Moreover, the depth map
D
has the same spatial resolution as that of the input frame
I
. Additionally,
ϕ D
and
ϕ P
denote the network parameters of the depth predictor and pose estimator, respectively. They are learned during the unsupervised training stage.
2.1 Network structures
(1) The monocular depth predictor
D ϕ D
follows the DispNet structure as suggested for SfMLearner, which includes an encoder-decoder as its backbone and multiple shortcut links fusing encoded image features with decoded multiscale depth features. The decoder simultaneously generates multiscale disparity maps as the bypass products of the final depth maps. All convolutional layers are followed by a ReLU activation operation except for the prediction layers, which follows
1 / ( 10 × s i g m o i d x + 0.1 )
, to bound the outcome disparity maps. The depth map is simply converted as the reciprocal of the disparity map. (2) The camera pose estimator
P ϕ P
employs the concatenated adjacent frames as its input and directly regresses the relative pose parameters in a predefined order. For example, when the input frames are concatenated in the order
I t - 1 , I t , I t + 1
, SfMLearner suggests the outcome poses in the order
θ t t - 1 , θ t t + 1
. It suggests that
N
input frames induce (N-1
) × 6
output pose matrices, where each row indicates one pose parameter. Its network structure is similar to PoseNet in SfMLearner. See Figure 1 for a brief illustration of both modules.
2.2 Unsupervised training
For simplicity, as suggested by SfMLeaner, we extract a large scale of
2 N + 1
frame clips from videos in the KITTI dataset[4] and define the
N
th frame of each clip as the reference frame. The target frames are the remaining frames in that clip. During training, all input frames are resized to
128 × 416
and augmented by random horizontal flipping, random scaling, and color jittering. Zhou et al.suggested
N = 2
, and we selected the list of training samples proposed in SfMLearner[1] to fairly compare related approaches and conduct component-wise analysis.
The baseline model employs photometric consistency and disparity smoothness as its objective, aiming at optimizing the quality of view synthesis with respect to disparity and rigid camera pose. (1) Photometric consistency is obtained by minimizing the visual dissimilarity between the warped target frame and the reference frame according to the estimated rigid pose and the depth image of the reference frame. This warping operation is differentiable, and a warped frame
I t - 1 t
from the target frame
I t - 1
to the target frame
I t
can be written as
I t - 1 t ( x ) = I t - 1 ( W x ; D t , θ t t - 1 )
, where
W x ; D t , θ t t - 1
is the rigid warping field from the reference frame
I t
to the target frame
I t - 1
. Therefore, the photometric consistency is
L p c = t T τ 1 , , N ρ I t , I t - τ t + ρ I t , I t + τ t
where
ρ ,
is a metric calculating the pixel-wise dissimilarity between two quantities (typically
l 1
-norm is sufficient, but we may use more advanced metrics).
T
indicates the set of indices of the reference frames used in the training stage. This term measures all backwarped target frames to the image coordinate of the reference frame. (2) The disparity smoothness term applies the TV-
l 1
norm to encourage the piecewise smoothness and edge-preserving properties that most disparity maps obey. We denote it as
L d s = t T d t l 1
, where
d t = 1 / D t
. The combined objective is therefore
L b a s e l i n e = L p c + λ d s L d s
where the weighting factor
λ d s
is used to balance the contributions of two terms.
The learning objective is optimized via stochastic gradient descent and its variants. We apply the Adam optimizer in this study and set the learning rate to
0.0002
.
2.3 Evaluation metrics
The depth prediction is evaluated according to the metrics introduced by Eigen et al. <sup/> [5]. The metrics include (1) percentage of
δ = m a x   D t D t * , D t * D t
below a predefined threshold; (2) RMSE in linear and logarithm spaces; (3) absolute relative difference (abs rel) and squared relative difference (sq rel).
The camera pose estimation is evaluated on a scale-normalized absolute trajectory error (ATE) on
5
-frame snippets in the testing dataset, as suggested by Zhou et al.[1].
3 Bag of tricks
In this section, a bag of tricks that improves the unsupervised learning of monocular visual odometry is introduced, without changing the main structure of the baseline model and adding computationally demanding postprocessing modules.
3.1 Scale stabilization
Visual odometry learned from monocular videos typically suffer from the scale ambiguity of depth maps and the poses. This is because the warping operation is scale-invariant, implying that the scale of the depth map can be compensated by the scale of the corresponding pose parameter. Therefore, the same warping field is applicable toarbitrary depth scales. Thus, under the optimized photometric consistency
L p c
, the scale of the predicted disparity map remains unclear. However, because the disparity smoothness term
L d s
minimizes the gradients of the disparity map, if the disparity scale is arbitrary, then disparity maps with smaller scales always favor a smaller
L d s
. This phenomenon has been discussed in DDVO[2] and has resulted in catastrophic training failures for any disparity map near zero, in the baseline model or the seminal work SfMLearner[1].
A simple trick is beneficial for regularizing the scale ambiguity. We may self-normalize the disparity map after the raw outcome and before the depth map conversion, i.e.,
η d = d d ¯
, where
d
is the raw disparity map and
d ¯
is the spatial mean of this raw map. This simple amendment will not yield an accurate scale of the trajectory or the scene geometry, but it prevents the disparity smoothness term
L d s
from reducing the disparity scale and suggests a converged scale after sufficient training epochs. As a bypass product, pose scale stabilization is ensured (Figure 2).
3.2 Overcoming the gradient locality
The backpropagation of the warping-based photometric consistency
L p c
can only flow gradients to the first-order local vicinity, which would inevitably hamper the training if a correct rigid correspondence is in a textureless region or far from the current estimation[1].
According to the coarse-to-fine optimization strategy frequently applied in stereo and optical flow estimations, SfMLearner[1] uses deep supervision to explicitly optimize multilevel photometric consistency and disparity smoothness losses in their own resolutions. However, as in the typical drawbacks occurring in coarse-to-fine optimization, the estimation failure in the coarser levels (typically from the non-ideal sampling strategy to the input frames) will provide negative guidance to the finer levels, thus resulting in consistent depth errors throughout all levels. One trick to overcome this problem is to upsample the resolutions of the depth maps in coarser levels to the resolution of the final depth map and subsequently minimize the multilevel learning objectives in a common image coordinate corresponding to the input frames. This trick will significantly downweigh the confidence of isolated erratic predictions in the coarser levels.
Another useful trick is to improve the point-wise metric for converting the photometric consistency to certain area-aware metrics. An effective approach is to add SSIM[6] to enforce consistency with respect to the local statistics in an overlapping
3 × 3
windows, such as
ρ I t , I t - τ = α 1 I t - I t - τ l 1 + α 2 1 - S S I M I t , I t - τ
where
α 1
and
α 2
are weighing factors. Moreover, recent methods calculate the pixel-wise CNN features (e.g.,VGG features[7]) distance, also known as perceptual loss[8], between the warped target frame and the reference frame, thus encouraging the gradients to flow over large-scale receptive fields.
3.3 Management of invalid area
The photometric consistency for the learning of visual odometry implicitly assumes that in the target area, (1) the scene is static without object motions; (2) no occlusion/disocclusion exists between the reference and the target frames; and (3) the surface is Lambertian and the ambient lights are consistent without sudden changes. In most cases in autonomous driving scenarios, the lighting changes or imaging noises can be filtered by robust metrics such as SSIM and perceptual loss. However, the violation of the first two assumptions in the training data will contaminate the gradients in backpropagation processing, thus inhibiting the training.
SfMLearner[1] applies a unified explainability prediction network as the decoder part of its PoseNet, which outputs a soft mask to remove any potential invalid areas (i.e.,occlusion/disocclusion and dynamic objects) in the calculation of the photometric consistency
L p c
. However, because the dissimilarity metric
ρ ,
is bounded in the intensity space, Zhou et al. discovered that this explainability mask does not necessarily facilitate the learning of the monocular visual odometry, as suggested in their updated results on their GitHub repository.
3.3.1   Occlusion/Disocclusion
We assume that the scenarios convey only rigid motions. Recent studies such as Struct2Depth[9] and those of Godard et al.[10] attempted to solve occlusion handling in photometric consistency in a nonparametric manner. They computed the minimum reconstruction loss between the warping from either the paired previous target frame
I t - τ
or the next target frame
I t + τ
into the reference frame
I t
, written as
L p c = t T τ 1 N x m i n   I t x - I t - τ t x l 1 , I t x - I t + τ t x l 1
Neither the reconstruction loss w.r.t. one pixel
x
in the reference frame will be consistently better than that of the other because the depth predictor
D ϕ D
and the camera pose estimator
P ϕ P
are homogeneously applied to any pair of consecutive frames; minimizing this loss will encourage reasonable view synthesis in one direction at the least and repel unconstrained warping inside the occlusion/disocclusion area.
Meanwhile, another category of approaches applied the consistency between bi-directional warping fields to indicate occlusion/disocclusion area. For example, a pixel
x
whose inconsistency
W x ; D t , θ t t - τ + W W x ; D t , θ t t - τ ; D t - τ , θ t - τ t
is above a scale-dependent threshold[11,12] will be indicated as invalid. Using this hard thresholding to replace the learnable explainability mask in SfMLearner[1] results in much fewer blurs in the predicted depth maps, and the estimated camera poses become more reliable.
3.3.2   Dynamic objects
The previous tricks are only valid when the scenes are static. It is thus beneficial to handle dynamic objects outside a static background and learn the camera pose (and in turn the depth maps) in a robust manner. We can reformulate our view synthesis-based objective with dynamic objects as a scene flow estimation problem that simultaneously estimates 3D scene structures, rigid camera motions, and 3D object flows. Unlike supervised methods that learn to segment moving objects based on explicit segmentation masks[9,13,14], unsupervised methods typically rely on an auxiliary optical flow estimation task to soft detect the dynamic area. In this case, the optical flow of one pixel
x
combines a pixel-wise object flow (if exists) with a rigid flow generated from the depth map and the rigid camera motion.
(1) Optical flow as input
EPC[15], Ranjan et al. [16], Lv et al. [14] , and DF-Net[17] employ a well-trained optical flow estimation network such as PWC-Net[18] to provide self-supervisory optical flows to softly detect moving objects after a consensus verification with the rigid warping fields. Given appropriate priors for the object flows, such as object-level smoothness and bi-directional consistency, the learning of depth prediction and camera pose estimation can be well constrained.
(2) Optical flow as output
Another set of methods do not employ optical flows as an input to augment the training. For example, GeoNet[11] formulates the final outcomes as the optical flows, where the predicted depth maps and camera poses serve as the intermediate outputs that feed the subsequent flow generator. The object flows become motion residuals that the flow generator would compensate for the final optical flow. However, the predicted motion residuals are sensitive to warping noises and thus may hamper the discovery of dynamic objects, which may in turn harm the visual odometry.
3.4 Improving the physical reliability
Photometric consistency learns appearance flows but not physical motions (for example, warping fields in textureless areas typically satisfy the photometric consistency, but they are not reliable in the physical world), which may not encode physically reliable scene structures. Because physically reliable optical flows always conform to the cycle consistency of bi-directional flows, valid flows from the reference frame always have corresponding inverse flows from the target frame back to their original locations in the reference frame. It is thus helpful to mimic this consistency as an effective prior to constrain the distribution of the generated warping fields. Specifically, the consistency contains two vital components: a bi-directional photometric consistency term
L b i - v c = t T τ - N , ,   N \ { 0 } M τ t ρ ( I t ,   I τ t ) + t T τ { - N , ,   N } \ { 0 } M t τ ρ ( I t τ ,   I τ )
and a cycle warping consistency term
L c c = t T τ { - N , ,   N } \ { 0 } x M τ t x W x ; D t , θ t τ + W W x ; D t , θ t τ ; D τ , θ τ t + t T τ { - N , ,   N } x M τ t x W x ; D τ , θ τ t + W W x ; D t , θ τ t ; D t , θ t τ
where
M τ t
and
M t τ
are valid masks generated by the approaches described in Section 3.3.
M τ t
indicates the valid area from the target frame
I τ
to the reference frame
I t
, and
M t τ
indicates vice versa. It is noteworthy that the depth values in the invalid regions are otherwise naively estimated by the depth smoothness prior
L d s
within their neighborhood (Figure 3).
3.5 Enforcing long-term robustness
The baseline model, as well as most unsupervised visual odometry systems, apply short-length video sequences (typically a
5
-frame snippet) as a training example, which results in a short learned system for handling large or rare camera motions. Instead of increasing the length of the training snippets to significantly increase the computational costs both in the training and testing stages, an effective method is to randomly sample neighboring frames in a longer distance. In this case, the length of the training snippet remains the same as that of the baseline model, but the capacity of the trained model to handle larger and more complex motions is augmented. Moreover, complex motions between adjacent frames render triangulation (or view synthesis from another perspective) more challenging; therefore, optimized photometric consistency requires the learned depth maps to better present the scene structures.
To conclude this section, we list a set of existing unsupervised monocular visual odometry methods in Table 1. The number of tricks used in these methods can be counted from this table. It is noteworthy that different specific treatments in each trick result in different performance gains, even though these methods use the same set of tricks.
Tricks applied in existing unsupervised monocular visual odometry methods
Methods

Scale

Normalization

Overcoming

Gradient Locality

Handling

Invalid Area

Dynamic

Objects

Physical

Reliability

Long-term

Robustness

SfMLearner[1] × × × ×
GeoNet[11] ×
DDVO[2] × × × ×
Struct2Depth[9] × ×
EPC[15] ×
Ranjan et al. [16] ×
DF-Net[17] ×
Ours ×
4 Experimental results
We evaluate our combination of tricks based on the baseline model, i.e.,scale normalization, multiscale depth upsampling, occlusion handling, cycle consistency, and long-term data augmentation. Dynamic object handling requires external modules in the training phases; thus, it is omitted in our test scenario. The training configurations are similar to the unsupervised learning paradigm in Section 2. The evaluation of depth prediction uses the protocol and testing list of Eigen et al. [5]. The evaluation of pose estimation applies the protocol of SfMLearner[1] and the testing list is the KITTI odometry split[4]. It is noteworthy that the inherent scale ambiguity of monocular visual odometry requires the normalization of the depth maps and trajectories.
4.1 Depth prediction
As shown in Table 2, if truncating the depth predictions by
80
m, the combined tricks enable the unsupervised baseline model to outperform all the compared monocular depth prediction methods on most evaluation metrics, even when using recent methods with calibrated stereo data. Our methods combine these tricks in the training process and achieves superior performances without additional training or testing modules (e.g., DDVO[2] and GeoNet[11]). If truncating the prediction by
50
m, our model achieves the best performances on all metrics. Please refer to the qualitative results in Figure 4 as well as ablation studies regarding the removals of cycle consistency and long-term data augmentation. These tricks, especially cycle consistency, are important for obtaining a reliable spatial structure of the predicted depth maps. Some recent methods applied cycle consistency in their flow estimation, such as GeoNet; however, it is not applied directly in the depth prediction, which results in less than optimal learning compared with our strategy.
Depth prediction results on the KITTI dataset[4] using the split of Eigen et al.[5]
Method Setting abs rel sq rel RMSE RMSE(log)
δ < 1.25
δ < 1 . 25 2
δ < 1 . 25 3
Cap 80m
Eigen et al.[5] Depth 0.203 1.548 6.307 0.282 0.702 0.890 0.958
Liu et al.[19] Depth 0.202 1.614 6.523 0.275 0.678 0.895 0.965
Godard et al.[10] Stereo 0.148 1.344 5.927 0.247 0.803 0.922 0.964
Zhou et al.[1] Mono 0.208 1.768 6.856 0.283 0.678 0.885 0.957
Zhou et al.[1]* Mono 0.183 1.595 6.709 0.270 0.734 0.902 0.959
GeoNet[11] Mono 0.164 1.303 6.090 0.247 0.765 0.919 0.968
DDVO[2] Mono 0.151 1.257 5.583 0.228 0.810 0.936 0.974
Ours w/o CC Mono 0.168 1.259 5.937 0.247 0.755 0.920 0.969
Ours w/o Aug Mono 0.144 1.175 5.430 0.220 0.819 0.942 0.976
Ours Mono 0.139 1.021 5.418 0.209 0.803 0.937 0.976
Cap 50m
Godard et al.[10] Stereo 0.140 0.976 4.471 0.232 0.818 0.931 0.969
Garg et al.[20] Pose 0.169 1.080 5.104 0.273 0.740 0.904 0.962
Zhou et al.[1] Mono 0.201 1.391 5.181 0.264 0.696 0.900 0.966
GeoNet[11] Mono 0.157 0.990 4.600 0.231 0.781 0.931 0.974
Ours Mono 0.131 0.805 4.021 0.202 0.820 0.947 0.982
Notes: We report
s e v e n
metrics as suggested by Eigen et al. [5]. Bold means the overall best results. “w/o CC” and “w/o Aug” demote our visual odometry module trained without cycle consistency or without long-term data augmentation, respectively. *Updated results are provided on the website of Zhou et al. [1].
4.2 Camera pose estimation
We compare our method with the traditional ORB-SLAM (full), of which the results of the latter were obtained from the website of Zhou et al. [1]. Furthermore, we compare our method with SfMLearner[1] and GeoNet[11]. As shown in Table 3, our method outperforms “mean odometry” and the conventional OBR-SLAM (full). Moreover, with respect to deep-learning-based methods, our estimated poses are more accurate than almost all reference methods, proving the significance of the applied tricks. It is noteworthy that the EPC++ (mono) is the extension of the EPC[15] that is learned by monocular videos. Furthermore, GeoNet[11] and DF-Net[17], which are inferior to our method, applies cycle consistency to enforce the reliability of camera motion. Thus, the performance gain of our camera pose estimation is attributed to the long-term data augmentation, which proves the reliability of this simple trick.
Camera pose evaluation on KITTI visual odometry split, based on the ATE
Method Seq-09 Seq-10
Mean Odometry
0.032 ± 0.026
0.028 ± 0.023
ORB-SLAM (full)
0.014 ± 0.008
0.012 ± 0.011
Zhou et al.[1]
0.021 ± 0.017
0.020 ± 0.015
GeoNet[11]
0.012 ± 0.007
0.012 ± 0.009
DF-Net[17]
0.017 ± 0.007
0.015 ± 0.009
EPC++ (mono)[21]
0.013 ± 0.007
0.012 ± 0.008
Struct2Depth[9]
0.011 ± 0.006
0.011 ± 0.010
Ours
0.010 ± 0.007
0.008 ± 0.007
5 Conclusion
In conclusion, this is the first study to theoretically analyze and categorize the tricks applied for learning visual odometry from monocular videos, without additional refinement modules to augment the basic depth prediction and camera pose estimation networks. We proved that the appropriate combination of these tricks improved the baseline model significantly and occasionally outperformed sophisticated reference methods, onto well-known autonomous driving scenarios, such as the KITTI dataset.

Reference

1.

Zhou T H, Brown M, Snavely N, Lowe D G. Unsupervised learning of depth and ego-motion from video. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA, IEEE, 2017 DOI:10.1109/cvpr.2017.700

2.

Wang C Y, Buenaposada J M, Zhu R, Lucey S. Learning depth from monocular videos using direct methods. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, IEEE, 2018 DOI:10.1109/cvpr.2018.00216

3.

Tang C, Tan P. BA-Net: Dense Bundle Adjustment Networks. In: International Conference on Learning Representation (ICLR). 2019

4.

Geiger A, Lenz P, Stiller C, Urtasun R. Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research, 2013, 32(11): 1231–1237 DOI:10.1177/0278364913491297

5.

Eigen D, Puhrsch C, Fergus R. Depth Map Prediction from a Single Image using a Multi-scale Deep Network. Advances in Neural Information Processing Systems (NIPS), 2014, 2366–2374

6.

Wang Z, Bovik A C, Sheikh H R, Simoncelli E P. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 2004, 13(4): 600–612 DOI:10.1109/tip.2003.819861

7.

Simonyan K, Zisserman A. Very Deep Convolutional Networks for Large-scale Image Recognition. arXiv preprint arXiv:1409.1556, 2014

8.

Johnson J, Alahi A, Li F F. Perceptual losses for real-time style transfer and super-resolution//Computer Vision–ECCV 2016. Cham: Springer International Publishing, 2016, 694–711 DOI:10.1007/978-3-319-46475-6_43

9.

Casser V, Pirk S, Mahjourian R, Angelova A. Depth prediction without the sensors: leveraging structure for unsupervised learning from monocular videos. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33, 8001–8008 DOI:10.1609/aaai.v33i01.33018001

10.

Godard C, Aodha O M, Brostow G J. Unsupervised monocular depth estimation with left-right consistency. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA, IEEE, 2017 DOI:10.1109/cvpr.2017.699

11.

Yin Z C, Shi J P. GeoNet: unsupervised learning of dense depth, optical flow and camera pose. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, IEEE, 2018 DOI:10.1109/cvpr.2018.00212

12.

Meister S, Hur J, Roth S. UnFlow: Unsupervised Learning of Optical Flow with a Bidirectional Census Loss. In: Thirty-Second AAAI Conference on Artificial Intelligence (AAAI), 2018

13.

Cao Z, Kar A, Hane C, Malik J. Learning Independent Object Motion from Unlabelled Stereoscopic Videos. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, 5594–5603

14.

Lv Z, Kim K, Troccoli A, Sun D Q, Rehg J M, Kautz J. Learning rigidity in dynamic scenes with a moving camera for 3D motion field estimation//Computer Vision–ECCV 2018. Cham: Springer International Publishing, 2018, 484–501 DOI:10.1007/978-3-030-01228-1_29

15.

Yang Z H, Wang P, Wang Y, Xu W, Nevatia R. Every pixel counts: unsupervised geometry learning with holistic 3D motion understanding//Lecture Notes in Computer Science. Cham: Springer International Publishing, 2019, 691–709 DOI:10.1007/978-3-030-11021-5_43

16.

Ranjan A, Jampani V, Balles L, Kim K, Sun D, Wulff J, Black M J. Competitive Collaboration: Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, 12240–12249

17.

Zou Y L, Luo Z L, Huang J B. DF-net: unsupervised joint learning of depth and flow using cross-task consistency//Computer Vision–ECCV 2018. Cham: Springer International Publishing, 2018, 38–55 DOI:10.1007/978-3-030-01228-1_3

18.

Sun D Q, Yang X D, Liu M Y, Kautz J. PWC-net: CNNs for optical flow using pyramid, warping, and cost volume. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, IEEE, 2018 DOI:10.1109/cvpr.2018.00931

19.

Liu F Y, Shen C H, Lin G S, Reid I. Learning Depth from Single Monocular Images using Deep Convolutional Neural Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 2016, 38(10), 2024–2039

20.

Garg R, Vijay K B G, Carneiro G, Reid I. Unsupervised CNN for single view depth estimation: geometry to the rescue//Computer Vision–ECCV 2016. Cham: Springer International Publishing, 2016, 740–756 DOI:10.1007/978-3-319-46484-8_45

21.

Luo H, Yang Z, Wang Y, Xu W, Nevatia R, Yuille A L. Every Pixel Counts++: Joint Learning of Geometry and Motion with 3D Holistic Understanding. arXiv preprint arXiv:1810.06125, 2018