Unsupervised monocular depth estimation with aggregating image features and wavelet SSIM (Structural SIMilarity) loss

Unsupervised learning has shown to be effective for image depth prediction. However, the accuracy is restricted because of uncertain moving objects and the lack of other proper constraints. This paper focuses on how to improve the accuracy of depth prediction without increasing the computational burden of the depth network. Aggregated residual transformations are embedded in the depth network to extract high-dimensional image features. A more accurate mapping relationship between feature map and depth map can be built without bringing extra network computational burden. Additionally, the 2D discrete wavelet transform is applied to the structural similarity loss (SSIM) to reduce the photometric loss effectively, which can divide the entire image into various patches and obtain high-quality image information. Finally, the effectiveness of the proposed method is demonstrated. The training model can improve the performance of the depth network on the KITTI dataset and decrease the domain gap on the Make3D dataset.


INTRODUCTION
Predicting depth from a single 2D image is a fundamental task in computer vision. It has been studied for many years with widespread applications in reality, such as visual navigation [1] , object tracking [2,3] , and surgery [4] . Moreover, accurate depth information is vital with considerable influence on the performance of autonomous driving, where expensive laser sensors are usually used. Recent advances in convolutional neural networks (CNNs) show their powerful ability to learn an image' s high-dimensional features. Especially, the mapping relationship between image feature and image depth can be built. Generally, monocular depth estimation approaches can be classified into three categories: supervised [5][6][7][8][9] , semi-supervised [10] , and unsupervised [11][12][13][14][15][16][17][18][19] . Both supervised and semi-supervised learning rely on the image depth ground truth. Using a laser sensor to obtain the depth ground truth of many images is expensive and difficult. However, unsupervised learning has the advantage of eliminating the dependency on the depth ground truth. Therefore, more and more studies are training monocular depth estimation networks using unsupervised methods from monocular images or stereo pairs. Compared with stereo pairs, a monocular dataset is more general as the input of network. However, it needs to estimate the pose transformation between consecutive frames simultaneously. As a result, a pose estimation network is necessary that outputs relative 6-DoF pose with given sequences of frames as input.
Most unsupervised depth estimation networks [5,8,11] are constructed using typical CNN structures. On the one hand, a series of max-pooling and stride operations may reduce the network' s ability to learn image features and cause lower quality of depth map. On the other hand, to improve the performance of the network, deeper convolution layers are designed in depth CNNs. They increase the computational burden of the network and bring extra hardware cost. In most cases, the cost of the network overweighs the benefits generated by the network. To improve the depth estimation performance without increasing the network burden, an end-to-end unsupervised monocular depth network framework is proposed in this paper. Inspired by previous work [20] on the image classification task, aggregated residual transformations (ResNeXt) are migrated to the depth estimation field. Based on typical depth CNNs, the ResNeXt block is embedded to extract more delicate image features in the encoder network. In addition, more accurate mapping relationship between the feature map and depth map can be built without bringing extra network burden. In addition, the accuracy of depth network suffers from some noise ( . ., haze and rain) in the complex images. To reduce the influence of noise, the 2D wavelet discrete transform [21] is applied to SSIM loss, which can recover high-quality clear images. A sample of depth prediction is shown in Figure 1.
In summary, our proposed network can improve depth prediction accuracy without increasing network computational complexity. The contributions of this paper can be summarized as follows: (1) Based on a ResNeXt block, a novel feature extraction module for depth network is developed to improve the accuracy of depth prediction. It can not only extract high-dimensional image features but also guide the network to more deeply learn the scene to get farther pixel depth.
(2) A wavelet SSIM loss is applied to photometric loss to converge the training network. Various patches with clearer image information computed by DWT are used as input, rather than the whole image, to the loss function, which can remove some noise (daze, rain, etc.) from the image.
The rest of this paper is organized as follows. The related work on depth estimation is discussed in Section 2. Section 3 presents an overview of the proposed network architecture and the loss function. Then, some experiments based on different datasets are presented to verify the performance of the proposed network in Section 4. Finally, the conclusions and future work are introduced in Section 5.

Supervised depth estimation
Based on vast training datasets with depth ground truth, depth estimation networks show great performance in recent years. Eigen et al. [5] first demonstrated the huge potential of CNNs in depth prediction from a single image. They obtained reliable depth estimation results by using a coarse-to-fine depth network. Further, Liu et al. [7] combined CNNs with Markov random fields (MRF) to learn intermediate features, acquiring clearer local details of depth map in the visual effect. Laina et al. [8] changed the structure of the depth network and proposed a residual CNNs to model the mapping relationship between monocular image and its corresponding depth map. Instead of using absolute depth ground truth, Chen et al. [9] acquired relative depth value labels between the random pixel pairs from the image to train the depth network. In addition, to obtain dense depth map, Kuznietsov et al. [10] proposed a semi-supervised method which used both sparse ground truth depth for supervised learning and a photo consistent loss in stereo images for unsupervised learning.
Even though the works mentioned above significantly contributed to depth estimation, these methods still suffer from the limitation of depth ground truth.

Unsupervised depth estimation from stereo images
Using stereo images is a feasible unsupervised way to train a monocular depth network. A depth network can be obtained by predicting the left-right pixel disparities between stereo pairs during training. It can be applied when predicting monocular image depth. Garg et al. [11] first used stereo pairs to train depth network with known disparities between left and right images and acquired great performance. Inspired by the authors of [11] , Godard et al. [12] designed a novel loss function which enforced both left-right and right-left disparities consistency produced from stereo images [12] . Zhan et al. [13] extended the stereo-based network architecture by increasing the visual odometry network (VO). The performance of Zhan' s network was superior to other unsupervised methods at that time. To recover absolute scale depth map from stereo pairs, Li et al. [14] proposed a visual odometry system (UnDeepVO), which was capable of estimating the 6-DoF camera pose and recovering the absolute depth value.

Unsupervised depth estimation from monocular images
For monocular depth estimation, it is necessary to design an extra pose network to obtain pose transformation between consecutive frames. Both depth and pose networks are trained together with loss function. Zhou et al. [16] pioneered the training of depth networks with monocular video. They proposed two separate networks (SfMLearner) to learn image depth and inter-frame pose transformation. However, the accuracy of the depth network was often limited by the influence of moving objects and occlusion. Their work motivated some researchers to consider these shortcomings. Subsequently, Casser et al. [17] developed a separate network (struct2depth) to learn each moving object motion, but their work was based on the condition that the number of moving objects needed to be hypothesized in advance. In addition, researchers found that the optical flow method could be employed to deal with moving object motion. Yin et al. [18] developed a cascading network framework (GeoNet) to adaptively learn rigid and non-rigid object motion. Recently, multi-task training methods have been proposed. Luo et al. [19] intended to train depth, camera pose, and optical flow networks (EPC++) jointly with 3D holistic understanding. Similarly, Ranjan et al. [24] proposed a competitive collaboration mechanism (CC) with depth, camera motion, optical flow, and motion segmentation together. Both Luo and Ranjan' s joint network inevitably increased the difficulty of the training network and the computational burden of the network.
From the above works, we can see that most studies aim to improve the accuracy of the depth network by changing the network structure or building robust supervisory signal. It is worth noting that these methods bring network complexity and computational burden while improving the network accuracy. This motivates us to study how to balance both sides. Poggi et al. [15] presented an effective pyramid feature extraction network, which can be implemented in real-time on CPU. However, the accuracy of the network cannot satisfy the requirements of practical applications. Xie et al. [20] provided a template with aggregated residual transformations (ResNeXt), which achieved a better classification result without increasing network computation. Because of the advantages of ResNeXt, we apply it to the image depth prediction field. The ResNeXt block serves as a feature extraction module of the depth network to learn the image' s high-dimensional features. The proposed approach is not only independent of depth ground truth, but also does not increase computational burden.

METHOD
The proposed method contains two parts: an end-to-end network framework and a loss function. The network framework consists of a depth network and a pose network, as shown in Figure 2. Given unlabeled monocular sequences, the depth network outputs the predicted depth map, while the pose network outputs the 6-DoF relative pose transformation between adjacent frames. The loss function is made up of the basic photometric loss and the depth smoothness loss, and it couples both networks into the end-to-end network.

Problem statement
The aim of the unsupervised monocular depth network is to develop a mapping relationship Γ : ( ) → ( ), where ( ) is an arbitrary image, ( ) is the predicted depth map of the image ( ), and is per pixel in the image ( ). Establishing a more accurate mapping function Γ is considered in this paper, which includes: (a) a simple and effective network pipeline without increasing network computational complexity; and (b) a high-quality depth map ( ) with subtle details for a given input image ( ).
For Item (a), our focus is to change the basic building blocks of the depth CNN structure using aggregated residual transformations (ResNeXt). In the depth network, ResNeXt serves as feature extraction module to learn the image' s high-dimensional features without increasing network computational burden. For Item (b), low-texture regions in the low-scale depth map are weakened, bringing inaccurate image reconstruction. Inspired by the authors of [22] , four images with full resolution are reconstructed instead of building four images with different resolutions. Before the four images are reconstructed, the predicted four-scale depth map needs to be resized to the same resolution as input image with bilinear interpolation.
A single image ( ) is considered as the input of the depth network. The designed depth network outputs fivescale feature map × ( ∈ 1, 2, 3,4,5) in the encoder network and four-scale depth map in the decoder network. The mapping function is designed as where denotes the number of feature maps, = 5. represents the scale factor of depth map, ∈ 0, 1, 2, 3.
denotes the resolution of feature map × is 1/2 of the input resolution.
Then, bilinear interpolation is applied to each predicted depth map × to acquire the full-resolution depth map ( ( )), which is defined as follows: where represents bilinear interpolation which recovers the resolution 1/2 of × to the input full resolution.
The full-resolution depth map ( ( )) is necessary to reconstruct the input image. Given two adjacent images with a target view and a source view ⟨ ( ), ( )⟩, and the predicted 6-DoF pose transformation , a pixel in the target image 's mapping homogeneous coordinate → in the source image is computed as where is camera intrinsic matrix, is set as the normalized coordinate in target image , and → is a 4 × 4 matrix transformed by .
Therefore, the reconstructed target image can be obtained by Equation (3) using differentiable bilinear sampling mechanism [16] to sample the corresponding pixel → on the source image . The reconstructed target image is used to calculate the photometric loss in Part D.

Feature extraction module
Equation (1) is applied to exploit higher-dimensional features and acquire feature map × with more details. Since the ResNeXt block has a great performance on classification task. the feature extraction module is constructed by the ResNeXt block. In contrast to the ResNet used in most depth CNNs, the ResNeXt block aggregates more image features without bringing more network parameters, as shown in Figure 3.
The ResNeXt block puts the input image into 32 parallel groups and learns the image features, respectively. Each group shares the same super-parameters and is designed as a bottleneck structure which cascades three convolution layers with the kernel sizes, respectively, being 1 × 1, 3 × 3, and 1 × 1. The first 1 × 1 convolution layer extracts high-dimensional abstract features by reducing (or increasing) output channels. Given an input image with × × ′ resolution, the transformation function of the th group maps image to the highdimensional feature map ( ). The aggregated output ( ) is the summation of the output of all the groups, which is defined as follows: where is the number of groups, = 32, with as cardinality.
Then, to be closely connected with the input, a residual operation is used, ( ). The aggregated output feature map for each module is

Network architecture
The proposed depth estimation network employs U-Net structure including an encoder network and a decoder network. The encoder network is built by embedding the ResNeXt block [20] . It transforms the threedimensional monocular image into multi-channel feature map. The decoder network builds the relationship between extracted feature map and the depth map by a series of upsample and convolution (Up-convolution) operations, as shown in Figure 4.
(1) To eliminate texture copy artifacts in the depth map, the Up-convolution operation [22] instead of deconvolution is used to reshape the feature map.
(2) Due to max-pooling and stride operations ignoring some local features and causing some details to be lost in the depth image, skip connections are used to merge the corresponding feature maps for encoder network into decoder network and obtain fine image details. (3) Inspired by the authors of [22] , we resize all depth maps to the same resolution as input using bilinear interpolation (represented by the operation in Equation (2)).
The structure for the pose network is designed as a standard ResNet18 encoder, which is similar to the one in [22] . More input images in the pose network bring more accurate depth estimation under certain conditions. However, to reduce the number of training parameters of pose network, the pose network has ( = 3) adjacent images as input. Therefore, the shape for convolutional weights in the first layer is (3 × ) × 64 × 3 × 3 rather than the default 3 × 64 × 3 × 3 in the pose network. The output of the pose network has 6 * ( − 1) channels. In addition, our pose network is trained without pre-training. All convolution layers are activated by ReLU function [25] except for the last layer. When the pose result is evaluated, an image pair is fed into pose network to produce six output channels, the first three-channel is rotation, and the last three-channel is translation.

Wavelet SSIM loss
In general, the SSIM [26] loss is included in the photometric loss to measure the degree of similarity between images. In this paper, the 2D discrete wavelet transform (DWT) is applied to SSIM to decrease the photometric loss. Firstly, The DWT divides an image into some patches with different frequencies. Then, the SSIM of each patch is computed. To preserve high-frequency image details and avoid producing "holes" or artifacts in some low-texture regions, it can flexibly adjust the weights of each patch of SSIM loss.
In the 2D discrete wavelet transform (DWT), low-pass and high-pass filters are performed on an image to obtain the convolution results. For instance, four filters, , , , and , are obtained by the lowpass filter multiplying the high-pass filter. The DWT divides an image into four small patches with different frequencies through these four filters, which can remove unnecessary interference from the images ( . ., haze and rain). Iteratively, the DWT can be formulated as follows: where is the iterative time of DWT. 0 is the original image. In this paper, = 2. is the down-sampling image. and are the horizontal and vertical edge detection images, respectively. is the corner detection image. To preserve high-frequency image details and avoid producing image artifacts, a coarse-to-fine manner is adopted to change the image resolution in the SSIM loss. The DWT divides the image into four patches: The ratios of the four patches are : : : where is the weight of each patch. The initial value of is 0.7. is the target image. is the source image.
Initially, before the DWT divides the image, the SSIM loss between the target image and source image is calculated. The total wavelet SSIM ( ) loss is

Total loss function
There are two main parts in the loss function: the target image photometric loss is calculated by reconstructing the target image, while the smoothness loss of depth image compels the predicted depth map to be smooth, given the input target image and its reconstructed image . The details are shown in Equation (3). To make the photometric loss effective and meaningful, some assumptions need to be set: (1) the scenes are Lambertian; and (2) the scenes should be static and unsheltered.
In general, the image photometric loss contains the structural similarity metric (SSIM) [26] and the regularization loss 1 . The wavelet SSIM loss is used to replace SSIM loss in photometric loss. Therefore, the image photometric loss is defined as where we empirically set = 0.85.
When computing the photometric loss from different source images, most previous approaches average the photometric loss together into every available source images. However, the second assumption requests that each pixel in the target image is also visible to the source image. However, this assumption is easily broken. It is inevitable that some moving objects and occlusions exist in the scene; thus, some pixels are available in one image but are not available in the next image. As a result, inaccurate pixel reconstruction and the photometric error are caused. Following the work in [22] , the minimum photometric loss at each pixel in the target image is computed instead of the average photometric loss. Note that this method can only correct the photometric loss but not eliminate it. Therefore, the final per-pixel photometric loss is In addition, the performance of depth network suffers from the influence of moving objects in the image. These moving pixels should not be involved in computing the photometric loss. Therefore, a binary per-pixel mask in [22] is applied to automatically recognize moving pixels ( = 0) and static pixels ( = 1). The mask only includes some pixels whose photometric error of the reconstructed image is lower than that of the target image and source image . The mask is defined as = [min( ( , )) > min( ( , ))] (12) [ ] is the Iverson bracket. The auto-masking photometric loss [22] The second-order gradients of the depth map are used to make the depth map smooth. Because the edge or corner in the depth map should be less smooth than other flat regions, the gradient of the depth map should be locally smooth rather than fully smooth. Therefore, a Laplacian [23] is applied to automatically perceive the position of each pixel. Different from the method in [23] , it is used at every scale instead of a specific scale. The Laplacian template is second-order differencing with four neighborhoods. It can reinforce object edges and weaken the region of slowly varying intensity. The smoothness loss of this pixel receives a lower weight when the Laplacian is higher. The smoothness loss is defined as follows: where ∇ is the Laplacian operator.
Therefore, the total loss function is = + The final total loss is averaged per pixel, batch, and scale.

EXPERIMENTS
To evaluate the effectiveness of our approach, some qualitative and quantitative results are provided about depth and pose prediction. KITTI dataset is the main data source to train and test depth networks. The KITTI odometry split was used to train and test our pose network. Meanwhile, the Make3D dataset was used to evaluate the adaptive ability and generalization of the proposed network.

Implementation details
The proposed depth network has dense skip connections which can fully learn deep abstract features. The network was trained from scratch without pre-training model weights and post-processing. The Sigmoid output of depth map is = 1/( + ), where and make the depth value between 0.1 and 100 units. In our experiments, the MonoDepth2 [22] was set to standard ResNet50 encoder for monocular depth network, ResNet18 for pose network, and without pre-training. Here, we simplify its name to MD2 for the rest of the paper.
Deep learning framework PyTorch [27] was used to implement our model. For comparison, the KITTI dataset was resized and downsampled to 640×192. The proposed network used Adam [28] optimizer with 1 = 0.9 and 1 = 0.999 to train 22 epochs. The batch size was set as 4 and the smoothness term was set to be 0.001. The learning rate was set to be 10 −4 for the first 20 epochs and reduced by a factor of 10 for the remaining epochs. The settings for the pose network were the same as in [22] . In addition, a single NVIDIA GeForce TITAN X with 12 GB GPU memory was used in our experiments.

Evaluation metrics
To evaluate our method, we used some standard evaluation metrics, as shown in Table 1.
| | is the number of pixels in image .
is the predicted depth from model. is the depth ground truth. represents the threshold between the depth ground truth and the predicted depth, which is set to be 1.25, 1.25 2 , and 1.25 3 , respectively.

KITTI eigen split
The KITTI Eigen split [16] was used to train the proposed network. Before the network was trained, Zhou' s [16] preprocessing was used to remove static images. As a result, the training dataset had 39,810 monocular triplets, which contain 29 different scenes. The validation dataset had 4424 images, and there were 697 testing images. The image depth ground truth of the KITTI dataset was captured by Velodyne laser. Following the work in [22] , the intrinsics of all images were same, the principal point of the camera was set as image center, and the focal length was defined as the average of all focal lengths in the KITTI dataset. In addition, the depth predicted results were obtained by using the per-image median ground truth scaling proposed in [16] . When the results were evaluated, the maximum depth value was set to be 80 m and the minimum to be 0.1 m. Input MonoDepth [12] Zhou et al. [16] DDVO [23] GeoNet [18] Zhan et al. [13] EPC++(M) [19] MD2 [22] Ours Figure 5. Qualitative results on the KITTI Eigen split. The results are compared with some existing unsupervised methods. Figure 5 shows some visual examples of predicted depth maps. Our proposed model in the last row generates higher quality depth maps and gets clearer object edges than the other models. Some quantitative results are also provided in Table 2. The evaluation metrics are defined in Table 1. For the first four indices, lower scores are better. For the last three indices, higher scores are better. In Table 2, all results are shown without postprocessing [12] . The last row is the predicted result of our proposed method. The accuracy of depth prediction is improved when compared with other methods trained on monocular images. It is demonstrated that the proposed method is effective. Generally, the fewer input images in the pose network have a negative impact on the accuracy of the depth network. Even though only three frames are used to train the pose network at a time, our depth prediction results still outperform the other methods. Note that, some methods in Table  2 [18,19,24] were trained with multiple tasks.

Make3D dataset
The collected scene of the Make3D dataset is different from the KITTI dataset. Therefore, the Make3D dataset is often used to evaluate the adaptability of a network model. Our depth model trained on the KITTI dataset was tested on the Make3D dataset to evaluate its adaptability. The qualitative results are shown in Figure 6. The second column is the depth ground truth. Compared with MD2 [22] , the visual results of our model can get the global scene information and capture more object details. It can be seen that our method is useful and has great scene adaptability.    Table 3 shows the result of depth prediction for different components of the proposed method. "Basic" is the MD2 mentioned above. The results clearly prove that the contributions of our proposed terms to the overall performance. It is evident that discrete wavelet transform (DWT) can recover a high-quality clear image and improve the accuracy of depth prediction. The accuracy of depth prediction for both single-scale and multiscale supervisions are shown. Compared with the multi-scale method, the result of the single-scale method is better. The reason for this phenomenon is hypothesized to be that the low-resolution image has over-smoothed pixel color, which can easily cause inaccurate photometric loss.  ORB-SLAM [33] 0.014 ± 0.008 0.012 ± 0.011 -DDVO [26] 0.045 ± 0.108 0.033 ± 0.074 3 Zhou* [16] 0.05 ± 0.039 0.034 ± 0.028 5→2 Mahjourian [30] 0.013 ± 0.010 0.012 ± 0.011 3 GeoNet [18] 0.012 ± 0.007 0.012 ± 0.009 5 EPC++(M) [19] 0.013 ± 0.007 0.012 ± 0.008 3 Ranjan [24] 0.012 ± 0.007 0.012 ± 0.008 5 MD2(M) 0.018 ± 0.009 0.015 ± 0.010 2 ours 0.017 ± 0.010 0.015 ± 0.010 2

Network capacity
To show our proposed network can improve accuracy without increasing network capacity, the number of network parameters and the floating-point operations per second ( ) for the network were computed to evaluate the capacity of the proposed network. The quantitative results are shown in Table 4. For the sake of fair comparison, the pose network of MD2 and ours were set as ResNet50. Note that ResNet50 serves as our pose network only for comparison. The pose network adopted in the proposed overall framework is still ResNet18. Compared with MD2, our proposed method improves the accuracy of the depth network without adding extra computational burden, as expected.

Pose estimation
Our pose model was evaluated on the standard KITTI odometry split [16] . This dataset includes 11 driving sequences. Sequences 00-08 were used to train our pose network without using pose ground truth, while Sequences 09 and 10 were used to evaluate our pose model. The average absolute trajectory error with standard deviation (in meters) was used as evaluation metric. Godard' s [22] handling strategy was followed to evaluate the result of the two-frame model on the five-frame snippets. Because Godard' s [22] pose estimation results (M, ResNet50 for depth network, and ResNet18 for pose network) are not provided, we retrained and obtained the trained result (MD2).
Only two adjacent frames were taken in our pose model at a time, as shown in Table 5. The output was the relative 6-DoF pose between images. Even though our pose network structure is the same as that in MD2, our pose model obtains better performance than MD2. In addition, the results are comparable to other previous methods. Thus, it is observed that the proposed depth network has a positive effect on pose network.

CONCLUSIONS
A versatile end-to-end unsupervised learning framework of monocular depth and pose estimation is developed and evaluated on a dataset in this paper. Aggregated residual transformations (ResNeXt) are embedded in depth network to extract the input image' s high-dimensional features. In addition, the proposed wavelet SSIM loss is based on 2D discrete wavelet transform (DWT). Different patches with different frequencies are computed by DWT as the input to the SSIM loss to converge the network, which can recover high-quality clear image patches. The evaluation results show that the performance of depth prediction is improved while the computational burden is reduced. In addition, the proposed method has great adaptive ability on the Make3D dataset and can decrease the domain gap between different datasets. In future work, how to further optimize the whole system will be considered.

Authors' contributions
Made substantial contributions to conception and design of the study and performed data analysis, data acquisition and interpretation: Li B Provided administrative, technical guidance and material support: Zhang H, Wang Z, Hu L