PrimeDepth: Efficient Monocular Depth Estimation with a Stable Diffusion Preimage

Utilising the Generative Preimage

Instead of the RGB input image, we use the image representation extracted from the last step of Stable Diffusion, i.e. Stable Diffusion Preimage, as it contains a richer feature composition than the RGB image itself. The preimage entails all the intermediate, multi-scale feature maps, cross- and self-attention maps of every neural block. After each block, the extracted feature maps, cross- and self-attention maps are fused and kept at their respective resolution, as illustrated in the figure above.

Architectural Bias for Preimage Integration

To exploit the full potential of the preimage, we mirror the architecture of the Stable Diffusion Decoder but with smaller size for the Preimage Refiner. Similar to U-Net skip connections, the refiner receives the fused preimage parts at its respective blocks via concatenation. The mirrored architecture and the successive incorporation of the preimage at the respective blocks yields a natural architectural bias for the processing of the preimage.

Training Protocol

To avoid the risk of catastrophic forgetting and altering the image representation, we leave the weights of Stable Diffusion frozen. The refiner, the depth branch and the segmentation branch are trained simultaneously.
Since the refiner has to be initialised randomly, we pre-train it on a set of pseudo label pairs gathered by running Depth Anything on unlabelled images. After pre-training, we train on labelled synthetic data composed of Hypersim and Virtual KITTI.

Comparison with other Methods

Quantitative comparison of PrimeDepth with depth estimators on several zero-shot benchmarks. Bold numbers are the best, underscored second best. Our method is either the best or second best and outperforms other methods on overall rank with only little training on purely synthetic datasets.

Comparison to State-of-the-Art

Quantitative comparison of PrimeDepth with state-of-the-art depth estimators on several zero-shot benchmarks including difficult scenarios using a standardised evaluation protocol. Bold numbers are the best, underscored second best.
Unlike Marigold, our method remains stable under difficult conditions (rabbitai, nuScenes-C), arguably because of the unchanged Stable Diffusion representation. Overall, Depth Anything, which uses 1.5M labelled training images vs 74K for Marigold and PrimeDepth, performs best with PrimeDepth being the clear runner-up. Despite the pre-training on Depth Anything pseudo labels, the predictions of PrimeDepth and Depth Anything are complementary. To demonstrate this, we show results of a simple pixel-wise average combining the predictions of Depth Anything and PrimeDepth. It improves over both Depth Anything and PrimeDepth, meaning that the two signals are not fully correlated and sets a new state-of-the-art in monocular depth estimation.

For additional qualitative and quantitative results, as well as ablation studies, please refer to the PDF paper linked at the top of this page.

PrimeDepth: Efficient Monocular Depth Estimation with a Stable Diffusion Preimage

ACCV 2024

Overview

Method

Utilising the Generative Preimage

Architectural Bias for Preimage Integration

Training Protocol

Comparison with other Methods

Comparison to State-of-the-Art

Citation