DepthSync: Diffusion Guidance-Based Depth Synchronization for Scale- and Geometry-Consistent Video Depth Estimation

1Tsinghua University, 2ARC Lab, Tencent PCG
ICCV 2025


Teaser
DepthSync is a framework that introduces cross-window depth scale synchronization and intra-window geometry alignment to achieve long video depth predictions with enhanced scale and geometry consistency via diffusion guidance.

Abstract

Diffusion-based video depth estimation methods have achieved remarkable success with strong generalization ability. However, predicting depth for long videos remains challenging. Existing methods typically split videos into overlapping sliding windows, leading to accumulated scale discrepancies across different windows, particularly as the number of windows increases. Additionally, these methods rely solely on 2D diffusion priors, overlooking the inherent 3D geometric structure of video depths, which results in geometrically inconsistent predictions.

In this paper, we propose DepthSync, a novel, training-free framework using diffusion guidance to achieve scale- and geometry-consistent depth predictions for long videos. Specifically, we introduce scale guidance to synchronize the depth scale across windows and geometry guidance to enforce geometric alignment within windows based on the inherent 3D constraints in video depths. These two terms work synergistically, steering the denoising process toward consistent depth predictions. Experiments on various datasets validate the effectiveness of our method in producing depth estimates with improved scale and geometry consistency, particularly for long videos.


Teaser
Pipeline. Our DepthSync framework predicts depth for a monocular video by applying guidance to a pre-trained diffusion-based depth estimation model, ensuring scale- and geometry-consistent depth predictions. Following common practice, the input video is split into overlapping windows and processed sequentially. During denoising, we derive depths from noise prediction, applying geometry guidance to align frames within each window using geometric constraints and scale guidance to synchronize depth scales across windows.

BibTeX

@article{dong2025depthsync,
  title={DepthSync: Diffusion Guidance-Based Depth Synchronization for Scale-and Geometry-Consistent Video Depth Estimation},
  author={Dong, Yue-Jiang and Zhao, Wang and Xu, Jiale and Shan, Ying and Zhang, Song-Hai},
  journal   = {ICCV},
  year      = {2025},
}