Transparent objects remain notoriously hard for perception systems: refraction, reflection and transmission break the assumptions behind stereo, ToF and purely discriminative monocular depth, causing holes and temporally unstable estimates. Our key observation is that modern video diffusion models already synthesize convincing transparent phenomena, suggesting they have internalized the optical rules. We build TransPhy3D, a synthetic video corpus of transparent/reflective scenes: 11k sequences rendered with Blender/Cycles. Scenes are assembled from a curated bank of category-rich static assets and shape-rich procedural assets paired with glass/plastic/metal materials. We render RGB + depth + normals with physically based ray tracing and OptiX denoising. Starting from a large video diffusion model, we learn a video-to-video translator for depth (and normals) via lightweight LoRA adapters. During training we concatenate RGB and (noisy) depth latents in the DiT backbone and co-train on TransPhy3D and existing frame-wise synthetic datasets, yielding temporally consistent predictions for arbitrary-length input videos. The resulting model, DKT, achieves zero-shot SOTA on real and synthetic video benchmarks involving transparency: ClearPose, DREDS (CatKnown/CatNovel), and TransPhy3D-Test. It improves accuracy and temporal consistency over strong image/video baselines, and a normal variant sets the best video normal estimation results on ClearPose. A compact 1.3B version runs at ~0.17 s/frame. Integrated into a grasping stack, DKT's depth boosts success rates across translucent, reflective and diffuse surfaces, outperforming prior estimators. Together, these results support a broader claim: "Diffusion knows transparency." Generative video priors can be repurposed, efficiently and label-free, into robust, temporally coherent perception for challenging real-world manipulation.
翻译:透明物体对于感知系统而言始终是众所周知的难题:折射、反射和透射现象破坏了立体视觉、飞行时间法以及纯判别式单目深度估计的基本假设,导致深度图出现空洞且时间稳定性差。我们的核心发现是:现代视频扩散模型已能合成逼真的透明现象,这表明其内部已隐式掌握了光学规律。我们构建了TransPhy3D合成视频数据集,包含11,000个透明/反射场景序列,均通过Blender/Cycles渲染生成。场景由精心筛选的类别丰富的静态资源库与几何形态丰富的程序化资源库组合而成,并配以玻璃/塑料/金属材质。我们采用基于物理的光线追踪与OptiX降噪技术,同步渲染RGB图像、深度图与法向图。基于大规模视频扩散模型,我们通过轻量级LoRA适配器学习从视频到深度(及法向)的转换器。训练过程中,我们在DiT主干网络中将RGB与(含噪)深度潜在表示进行拼接,并在TransPhy3D与现有逐帧合成数据集上联合训练,从而实现对任意长度输入视频的时间一致性预测。所得模型DKT在涉及透明物体的真实与合成视频基准测试(ClearPose、DREDS的CatKnown/CatNovel子集、TransPhy3D-Test)中实现了零样本状态最优性能。相较于强力的图像/视频基线方法,该模型在精度与时间一致性方面均有提升,其法向估计变体在ClearPose上取得了最佳视频法向估计结果。紧凑的13亿参数版本运行速度约0.17秒/帧。在抓取任务集成实验中,DKT生成的深度图显著提升了半透明、反射与漫反射表面的抓取成功率,优于现有估计器。这些结果共同印证了一个更广泛的论断:“扩散模型通晓透明性”。生成式视频先验能够被高效、无标注地转化为鲁棒且时间连贯的感知系统,应对现实世界中具有挑战性的操作任务。