We present a novel method for precise 3D object localization in single images from a single calibrated camera using only 2D labels. No expensive 3D labels are needed. Thus, instead of using 3D labels, our model is trained with easy-to-annotate 2D labels along with the physical knowledge of the object's motion. Given this information, the model can infer the latent third dimension, even though it has never seen this information during training. Our method is evaluated on both synthetic and real-world datasets, and we are able to achieve a mean distance error of just 6 cm in our experiments on real data. The results indicate the method's potential as a step towards learning 3D object location estimation, where collecting 3D data for training is not feasible.
翻译:我们提出了一种新颖的方法,仅使用二维标签即可从单张图像中实现精准的三维目标定位——仅需单个已标定摄像头。该方法无需昂贵的三维标签支持。具体而言,模型无需三维标签,而是利用易于标注的二维标签结合物体运动的物理知识进行训练。基于这些信息,模型能够推断出隐含的第三维度信息,即便在训练过程中从未接触过此类数据。我们在合成数据集与真实世界数据集上对该方法进行了评估,在真实数据实验中实现了仅6厘米的平均距离误差。结果表明,该方法在三维目标位置估计学习中具有潜力,尤其适用于难以收集三维训练数据的场景。