We introduce GazeD, a new 3D gaze estimation method that jointly provides 3D gaze and human pose from a single RGB image. Leveraging the ability of diffusion models to deal with uncertainty, it generates multiple plausible 3D gaze and pose hypotheses based on the 2D context information extracted from the input image. Specifically, we condition the denoising process on the 2D pose, the surroundings of the subject, and the context of the scene. With GazeD we also introduce a novel way of representing the 3D gaze by positioning it as an additional body joint at a fixed distance from the eyes. The rationale is that the gaze is usually closely related to the pose, and thus it can benefit from being jointly denoised during the diffusion process. Evaluations across three benchmark datasets demonstrate that GazeD achieves state-of-the-art performance in 3D gaze estimation, even surpassing methods that rely on temporal information. Project details will be available at https://aimagelab.ing.unimore.it/go/gazed.
翻译:本文提出GazeD,一种从单张RGB图像联合估计三维视线与人体姿态的新方法。该方法利用扩散模型处理不确定性的能力,基于输入图像提取的二维上下文信息生成多个合理的三维视线与姿态假设。具体而言,我们将去噪过程以二维姿态、主体周围环境及场景上下文为条件进行约束。通过GazeD,我们还提出了一种新颖的三维视线表示方法,将其定位为距离眼睛固定距离的附加身体关节点。其基本原理在于视线通常与姿态密切相关,因此在扩散过程中联合去噪可使其受益。在三个基准数据集上的评估表明,GazeD在三维视线估计任务中达到了最先进的性能,甚至超越了依赖时序信息的方法。项目详情详见https://aimagelab.ing.unimore.it/go/gazed。