We introduce GazeD, a new 3D gaze estimation method that jointly provides 3D gaze and human pose from a single RGB image. Leveraging the ability of diffusion models to deal with uncertainty, it generates multiple plausible 3D gaze and pose hypotheses based on the 2D context information extracted from the input image. Specifically, we condition the denoising process on the 2D pose, the surroundings of the subject, and the context of the scene. With GazeD we also introduce a novel way of representing the 3D gaze by positioning it as an additional body joint at a fixed distance from the eyes. The rationale is that the gaze is usually closely related to the pose, and thus it can benefit from being jointly denoised during the diffusion process. Evaluations across three benchmark datasets demonstrate that GazeD achieves state-of-the-art performance in 3D gaze estimation, even surpassing methods that rely on temporal information. Project details will be available at https://aimagelab.ing.unimore.it/go/gazed.
翻译:本文提出GazeD,一种能够从单张RGB图像联合估计三维视线与人体姿态的新方法。该方法利用扩散模型处理不确定性的能力,基于输入图像提取的二维上下文信息,生成多个合理的三维视线与姿态假设。具体而言,我们将去噪过程的条件设定为二维姿态、目标人物周围环境及场景上下文。通过GazeD,我们还提出一种新颖的三维视线表征方式:将视线定位为距离眼睛固定距离的附加身体关节点。其原理在于视线通常与姿态密切相关,因此在扩散过程中进行联合去噪可提升性能。在三个基准数据集上的评估表明,GazeD在三维视线估计任务中达到了最先进的性能,甚至超越了依赖时序信息的方法。项目详情详见 https://aimagelab.ing.unimore.it/go/gazed。