Estimating object mass from visual input is challenging because mass depends jointly on geometric volume and material-dependent density, neither of which is directly observable from RGB appearance. Consequently, mass prediction from pixels is ill-posed and therefore benefits from physically meaningful representations to constrain the space of plausible solutions. We propose a physically structured framework for single-image mass estimation that addresses this ambiguity by aligning visual cues with the physical factors governing mass. From a single RGB image, we recover object-centric three-dimensional geometry via monocular depth estimation to inform volume and extract coarse material semantics using a vision-language model to guide density-related reasoning. These geometry, semantic, and appearance representations are fused through an instance-adaptive gating mechanism, and two physically guided latent factors (volume- and density-related) are predicted through separate regression heads under mass-only supervision. Experiments on image2mass and ABO-500 show that the proposed method consistently outperforms state-of-the-art methods.
翻译:从视觉输入估计物体质量具有挑战性,因为质量同时取决于几何体积和依赖于材料的密度,而这两者均无法直接从RGB外观中观测。因此,从像素预测质量是一个不适定问题,从而受益于具有物理意义的表征来约束合理解的空间。我们提出了一种用于单图像质量估计的物理结构化框架,通过将视觉线索与决定质量的物理因素对齐来解决这一模糊性。从单张RGB图像中,我们通过单目深度估计恢复以物体为中心的三维几何以获取体积信息,并利用视觉语言模型提取粗略的材料语义以指导密度相关的推理。这些几何、语义和外观表征通过一个实例自适应门控机制进行融合,两个物理引导的潜在因子(与体积相关和与密度相关)在仅使用质量监督的情况下通过独立的回归头进行预测。在image2mass和ABO-500数据集上的实验表明,所提方法在性能上持续优于现有最先进方法。