Estimating object mass from visual input is challenging because mass depends jointly on geometric volume and material-dependent density, neither of which is directly observable from RGB appearance. Consequently, mass prediction from pixels is ill-posed and therefore benefits from physically meaningful representations to constrain the space of plausible solutions. We propose a physically structured framework for single-image mass estimation that addresses this ambiguity by aligning visual cues with the physical factors governing mass. From a single RGB image, we recover object-centric three-dimensional geometry via monocular depth estimation to inform volume and extract coarse material semantics using a vision-language model to guide density-related reasoning. These geometry, semantic, and appearance representations are fused through an instance-adaptive gating mechanism, and two physically guided latent factors (volume- and density-related) are predicted through separate regression heads under mass-only supervision. Experiments on image2mass and ABO-500 show that the proposed method consistently outperforms state-of-the-art methods.
翻译:从视觉输入估计物体质量具有挑战性,因为质量同时依赖于几何体积和材料相关的密度,而两者均无法从RGB外观直接观测。因此,基于像素的质量预测是一个病态问题,需要借助物理有意义的表示来约束合理解空间。我们提出了一种面向单幅图像质量估计的物理结构化框架,通过将视觉线索与支配质量的物理因子对齐来解决这一歧义性。从单幅RGB图像出发,我们通过单目深度估计恢复以物体为中心的三维几何信息以获取体积,并利用视觉语言模型提取粗粒度材料语义以指导密度相关推理。这些几何、语义和外观表示通过实例自适应门控机制融合,并在仅依赖质量监督的条件下,通过独立回归头预测两个物理引导的潜在因子(体积相关与密度相关因子)。在image2mass和ABO-500数据集上的实验表明,所提方法持续优于现有最先进方法。