We present Implicit-Scale 3D Reconstruction from Monocular Multi-Food Images, a benchmark dataset designed to advance geometry-based food portion estimation in realistic dining scenarios. Existing dietary assessment methods largely rely on single-image analysis or appearance-based inference, including recent vision-language models, which lack explicit geometric reasoning and are sensitive to scale ambiguity. This benchmark reframes food portion estimation as an implicit-scale 3D reconstruction problem under monocular observations. To reflect real-world conditions, explicit physical references and metric annotations are removed; instead, contextual objects such as plates and utensils are provided, requiring algorithms to infer scale from implicit cues and prior knowledge. The dataset emphasizes multi-food scenes with diverse object geometries, frequent occlusions, and complex spatial arrangements. The benchmark was adopted as a challenge at the MetaFood 2025 Workshop, where multiple teams proposed reconstruction-based solutions. Experimental results show that while strong vision--language baselines achieve competitive performance, geometry-based reconstruction methods provide both improved accuracy and greater robustness, with the top-performing approach achieving 0.21 MAPE in volume estimation and 5.7 L1 Chamfer Distance in geometric accuracy.
翻译:我们提出了"单目多食物图像隐式尺度三维重建"基准数据集,旨在推进真实用餐场景中基于几何的食物分量估计研究。现有膳食评估方法主要依赖单图像分析或基于外观的推断(包括近期兴起的视觉-语言模型),这些方法缺乏显式几何推理能力,且对尺度模糊性敏感。本基准将食物分量估计重新定义为单目观测下的隐式尺度三维重建问题。为反映真实场景条件,数据集移除了显式物理参照物和度量标注,转而提供餐盘、餐具等上下文物体,要求算法从隐式线索和先验知识中推断尺度。该数据集聚焦于多食物场景,涵盖多样化的物体几何形态、频繁的遮挡现象和复杂的空间布局。本基准已被MetaFood 2025研讨会采纳为挑战赛任务,多个参赛团队提出了基于重建的解决方案。实验结果表明:虽然强大的视觉-语言基线模型取得了具有竞争力的性能,但基于几何的重建方法在提升精度的同时展现出更强的鲁棒性,其中最优方法的体积估计达到0.21平均绝对百分比误差,几何精度方面获得5.7 L1 Chamfer距离。