Food image segmentation is a critical task for dietary analysis, enabling accurate estimation of food volume and nutrients. However, current methods suffer from limited multi-view data and poor generalization to new viewpoints. We introduce BenchSeg, a novel multi-view food video segmentation dataset and benchmark. BenchSeg aggregates 55 dish scenes (from Nutrition5k, Vegetables & Fruits, MetaFood3D, and FoodKit) with 25,284 meticulously annotated frames, capturing each dish under free 360° camera motion. We evaluate a diverse set of 20 state-of-the-art segmentation models (e.g., SAM-based, transformer, CNN, and large multimodal) on the existing FoodSeg103 dataset and evaluate them (alone and combined with video-memory modules) on BenchSeg. Quantitative and qualitative results demonstrate that while standard image segmenters degrade sharply under novel viewpoints, memory-augmented methods maintain temporal consistency across frames. Our best model based on a combination of SeTR-MLA+XMem2 outperforms prior work (e.g., improving over FoodMem by ~2.63% mAP), offering new insights into food segmentation and tracking for dietary analysis. We release BenchSeg to foster future research. The project page including the dataset annotations and the food segmentation models can be found at https://amughrabi.github.io/benchseg.
翻译:食物图像分割是饮食分析中的关键任务,能够实现对食物体积与营养物质的精确估算。然而,现有方法受限于多视角数据的匮乏,且对新视角的泛化能力较差。本文提出了BenchSeg,一个新颖的多视角食物视频分割数据集与基准。BenchSeg汇集了来自Nutrition5k、Vegetables & Fruits、MetaFood3D和FoodKit的55个菜品场景,包含25,284帧精细标注的图像,在自由360°相机运动下捕捉每个菜品。我们在现有FoodSeg103数据集上评估了20种先进的图像分割模型(如基于SAM、Transformer、CNN及大型多模态模型),并在BenchSeg上对这些模型(单独或与视频记忆模块结合)进行了评估。定量与定性结果表明,尽管标准图像分割器在新视角下性能急剧下降,但记忆增强方法能够保持跨帧的时间一致性。我们基于SeTR-MLA+XMem2组合的最佳模型超越了先前工作(例如,在mAP上较FoodMem提升约2.63%),为饮食分析中的食物分割与追踪提供了新见解。我们公开发布BenchSeg以促进未来研究。包含数据集标注与食物分割模型的项目页面位于 https://amughrabi.github.io/benchseg。