Understanding Image2Video Domain Shift in Food Segmentation: An Instance-level Analysis on Apples

Food segmentation models trained on static images have achieved strong performance on benchmark datasets; however, their reliability in video settings remains poorly understood. In real-world applications such as food monitoring and instance counting, segmentation outputs must be temporally consistent, yet image-trained models often break down when deployed on videos. In this work, we analyze this failure through an instance segmentation and tracking perspective, focusing on apples as a representative food category. Models are trained solely on image-level food segmentation data and evaluated on video sequences using an instance segmentation with tracking-by-matching framework, enabling object-level temporal analysis. Our results reveal that high frame-wise segmentation accuracy does not translate to stable instance identities over time. Temporal appearance variations, particularly illumination changes, specular reflections, and texture ambiguity, lead to mask flickering and identity fragmentation, resulting in significant errors in apple counting. These failures are largely overlooked by conventional image-based metrics, which substantially overestimate real-world video performance. Beyond diagnosing the problem, we examine practical remedies that do not require full video supervision, including post-hoc temporal regularization and self-supervised temporal consistency objectives. Our findings suggest that the root cause of failure lies in image-centric training objectives that ignore temporal coherence, rather than model capacity. This study highlights a critical evaluation gap in food segmentation research and motivates temporally-aware learning and evaluation protocols for video-based food analysis.

翻译：在静态图像上训练的食物分割模型已在基准数据集上取得了优异性能；然而，其在视频场景下的可靠性仍鲜为人知。在食物监测与实例计数等实际应用中，分割输出需具备时间一致性，但基于图像训练的模型部署于视频时往往失效。本研究通过实例分割与跟踪视角分析这一失败现象，以苹果作为代表性食物类别展开研究。模型仅使用图像级食物分割数据进行训练，并通过基于匹配跟踪的实例分割框架在视频序列上进行评估，从而实现对象级时序分析。我们的结果表明，高帧级分割精度并不能转化为随时间稳定的实例身份标识。时序外观变化（特别是光照变化、镜面反射和纹理模糊性）会导致掩码闪烁和身份碎片化，从而造成苹果计数的显著误差。这些失败现象被传统基于图像的评估指标严重忽视，这些指标大幅高估了实际视频场景的性能。除问题诊断外，我们研究了无需完整视频监督的实用改进方案，包括后处理时序正则化和自监督时序一致性目标。研究发现表明，失效的根本原因在于以图像为中心的训练目标忽略了时序连贯性，而非模型容量问题。本研究揭示了食物分割研究中的关键评估缺陷，并为基于视频的食物分析提出了具有时序感知能力的学习与评估方案。