The demand for accurate food quantification has increased in the recent years, driven by the needs of applications in dietary monitoring. At the same time, computer vision approaches have exhibited great potential in automating tasks within the food domain. Traditionally, the development of machine learning models for these problems relies on training data sets with pixel-level class annotations. However, this approach introduces challenges arising from data collection and ground truth generation that quickly become costly and error-prone since they must be performed in multiple settings and for thousands of classes. To overcome these challenges, the paper presents a weakly supervised methodology for training food image classification and semantic segmentation models without relying on pixel-level annotations. The proposed methodology is based on a multiple instance learning approach in combination with an attention-based mechanism. At test time, the models are used for classification and, concurrently, the attention mechanism generates semantic heat maps which are used for food class segmentation. In the paper, we conduct experiments on two meta-classes within the FoodSeg103 data set to verify the feasibility of the proposed approach and we explore the functioning properties of the attention mechanism.
翻译:近年来,受饮食监测应用需求的驱动,对食品精确量化的需求日益增长。与此同时,计算机视觉方法在自动化食品领域任务中展现出巨大潜力。传统上,针对这些问题的机器学习模型开发依赖于具有像素级类别标注的训练数据集。然而,这种方法因数据采集和真实标注生成需在多种场景下针对数千个类别进行,很快便面临成本高昂且易出错的挑战。为克服这些难题,本文提出了一种无需像素级标注的弱监督方法,用于训练食品图像分类与语义分割模型。所提方法基于多实例学习框架并结合注意力机制。在测试阶段,模型用于分类任务,同时注意力机制生成语义热力图,进而用于食品类别分割。本文通过在FoodSeg103数据集中两个元类别上的实验验证了该方法的可行性,并探究了注意力机制的功能特性。