Food image classification is the fundamental step in image-based dietary assessment, which aims to estimate participants' nutrient intake from eating occasion images. A common challenge of food images is the intra-class diversity and inter-class similarity, which can significantly hinder classification performance. To address this issue, we introduce a novel multi-modal contrastive learning framework called FMiFood, which learns more discriminative features by integrating additional contextual information, such as food category text descriptions, to enhance classification accuracy. Specifically, we propose a flexible matching technique that improves the similarity matching between text and image embeddings to focus on multiple key information. Furthermore, we incorporate the classification objectives into the framework and explore the use of GPT-4 to enrich the text descriptions and provide more detailed context. Our method demonstrates improved performance on both the UPMC-101 and VFN datasets compared to existing methods.
翻译:食物图像分类是基于图像的饮食评估中的基础步骤,其目标是通过进食场合图像估计参与者的营养摄入量。食物图像常见的挑战在于类内多样性与类间相似性,这会显著影响分类性能。为解决这一问题,我们提出了一种名为FMiFood的新型多模态对比学习框架,该框架通过整合食物类别文本描述等额外上下文信息来学习更具判别性的特征,从而提升分类准确率。具体而言,我们提出了一种灵活的匹配技术,通过改进文本与图像嵌入之间的相似度匹配,使模型能够聚焦于多重关键信息。此外,我们将分类目标融入框架,并探索使用GPT-4来丰富文本描述以提供更详尽的上下文信息。实验表明,与现有方法相比,我们的方法在UPMC-101和VFN数据集上均取得了更优的性能。