Food image segmentation is an important task that has ubiquitous applications, such as estimating the nutritional value of a plate of food. Although machine learning models have been used for segmentation in this domain, food images pose several challenges. One challenge is that food items can overlap and mix, making them difficult to distinguish. Another challenge is the degree of inter-class similarity and intra-class variability, which is caused by the varying preparation methods and dishes a food item may be served in. Additionally, class imbalance is an inevitable issue in food datasets. To address these issues, two models are trained and compared, one based on convolutional neural networks and the other on Bidirectional Encoder representation for Image Transformers (BEiT). The models are trained and valuated using the FoodSeg103 dataset, which is identified as a robust benchmark for food image segmentation. The BEiT model outperforms the previous state-of-the-art model by achieving a mean intersection over union of 49.4 on FoodSeg103. This study provides insights into transfering knowledge using convolution and Transformer-based approaches in the food image domain.
翻译:食物图像分割是一项具有广泛应用的重大任务,例如估算一盘食物的营养价值。尽管机器学习模型已用于该领域的分割任务,但食物图像仍面临诸多挑战。其一,食物成分可能重叠混合,导致难以区分;其二,由于食物制备方法和盛放方式各异,不同类别间相似性与类别内变异性问题突出;此外,类别不平衡也是食物数据集中不可避免的问题。为解决这些问题,本研究训练并比较了两种模型:一种基于卷积神经网络,另一种基于双向编码器图像表征Transformer(BEiT)。采用FoodSeg103数据集(被公认为食物图像分割的稳健基准)对模型进行训练与评估。BEiT模型在FoodSeg103上取得了49.4的平均交并比,超越了先前的最优模型。本研究为在食物图像领域通过卷积与Transformer方法进行知识迁移提供了重要见解。