Knowledge distillation to effectively attain both region-of-interest and global semantics from an image where multiple objects appear

Models based on convolutional neural networks (CNN) and transformers have steadily been improved. They also have been applied in various computer vision downstream tasks. However, in object detection tasks, accurately localizing and classifying almost infinite categories of foods in images remains challenging. To address these problems, we first segmented the food as the region-of-interest (ROI) by using the segment-anything model (SAM) and masked the rest of the region except ROI as black pixels. This process simplified the problems into a single classification for which annotation and training were much simpler than object detection. The images in which only the ROI was preserved were fed as inputs to fine-tune various off-the-shelf models that encoded their own inductive biases. Among them, Data-efficient image Transformers (DeiTs) had the best classification performance. Nonetheless, when foods' shapes and textures were similar, the contextual features of the ROI-only images were not enough for accurate classification. Therefore, we introduced a novel type of combined architecture, RveRNet, which consisted of ROI, extra-ROI, and integration modules that allowed it to account for both the ROI's and global contexts. The RveRNet's F1 score was 10% better than other individual models when classifying ambiguous food images. If the RveRNet's modules were DeiT with the knowledge distillation from the CNN, performed the best. We investigated how architectures can be made robust against input noise caused by permutation and translocation. The results indicated that there was a trade-off between how much the CNN teacher's knowledge could be distilled to DeiT and DeiT's innate strength. Code is publicly available at: https://github.com/Seonwhee-Genome/RveRNet.

翻译：基于卷积神经网络（CNN）和Transformer的模型持续得到改进，并已应用于各种计算机视觉下游任务。然而，在目标检测任务中，对图像中近乎无限类别的食物进行准确定位和分类仍然具有挑战性。为解决这些问题，我们首先使用Segment-Anything模型（SAM）将食物分割为感兴趣区域（ROI），并将ROI以外的区域掩码为黑色像素。这一过程将问题简化为单一分类任务，其标注和训练复杂度远低于目标检测。仅保留ROI的图像被输入到经过微调的各种现成模型中，这些模型编码了各自的归纳偏置。其中，数据高效图像Transformer（DeiT）表现出最佳分类性能。然而，当食物形状和纹理相似时，仅含ROI的图像上下文特征不足以实现精确分类。因此，我们提出了一种新型组合架构RveRNet，该架构由ROI模块、超ROI模块和集成模块组成，使其能够同时考虑ROI与全局上下文信息。在分类模糊食物图像时，RveRNet的F1分数比其他独立模型高出10%。当RveRNet的模块采用经CNN知识蒸馏的DeiT时，性能达到最优。我们研究了如何构建对置换和移位引起的输入噪声具有鲁棒性的架构。结果表明，CNN教师模型的知识可蒸馏至DeiT的程度与DeiT的固有优势之间存在权衡关系。代码已公开于：https://github.com/Seonwhee-Genome/RveRNet。