Food classification is an important task in health care. In this work, we propose a multimodal classification framework that uses the modified version of EfficientNet with the Mish activation function for image classification, and the traditional BERT transformer-based network is used for text classification. The proposed network and the other state-of-the-art methods are evaluated on a large open-source dataset, UPMC Food-101. The experimental results show that the proposed network outperforms the other methods, a significant difference of 11.57% and 6.34% in accuracy is observed for image and text classification, respectively, when compared with the second-best performing method. We also compared the performance in terms of accuracy, precision, and recall for text classification using both machine learning and deep learning-based models. The comparative analysis from the prediction results of both images and text demonstrated the efficiency and robustness of the proposed approach.
翻译:食物分类是医疗健康领域的一项重要任务。本文提出了一种多模态分类框架,该框架在图像分类中采用融合Mish激活函数的改进版EfficientNet,在文本分类中采用基于传统BERT Transformer的网络。所提出的网络及其他现有最优方法均在大型开源数据集UPMC Food-101上进行了评估。实验结果表明,与次优方法相比,该网络在图像分类和文本分类的准确率上分别显著提升了11.57%和6.34%。我们还利用基于机器学习和深度学习的模型,从准确率、精确率和召回率三个方面对文本分类性能进行了比较。基于图像和文本预测结果的对比分析表明,所提出的方法具有良好的效率和鲁棒性。