Research on food image understanding using recipe data has been a long-standing focus due to the diversity and complexity of the data. Moreover, food is inextricably linked to people's lives, making it a vital research area for practical applications such as dietary management. Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities, not only in their vast knowledge but also in their ability to handle languages naturally. While English is predominantly used, they can also support multiple languages including Japanese. This suggests that MLLMs are expected to significantly improve performance in food image understanding tasks. We fine-tuned open MLLMs LLaVA-1.5 and Phi-3 Vision on a Japanese recipe dataset and benchmarked their performance against the closed model GPT-4o. We then evaluated the content of generated recipes, including ingredients and cooking procedures, using 5,000 evaluation samples that comprehensively cover Japanese food culture. Our evaluation demonstrates that the open models trained on recipe data outperform GPT-4o, the current state-of-the-art model, in ingredient generation. Our model achieved F1 score of 0.531, surpassing GPT-4o's F1 score of 0.481, indicating a higher level of accuracy. Furthermore, our model exhibited comparable performance to GPT-4o in generating cooking procedure text.
翻译:基于食谱数据的食物图像理解研究因其数据的多样性和复杂性而长期受到关注。此外,食物与人们的生活密不可分,使其成为饮食管理等实际应用的重要研究领域。多模态大语言模型(MLLMs)的最新进展展现了卓越的能力,不仅体现在其庞大的知识储备上,还体现在其自然处理语言的能力上。虽然英语是主要使用语言,但它们也能支持包括日语在内的多种语言。这表明MLLMs有望在食物图像理解任务中显著提升性能。我们在一个日本食谱数据集上对开源MLLMs LLaVA-1.5和Phi-3 Vision进行了微调,并将其性能与闭源模型GPT-4o进行了基准测试。随后,我们使用全面覆盖日本饮食文化的5000个评估样本,对生成食谱的内容(包括食材和烹饪步骤)进行了评估。我们的评估表明,在食谱数据上训练的开源模型在食材生成方面优于当前最先进的模型GPT-4o。我们的模型取得了0.531的F1分数,超过了GPT-4o的0.481分,表明其具有更高的准确性。此外,我们的模型在生成烹饪步骤文本方面也表现出了与GPT-4o相当的性能。