FoodMLLM-JP: Leveraging Multimodal Large Language Models for Japanese Recipe Generation

Research on food image understanding using recipe data has been a long-standing focus due to the diversity and complexity of the data. Moreover, food is inextricably linked to people's lives, making it a vital research area for practical applications such as dietary management. Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities, not only in their vast knowledge but also in their ability to handle languages naturally. While English is predominantly used, they can also support multiple languages including Japanese. This suggests that MLLMs are expected to significantly improve performance in food image understanding tasks. We fine-tuned open MLLMs LLaVA-1.5 and Phi-3 Vision on a Japanese recipe dataset and benchmarked their performance against the closed model GPT-4o. We then evaluated the content of generated recipes, including ingredients and cooking procedures, using 5,000 evaluation samples that comprehensively cover Japanese food culture. Our evaluation demonstrates that the open models trained on recipe data outperform GPT-4o, the current state-of-the-art model, in ingredient generation. Our model achieved F1 score of 0.531, surpassing GPT-4o's F1 score of 0.481, indicating a higher level of accuracy. Furthermore, our model exhibited comparable performance to GPT-4o in generating cooking procedure text.

翻译：基于食谱数据的食物图像理解研究因其数据的多样性和复杂性而长期受到关注。此外，食物与人们的生活密不可分，使其成为饮食管理等实际应用的重要研究领域。多模态大语言模型（MLLMs）的最新进展展现了卓越的能力，不仅体现在其庞大的知识储备上，还体现在其自然处理语言的能力上。虽然英语是主要使用语言，但它们也能支持包括日语在内的多种语言。这表明MLLMs有望在食物图像理解任务中显著提升性能。我们在一个日本食谱数据集上对开源MLLMs LLaVA-1.5和Phi-3 Vision进行了微调，并将其性能与闭源模型GPT-4o进行了基准测试。随后，我们使用全面覆盖日本饮食文化的5000个评估样本，对生成食谱的内容（包括食材和烹饪步骤）进行了评估。我们的评估表明，在食谱数据上训练的开源模型在食材生成方面优于当前最先进的模型GPT-4o。我们的模型取得了0.531的F1分数，超过了GPT-4o的0.481分，表明其具有更高的准确性。此外，我们的模型在生成烹饪步骤文本方面也表现出了与GPT-4o相当的性能。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日