The obesity phenomenon, known as the heavy issue, is a leading cause of preventable chronic diseases worldwide. Traditional calorie estimation tools often rely on specific data formats or complex pipelines, limiting their practicality in real-world scenarios. Recently, vision-language models (VLMs) have excelled in understanding real-world contexts and enabling conversational interactions, making them ideal for downstream tasks such as ingredient analysis. However, applying VLMs to calorie estimation requires domain-specific data and alignment strategies. To this end, we curated CalData, a 330K image-text pair dataset tailored for ingredient recognition and calorie estimation, combining a large-scale recipe dataset with detailed nutritional instructions for robust vision-language training. Built upon this dataset, we present CaLoRAify, a novel VLM framework aligning ingredient recognition and calorie estimation via training with visual-text pairs. During inference, users only need a single monocular food image to estimate calories while retaining the flexibility of agent-based conversational interaction. With Low-rank Adaptation (LoRA) and Retrieve-augmented Generation (RAG) techniques, our system enhances the performance of foundational VLMs in the vertical domain of calorie estimation. Our code and data are fully open-sourced at https://github.com/KennyYao2001/16824-CaLORAify.
翻译:肥胖现象,即所谓的"沉重问题",是全球可预防慢性疾病的主要诱因。传统热量估算工具通常依赖特定数据格式或复杂流程,限制了其在真实场景中的实用性。近期,视觉语言模型(VLMs)在理解真实世界语境和实现对话交互方面表现卓越,使其成为食材分析等下游任务的理想选择。然而,将VLMs应用于热量估算需要领域特定的数据和对齐策略。为此,我们构建了CalData——一个包含33万图像-文本对的数据集,专为食材识别与热量估算设计,该数据集结合了大规模食谱数据与详细的营养说明,以支持鲁棒的视觉语言训练。基于此数据集,我们提出了CaLoRAify,这是一种通过视觉-文本对训练实现食材识别与热量估算对齐的新型VLM框架。在推理阶段,用户仅需提供单目食物图像即可估算热量,同时保留基于智能体的对话交互灵活性。通过低秩自适应(LoRA)与检索增强生成(RAG)技术,我们的系统提升了基础VLM在热量估算垂直领域的性能。我们的代码与数据已在https://github.com/KennyYao2001/16824-CaLORAify 完全开源。