RoDE: Linear Rectified Mixture of Diverse Experts for Food Large Multi-Modal Models

Large Multi-modal Models (LMMs) have significantly advanced a variety of vision-language tasks. The scalability and availability of high-quality training data play a pivotal role in the success of LMMs. In the realm of food, while comprehensive food datasets such as Recipe1M offer an abundance of ingredient and recipe information, they often fall short of providing ample data for nutritional analysis. The Recipe1M+ dataset, despite offering a subset for nutritional evaluation, is limited in the scale and accuracy of nutrition information. To bridge this gap, we introduce Uni-Food, a unified food dataset that comprises over 100,000 images with various food labels, including categories, ingredients, recipes, and ingredient-level nutritional information. Uni-Food is designed to provide a more holistic approach to food data analysis, thereby enhancing the performance and capabilities of LMMs in this domain. To mitigate the conflicts arising from multi-task supervision during fine-tuning of LMMs, we introduce a novel Linear Rectification Mixture of Diverse Experts (RoDE) approach. RoDE utilizes a diverse array of experts to address tasks of varying complexity, thereby facilitating the coordination of trainable parameters, i.e., it allocates more parameters for more complex tasks and, conversely, fewer parameters for simpler tasks. RoDE implements linear rectification union to refine the router's functionality, thereby enhancing the efficiency of sparse task allocation. These design choices endow RoDE with features that ensure GPU memory efficiency and ease of optimization. Our experimental results validate the effectiveness of our proposed approach in addressing the inherent challenges of food-related multitasking.

翻译：大型多模态模型（LMMs）在多种视觉-语言任务中取得了显著进展。高质量训练数据的可扩展性与可获得性对LMMs的成功具有关键作用。在食品领域，虽然如Recipe1M等综合性食品数据集提供了丰富的食材与食谱信息，但其营养分析数据往往不足。Recipe1M+数据集虽提供了营养评估子集，但其营养信息的规模与准确性仍存在局限。为弥补这一缺口，我们提出了Uni-Food——一个包含超过10万张图像的统一食品数据集，涵盖类别、食材、食谱及食材级营养信息等多维度标签。Uni-Food旨在为食品数据分析提供更全面的解决方案，从而提升LMMs在该领域的性能与能力。为缓解LMMs微调过程中多任务监督引发的冲突，我们提出了一种新颖的线性整流多样化专家混合（RoDE）方法。RoDE利用多样化的专家网络处理不同复杂度的任务，从而协调可训练参数的分配——即对更复杂的任务分配更多参数，反之对简单任务分配较少参数。RoDE通过线性整流联合机制优化路由器的功能，从而提升稀疏任务分配的效率。这些设计使RoDE兼具GPU内存高效性与优化便捷性。实验结果验证了所提方法在应对食品相关多任务固有挑战方面的有效性。