This paper addresses the challenges of learning representations for recipes and food images in the cross-modal retrieval problem. As the relationship between a recipe and its cooked dish is cause-and-effect, treating a recipe as a text source describing the visual appearance of a dish for learning representation, as the existing approaches, will create bias misleading image-and-recipe similarity judgment. Specifically, a food image may not equally capture every detail in a recipe, due to factors such as the cooking process, dish presentation, and image-capturing conditions. The current representation learning tends to capture dominant visual-text alignment while overlooking subtle variations that determine retrieval relevance. In this paper, we model such bias in cross-modal representation learning using causal theory. The causal view of this problem suggests ingredients as one of the confounder sources and a simple backdoor adjustment can alleviate the bias. By causal intervention, we reformulate the conventional model for food-to-recipe retrieval with an additional term to remove the potential bias in similarity judgment. Based on this theory-informed formulation, we empirically prove the oracle performance of retrieval on the Recipe1M dataset to be MedR=1 across the testing data sizes of 1K, 10K, and even 50K. We also propose a plug-and-play neural module, which is essentially a multi-label ingredient classifier for debiasing. New state-of-the-art search performances are reported on the Recipe1M dataset.
翻译:本文针对跨模态检索中食谱与食物图像表征学习的挑战展开研究。由于食谱与其成品菜肴之间存在因果关系,若如现有方法将食谱视为描述菜品视觉外观的文本源进行表征学习,将产生误导图像-食谱相似性判断的偏差。具体而言,由于烹饪过程、菜肴呈现方式和图像采集条件等因素,食物图像可能无法均等地捕捉食谱中的所有细节。现有表征学习方法倾向于捕捉主导的视觉-文本对齐关系,而忽略了决定检索相关性的细微差异。本文基于因果理论对此类跨模态表征学习中的偏差进行建模。该问题的因果视角表明食材是混杂因素之一,通过简单的后门调整即可缓解偏差。通过因果干预,我们重构了传统食物-食谱检索模型,引入附加项以消除相似性判断中的潜在偏差。基于此理论化表述,我们在Recipe1M数据集上实证验证了检索性能在1K、10K乃至50K测试数据规模下均达到MedR=1的理论最优值。同时提出即插即用的神经模块——本质上是一个用于去偏的多标签食材分类器。在Recipe1M数据集上实现了新的最先进检索性能。