Multi-turn textual feedback-based fashion image retrieval focuses on a real-world setting, where users can iteratively provide information to refine retrieval results until they find an item that fits all their requirements. In this work, we present a novel memory-based method, called FashionNTM, for such a multi-turn system. Our framework incorporates a new Cascaded Memory Neural Turing Machine (CM-NTM) approach for implicit state management, thereby learning to integrate information across all past turns to retrieve new images, for a given turn. Unlike vanilla Neural Turing Machine (NTM), our CM-NTM operates on multiple inputs, which interact with their respective memories via individual read and write heads, to learn complex relationships. Extensive evaluation results show that our proposed method outperforms the previous state-of-the-art algorithm by 50.5%, on Multi-turn FashionIQ -- the only existing multi-turn fashion dataset currently, in addition to having a relative improvement of 12.6% on Multi-turn Shoes -- an extension of the single-turn Shoes dataset that we created in this work. Further analysis of the model in a real-world interactive setting demonstrates two important capabilities of our model -- memory retention across turns, and agnosticity to turn order for non-contradictory feedback. Finally, user study results show that images retrieved by FashionNTM were favored by 83.1% over other multi-turn models. Project page: https://sites.google.com/eng.ucsd.edu/fashionntm
翻译:多轮文本反馈驱动的时尚图像检索聚焦于真实场景,用户可通过迭代提供信息逐步优化检索结果,直至找到完全符合需求的物品。本文提出了一种名为FashionNTM的新型记忆方法,专为这种多轮检索系统设计。我们的框架引入了级联记忆神经图灵机(CM-NTM)方法实现隐式状态管理,从而学习整合所有历史轮次信息,为当前轮次检索新图像。与标准神经图灵机(NTM)不同,CM-NTM可处理多输入,各输入通过独立读写头与对应记忆交互,以学习复杂关联。广泛评估结果显示,在现有唯一多轮时尚数据集Multi-turn FashionIQ上,本方法相较先前最优算法提升50.5%;同时,在本文构建的Multi-turn Shoes(由单轮Shoes数据集扩展而来)上实现12.6%的相对改进。真实交互环境下的模型分析进一步揭示了本模型的两项关键能力:跨轮次记忆保持性,以及对非矛盾反馈的轮序无关性。用户研究表明,FashionNTM检索的图像在多轮模型中获得了83.1%的偏好率。项目主页:https://sites.google.com/eng.ucsd.edu/fashionntm