We focus on the task of retrieving nail design images based on dense intent descriptions, which represent multi-layered user intent for nail designs. This is challenging because such descriptions specify unconstrained painted elements and pre-manufactured embellishments as well as visual characteristics, themes, and overall impressions. In addition to these descriptions, we assume that users provide palette queries by specifying zero or more colors via a color picker, enabling the expression of subtle and continuous color nuances. Existing vision-language foundation models often struggle to incorporate such descriptions and palettes. To address this, we propose NaiLIA, a multimodal retrieval method for nail design images, which comprehensively aligns with dense intent descriptions and palette queries during retrieval. Our approach introduces a relaxed loss based on confidence scores for unlabeled images that can align with the descriptions. To evaluate NaiLIA, we constructed a benchmark consisting of 10,625 images collected from people with diverse cultural backgrounds. The images were annotated with long and dense intent descriptions given by over 200 annotators. Experimental results demonstrate that NaiLIA outperforms standard methods.
翻译:本研究聚焦于基于密集意图描述的美甲设计图像检索任务,此类描述表征了用户对美甲设计的多层次意图。该任务具有挑战性,因为此类描述不仅限定了无约束的手绘元素与预制装饰物,还涵盖了视觉特征、主题风格及整体观感。除文字描述外,我们假设用户通过颜色选择器指定零至多种颜色构成调色板查询,从而实现对细微连续色彩差异的表达。现有的视觉-语言基础模型往往难以有效融合此类描述与调色板信息。为此,我们提出NaiLIA——一种面向美甲设计图像的多模态检索方法,能够在检索过程中全面对齐密集意图描述与调色板查询。本方法引入基于未标注图像置信度的松弛损失函数,使其能够与描述信息对齐。为评估NaiLIA,我们构建了包含10,625张图像的基准数据集,这些图像采集自多元文化背景的人群,并由超过200名标注者提供了长文本密集意图描述。实验结果表明,NaiLIA在性能上优于现有标准方法。