Due to the availability of increasingly large amounts of visual data, there is a growing need for tools that can help users find relevant images. While existing tools can perform image retrieval based on similarity or metadata, they fall short in scenarios that necessitate semantic reasoning about the content of the image. This paper explores a new multi-modal image search approach that allows users to conveniently specify and perform semantic image search tasks. With our tool, PhotoScout, the user interactively provides natural language descriptions, positive and negative examples, and object tags to specify their search tasks. Under the hood, PhotoScout is powered by a program synthesis engine that generates visual queries in a domain-specific language and executes the synthesized program to retrieve the desired images. In a study with 25 participants, we observed that PhotoScout allows users to perform image retrieval tasks more accurately and with less manual effort.
翻译:随着视觉数据量的日益增长,用户对能够高效检索相关图像工具的需求愈发迫切。现有工具虽能基于相似度或元数据进行图像检索,但在需要根据图像内容进行语义推理的场景中仍存在局限。本文探索了一种新型多模态图像搜索方法,使用户能够便捷地定义并执行语义图像检索任务。通过我们的工具PhotoScout,用户可交互式地提供自然语言描述、正负样本示例及对象标签来明确搜索任务。在底层,PhotoScout由程序合成引擎驱动,该引擎能够生成领域特定语言中的视觉查询,并执行合成程序以检索目标图像。一项涉及25名参与者的研究表明,PhotoScout能帮助用户更准确、更省力地完成图像检索任务。