Zero-shot image recognition (ZSIR) aims at empowering models to recognize and reason in unseen domains via learning generalized knowledge from limited data in the seen domain. The gist for ZSIR is to execute element-wise representation and reasoning from the input visual space to the target semantic space, which is a bottom-up modeling paradigm inspired by the process by which humans observe the world, i.e., capturing new concepts by learning and combining the basic components or shared characteristics. In recent years, element-wise learning techniques have seen significant progress in ZSIR as well as widespread application. However, to the best of our knowledge, there remains a lack of a systematic overview of this topic. To enrich the literature and provide a sound basis for its future development, this paper presents a broad review of recent advances in element-wise ZSIR. Concretely, we first attempt to integrate the three basic ZSIR tasks of object recognition, compositional recognition, and foundation model-based open-world recognition into a unified element-wise perspective and provide a detailed taxonomy and analysis of the main research approaches. Then, we collect and summarize some key information and benchmarks, such as detailed technical implementations and common datasets. Finally, we sketch out the wide range of its related applications, discuss vital challenges, and suggest potential future directions.
翻译:零样本图像识别(ZSIR)旨在通过从可见域有限数据中学习泛化知识,使模型能够识别和推理未见域中的对象。ZSIR的核心在于执行从输入视觉空间到目标语义空间的元素级表示与推理,这是一种受人类观察世界过程启发的自底向上建模范式——即通过学习和组合基本组件或共享特征来捕获新概念。近年来,元素级学习技术在ZSIR领域取得显著进展并获得广泛应用。然而,据我们所知,目前仍缺乏对该主题的系统性概述。为丰富文献并为其未来发展奠定坚实基础,本文对元素级ZSIR的最新进展进行了全面回顾。具体而言,我们首先尝试将目标识别、组合识别和基于基础模型的开放世界识别这三个基本ZSIR任务整合到统一的元素级视角下,并对主要研究方法进行了详细分类与分析。随后,我们收集并总结了关键技术实现细节、常用数据集等关键信息与基准数据。最后,我们勾勒了其广泛的相关应用领域,讨论了核心挑战,并提出了潜在的未来研究方向。