In order for artificial agents to successfully perform tasks in changing environments, they must be able to both detect and adapt to novelty. However, visual novelty detection research often only evaluates on repurposed datasets such as CIFAR-10 originally intended for object classification, where images focus on one distinct, well-centered object. New benchmarks are needed to represent the challenges of navigating the complex scenes of an open world. Our new NovelCraft dataset contains multimodal episodic data of the images and symbolic world-states seen by an agent completing a pogo stick assembly task within a modified Minecraft environment. In some episodes, we insert novel objects within the complex 3D scene that may impact gameplay and appear in a variety of sizes and positions. Our visual novelty detection benchmark finds that methods that rank best on popular area-under-the-curve metrics may be outperformed by simpler alternatives when controlling false positives matters most. Further multi-modal novelty detection experiments suggest that methods that fuse both visual and symbolic information can improve time until detection as well as overall discrimination. Finally, our evaluation of recent generalized category discovery methods suggests that adapting to new imbalanced categories in complex scenes remains an exciting open problem.
翻译:为使智能体能够在动态环境中成功完成任务,必须同时具备检测和适应新颖性的能力。然而,现有视觉新颖性检测研究多直接沿用CIFAR-10等最初为物体分类设计的重构数据集,其图像聚焦于单个清晰且居中分布的物体。为反映开放世界中复杂场景导航的挑战,亟需构建新型基准测试。本文提出的NovelCraft数据集包含多模态情景数据,涵盖智能体在改良版Minecraft环境中执行弹跳棍组装任务时所观测的图像与符号化世界状态。部分情景中,我们在复杂三维场景中插入可能影响游戏进程的新颖物体,这些物体以多种尺寸和位置出现。基于该数据集的视觉新颖性检测基准测试表明:当以控制假阳性率为核心评估指标时,在标准AUC指标上表现最优的方法可能反被更简单的替代方案超越。进一步的多模态新颖性检测实验提示,融合视觉与符号信息的处理方法可同时提升检测速度和整体辨别能力。最后,针对当前广义类别发现方法的评估表明,如何在复杂场景中适应非平衡的新颖类别仍是亟待攻关的开放性问题。