In order for artificial agents to successfully perform tasks in changing environments, they must be able to both detect and adapt to novelty. However, visual novelty detection research often only evaluates on repurposed datasets such as CIFAR-10 originally intended for object classification, where images focus on one distinct, well-centered object. New benchmarks are needed to represent the challenges of navigating the complex scenes of an open world. Our new NovelCraft dataset contains multimodal episodic data of the images and symbolic world-states seen by an agent completing a pogo stick assembly task within a modified Minecraft environment. In some episodes, we insert novel objects of varying size within the complex 3D scene that may impact gameplay. Our visual novelty detection benchmark finds that methods that rank best on popular area-under-the-curve metrics may be outperformed by simpler alternatives when controlling false positives matters most. Further multimodal novelty detection experiments suggest that methods that fuse both visual and symbolic information can improve time until detection as well as overall discrimination. Finally, our evaluation of recent generalized category discovery methods suggests that adapting to new imbalanced categories in complex scenes remains an exciting open problem.
翻译:为使智能体能够在动态环境中成功执行任务,其必须同时具备检测并适应新颖性的能力。然而,当前视觉新颖性检测研究常采用如CIFAR-10等最初为物体分类任务设计的复用数据集——这类数据集的图像聚焦于单一、边界清晰且居中的物体。为模拟开放世界中复杂场景导航的挑战,亟需构建新型基准。我们提出的NovelCraft数据集包含多模态时序数据,记录了智能体在修改版Minecraft环境中完成弹跳棒组装任务时观测到的图像及符号化世界状态。在某些任务序列中,我们向复杂3D场景插入了可能影响游戏进程的多尺度新异物体。视觉新颖性检测基准测试表明:在需要优先控制假阳性率时,基于流行曲线下面积指标排名最优的方法可能被更简单的替代方案超越。进一步的多模态新颖性检测实验显示,融合视觉与符号信息的混合方法既能缩短检测延迟,又能提升整体判别能力。最后,针对近期广义类别发现方法的评估表明,在复杂场景中适应非平衡新类别仍是一个极具挑战性的开放问题。