Intelligent robots need to interact with diverse objects across various environments. The appearance and state of objects frequently undergo complex transformations depending on the object properties, e.g., phase transitions. However, in the vision community, segmenting dynamic objects with phase transitions is overlooked. In light of this, we introduce the concept of phase in segmentation, which categorizes real-world objects based on their visual characteristics and potential morphological and appearance changes. Then, we present a new benchmark, Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation (M3-VOS), to verify the ability of models to understand object phases, which consists of 479 high-resolution videos spanning over 10 distinct everyday scenarios. It provides dense instance mask annotations that capture both object phases and their transitions. We evaluate state-of-the-art methods on M3-VOS, yielding several key insights. Notably, current appearance based approaches show significant room for improvement when handling objects with phase transitions. The inherent changes in disorder suggest that the predictive performance of the forward entropy-increasing process can be improved through a reverse entropy-reducing process. These findings lead us to propose ReVOS, a new plug-and-play model that improves its performance by reversal refinement. Our data and code will be publicly available
翻译:智能机器人需要与不同环境中的多样化物体进行交互。物体的外观和状态常根据其属性(例如相变)经历复杂的转变。然而,在视觉研究领域,对经历相变的动态物体进行分割的问题尚未得到充分关注。鉴于此,我们引入了分割中的“相态”概念,该概念依据物体的视觉特征及其潜在的形态与外观变化对现实世界物体进行分类。随后,我们提出了一个新的基准数据集——多相态、多转变与多场景视频目标分割(M3-VOS),用以验证模型理解物体相态的能力。该数据集包含479个高分辨率视频,涵盖10个不同的日常场景,并提供了捕获物体相态及其转变过程的密集实例掩码标注。我们在M3-VOS上评估了当前最先进的方法,获得了若干关键发现。值得注意的是,现有的基于外观的方法在处理具有相变过程的物体时,表现出显著的改进空间。物体变化内在的无序性表明,通过一个逆向的熵减过程可以改进前向熵增过程的预测性能。这些发现促使我们提出了ReVOS,一种新的即插即用模型,它通过逆向细化机制来提升其性能。我们的数据与代码将公开提供。