Intelligent robots need to interact with diverse objects across various environments. The appearance and state of objects frequently undergo complex transformations depending on the object properties, e.g., phase transitions. However, in the vision community, segmenting dynamic objects with phase transitions is overlooked. In light of this, we introduce the concept of phase in segmentation, which categorizes real-world objects based on their visual characteristics and potential morphological and appearance changes. Then, we present a new benchmark, Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation (M$^3$-VOS), to verify the ability of models to understand object phases, which consists of 479 high-resolution videos spanning over 10 distinct everyday scenarios. It provides dense instance mask annotations that capture both object phases and their transitions. We evaluate state-of-the-art methods on M$^3$-VOS, yielding several key insights. Notably, current appearancebased approaches show significant room for improvement when handling objects with phase transitions. The inherent changes in disorder suggest that the predictive performance of the forward entropy-increasing process can be improved through a reverse entropy-reducing process. These findings lead us to propose ReVOS, a new plug-andplay model that improves its performance by reversal refinement. Our data and code will be publicly available at https://zixuan-chen.github.io/M-cubeVOS.github.io/.
翻译:智能机器人需要与各种环境中的不同物体进行交互。物体的外观和状态常常会因其自身属性(例如相变)而发生复杂的转变。然而,在视觉研究领域,对具有相变特性的动态物体进行分割的问题尚未得到充分重视。鉴于此,我们引入了分割中的“相态”概念,该概念根据物体的视觉特征及其潜在的形态与外观变化对现实世界中的物体进行分类。随后,我们提出了一个新的基准数据集——多相态、多转变与多场景视频目标分割(M$^3$-VOS),用以验证模型理解物体相态的能力。该数据集包含479个高分辨率视频,涵盖超过10种不同的日常场景,并提供了密集的实例掩码标注,以捕捉物体的相态及其转变过程。我们在M$^3$-VOS上评估了当前最先进的方法,并获得了若干关键发现。值得注意的是,现有的基于外观的方法在处理具有相变特性的物体时,仍有显著的改进空间。物体变化固有的无序性表明,前向熵增过程的预测性能可以通过反向熵减过程得到提升。基于这些发现,我们提出了ReVOS,一种新的即插即用模型,它通过反向细化机制来提升其性能。我们的数据与代码将在 https://zixuan-chen.github.io/M-cubeVOS.github.io/ 公开。