We introduce a method to segment the visual field into independently moving regions, trained with no ground truth or supervision. It consists of an adversarial conditional encoder-decoder architecture based on Slot Attention, modified to use the image as context to decode optical flow without attempting to reconstruct the image itself. In the resulting multi-modal representation, one modality (flow) feeds the encoder to produce separate latent codes (slots), whereas the other modality (image) conditions the decoder to generate the first (flow) from the slots. This design frees the representation from having to encode complex nuisance variability in the image due to, for instance, illumination and reflectance properties of the scene. Since customary autoencoding based on minimizing the reconstruction error does not preclude the entire flow from being encoded into a single slot, we modify the loss to an adversarial criterion based on Contextual Information Separation. The resulting min-max optimization fosters the separation of objects and their assignment to different attention slots, leading to Divided Attention, or DivA. DivA outperforms recent unsupervised multi-object motion segmentation methods while tripling run-time speed up to 104FPS and reducing the performance gap from supervised methods to 12% or less. DivA can handle different numbers of objects and different image sizes at training and test time, is invariant to permutation of object labels, and does not require explicit regularization.
翻译:我们提出一种方法,将视野分割成独立运动的区域,且在无任何真实标签或监督的条件下完成训练。该方法基于槽注意力机制,采用对抗性条件编码器-解码器架构,并对其进行改进:利用图像作为上下文解码光流,而不尝试重建图像本身。在这种多模态表示中,一种模态(光流)馈入编码器以生成分离的潜在编码(槽),而另一种模态(图像)则作为条件输入解码器,从槽中生成前一种模态(光流)。这种设计使表示无需编码图像中因光照和场景反射特性等造成的复杂干扰变化。由于基于最小化重建误差的传统自编码无法阻止整个光流被编码到单个槽中,我们将损失函数修改为基于上下文信息分离的对抗性准则。由此产生的极小极大优化促进了物体分离及其分配到不同注意力槽的过程,从而形成了分治注意力机制(DivA)。DivA在性能上超越近期无监督多物体运动分割方法,同时运行速度提升至104FPS(速度提高三倍),并将与监督方法的性能差距缩小至12%以内。DivA能够在训练和测试时处理不同数量的物体及不同图像尺寸,对物体标签排列具有不变性,且无需显式正则化。