We introduce a method to segment the visual field into independently moving regions, trained with no ground truth or supervision. It consists of an adversarial conditional encoder-decoder architecture based on Slot Attention, modified to use the image as context to decode optical flow without attempting to reconstruct the image itself. In the resulting multi-modal representation, one modality (flow) feeds the encoder to produce separate latent codes (slots), whereas the other modality (image) conditions the decoder to generate the first (flow) from the slots. This design frees the representation from having to encode complex nuisance variability in the image due to, for instance, illumination and reflectance properties of the scene. Since customary autoencoding based on minimizing the reconstruction error does not preclude the entire flow from being encoded into a single slot, we modify the loss to an adversarial criterion based on Contextual Information Separation. The resulting min-max optimization fosters the separation of objects and their assignment to different attention slots, leading to Divided Attention, or DivA. DivA outperforms recent unsupervised multi-object motion segmentation methods while tripling run-time speed up to 104FPS and reducing the performance gap from supervised methods to 12% or less. DivA can handle different numbers of objects and different image sizes at training and test time, is invariant to permutation of object labels, and does not require explicit regularization.
翻译:我们提出一种将视场分割为独立运动区域的方法,该方法无需真实标注或监督训练。该方法采用基于Slot Attention的对抗性条件编码器-解码器架构,并通过修改使其以图像为上下文解码光流,而不试图重构图像本身。在所得多模态表示中,一种模态(光流)输入编码器以生成分离的潜变量编码(槽),而另一种模态(图像)则为解码器提供条件,使其根据槽生成前者(光流)。这种设计使表示免于编码场景中由照明和反射特性等引起的复杂干扰变异。由于基于最小化重构误差的传统自编码无法避免整个光流被编码至单一槽中的情况,我们将损失函数修改为基于上下文信息分离的对抗性准则。由此产生的极小-极大优化促进了目标分离及其分配至不同注意力槽的过程,形成了分割注意力(DivA)。DivA在提升运行速度至104FPS(三倍加速)的同时,在无监督多目标运动分割方法中取得最优性能,并将与监督方法的性能差距缩小至12%以内。DivA可在训练和测试阶段处理不同数量的目标及不同尺寸的图像,对目标标签置换具有不变性,且无需显式正则化。