Ablation Study to Clarify the Mechanism of Object Segmentation in Multi-Object Representation Learning

Multi-object representation learning aims to represent complex real-world visual input using the composition of multiple objects. Representation learning methods have often used unsupervised learning to segment an input image into individual objects and encode these objects into each latent vector. However, it is not clear how previous methods have achieved the appropriate segmentation of individual objects. Additionally, most of the previous methods regularize the latent vectors using a Variational Autoencoder (VAE). Therefore, it is not clear whether VAE regularization contributes to appropriate object segmentation. To elucidate the mechanism of object segmentation in multi-object representation learning, we conducted an ablation study on MONet, which is a typical method. MONet represents multiple objects using pairs that consist of an attention mask and the latent vector corresponding to the attention mask. Each latent vector is encoded from the input image and attention mask. Then, the component image and attention mask are decoded from each latent vector. The loss function of MONet consists of 1) the sum of reconstruction losses between the input image and decoded component image, 2) the VAE regularization loss of the latent vector, and 3) the reconstruction loss of the attention mask to explicitly encode shape information. We conducted an ablation study on these three loss functions to investigate the effect on segmentation performance. Our results showed that the VAE regularization loss did not affect segmentation performance and the others losses did affect it. Based on this result, we hypothesize that it is important to maximize the attention mask of the image region best represented by a single latent vector corresponding to the attention mask. We confirmed this hypothesis by evaluating a new loss function with the same mechanism as the hypothesis.

翻译：多对象表征学习旨在利用多个对象的组合来表示复杂的真实世界视觉输入。表征学习方法常使用无监督学习将输入图像分割为单个对象，并将这些对象编码到每个潜在向量中。然而，目前尚不清楚先前方法如何实现对各对象的恰当分割。此外，大多数先前方法使用变分自编码器对潜在向量进行正则化，因此也不清楚变分自编码器正则化是否有助于实现恰当的对象分割。为阐明多对象表征学习中的对象分割机制，我们针对典型方法MONet开展了消融研究。MONet通过由注意力掩码及与其对应的潜在向量组成的配对来表示多个对象。每个潜在向量由输入图像和注意力掩码编码得到，随后从各潜在向量解码出组件图像和注意力掩码。MONet的损失函数包含：1）输入图像与解码组件图像之间重建损失的总和，2）潜在向量的变分自编码器正则化损失，以及3）为显式编码形状信息而引入的注意力掩码重建损失。我们对这三项损失函数进行消融研究，以探究其对分割性能的影响。结果表明：变分自编码器正则化损失不影响分割性能，而其他两项损失则产生影响。基于此结果，我们提出假设：最大化由单一潜在向量（与该注意力掩码对应）最优表示的图像区域的注意力掩码至关重要。通过评估具有相同机制的新损失函数，我们验证了该假设。