Latent Noise Segmentation: How Neural Noise Leads to the Emergence of Segmentation and Grouping

Deep Neural Networks (DNNs) that achieve human-level performance in general tasks like object segmentation typically require supervised labels. In contrast, humans are able to perform these tasks effortlessly without supervision. To accomplish this, the human visual system makes use of perceptual grouping. Understanding how perceptual grouping arises in an unsupervised manner is critical for improving both models of the visual system, and computer vision models. In this work, we propose a counterintuitive approach to unsupervised perceptual grouping and segmentation: that they arise because of neural noise, rather than in spite of it. We (1) mathematically demonstrate that under realistic assumptions, neural noise can be used to separate objects from each other, and (2) show that adding noise in a DNN enables the network to segment images even though it was never trained on any segmentation labels. Interestingly, we find that (3) segmenting objects using noise results in segmentation performance that aligns with the perceptual grouping phenomena observed in humans. We introduce the Good Gestalt (GG) datasets -- six datasets designed to specifically test perceptual grouping, and show that our DNN models reproduce many important phenomena in human perception, such as illusory contours, closure, continuity, proximity, and occlusion. Finally, we (4) demonstrate the ecological plausibility of the method by analyzing the sensitivity of the DNN to different magnitudes of noise. We find that some model variants consistently succeed with remarkably low levels of neural noise ($\sigma<0.001$), and surprisingly, that segmenting this way requires as few as a handful of samples. Together, our results suggest a novel unsupervised segmentation method requiring few assumptions, a new explanation for the formation of perceptual grouping, and a potential benefit of neural noise in the visual system.

翻译：在通用任务（如目标分割）中达到人类水平的深度神经网络通常需要监督标签。相比之下，人类可以在无监督的情况下轻松完成这些任务。为此，人类视觉系统利用了知觉分组。理解知觉分组如何在无监督条件下涌现，对于改进视觉系统模型和计算机视觉模型都至关重要。本文提出了一种反直觉的无监督知觉分组与分割方法：即这些现象源于神经噪声，而非其对立面。我们（1）从数学上证明，在现实假设下，神经噪声可用于将物体彼此分离；（2）表明在深度神经网络中添加噪声可使网络实现图像分割，即使该网络从未经过任何分割标签训练。有趣的是，我们发现（3）利用噪声分割物体的结果与人类观察到的知觉分组现象一致。我们引入了"完形"数据集——六个专门用于测试知觉分组的数据集，并证明我们的深度神经网络模型再现了人类知觉中的许多重要现象，如错觉轮廓、闭合性、连续性、接近性和遮挡。最后，（4）我们通过分析深度神经网络对不同噪声幅度的敏感性，证明了该方法的生态学合理性。我们发现，部分模型变体在极低水平的神经噪声（σ<0.001）下仍能持续成功分割，且令人惊讶的是，这种分割方式仅需极少样本。综上，我们的研究提出了一种假设较少的新型无监督分割方法，为知觉分组形成提供了新解释，并揭示了视觉系统中神经噪声的潜在益处。