Unsupervised object detection using deep neural networks is typically a difficult problem with few to no guarantees about the learned representation. In this work we present the first unsupervised object detection method that is theoretically guaranteed to recover the true object positions up to quantifiable small shifts. We develop an unsupervised object detection architecture and prove that the learned variables correspond to the true object positions up to small shifts related to the encoder and decoder receptive field sizes, the object sizes, and the widths of the Gaussians used in the rendering process. We perform detailed analysis of how the error depends on each of these variables and perform synthetic experiments validating our theoretical predictions up to a precision of individual pixels. We also perform experiments on CLEVR-based data and show that, unlike current SOTA object detection methods (SAM, CutLER), our method's prediction errors always lie within our theoretical bounds. We hope that this work helps open up an avenue of research into object detection methods with theoretical guarantees.
翻译:基于深度神经网络的无监督目标检测通常是一个难题,且对学习到的表示几乎没有保证。本文提出了首个具有理论保证的无监督目标检测方法,该方法能够恢复真实目标位置,误差仅限于可量化的微小偏移。我们开发了一种无监督目标检测架构,并证明学习到的变量与真实目标位置相对应,其误差仅与编码器和解码器的感受野大小、目标尺寸以及渲染过程中使用的高斯核宽度相关的微小偏移有关。我们详细分析了误差如何依赖于这些变量,并进行了合成实验,验证了我们的理论预测在单个像素精度范围内成立。我们还在基于CLEVR的数据上进行了实验,结果表明,与当前最先进的目标检测方法(SAM、CutLER)不同,我们的方法预测误差始终在理论边界内。我们希望这项工作有助于开启具有理论保证的目标检测方法的研究方向。