Current visual detectors, though impressive within their training distribution, often fail to parse out-of-distribution scenes into their constituent entities. Recent test-time adaptation methods use auxiliary self-supervised losses to adapt the network parameters to each test example independently and have shown promising results towards generalization outside the training distribution for the task of image classification. In our work, we find evidence that these losses are insufficient for the task of scene decomposition, without also considering architectural inductive biases. Recent slot-centric generative models attempt to decompose scenes into entities in a self-supervised manner by reconstructing pixels. Drawing upon these two lines of work, we propose Slot-TTA, a semi-supervised slot-centric scene decomposition model that at test time is adapted per scene through gradient descent on reconstruction or cross-view synthesis objectives. We evaluate Slot-TTA across multiple input modalities, images or 3D point clouds, and show substantial out-of-distribution performance improvements against state-of-the-art supervised feed-forward detectors, and alternative test-time adaptation methods.
翻译:当前的视觉检测器尽管在训练分布内表现惊人,但往往难以将分布外场景解析为其组成实体。最近的测试时自适应方法利用辅助自监督损失,独立地为每个测试样本调整网络参数,并在图像分类任务的训练分布外泛化方面展现出可喜成果。本研究发现,这些损失若不结合架构归纳偏置,尚不足以应对场景分解任务。近期基于槽中心的生成模型尝试通过像素重建以自监督方式将场景分解为实体。综合这两条研究路线,我们提出Slot-TTA——一种半监督的槽中心场景分解模型,在测试时通过梯度下降优化重建或跨视角合成目标,实现每场景自适应。我们针对多种输入模态(图像或三维点云)评估Slot-TTA,证明其相较最先进的监督前馈检测器及其他测试时自适应方法,在分布外场景中实现了显著性能提升。