In recent years, test-time adaptive object detection has attracted increasing attention due to its unique advantages in online domain adaptation, which aligns more closely with real-world application scenarios. However, existing approaches heavily rely on source-derived statistical characteristics while making the strong assumption that the source and target domains share an identical category space. In this paper, we propose the first foundation model-powered test-time adaptive object detection method that eliminates the need for source data entirely and overcomes traditional closed-set limitations. Specifically, we design a Multi-modal Prompt-based Mean-Teacher framework for vision-language detector-driven test-time adaptation, which incorporates text and visual prompt tuning to adapt both language and vision representation spaces on the test data in a parameter-efficient manner. Correspondingly, we propose a Test-time Warm-start strategy tailored for the visual prompts to effectively preserve the representation capability of the vision branch. Furthermore, to guarantee high-quality pseudo-labels in every test batch, we maintain an Instance Dynamic Memory (IDM) module that stores high-quality pseudo-labels from previous test samples, and propose two novel strategies-Memory Enhancement and Memory Hallucination-to leverage IDM's high-quality instances for enhancing original predictions and hallucinating images without available pseudo-labels, respectively. Extensive experiments on cross-corruption and cross-dataset benchmarks demonstrate that our method consistently outperforms previous state-of-the-art methods, and can adapt to arbitrary cross-domain and cross-category target data. Code is available at https://github.com/gaoyingjay/ttaod_foundation.
翻译:近年来,测试时自适应目标检测因其在在线领域自适应方面的独特优势而受到越来越多的关注,这更符合现实世界的应用场景。然而,现有方法严重依赖于源域统计特征,并强假设源域与目标域共享相同的类别空间。本文提出了首个基于基础模型的测试时自适应目标检测方法,完全无需源域数据,并克服了传统闭集限制。具体而言,我们设计了一种基于多模态提示的均值教师框架,用于视觉-语言检测器驱动的测试时自适应,该框架结合文本和视觉提示调优,以参数高效的方式在测试数据上同时适应语言和视觉表示空间。相应地,我们提出了一种专为视觉提示设计的测试时热启动策略,以有效保持视觉分支的表征能力。此外,为确保每个测试批次中伪标签的高质量,我们维护了一个实例动态记忆模块,用于存储先前测试样本的高质量伪标签,并提出了两种新策略——记忆增强与记忆幻觉——分别利用IDM中的高质量实例来增强原始预测,以及在没有可用伪标签的情况下生成幻觉图像。在跨损坏和跨数据集基准上的大量实验表明,我们的方法始终优于先前的最先进方法,并能适应任意的跨域和跨类别目标数据。代码可在https://github.com/gaoyingjay/ttaod_foundation获取。