Cross-Modality Ship Re-Identification (CMS Re-ID) is critical for achieving all-day and all-weather maritime target tracking, yet it is fundamentally challenged by significant modality discrepancies. Mainstream solutions typically rely on explicit modality alignment strategies; however, this paradigm heavily depends on constructing large-scale paired datasets for pre-training. To address this, grounded in the Platonic Representation Hypothesis, we explore the potential of Vision Foundation Models (VFMs) in bridging modality gaps. Recognizing the suboptimal performance of existing generic Parameter-Efficient Fine-Tuning (PEFT) methods that operate within the weight space, particularly on limited-capacity models, we shift the optimization perspective to the feature space and propose a novel PEFT strategy termed Domain Representation Injection (DRI). Specifically, while keeping the VFM fully frozen to maximize the preservation of general knowledge, we design a lightweight, learnable Offset Encoder to extract domain-specific representations rich in modality and identity attributes from raw inputs. Guided by the contextual information of intermediate features at different layers, a Modulator adaptively transforms these representations. Subsequently, they are injected into the intermediate layers via additive fusion, dynamically reshaping the feature distribution to adapt to the downstream task without altering the VFM's pre-trained weights. Extensive experimental results demonstrate the superiority of our method, achieving State-of-the-Art (SOTA) performance with minimal trainable parameters. For instance, on the HOSS-ReID dataset, we attain 57.9\% and 60.5\% mAP using only 1.54M and 7.05M parameters, respectively. The code is available at https://github.com/TingfengXian/DRI.
翻译:跨模态船舶重识别(CMS Re-ID)对于实现全天候、全气象的海上目标跟踪至关重要,但其根本上受到显著模态差异的挑战。主流解决方案通常依赖于显式的模态对齐策略;然而,这种范式高度依赖于构建大规模配对数据集进行预训练。为解决此问题,基于柏拉图式表征假说,我们探索了视觉基础模型(VFMs)在弥合模态鸿沟方面的潜力。认识到现有在权重空间内操作的通用参数高效微调(PEFT)方法性能欠佳,尤其是在容量有限的模型上,我们将优化视角转向特征空间,并提出一种新颖的PEFT策略,称为域表征注入(DRI)。具体而言,在保持VFM完全冻结以最大化保留通用知识的同时,我们设计了一个轻量级、可学习的偏移编码器,用于从原始输入中提取富含模态和身份属性的领域特定表征。在不同层的中间特征上下文信息引导下,一个调制器自适应地转换这些表征。随后,它们通过加性融合被注入到中间层,动态地重塑特征分布以适应下游任务,而无需改变VFM的预训练权重。大量的实验结果证明了我们方法的优越性,仅使用极少的可训练参数即可实现最先进的(SOTA)性能。例如,在HOSS-ReID数据集上,我们仅使用1.54M和7.05M参数就分别达到了57.9%和60.5%的mAP。代码可在 https://github.com/TingfengXian/DRI 获取。