Are Pretrained Image Matchers Good Enough for SAR-Optical Satellite Registration?

Cross-modal optical-SAR (Synthetic Aperture Radar) registration is a bottleneck for disaster-response via remote sensing, yet modern image matchers are developed and benchmarked almost exclusively on natural-image domains. We evaluate twenty-four pretrained matcher families--in a zero-shot setting with no fine-tuning or domain adaptation on satellite or SAR data--on SpaceNet9 and two additional cross-modal benchmarks under a deterministic protocol with tiled large-image inference, robust geometric filtering, and tie-point-grounded metrics. Our results reveal asymmetric transfer--matchers with explicit cross-modal training do not uniformly outperform those without it. While XoFTR (trained for visible-thermal matching) and RoMa achieve the lowest reported mean error at $3.0$ px on the labeled SpaceNet9 training scenes, RoMa achieves this without any cross-modal training, and MatchAnything-ELoFTR ($3.4$ px)--trained on synthetic cross-modal pairs--matches closely, suggesting (as a working hypothesis) that foundation-model features (DINOv2) may contribute to modality invariance that partially substitutes for explicit cross-modal supervision. 3D-reconstruction matchers (MASt3R, DUSt3R), which are not designed for traditional 2D image matching, are highly protocol-sensitive and remain fragile under default settings. Deployment protocol choices (geometry model, tile size, inlier gating) shift accuracy by up to $33\times$ for a single matcher, sometimes exceeding the effect of swapping matchers entirely within the evaluated sweep--affine geometry alone reduces mean error from $12.34$ to $9.74$ px. These findings inform both practical deployment of existing matchers and future matcher design for cross-modal satellite registration.

翻译：跨模态光学-SAR（合成孔径雷达）配准是遥感灾害响应的瓶颈，然而现代图像匹配器几乎完全在自然图像领域进行开发和基准测试。我们在确定性协议下，对24个预训练匹配器家族——在卫星或SAR数据上不进行微调或领域适应的零样本设置下——在SpaceNet9和两个额外的跨模态基准上进行了评估，该协议包含大图像分块推理、鲁棒几何滤波和基于连接点的度量标准。我们的结果揭示了非对称迁移：具有显式跨模态训练的匹配器并不统一优于未经训练的匹配器。虽然XoFTR（用于可见光-热红外匹配训练）和RoMa在标注的SpaceNet9训练场景上达到了3.0像素的最低报告平均误差，但RoMa未经过任何跨模态训练即实现此性能；而MatchAnything-ELoFTR（3.4像素）——在合成跨模态对上训练——与之接近，这表明（作为一个工作假设）基础模型特征（DINOv2）可能有助于实现模态不变性，部分替代显式跨模态监督。未设计用于传统二维图像匹配的三维重建匹配器（MASt3R，DUSt3R）对协议高度敏感，在默认设置下仍显脆弱。部署协议选择（几何模型、分块大小、内点门控）可使单一匹配器精度变化高达33倍，有时超过在评估范围内完全更换匹配器的效果——仅仿射几何就使平均误差从12.34像素降至9.74像素。这些发现为现有匹配器的实际部署以及未来面向跨模态卫星配准的匹配器设计提供了启示。