Existing object recognition models have been shown to lack robustness in diverse geographical scenarios due to significant domain shifts in design and context. Class representations need to be adapted to more accurately reflect an object concept under these shifts. In the absence of training data from target geographies, we hypothesize that geography-specific descriptive knowledge of object categories can be leveraged to enhance robustness. For this purpose, we explore the feasibility of probing a large-language model for geography-specific object knowledge, and we investigate integrating knowledge in zero-shot and learnable soft prompting with the CLIP vision-language model. In particular, we propose a geography knowledge regularization method to ensure that soft prompts trained on a source set of geographies generalize to an unseen target set of geographies. Our gains on DollarStreet when generalizing from a model trained only on data from Europe are as large as +2.8 on countries from Africa, and +4.6 on the hardest classes. We further show competitive performance vs. few-shot target training, and provide insights into how descriptive knowledge captures geographical differences.
翻译:现有目标识别模型已被证明因设计和场景的显著领域偏移而在多样地理场景中缺乏鲁棒性。类别表示需要调整以更准确地反映这些偏移下的对象概念。在缺乏目标地域训练数据的情况下,我们假设可以利用对象类别的地理特定描述性知识来增强鲁棒性。为此,我们探索从大语言模型中提取地理特定对象知识的可行性,并研究在零样本和可学习的软提示中与CLIP视觉语言模型集成知识。特别地,我们提出了一种地理知识正则化方法,确保在源地域集上训练的软提示能够泛化到未见过的目标地域集。在DollarStreet数据集上,仅使用欧洲数据训练的模型推广至非洲国家时,准确率提升高达+2.8,最难类别提升达+4.6。我们还展示了与少样本目标训练相比的竞争力,并提供了描述性知识如何捕捉地理差异的见解。