Learning domain-invariant representations is important to train a model that can generalize well to unseen target task domains. Text descriptions inherently contain semantic structures of concepts and such auxiliary semantic cues can be used as effective pivot embedding for domain generalization problems. Here, we use multimodal graph representations, fusing images and text, to get domain-invariant pivot embeddings by considering the inherent semantic structure between local images and text descriptors. Specifically, we aim to learn domain-invariant features by (i) representing the image and text descriptions with graphs, and by (ii) clustering and matching the graph-based image node features into textual graphs simultaneously. We experiment with large-scale public datasets, such as CUB-DG and DomainBed, and our model achieves matched or better state-of-the-art performance on these datasets. Our code will be publicly available upon publication.
翻译:学习域不变表示对于训练模型使其能有效泛化至未见目标任务域至关重要。文本描述本质上蕴含概念的语义结构,这种辅助性语义线索可作为域泛化问题的有效枢轴嵌入。本研究采用多模态图表示融合图像与文本,通过考虑局部图像与文本描述符之间的固有语义结构,获取域不变枢轴嵌入。具体而言,我们旨在通过以下方式学习域不变特征:(i) 将图像和文本描述表示为图结构,(ii) 同时将基于图的图像节点特征聚类并匹配到文本图中。我们在CUB-DG和DomainBed等大规模公开数据集上进行实验,所提模型在这些数据集上达到或超越现有最优性能。相关代码将在论文发表后公开。