The objective of Domain Generalization (DG) is to devise algorithms and models capable of achieving high performance on previously unseen test distributions. In the pursuit of this objective, average measure has been employed as the prevalent measure for evaluating models and comparing algorithms in the existing DG studies. Despite its significance, a comprehensive exploration of the average measure has been lacking and its suitability in approximating the true domain generalization performance has been questionable. In this study, we carefully investigate the limitations inherent in the average measure and propose worst+gap measure as a robust alternative. We establish theoretical grounds of the proposed measure by deriving two theorems starting from two different assumptions. We conduct extensive experimental investigations to compare the proposed worst+gap measure with the conventional average measure. Given the indispensable need to access the true DG performance for studying measures, we modify five existing datasets to come up with SR-CMNIST, C-Cats&Dogs, L-CIFAR10, PACS-corrupted, and VLCS-corrupted datasets. The experiment results unveil an inferior performance of the average measure in approximating the true DG performance and confirm the robustness of the theoretically supported worst+gap measure.
翻译:域泛化(Domain Generalization,DG)的目标是设计能够在先前未见过的测试分布上实现高性能的算法与模型。为达成此目标,现有DG研究普遍采用平均度量作为评估模型与比较算法的主要指标。尽管其重要性不言而喻,但针对平均度量的全面探讨尚显不足,且其在近似真实域泛化性能方面的适用性一直存疑。本研究细致剖析了平均度量固有的局限性,并提出“最差+差距”度量作为一种稳健的替代方案。我们基于两种不同的假设推导出两条定理,为所提出的度量奠定了理论基础。通过广泛的实验研究,我们将提出的“最差+差距”度量与传统平均度量进行了系统比较。鉴于研究度量方法必须获取真实DG性能这一不可或缺的需求,我们修改了五个现有数据集,构建出SR-CMNIST、C-Cats&Dogs、L-CIFAR10、PACS-corrupted及VLCS-corrupted数据集。实验结果揭示了平均度量在近似真实DG性能方面的不足,并证实了具有理论支撑的“最差+差距”度量的稳健性。