"Accuracy-on-the-line" is a widely observed phenomenon in machine learning, where a model's accuracy on in-distribution (ID) and out-of-distribution (OOD) data is positively correlated across different hyperparameters and data configurations. But when does this useful relationship break down? In this work, we explore its robustness. The key observation is that noisy data and the presence of nuisance features can be sufficient to shatter the Accuracy-on-the-line phenomenon. In these cases, ID and OOD accuracy can become negatively correlated, leading to "Accuracy-on-the-wrong-line". This phenomenon can also occur in the presence of spurious (shortcut) features, which tend to overshadow the more complex signal (core, non-spurious) features, resulting in a large nuisance feature space. Moreover, scaling to larger datasets does not mitigate this undesirable behavior and may even exacerbate it. We formally prove a lower bound on Out-of-distribution (OOD) error in a linear classification model, characterizing the conditions on the noise and nuisance features for a large OOD error. We finally demonstrate this phenomenon across both synthetic and real datasets with noisy data and nuisance features.
翻译:“准确性线性关系”是机器学习中广泛观察到的现象,即模型在分布内数据与分布外数据上的准确率在不同超参数和数据配置下呈现正相关。但这种有用关系何时会失效?本研究探讨了该现象的鲁棒性。关键发现是:噪声数据和干扰特征的存在足以破坏“准确性线性关系”现象。在这些情况下,分布内与分布外准确率可能转为负相关,形成“错误走向的准确性”。这种现象也可能在伪相关(捷径)特征存在时发生,这类特征往往会掩盖更复杂的信号(核心、非伪相关)特征,从而形成庞大的干扰特征空间。此外,扩大数据集规模不仅无法缓解这种不良行为,甚至可能加剧其影响。我们在线性分类模型中严格证明了分布外误差的下界,刻画了噪声与干扰特征导致较大分布外误差的条件。最终,我们在含噪声数据和干扰特征的合成数据集与真实数据集上验证了这一现象。