In this paper we demonstrate the possibility of a trend reversal in binary classification tasks between the dataset and a classification score obtained from a trained model. This trend reversal occurs for certain choices of the regularization parameter for model training, namely, if the parameter is contained in what we call the pathological regularization regime. For ridge regression, we give necessary and sufficient algebraic conditions on the dataset for the existence of a pathological regularization regime. Moreover, our results provide a data science practitioner with a hands-on tool to avoid hyperparameter choices suffering from trend reversal. We furthermore present numerical results on pathological regularization regimes for logistic regression. Finally, we draw connections to datasets exhibiting Simpson's paradox, providing a natural source of pathological datasets.
翻译:本文证明了在二分类任务中,数据集与训练模型所得分类分数之间可能出现的趋势逆转现象。这种趋势逆转发生在模型训练正则化参数的特定选择下,即当参数位于我们称之为病态正则化机制的区间内时。对于岭回归,我们给出了数据集存在病态正则化机制的充分必要代数条件。此外,我们的研究结果为数据科学从业者提供了实用工具,以规避因趋势逆转而导致超参数选择失效的问题。我们进一步展示了逻辑回归中病态正则化机制的数值结果。最后,我们建立了与呈现辛普森悖论的数据集之间的联系,为病态数据集提供了自然来源。