In this paper we discuss how to evaluate the differences between fitted logistic regression models across sub-populations. Our motivating example is in studying computerized diagnosis for learning disabilities, where sub-populations based on gender may or may not require separate models. In this context, significance tests for hypotheses of no difference between populations may provide perverse incentives, as larger variances and smaller samples increase the probability of not-rejecting the null. We argue that equivalence testing for a prespecified tolerance level on population differences incentivizes accuracy in the inference. We develop a cascading set of equivalence tests, in which each test addresses a different aspect of the model: the way the phenomenon is coded in the regression coefficients, the individual predictions in the per example log odds ratio and the overall accuracy in the mean square prediction error. For each equivalence test, we propose a strategy for setting the equivalence thresholds. The large-sample approximations are validated using simulations. For diagnosis data, we show examples for equivalent and non-equivalent models.
翻译:本文讨论了如何评估不同子群体间拟合逻辑回归模型的差异。我们的动机案例是研究学习障碍的计算机化诊断,其中基于性别的子群体可能需要也可能不需要单独的模型。在此背景下,关于群体间无差异假设的显著性检验可能产生不当激励,因为较大的方差和较小的样本量会增加无法拒绝零假设的概率。我们认为,针对预定义的群体差异容忍度进行的等价性检验能够激励推断的准确性。我们开发了一套级联的等价性检验,其中每项检验针对模型的不同方面:回归系数中现象编码的方式、每个示例对数几率比的个体预测结果以及均方预测误差中的整体准确性。针对每项等价性检验,我们提出了设置等价阈值的策略。大样本近似通过模拟实验进行了验证。基于诊断数据,我们展示了等价模型和非等价模型的示例。