In statistics and machine learning, when we train a fitted model on available data, we typically want to ensure that we are searching within a model class that contains at least one accurate model -- that is, we would like to ensure an upper bound on the model class risk (the lowest possible risk that can be attained by any model in the class). However, it is also of interest to establish lower bounds on the model class risk, for instance so that we can determine whether our fitted model is at least approximately optimal within the class, or, so that we can decide whether the model class is unsuitable for the particular task at hand. Particularly in the setting of interpolation learning where machine learning models are trained to reach zero error on the training data, we might ask if, at the very least, a positive lower bound on the model class risk is possible -- or are we unable to detect that "all models are wrong"? In this work, we answer these questions in a distribution-free setting by establishing a model-agnostic, fundamental hardness result for the problem of constructing a lower bound on the best test error achievable over a model class, and examine its implications on specific model classes such as tree-based methods and linear regression.
翻译:在统计学与机器学习中,当我们在可用数据上训练拟合模型时,通常希望确保所搜索的模型类别至少包含一个准确模型——即我们希望确保模型类别风险(该类别中任何模型所能达到的最低可能风险)存在上界。然而,建立模型类别风险的下界同样具有重要意义,例如这能帮助我们判断拟合模型在类别内是否至少近似最优,或决定当前模型类别是否适用于特定任务。特别是在插值学习场景中(机器学习模型被训练至在训练数据上达到零误差),我们或许会问:是否至少可能获得模型类别风险的正下界?抑或我们根本无法检测到“所有模型都是错误的”?本研究通过在分布无关设定中建立模型无关的基本硬度结果,回答了关于构建模型类别最佳测试误差下界的问题,并探讨了该结果对树基方法、线性回归等特定模型类别的启示。