Machine Learning research, as most of Statistics, heavily relies on the concept of a data-generating probability distribution. As data points are thought to be sampled from such a distribution, we can learn from observed data about this distribution and, thus, predict future data points drawn from it (with some probability of success). Drawing on scholarship across disciplines, we here argue that this framework is not always a good model. Not only do such true probability distributions not exist; the framework can also be misleading and obscure both the choices made and the goals pursued in machine learning practice. We suggest an alternative framework that focuses on finite populations rather than abstract distributions; while classical learning theory can be left almost unchanged, it opens new opportunities, especially to model sampling. We compile these considerations into five reasons for modelling machine learning -- in some settings -- with finite distributions rather than generative distributions, both to be more faithful to practice and to provide novel theoretical insights.
翻译:机器学习研究,如同统计学的大部分领域,严重依赖于数据生成概率分布的概念。由于数据点被认为是从这样的分布中采样得到的,我们可以从观测数据中学习该分布,从而预测从中抽取的未来数据点(并具有一定的成功概率)。借鉴跨学科的学术成果,我们在此论证这一框架并非总是良好的模型。不仅此类真实的概率分布并不存在;该框架还可能产生误导,并掩盖机器学习实践中做出的选择与追求的目标。我们提出一种替代框架,其关注点在于有限总体而非抽象分布;虽然经典学习理论几乎可以保持不变,但它开辟了新的机遇,特别是在建模采样方面。我们将这些思考归纳为五点理由,主张在某些场景下使用有限分布而非生成分布来建模机器学习——既是为了更忠实于实践,也是为了提供新颖的理论见解。