Experimental studies are a cornerstone of machine learning (ML) research. A common, but often implicit, assumption is that the results of a study will generalize beyond the study itself, e.g. to new data. That is, there is a high probability that repeating the study under different conditions will yield similar results. Despite the importance of the concept, the problem of measuring generalizability remains open. This is probably due to the lack of a mathematical formalization of experimental studies. In this paper, we propose such a formalization and develop a quantifiable notion of generalizability. This notion allows to explore the generalizability of existing studies and to estimate the number of experiments needed to achieve the generalizability of new studies. To demonstrate its usefulness, we apply it to two recently published benchmarks to discern generalizable and non-generalizable results. We also publish a Python module that allows our analysis to be repeated for other experimental studies.
翻译:实验研究是机器学习(ML)研究的基石。一个常见但通常隐含的假设是,一项研究的结果能够推广到研究本身之外,例如推广到新数据。也就是说,在不同条件下重复该研究有很大可能产生相似的结果。尽管这一概念至关重要,但衡量可推广性的问题仍未解决。这可能是由于缺乏对实验研究的数学形式化。在本文中,我们提出了这样一种形式化方法,并发展了一个可量化的可推广性概念。这一概念使得我们能够探索现有研究的可推广性,并估计新研究达到可推广性所需的实验数量。为了证明其实用性,我们将其应用于两个最近发布的基准测试,以区分可推广和不可推广的结果。我们还发布了一个Python模块,以便对其他实验研究重复我们的分析。