Dynamic benchmarks interweave model fitting and data collection in an attempt to mitigate the limitations of static benchmarks. In contrast to an extensive theoretical and empirical study of the static setting, the dynamic counterpart lags behind due to limited empirical studies and no apparent theoretical foundation to date. Responding to this deficit, we initiate a theoretical study of dynamic benchmarking. We examine two realizations, one capturing current practice and the other modeling more complex settings. In the first model, where data collection and model fitting alternate sequentially, we prove that model performance improves initially but can stall after only three rounds. Label noise arising from, for instance, annotator disagreement leads to even stronger negative results. Our second model generalizes the first to the case where data collection and model fitting have a hierarchical dependency structure. We show that this design guarantees strictly more progress than the first, albeit at a significant increase in complexity. We support our theoretical analysis by simulating dynamic benchmarks on two popular datasets. These results illuminate the benefits and practical limitations of dynamic benchmarking, providing both a theoretical foundation and a causal explanation for observed bottlenecks in empirical work.
翻译:动态基准通过交替进行模型拟合与数据收集,试图缓解静态基准的局限性。与静态设置中广泛的理论与实证研究相比,动态基准因缺乏实证研究且至今无明显理论基础而发展滞后。针对这一不足,我们开启了对动态基准的理论研究。我们考察了两种实现方式:一种捕捉当前实践,另一种建模更复杂的场景。在第一种模型中,数据收集与模型拟合依次交替进行,我们证明了模型性能最初会提升,但仅需三轮迭代后便可能陷入停滞。标签噪声(例如由标注者分歧引起)甚至会导致更严重的负面结果。第二种模型将第一种模型推广至数据收集与模型拟合具有层次依赖结构的情况。我们表明,这种设计相较于第一种模型能保证更显著的进步,尽管复杂度显著增加。我们通过在两个流行数据集上模拟动态基准来支持理论分析。这些结果阐明了动态基准的优势与实际局限性,为实证研究中观察到的瓶颈提供了理论基础与因果解释。