Machine learning systems appear stochastic but are deterministically random, as seeded pseudorandom number generators produce identical realisations across executions. Learning-based simulators are widely used to compare algorithms, design choices, and interventions under such dynamics, yet evaluation outcomes often exhibit high variance due to random initialisation and learning stochasticity. We analyse the statistical structure of comparative evaluation in these settings and show that standard independent evaluation designs fail to exploit shared sources of randomness across alternatives. We formalise a paired seed evaluation design in which competing systems are evaluated under identical random seeds, inducing matched realisations of stochastic components and strict variance reduction whenever outcomes are positively correlated at the seed level. This yields tighter confidence intervals, higher statistical power, and effective sample size gains at fixed computational budgets. Empirically, seed-level correlations are typically large and positive, producing order-of-magnitude efficiency gains. Paired seed evaluation is weakly dominant in practice, improving statistical reliability when correlation is present and reducing to independent evaluation without loss of validity when it is not.
翻译:机器学习系统看似随机,实则具有确定性随机特性,因为基于种子的伪随机数生成器在不同执行中会产生相同的随机实现。基于学习的模拟器被广泛用于在此类动态下比较算法、设计选择和干预措施,然而由于随机初始化和学习随机性,评估结果往往表现出高方差。我们分析了此类场景下比较评估的统计结构,并证明标准的独立评估设计未能利用不同方案间共享的随机性来源。我们形式化了一种配对种子评估设计,其中竞争系统在相同的随机种子下进行评估,从而诱导随机成分的匹配实现,并在结果于种子层面呈正相关时实现严格的方差缩减。这能在固定计算预算下产生更紧凑的置信区间、更高的统计功效以及有效的样本量增益。实证研究表明,种子层面的相关性通常呈现强正相关,可产生数量级的效率提升。配对种子评估在实践中具有弱主导性:当相关性存在时可提升统计可靠性,当相关性不存在时则退化为独立评估且不损失有效性。