The accuracy of machine learning interatomic potentials suffers from reference data that contains numerical noise. Often originating from unconverged or inconsistent electronic-structure calculations, this noise is challenging to identify. Existing mitigation strategies such as manual filtering or iterative refinement of outliers, require either substantial expert effort or multiple expensive retraining cycles, making them difficult to scale to large datasets. Here, we introduce an on-the-fly outlier detection scheme that automatically down-weights noisy samples, without requiring additional reference calculations. By tracking the loss distribution via an exponential moving average, this unsupervised method identifies outliers throughout a single training run. We show that this approach prevents overfitting and matches the performance of iterative refinement baselines with significantly reduced overhead. The method's effectiveness is demonstrated by recovering accurate physical observables for liquid water from unconverged reference data, including diffusion coefficients. Furthermore, we validate its scalability by training a foundation model for organic chemistry on the SPICE dataset, where it reduces energy errors by a factor of three. This framework provides a simple, automated solution for training robust models on imperfect datasets across dataset sizes.
翻译:机器学习原子间势的准确性常受包含数值噪声的参考数据影响。此类噪声通常源于未收敛或不一致的电子结构计算,且难以识别。现有的缓解策略(如人工筛选或离群点的迭代优化)需要大量专家干预或多次昂贵的重新训练周期,难以扩展至大规模数据集。本文提出一种在线离群点检测方案,可自动降低噪声样本的权重,且无需额外的参考计算。该方法通过指数移动平均跟踪损失分布,在单次训练过程中持续识别离群点。研究表明,该方法能有效防止过拟合,并以显著降低的开销达到迭代优化基准方法的性能。通过从未收敛的参考数据中恢复液态水的精确物理观测值(包括扩散系数),验证了该方法的有效性。此外,我们在SPICE数据集上训练有机化学基础模型,将能量误差降低至三分之一,证明了其可扩展性。该框架为在不同规模的不完美数据集上训练稳健模型提供了一种简单、自动化的解决方案。