浮点数非结合性对高性能计算与深度学习应用可重现性的影响 (Impacts of floating-point non-associativity on reproducibility for HPC and deep learning applications)

Run to run variability in parallel programs caused by floating-point non-associativity has been known to significantly affect reproducibility in iterative algorithms, due to accumulating errors. Non-reproducibility can critically affect the efficiency and effectiveness of correctness testing for stochastic programs. Recently, the sensitivity of deep learning training and inference pipelines to floating-point non-associativity has been found to sometimes be extreme. It can prevent certification for commercial applications, accurate assessment of robustness and sensitivity, and bug detection. New approaches in scientific computing applications have coupled deep learning models with high-performance computing, leading to an aggravation of debugging and testing challenges. Here we perform an investigation of the statistical properties of floating-point non-associativity within modern parallel programming models, and analyze performance and productivity impacts of replacing atomic operations with deterministic alternatives on GPUs. We examine the recently-added deterministic options in PyTorch within the context of GPU deployment for deep learning, uncovering and quantifying the impacts of input parameters triggering run to run variability and reporting on the reliability and completeness of the documentation. Finally, we evaluate the strategy of exploiting automatic determinism that could be provided by deterministic hardware, using the Groq accelerator for inference portions of the deep learning pipeline. We demonstrate the benefits that a hardware-based strategy can provide within reproducibility and correctness efforts.

翻译：并行程序中由浮点数非结合性导致的运行间变异性，已知会因误差累积而显著影响迭代算法的可重现性。不可重现性可能严重影响随机程序正确性测试的效率和效果。最近研究发现，深度学习训练与推理流程对浮点数非结合性的敏感度有时极为显著，这可能阻碍商业应用的认证流程、鲁棒性与敏感度的准确评估以及缺陷检测。科学计算应用中的新方法将深度学习模型与高性能计算相结合，进一步加剧了调试与测试的挑战。本文研究了现代并行编程模型中浮点数非结合性的统计特性，分析了在GPU上用确定性替代方案替换原子操作对性能与生产力的影响。我们在GPU部署深度学习的背景下检验了PyTorch近期新增的确定性选项，揭示并量化了触发运行间变异性的输入参数的影响，同时评估了相关文档的可靠性与完整性。最后，我们通过采用Groq加速器处理深度学习流程的推理部分，评估了利用确定性硬件实现自动确定性的策略，论证了基于硬件的策略在提升可重现性与正确性方面的优势。