With the end of Moore's law and Dennard scaling, efficient training increasingly requires rethinking data volume. Can we train better models with significantly less data via intelligent subsampling? To explore this, we develop SICKLE, a sparse intelligent curation framework for efficient learning, featuring a novel maximum entropy (MaxEnt) sampling approach, scalable training, and energy benchmarking. We compare MaxEnt with random and phase-space sampling on large direct numerical simulation (DNS) datasets of turbulence. Evaluating SICKLE at scale on Frontier, we show that subsampling as a preprocessing step can, in many cases, improve model accuracy and substantially lower energy consumption, with observed reductions of up to 38x.
翻译:随着摩尔定律和登纳德缩放定律的终结,高效训练日益需要重新审视数据量问题。我们能否通过智能子采样技术,用显著更少的数据训练出更优模型?为探索这一问题,我们开发了SICKLE——一种面向高效学习的稀疏智能数据筛选框架,其核心包含新颖的最大熵采样方法、可扩展训练流程及能耗基准测试体系。我们在大型湍流直接数值模拟数据集上,将最大熵采样与随机采样及相空间采样进行了对比评估。通过在Frontier超算系统上对SICKLE进行大规模测试,我们证明:在多数情况下,将子采样作为预处理步骤不仅能提升模型精度,还能大幅降低能耗,观测到的能耗降幅最高可达38倍。