The world has recently witnessed an unprecedented acceleration in demands for Machine Learning and Artificial Intelligence applications. This spike in demand has imposed tremendous strain on the underlying technology stack in supply chain, GPU-accelerated hardware, software, datacenter power density, and energy consumption. If left on the current technological trajectory, future demands show insurmountable spending trends, further limiting market players, stifling innovation, and widening the technology gap. To address these challenges, we propose a fundamental change in the AI training infrastructure throughout the technology ecosystem. The changes require advancements in supercomputing and novel AI training approaches, from high-end software to low-level hardware, microprocessor, and chip design, while advancing the energy efficiency required by a sustainable infrastructure. This paper presents the analytical framework that quantitatively highlights the challenges and points to the opportunities to reduce the barriers to entry for training large language models.
翻译:近期,全球对机器学习与人工智能应用的需求呈现前所未有的加速增长。这一需求激增对技术栈的底层供应链、GPU加速硬件、软件、数据中心功率密度及能耗造成了巨大压力。若延续当前技术发展轨迹,未来需求将呈现难以承受的支出趋势,进一步限制市场参与者、抑制创新并扩大技术鸿沟。为应对这些挑战,我们提出贯穿整个技术生态系统的AI训练基础设施的根本性变革。这一变革需要在超级计算与新型AI训练方法上取得突破,涵盖从高端软件到底层硬件、微处理器及芯片设计的全链条创新,同时提升可持续基础设施所需的能源效率。本文提出的分析框架定量揭示了相关挑战,并指出了降低大语言模型训练门槛的潜在路径。