Rising computational demands of modern natural language processing (NLP) systems have increased the barrier to entry for cutting-edge research while posing serious environmental concerns. Yet, progress on model efficiency has been impeded by practical challenges in model evaluation and comparison. For example, hardware is challenging to control due to disparate levels of accessibility across different institutions. Moreover, improvements in metrics such as FLOPs often fail to translate to progress in real-world applications. In response, we introduce Pentathlon, a benchmark for holistic and realistic evaluation of model efficiency. Pentathlon focuses on inference, which accounts for a majority of the compute in a model's lifecycle. It offers a strictly-controlled hardware platform, and is designed to mirror real-world applications scenarios. It incorporates a suite of metrics that target different aspects of efficiency, including latency, throughput, memory overhead, and energy consumption. Pentathlon also comes with a software library that can be seamlessly integrated into any codebase and enable evaluation. As a standardized and centralized evaluation platform, Pentathlon can drastically reduce the workload to make fair and reproducible efficiency comparisons. While initially focused on natural language processing (NLP) models, Pentathlon is designed to allow flexible extension to other fields. We envision Pentathlon will stimulate algorithmic innovations in building efficient models, and foster an increased awareness of the social and environmental implications in the development of future-generation NLP models.
翻译:现代自然语言处理系统日益增长的算力需求提升了前沿研究的准入门槛,同时也引发了严重的环境关切。然而,模型效率的进展一直受到实际评估与比较难题的阻碍。例如,由于不同机构硬件可及性参差不齐,硬件控制极具挑战性。此外,FLOPs等指标的提升往往难以转化为真实应用场景的进步。为此,我们提出五边形基准,旨在对模型效率进行整体且真实的评估。该基准聚焦于推理阶段——这一环节占据模型生命周期中的绝大部分算力消耗。它提供严格受控的硬件平台,并设计用于模拟真实应用场景。该基准整合了一套涵盖延迟、吞吐量、内存开销及能耗等不同效率维度的评估指标。同时,五边形基准还配备可无缝集成至任意代码库的软件库,助力实现模块化评估。作为标准化、集中化的评估平台,五边形基准能够显著降低实现公平可重复效率对比的工作量。虽初始聚焦自然语言处理模型,但其设计支持灵活扩展至其他领域。我们期望五边形基准能激发高效模型构建领域的算法创新,并推动未来新一代NLP模型开发中对社会与环境影响的更深入认知。