Large Language Model (LLM) systems have been the frontier of AI in many application domains, leading to new challenges and opportunities for hyperparameter optimization (HPO) for the AutoML community. However, this type of system exhibits an unprecedented compound space of hyperparameter configuration from both the AI and non-AI components; rich and nonlinear implications from the fidelity factors; and diverse costs of measuring hyperparameter configurations, none of which have been fully captured in existing benchmarks. This paper presents the first (live) benchmark suite and datasets for HPO of real-world LLM systems, dubbed LLMSYS-HPOBench, covering data related to the inference objective values of hyperparameter configurations profiled from running the LLM systems. Currently, LLMSYS-HPOBench contains 364,450 hyperparameter configurations with a dimensionality of 12-23, 3-5 dimensions of fidelity factor leading to 932 settings, 3-9 inference objective metrics, and 2-10 cost metrics, together with generated logs from measuring the LLM systems. What we seek to advocate is not only a revalidation of the existing HPO algorithms over the frontier LLM systems, but also to provide an evolving platform for the AutoML community to explore new directions of research in this regard. The benchmark suite has been made available at: https://github.com/ideas-labo/llmsys-hpobench
翻译:大语言模型系统已成为众多应用领域中人工智能的前沿阵地,为自动机器学习社区的超参数优化带来了新的挑战与机遇。然而,此类系统展现出前所未有的超参数配置复合空间(涵盖人工智能与非人工智能组件)、保真度因素带来的丰富非线性影响,以及超参数配置测量的多样化成本——现有基准测试均未能全面涵盖这些特征。本文首次提出面向真实世界大语言模型系统超参数优化的(实时)基准套件与数据集,命名为LLMSYS-HPOBench,包含通过运行大语言模型系统剖析得到的超参数配置推理目标值相关数据。目前,LLMSYS-HPOBench收录了364,450组超参数配置(维度为12-23)、3-5维保真度因素(衍生932种设置)、3-9个推理目标指标、2-10个成本指标,并附有测量大语言模型系统时生成的日志。我们不仅致力于呼吁用前沿大语言模型系统重新验证现有超参数优化算法,更旨在为自动机器学习社区提供一个持续演进的平台,以探索该领域的新研究方向。该基准套件已发布于:https://github.com/ideas-labo/llmsys-hpobench