This study introduces SLLMBO, an innovative framework that leverages Large Language Models (LLMs) for hyperparameter optimization (HPO), incorporating dynamic search space adaptability, enhanced parameter landscape exploitation, and a hybrid, novel LLM-Tree-structured Parzen Estimator (LLM-TPE) sampler. By addressing limitations in recent fully LLM-based methods and traditional Bayesian Optimization (BO), SLLMBO achieves more robust optimization. This comprehensive benchmarking evaluates multiple LLMs, including GPT-3.5-turbo, GPT-4o, Claude-Sonnet-3.5, and Gemini-1.5-flash, extending prior work beyond GPT-3.5 and GPT-4 and establishing SLLMBO as the first framework to benchmark a diverse set of LLMs for HPO. By integrating LLMs' established strengths in parameter initialization with the exploitation abilities demonstrated in this study, alongside TPE's exploration capabilities, the LLM-TPE sampler achieves a balanced exploration-exploitation trade-off, reduces API costs, and mitigates premature early stoppings for more effective parameter searches. Across 14 tabular tasks in classification and regression, the LLM-TPE sampler outperformed fully LLM-based methods and achieved superior results over BO methods in 9 tasks. Testing early stopping in budget-constrained scenarios further demonstrated competitive performance, indicating that LLM-based methods generally benefit from extended iterations for optimal results. This work lays the foundation for future research exploring open-source LLMs, reproducibility of LLM results in HPO, and benchmarking SLLMBO on complex datasets, such as image classification, segmentation, and machine translation.
翻译:本研究提出了SLLMBO,一种创新框架,该框架利用大语言模型(LLMs)进行超参数优化(HPO),融合了动态搜索空间适应性、增强的参数空间探索能力以及一种新颖的混合型LLM-树结构Parzen估计器(LLM-TPE)采样器。通过解决近期完全基于LLM的方法和传统贝叶斯优化(BO)的局限性,SLLMBO实现了更稳健的优化。这项全面的基准测试评估了多种LLM,包括GPT-3.5-turbo、GPT-4o、Claude-Sonnet-3.5和Gemini-1.5-flash,将先前工作扩展至GPT-3.5和GPT-4之外,并确立了SLLMBO作为首个对多样化LLM集合进行HPO基准测试的框架。通过整合LLM在参数初始化方面的既定优势与本研究中展示的探索能力,以及TPE的勘探能力,LLM-TPE采样器实现了勘探与利用的平衡权衡,降低了API成本,并缓解了过早停止的问题,从而进行更有效的参数搜索。在涵盖分类和回归的14个表格任务中,LLM-TPE采样器的表现优于完全基于LLM的方法,并在9个任务中取得了超越BO方法的优异结果。在预算受限场景下测试早期停止机制进一步证明了其具有竞争力的性能,表明基于LLM的方法通常需要更多迭代次数以获得最优结果。本工作为未来研究奠定了基础,包括探索开源LLM、LLM在HPO中结果的可复现性,以及在复杂数据集(如图像分类、分割和机器翻译)上对SLLMBO进行基准测试。