Large language models (LLMs) currently dominate the field of natural language processing (NLP), representing the state-of-the-art across a diverse array of tasks. Developing a model of this nature, from training to inference, requires making numerous decisions which define a combinatorial search problem. For example, selecting the optimal pre-trained LLM, prompt, or hyperparameters to attain the best performance for a task often requires evaluating multiple candidates on an entire test set. This exhaustive evaluation can be time-consuming and costly, as both inference and metric computation with LLMs are resource-intensive. In this paper, we address the challenge of identifying the best method within a limited budget for evaluating methods on test examples. By leveraging the well-studied multi-armed bandit framework, which sequentially selects the next method-example pair to evaluate, our approach, combining multi-armed bandit algorithms with low-rank factorization, significantly reduces the required resources. Experiments show that our algorithms can identify the top-performing method using only 5-15\% of the typically needed resources, resulting in an 85-95\% reduction in cost.
翻译:大型语言模型(LLM)目前在自然语言处理(NLP)领域占据主导地位,代表了各类任务的最先进水平。开发此类模型,从训练到推理,需要做出大量决策,这构成了一个组合搜索问题。例如,为在特定任务上获得最佳性能而选择最优的预训练LLM、提示或超参数,通常需要在完整测试集上评估多个候选方案。这种穷举评估既耗时又昂贵,因为LLM的推理和指标计算均需消耗大量资源。本文致力于解决在有限预算内识别测试样例上最优评估方法的挑战。通过利用经过深入研究的**多臂老虎机**框架——该框架能顺序选择下一个待评估的“方法-样例”对——我们将多臂老虎机算法与低秩分解相结合,显著降低了所需资源。实验表明,我们的算法仅需通常所需资源的5-15%,即可识别出性能最优的方法,从而实现85-95%的成本降低。