Arena Learning: Build Data Flywheel for LLMs Post-training via Simulated Chatbot Arena

Assessing the effectiveness of large language models (LLMs) presents substantial challenges. The method of conducting human-annotated battles in an online Chatbot Arena is a highly effective evaluative technique. However, this approach is limited by the costs and time required for human annotation. In this paper, we introduce Arena Learning, an innovative offline strategy designed to simulate these arena battles using AI-driven annotations to evaluate battle outcomes, thus facilitating the continuous improvement of the target model through both supervised fine-tuning and reinforcement learning. Arena Learning comprises two key elements. First, it ensures precise evaluations and maintains consistency between offline simulations and online competitions via WizardArena, a pipeline developed to accurately predict the Elo rankings of various models using a meticulously designed offline test set. Our results demonstrate that WizardArena's predictions closely align with those from the online Arena. Second, it involves the continuous improvement of training data based on the battle results and the refined model. We establish a data flywheel to iteratively update the training data by highlighting the weaknesses of the target model based on its battle results, enabling it to learn from the strengths of multiple different models. We apply Arena Learning to train our target model, WizardLM-$\beta$, and demonstrate significant performance enhancements across various metrics. This fully automated training and evaluation pipeline sets the stage for continuous advancements in various LLMs via post-training. Notably, Arena Learning plays a pivotal role in the success of WizardLM-2, and this paper serves both as an exploration of its efficacy and a foundational study for future discussions related to WizardLM-2 and its derivatives.

翻译：评估大型语言模型（LLMs）的有效性面临重大挑战。在在线聊天机器人竞技场中进行人工标注的对战是一种高效的评估方法。然而，该方法受限于人工标注所需的成本和时间。本文提出竞技场学习，这是一种创新的离线策略，旨在利用人工智能驱动的标注来模拟这些竞技场对战以评估对战结果，从而通过监督微调和强化学习促进目标模型的持续改进。竞技场学习包含两个关键要素。首先，它通过WizardArena确保精确评估并维持离线模拟与在线竞赛之间的一致性。WizardArena是一个为准确预测各种模型的Elo排名而开发的流程，它使用精心设计的离线测试集。我们的结果表明，WizardArena的预测与在线竞技场的结果高度吻合。其次，它涉及基于对战结果和精炼模型对训练数据进行持续改进。我们建立了一个数据飞轮，通过根据目标模型的对战结果突出其弱点来迭代更新训练数据，使其能够从多个不同模型的优势中学习。我们将竞技场学习应用于训练我们的目标模型WizardLM-$\beta$，并在各种指标上展示了显著的性能提升。这个全自动的训练和评估流程为通过后训练实现各种LLMs的持续进步奠定了基础。值得注意的是，竞技场学习在WizardLM-2的成功中发挥了关键作用，本文既是对其有效性的探索，也是未来讨论WizardLM-2及其衍生模型的基础性研究。