FedEval-LLM: Federated Evaluation of Large Language Models on Downstream Tasks with Collective Wisdom

Federated Learning (FL) has emerged as a promising solution for collaborative training of large language models (LLMs). However, the integration of LLMs into FL introduces new challenges, particularly concerning the evaluation of LLMs. Traditional evaluation methods that rely on labeled test sets and similarity-based metrics cover only a subset of the acceptable answers, thereby failing to accurately reflect the performance of LLMs on generative tasks. Meanwhile, although automatic evaluation methods that leverage advanced LLMs present potential, they face critical risks of data leakage due to the need to transmit data to external servers and suboptimal performance on downstream tasks due to the lack of domain knowledge. To address these issues, we propose a Federated Evaluation framework of Large Language Models, named FedEval-LLM, that provides reliable performance measurements of LLMs on downstream tasks without the reliance on labeled test sets and external tools, thus ensuring strong privacy-preserving capability. FedEval-LLM leverages a consortium of personalized LLMs from participants as referees to provide domain knowledge and collective evaluation capability, thus aligning to the respective downstream tasks and mitigating uncertainties and biases associated with a single referee. Experimental results demonstrate a significant improvement in the evaluation capability of personalized evaluation models on downstream tasks. When applied to FL, these evaluation models exhibit strong agreement with human preference and RougeL-score on meticulously curated test sets. FedEval-LLM effectively overcomes the limitations of traditional metrics and the reliance on external services, making it a promising framework for the evaluation of LLMs within collaborative training scenarios.

翻译：联邦学习（Federated Learning, FL）已成为协作训练大语言模型（LLMs）的一种有前景方案。然而，将LLMs融入FL引入了新的挑战，尤其是在LLM评估方面。传统评估方法依赖标注测试集和基于相似度的指标，仅能覆盖可接受答案的子集，从而无法准确反映LLMs在生成任务上的性能。同时，尽管利用先进LLMs的自动化评估方法展现出潜力，但因需将数据传输至外部服务器面临关键的数据泄露风险，且因缺乏领域知识在下游任务上表现欠佳。为应对这些问题，我们提出一种大语言模型联邦评估框架FedEval-LLM，该框架能在无需依赖标注测试集和外部工具的情况下，可靠度量LLMs在下游任务上的性能，从而确保强大的隐私保护能力。FedEval-LLM利用参与者处个性化LLMs组成的联盟作为裁判，提供领域知识与集体评估能力，从而适配各自下游任务，并缓解单一裁判带来的不确定性与偏差。实验结果表明，个性化评估模型在下游任务上的评估能力显著提升。当应用于FL时，这些评估模型与人工偏好及基于精心整理测试集的RougeL分数高度一致。FedEval-LLM有效克服了传统指标的局限性及对外部服务的依赖，成为协作训练场景下LLM评估的富有前景的框架。