The rapid advancement of large language models (LLMs) demands increasingly reliable evaluation, yet current centralized evaluation suffers from opacity, overfitting, and hardware-induced variance. Our empirical analysis reveals an alarming inconsistency in existing evaluations: the standard deviation across ten repeated runs of a single model on HumanEval (1.67) actually exceeds the performance gap among the top-10 models on the official leaderboard (0.91), rendering current rankings statistically precarious. To mitigate these instabilities, we propose a decentralized evaluation framework that enables hardware and parameter diversity through large-scale benchmarking across heterogeneous compute nodes. By leveraging the blockchain-based protocol, the framework incentivizes global contributors to act as independent validators, using a robust reward system to ensure evaluation integrity and discourage dishonest participation. This collective verification transforms evaluation from a "centralized black box" into a "decentralized endorsement" where multi-party consensus and diverse inference environments yield a more stable, representative metric. Experimental results demonstrate that the decentralized evaluation framework reduces the standard deviation across ten runs on the same model to 0.28. This significant improvement over conventional frameworks ensures higher statistical confidence in model rankings. We have completely implemented this platform and will soon release it to the community.
翻译:大语言模型(LLM)的快速发展对评估的可靠性提出了更高要求,然而当前的中心化评估存在不透明、过拟合以及由硬件差异导致的波动等问题。我们的实证分析揭示了现有评估中存在令人担忧的不一致性:单个模型在HumanEval上十次重复运行的标准差(1.67)实际上超过了官方排行榜上前十名模型之间的性能差距(0.91),这使得当前的排名在统计意义上并不可靠。为了缓解这些不稳定性,我们提出了一种去中心化评估框架,该框架通过在异构计算节点上进行大规模基准测试,实现了硬件和参数的多样性。通过利用基于区块链的协议,该框架激励全球贡献者作为独立验证者参与,采用稳健的奖励机制来确保评估的完整性并抑制不诚实参与。这种集体验证将评估从“中心化的黑盒”转变为“去中心化的背书”,其中多方共识和多样化的推理环境共同产生更稳定、更具代表性的度量指标。实验结果表明,该去中心化评估框架将同一模型十次运行的标准差降低至0.28。相较于传统框架,这一显著改进确保了模型排名具有更高的统计置信度。我们已完整实现了该平台,并将很快向社区发布。