AI arenas, which rank generative models from pairwise preferences of users, are a popular method for measuring the relative performance of models in the course of their organic use. Because rankings are computed from noisy preferences, there is a concern that model producers can exploit this randomness by submitting many models (e.g., multiple variants of essentially the same model) and thereby artificially improve the rank of their top models. This can lead to degradations in the quality, and therefore the usefulness, of the ranking. In this paper, we begin by establishing, both theoretically and in simulations calibrated to data from the platform Arena (formerly LMArena, Chatbot Arena), conditions under which producers can benefit from submitting clones when their goal is to be ranked highly. We then propose a new mechanism for ranking models from pairwise comparisons, called You-Rank-We-Rank (YRWR). It requires that producers submit rankings over their own models and uses these rankings to correct statistical estimates of model quality. We prove that this mechanism is approximately clone-robust, in the sense that a producer cannot improve their rank much by doing anything other than submitting each of their unique models exactly once. Moreover, to the extent that model producers are able to correctly rank their own models, YRWR improves overall ranking accuracy. In further simulations, we show that indeed the mechanism is approximately clone-robust and quantify improvements to ranking accuracy, even under producer misranking.
翻译:人工智能竞技场通过用户对生成模型的成对偏好进行排名,已成为在模型有机使用过程中衡量其相对性能的流行方法。由于排名是根据带有噪声的偏好计算得出,模型生产者可能利用这种随机性提交大量模型(例如,本质相同的多个变体),从而人为提升其顶级模型的排名。这种行为可能导致排名质量下降,进而削弱其可用性。本文首先从理论和数值模拟两个层面(模拟数据校准自Arena平台,原LMArena/Chatbot Arena),证实了当生产者以高排名为目标时,其可通过提交克隆模型获益的条件。随后,我们提出一种基于成对比较的排名新机制——“你排我排”(YRWR)。该机制要求生产者提交自身模型的排名,并利用这些排名修正模型质量的统计估计值。理论证明该机制具有近似克隆鲁棒性:即生产者仅提交每个独特模型一次时,无法通过其他操作显著提升排名。此外,在生产者能正确排序自身模型的条件下,YRWR可提升整体排名准确性。进一步模拟显示,即使存在生产者错误排名的情形,该机制仍保持近似克隆鲁棒性,且能量化评估排名准确性的改进幅度。