We propose a method for evaluating the robustness of widely used LLM ranking systems -- variants of a Bradley--Terry model -- to dropping a worst-case very small fraction of preference data. Our approach is computationally fast and easy to adopt. When we apply our method to matchups from popular LLM ranking platforms, including Chatbot Arena and derivatives, we find that the rankings of top-performing models can be remarkably sensitive to the removal of a small fraction of preferences; for instance, dropping just 0.003% of human preferences can change the top-ranked model on Chatbot Arena. Our robustness check identifies the specific preferences most responsible for such ranking flips, allowing for inspection of these influential preferences. We observe that the rankings derived from MT-bench preferences are notably more robust than those from Chatbot Arena, likely due to MT-bench's use of expert annotators and carefully constructed prompts. Finally, we find that neither rankings based on crowdsourced human evaluations nor those based on LLM-as-a-judge preferences are systematically more sensitive than the other.
翻译:我们提出了一种评估广泛使用的大语言模型排名系统(Bradley-Terry模型的变体)鲁棒性的方法,该方法针对丢弃最坏情况下极少量偏好数据的情形。我们的方法计算速度快且易于采用。当将此方法应用于主流大语言模型排名平台(包括Chatbot Arena及其衍生平台)的对战数据时,我们发现顶级模型的排名对少量偏好数据的移除表现出惊人的敏感性;例如,仅丢弃0.003%的人类偏好即可改变Chatbot Arena平台上排名第一的模型。我们的鲁棒性检验能够识别导致此类排名翻转的关键偏好,从而支持对这些影响性偏好的核查。我们观察到,基于MT-bench偏好数据得出的排名显著比基于Chatbot Arena的排名更具鲁棒性,这很可能源于MT-bench采用专家标注者和精心构建的提示词。最后,我们发现基于众包人类评估的排名与基于LLM-as-a-judge偏好的排名在系统性敏感度方面并无显著差异。