Large language models are increasingly deployed as automated judges to evaluate the strength of arguments. As this role expands, their legitimacy depends on consistency, transparency, and the ability to separate argumentative structure from rhetorical appeal. However, we show that holistic judging - a common LLM-as-a-Judge practice where a model provides a global verdict on a debate - suffers from substantial inter-model disagreement. We argue that this instability arises from collapsing a debate's complex interaction structure into a single opaque score. To address this, we propose GRASP (Gradual Ranking with Attacks and Support Propagation), a deterministic framework that aggregates stable local interaction judgments into a global ranking via a convergent attack--defense propagation operator. We show that local interaction judgments are more reproducible than holistic rankings in LLM-as-a-Judge evaluations, allowing GRASP to produce more consistent global rankings. We further show that GRASP scores do not correlate with human "convincingness" labels, highlighting a vital sociotechnical distinction: GRASP does not measure persuasion, factuality, or rhetorical appeal, but structural sufficiency - a defense-aware notion of argument robustness over the explicit interaction graph. Overall, GRASP offers a transparent and auditable alternative to holistic LLM judging.
翻译:摘要:大型语言模型正越来越多地被部署为自动化裁判,用于评估论点的强度。随着这一角色的扩展,其合法性取决于一致性、透明性以及将论证结构与修辞吸引力分离的能力。然而,我们发现整体评判(一种常见的“大模型即裁判”实践,即模型对辩论提供全局性裁决)存在显著的跨模型分歧。我们认为这种不稳定性源于将辩论复杂的交互结构坍缩为单一不透明分数。为解决此问题,我们提出GRASP(基于攻击与支持传播的渐进式排序方法),这是一种确定性框架,通过收敛的攻击-防御传播算子将稳定的局部交互判断聚合为全局排序。我们证明,在大模型即裁判评估中,局部交互判断比整体排序更具可复现性,从而使GRASP能够生成更一致的全局排序。我们进一步表明,GRASP分数与人类“说服力”标签无相关性,这凸显了一个关键的社会技术区分:GRASP不衡量说服力、事实性或修辞吸引力,而是衡量结构充分性——一种基于显式交互图且重视防御性的论点鲁棒性概念。总体而言,GRASP为整体式大模型评判提供了一种透明且可审计的替代方案。