Pairwise comparison is the gold standard for subjective ranking tasks; however, exhaustive annotation requires a massive number of human comparisons ($O(n^2)$). While sorting-based methods have reduced this burden to $O(n\log n)$, they still require expensive human judgment for every single comparison. To further improve annotation efficiency, we propose leveraging a Vision-Language Model (VLM) not as an annotator replacement, but as a \emph{question prioritizer} to identify which comparisons genuinely require human judgment. The proposed \textbf{Surprise-Guided MergeSort (SGS)} framework achieves this through three integrated components: (1) a bottom-up MergeSort scheduler that structures comparisons and exploits transitivity, (2) a composite Surprise Scorer -- combining position-bias-cancelled VLM confidence, Elo gap, and vote entropy -- to quantify comparison ambiguity, and (3) an adaptive budget allocator that routes high-surprise pairs to humans while automating low-surprise pairs via transitivity inference. Validation was conducted on six diverse benchmarks spanning text similarity (STS-B, BIOSSES, SICKR-STS) and image quality assessment (KonIQ-10k, TID2013, LIVE Challenge). SGS effectively identified and skipped up to 535 non-informative comparisons per session. Consequently, it achieved Kendall's $τ{\times}100$ improvements of $+6$ to $+12$ over Active Elo under the same total budget. These results demonstrate that combining VLM-guided surprise metrics with algorithmic sorting provides a generally consistent accuracy-efficiency trade-off across diverse domains.
翻译:成对比较是主观排名任务的金标准,然而穷尽式标注需要大量人工比较($O(n^2)$)。虽然基于排序的方法已将此负担降至$O(n\log n)$,但每次比较仍需要昂贵的人类判断。为了进一步提高标注效率,我们提出利用视觉语言模型(VLM)取代标注员,而是作为问题优先级排序器,识别哪些比较真正需要人工判断。所提出的**惊喜引导归并排序(SGS)**框架通过三个集成组件实现这一目标:(1)自底向上的归并排序调度器,用于结构化比较并利用传递性;(2)复合惊喜评分器——结合位置偏差校正的VLM置信度、Elo分数差和投票熵——量化比较模糊性;(3)自适应预算分配器,将高惊喜对分配给人工,同时通过传递性推理自动处理低惊喜对。在涵盖文本相似度(STS-B, BIOSSES, SICKR-STS)和图像质量评估(KonIQ-10k, TID2013, LIVE Challenge)的六个多样化基准上进行了验证。SGS有效识别并跳过了每次会话中最多535个非信息性比较。因此,在相同总预算下,与Active Elo相比,其Kendall's $τ{\times}100$提升了$+6$到$+12$。这些结果表明,将VLM引导的惊喜度量与算法排序相结合,能在不同领域提供普遍一致的准确率-效率权衡。