Surprise-Guided MergeSort: Budget-Efficient Human-in-the-Loop Ranking via Adaptive Comparison Scheduling

Pairwise comparison is the gold standard for subjective ranking tasks; however, exhaustive annotation requires a massive number of human comparisons ($O(n^2)$). While sorting-based methods have reduced this burden to $O(n\log n)$, they still require expensive human judgment for every single comparison. To further improve annotation efficiency, we propose leveraging a Vision-Language Model (VLM) not as an annotator replacement, but as a \emph{question prioritizer} to identify which comparisons genuinely require human judgment. The proposed \textbf{Surprise-Guided MergeSort (SGS)} framework achieves this through three integrated components: (1) a bottom-up MergeSort scheduler that structures comparisons and exploits transitivity, (2) a composite Surprise Scorer -- combining position-bias-cancelled VLM confidence, Elo gap, and vote entropy -- to quantify comparison ambiguity, and (3) an adaptive budget allocator that routes high-surprise pairs to humans while automating low-surprise pairs via transitivity inference. Validation was conducted on six diverse benchmarks spanning text similarity (STS-B, BIOSSES, SICKR-STS) and image quality assessment (KonIQ-10k, TID2013, LIVE Challenge). SGS effectively identified and skipped up to 535 non-informative comparisons per session. Consequently, it achieved Kendall's $τ{\times}100$ improvements of $+6$ to $+12$ over Active Elo under the same total budget. These results demonstrate that combining VLM-guided surprise metrics with algorithmic sorting provides a generally consistent accuracy-efficiency trade-off across diverse domains.

翻译：成对比较是主观排名任务的金标准，然而穷尽式标注需要大量人工比较（$O(n^2)$）。虽然基于排序的方法已将此负担降至$O(n\log n)$，但每次比较仍需要昂贵的人类判断。为了进一步提高标注效率，我们提出利用视觉语言模型（VLM）取代标注员，而是作为问题优先级排序器，识别哪些比较真正需要人工判断。所提出的**惊喜引导归并排序（SGS）**框架通过三个集成组件实现这一目标：（1）自底向上的归并排序调度器，用于结构化比较并利用传递性；（2）复合惊喜评分器——结合位置偏差校正的VLM置信度、Elo分数差和投票熵——量化比较模糊性；（3）自适应预算分配器，将高惊喜对分配给人工，同时通过传递性推理自动处理低惊喜对。在涵盖文本相似度（STS-B, BIOSSES, SICKR-STS）和图像质量评估（KonIQ-10k, TID2013, LIVE Challenge）的六个多样化基准上进行了验证。SGS有效识别并跳过了每次会话中最多535个非信息性比较。因此，在相同总预算下，与Active Elo相比，其Kendall's $τ{\times}100$提升了$+6$到$+12$。这些结果表明，将VLM引导的惊喜度量与算法排序相结合，能在不同领域提供普遍一致的准确率-效率权衡。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

[ICML 2026] 看见的还是思考的？用奖励机制区分“看错”与“想错”：视觉语言模型奖励感知

专知会员服务

10+阅读 · 5月15日

【CVPR2025】先过滤图像，后生成指令：视觉指令调优的预指令数据选择

专知会员服务

10+阅读 · 2025年3月11日

【斯坦福博士论文】推动医学人工智能发展的数据高效算法

专知会员服务

28+阅读 · 2024年12月1日

【斯坦福博士论文】促进医疗人工智能的数据高效算法，123页pdf

专知会员服务

27+阅读 · 2024年9月5日