Large language models (LLMs) are increasingly used as raters for evaluation tasks. However, their reliability is often limited for subjective tasks, when human judgments involve subtle reasoning beyond annotation labels. Thinking traces, the reasoning behind a judgment, are highly informative but challenging to collect and curate. We present a human-LLM collaborative framework to infer thinking traces from label-only annotations. The proposed framework uses a simple and effective rejection sampling method to reconstruct these traces at scale. These inferred thinking traces are applied to two complementary tasks: (1) fine-tuning open LLM raters; and (2) synthesizing clearer annotation guidelines for proprietary LLM raters. Across multiple datasets, our methods lead to significantly improved LLM-human agreement. Additionally, the refined annotation guidelines increase agreement among different LLM models. These results suggest that LLMs can serve as practical proxies for otherwise unrevealed human thinking traces, enabling label-only corpora to be extended into thinking-trace-augmented resources that enhance the reliability of LLM raters.
翻译:大型语言模型(LLM)正日益被用作评估任务中的评分者。然而,对于主观性任务,当人类判断涉及超出标注标签的微妙推理时,其可靠性往往受限。思维轨迹——即判断背后的推理过程——具有很高的信息价值,但难以收集和整理。我们提出了一种人机协作框架,用于从仅含标签的标注中推断思维轨迹。该框架采用一种简单有效的拒绝采样方法,以大规模重建这些轨迹。这些推断出的思维轨迹被应用于两项互补任务:(1)微调开源LLM评分者;(2)为专有LLM评分者合成更清晰的标注指南。在多个数据集上的实验表明,我们的方法显著提升了LLM与人类判断的一致性。此外,优化后的标注指南也提高了不同LLM模型之间的一致性。这些结果表明,LLM可以作为未公开的人类思维轨迹的实用代理,使得仅含标签的语料库能够扩展为思维轨迹增强资源,从而提升LLM评分者的可靠性。