LLM-as-a-judge is now the default measurement instrument for open-ended generation, but on the public JudgeBench benchmark even strong instruction-tuned judges barely scrape past random on objective-correctness pairwise items. We introduce RTLC, a three-stage prompting recipe -- Research, Teach-to-Learn, Critique -- that promotes a single black-box LLM into an ensemble-of-thought judge with no fine-tuning, retrieval, or external tools. Stage 1 wraps the input in a fixed pedagogical scaffold porting the Feynman Learning Technique (study $\to$ teach $\to$ find gaps $\to$ simplify) into LLM prompting. Stage 2 draws N=10 independent candidate verdicts at temperature 0.4. Stage 3 acts as its own critic, cross-comparing the candidate set against the original question to emit one critiqued verdict at temperature 0. On JudgeBench-GPT (350 hard pairwise items), Claude 3.7 Sonnet's pairwise accuracy climbs from 64.6% (single-shot vanilla prompt) to 78.6% (RTLC critique-of-10) -- an absolute 14.0-percentage-point gain. RTLC also beats N=10 self-consistency majority voting (77.7%) and a zero-shot first candidate (74.0%). A clean three-step ablation attributes +9.4 pp to the Teach-to-Learn scaffold, +3.7 pp to N=10 marginalisation, and +0.9 pp to explicit critique. We discuss the cost-accuracy frontier (RTLC sits above self-consistency at every working point), the error-budget breakdown across the four JudgeBench categories (knowledge, reasoning, math, coding), and how RTLC composes orthogonally with post-hoc judge-score calibration, with the two interventions compounding multiplicatively in practice.
翻译:LLM作为评判者已成为开放式生成任务的默认测量工具,但在公开的JudgeBench基准测试中,即便是经过强指令微调的评判模型,在面对客观正确性成对比较项目时,其性能也仅勉强超过随机水平。我们提出RTLC这一三阶段提示方案——研究、教学式学习、批判,无需微调、检索或外部工具,即可将单个黑盒LLM提升为集成思维评判者。第一阶段将输入嵌入固定教学支架,将费曼学习法(学习→教学→发现漏洞→简化)转化为LLM提示。第二阶段在温度0.4下抽取N=10个独立候选判断结果。第三阶段扮演自我批判角色,将候选集与原始问题交叉比对,在温度0下输出一个经批判的判断结果。在JudgeBench-GPT(350个困难成对项目)上,Claude 3.7 Sonnet的成对准确率从64.6%(单次原始提示)提升至78.6%(RTLC的10候选批判),绝对增益达14.0个百分点。RTLC还优于N=10自一致性多数投票(77.7%)和零样本首候选方案(74.0%)。通过三步消融实验,教学式学习支架贡献+9.4个百分点,N=10边际化贡献+3.7个百分点,显式批判贡献+0.9个百分点。我们探讨了成本-准确率前沿(RTLC在所有工作点上均优于自一致性)、JudgeBench四个类别(知识、推理、数学、编码)的错误预算分布,以及RTLC如何与事后评判分数校准正交组合——两种干预措施在实践中呈现乘性复合效应。