RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning

翻译：RTLC——研究、教学式学习、批判：受费曼学习法启发的三阶段提示范式，无需微调即可提升LLM作为评判者在JudgeBench上的准确率

Andrea Morandi

LLM-as-a-judge is now the default measurement instrument for open-ended generation, but on the public JudgeBench benchmark even strong instruction-tuned judges barely scrape past random on objective-correctness pairwise items. We introduce RTLC, a three-stage prompting recipe -- Research, Teach-to-Learn, Critique -- that promotes a single black-box LLM into an ensemble-of-thought judge with no fine-tuning, retrieval, or external tools. Stage 1 wraps the input in a fixed pedagogical scaffold porting the Feynman Learning Technique (study $\to$ teach $\to$ find gaps $\to$ simplify) into LLM prompting. Stage 2 draws N=10 independent candidate verdicts at temperature 0.4. Stage 3 acts as its own critic, cross-comparing the candidate set against the original question to emit one critiqued verdict at temperature 0. On JudgeBench-GPT (350 hard pairwise items), Claude 3.7 Sonnet's pairwise accuracy climbs from 64.6% (single-shot vanilla prompt) to 78.6% (RTLC critique-of-10) -- an absolute 14.0-percentage-point gain. RTLC also beats N=10 self-consistency majority voting (77.7%) and a zero-shot first candidate (74.0%). A clean three-step ablation attributes +9.4 pp to the Teach-to-Learn scaffold, +3.7 pp to N=10 marginalisation, and +0.9 pp to explicit critique. We discuss the cost-accuracy frontier (RTLC sits above self-consistency at every working point), the error-budget breakdown across the four JudgeBench categories (knowledge, reasoning, math, coding), and how RTLC composes orthogonally with post-hoc judge-score calibration, with the two interventions compounding multiplicatively in practice.

翻译：LLM作为评判者已成为开放式生成任务的默认测量工具，但在公开的JudgeBench基准测试中，即便是经过强指令微调的评判模型，在面对客观正确性成对比较项目时，其性能也仅勉强超过随机水平。我们提出RTLC这一三阶段提示方案——研究、教学式学习、批判，无需微调、检索或外部工具，即可将单个黑盒LLM提升为集成思维评判者。第一阶段将输入嵌入固定教学支架，将费曼学习法（学习→教学→发现漏洞→简化）转化为LLM提示。第二阶段在温度0.4下抽取N=10个独立候选判断结果。第三阶段扮演自我批判角色，将候选集与原始问题交叉比对，在温度0下输出一个经批判的判断结果。在JudgeBench-GPT（350个困难成对项目）上，Claude 3.7 Sonnet的成对准确率从64.6%（单次原始提示）提升至78.6%（RTLC的10候选批判），绝对增益达14.0个百分点。RTLC还优于N=10自一致性多数投票（77.7%）和零样本首候选方案（74.0%）。通过三步消融实验，教学式学习支架贡献+9.4个百分点，N=10边际化贡献+3.7个百分点，显式批判贡献+0.9个百分点。我们探讨了成本-准确率前沿（RTLC在所有工作点上均优于自一致性）、JudgeBench四个类别（知识、推理、数学、编码）的错误预算分布，以及RTLC如何与事后评判分数校准正交组合——两种干预措施在实践中呈现乘性复合效应。