Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods. While effective, these approaches are computationally expensive and require post-processing. To address these limitations, we build upon ParaPLUIE, a perplexity-based LLM-judge metric that estimates confidence over ``Yes/No'' answers without generating text. We introduce *-PLUIE, task specific prompting variants of ParaPLUIE and evaluate their alignment with human judgement. Our experiments show that personalised *-PLUIE achieves stronger correlations with human ratings while maintaining low computational cost.
翻译:自动生成文本的质量评估通常依赖于大语言模型作为评判者(LLM-judge)的方法。虽然这些方法有效,但计算成本高昂且需要后处理。为应对这些局限性,我们在ParaPLUIE(一种基于困惑度的LLM-judge指标,无需生成文本即可对“是/否”答案进行置信度估计)的基础上进行改进。我们提出了*-PLUIE——ParaPLUIE的特定任务提示变体,并评估其与人类判断的一致性。实验表明,个性化配置的*-PLUIE在保持较低计算成本的同时,与人类评分呈现出更强的相关性。