This paper proposes a novel approach to evaluate Counter Narrative (CN) generation using a Large Language Model (LLM) as an evaluator. We show that traditional automatic metrics correlate poorly with human judgements and fail to capture the nuanced relationship between generated CNs and human perception. To alleviate this, we introduce a model ranking pipeline based on pairwise comparisons of generated CNs from different models, organized in a tournament-style format. The proposed evaluation method achieves a high correlation with human preference, with a $\rho$ score of 0.88. As an additional contribution, we leverage LLMs as zero-shot CN generators and provide a comparative analysis of chat, instruct, and base models, exploring their respective strengths and limitations. Through meticulous evaluation, including fine-tuning experiments, we elucidate the differences in performance and responsiveness to domain-specific data. We conclude that chat-aligned models in zero-shot are the best option for carrying out the task, provided they do not refuse to generate an answer due to security concerns.
翻译:本文提出了一种利用大语言模型作为评估器来评估反叙事生成的新方法。研究表明,传统的自动评估指标与人类判断相关性较差,且无法捕捉生成的反叙事与人类感知之间的微妙关系。为缓解此问题,我们引入了一种基于模型间生成反叙事成对比较的模型排序流程,采用锦标赛形式组织。所提出的评估方法实现了与人类偏好的高度相关性,$\rho$得分达到0.88。作为额外贡献,我们利用大语言模型作为零样本反叙事生成器,并对对话模型、指令模型和基础模型进行了比较分析,探讨了它们各自的优势与局限。通过包括微调实验在内的细致评估,我们阐明了模型在性能及对领域特定数据响应性方面的差异。结论表明,在确保不会因安全顾虑拒绝生成答案的前提下,零样本设置的对话对齐模型是执行此任务的最佳选择。