Large language models (LLMs) have made remarkable progress in a wide range of natural language understanding and generation tasks. However, their ability to generate counterfactuals has not been examined systematically. To bridge this gap, we present a comprehensive evaluation framework on various types of NLU tasks, which covers all key factors in determining LLMs' capability of generating counterfactuals. Based on this framework, we 1) investigate the strengths and weaknesses of LLMs as the counterfactual generator, and 2) disclose the factors that affect LLMs when generating counterfactuals, including both the intrinsic properties of LLMs and prompt designing. The results show that, though LLMs are promising in most cases, they face challenges in complex tasks like RE since they are bounded by task-specific performance, entity constraints, and inherent selection bias. We also find that alignment techniques, e.g., instruction-tuning and reinforcement learning from human feedback, may potentially enhance the counterfactual generation ability of LLMs. On the contrary, simply increasing the parameter size does not yield the desired improvements. Besides, from the perspective of prompt designing, task guidelines unsurprisingly play an important role. However, the chain-of-thought approach does not always help due to inconsistency issues.
翻译:大语言模型(LLMs)在自然语言理解与生成的多项任务中取得了显著进展,但其生成反事实的能力尚未得到系统检验。为填补这一空白,我们提出了一个涵盖自然语言理解各类任务的综合评估框架,该框架整合了决定LLMs反事实生成能力的所有关键因素。基于此框架,我们:1)探究LLMs作为反事实生成器的优势与不足;2)揭示影响LLMs生成反事实的因素,包括LLMs的固有属性与提示设计。结果表明,虽然LLMs在多数情况下表现优异,但在关系抽取等复杂任务中仍面临挑战,原因在于其受限于任务特定性能、实体约束与固有选择偏差。我们还发现,对齐技术(如指令微调和基于人类反馈的强化学习)可能潜在地增强LLMs的反事实生成能力。相反,单纯增加参数规模并未带来预期改进。此外,从提示设计角度看,任务指南显然发挥重要作用,但思维链方法因不一致性问题并非始终有效。