大语言模型毒性解释中的人类对齐忠实性 (Human-Aligned Faithfulness in Toxicity Explanations of LLMs)

The discourse around toxicity and LLMs in NLP largely revolves around detection tasks. This work shifts the focus to evaluating LLMs' reasoning about toxicity -- from their explanations that justify a stance -- to enhance their trustworthiness in downstream tasks. Despite extensive research on explainability, it is not straightforward to adopt existing methods to evaluate free-form toxicity explanation due to their over-reliance on input text perturbations, among other challenges. To account for these, we propose a novel, theoretically-grounded multi-dimensional criterion, Human-Aligned Faithfulness (HAF), that measures the extent to which LLMs' free-form toxicity explanations align with those of a rational human under ideal conditions. We develop six metrics, based on uncertainty quantification, to comprehensively evaluate HAF of LLMs' toxicity explanations with no human involvement, and highlight how "non-ideal" the explanations are. We conduct several experiments on three Llama models (of size up to 70B) and an 8B Ministral model on five diverse toxicity datasets. Our results show that while LLMs generate plausible explanations to simple prompts, their reasoning about toxicity breaks down when prompted about the nuanced relations between the complete set of reasons, the individual reasons, and their toxicity stances, resulting in inconsistent and irrelevant responses. We open-source our code at https://github.com/uofthcdslab/HAF and LLM-generated explanations at https://huggingface.co/collections/uofthcdslab/haf.

翻译：自然语言处理领域关于毒性与大语言模型的讨论主要围绕检测任务展开。本研究将焦点转向评估大语言模型对毒性的推理能力——通过其论证立场的解释——以提升其在下游任务中的可信度。尽管可解释性研究已相当广泛，但由于现有方法过度依赖输入文本扰动等挑战，直接采用这些方法来评估自由形式的毒性解释并非易事。为此，我们提出一个新颖的、基于理论的多维标准——人类对齐忠实性，用于衡量大语言模型的自由形式毒性解释在理想条件下与理性人类解释的契合程度。我们基于不确定性量化开发了六项指标，无需人工参与即可全面评估大语言模型毒性解释的HAF，并揭示这些解释的“非理想”程度。我们在五个不同的毒性数据集上对三个Llama模型（最大规模达700亿参数）和一个80亿参数的Ministral模型进行了多项实验。结果表明，尽管大语言模型能对简单提示生成合理的解释，但当提示涉及完整原因集、个体原因及其毒性立场之间的细微关系时，其对毒性的推理能力会出现崩溃，导致生成不一致且不相关的回应。我们在https://github.com/uofthcdslab/HAF开源代码，并在https://huggingface.co/collections/uofthcdslab/haf发布大语言模型生成的解释。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日