The discourse around toxicity and LLMs in NLP largely revolves around detection tasks. This work shifts the focus to evaluating LLMs' reasoning about toxicity -- from their explanations that justify a stance -- to enhance their trustworthiness in downstream tasks. Despite extensive research on explainability, it is not straightforward to adopt existing methods to evaluate free-form toxicity explanation due to their over-reliance on input text perturbations, among other challenges. To account for these, we propose a novel, theoretically-grounded multi-dimensional criterion, Human-Aligned Faithfulness (HAF), that measures the extent to which LLMs' free-form toxicity explanations align with those of a rational human under ideal conditions. We develop six metrics, based on uncertainty quantification, to comprehensively evaluate HAF of LLMs' toxicity explanations with no human involvement, and highlight how "non-ideal" the explanations are. We conduct several experiments on three Llama models (of size up to 70B) and an 8B Ministral model on five diverse toxicity datasets. Our results show that while LLMs generate plausible explanations to simple prompts, their reasoning about toxicity breaks down when prompted about the nuanced relations between the complete set of reasons, the individual reasons, and their toxicity stances, resulting in inconsistent and irrelevant responses. We open-source our code at https://github.com/uofthcdslab/HAF and LLM-generated explanations at https://huggingface.co/collections/uofthcdslab/haf.
翻译:自然语言处理领域关于毒性与大语言模型的讨论主要围绕检测任务展开。本研究将焦点转向评估大语言模型对毒性的推理能力——通过其论证立场的解释——以提升其在下游任务中的可信度。尽管可解释性研究已相当广泛,但由于现有方法过度依赖输入文本扰动等挑战,直接采用这些方法来评估自由形式的毒性解释并非易事。为此,我们提出一个新颖的、基于理论的多维标准——人类对齐忠实性,用于衡量大语言模型的自由形式毒性解释在理想条件下与理性人类解释的契合程度。我们基于不确定性量化开发了六项指标,无需人工参与即可全面评估大语言模型毒性解释的HAF,并揭示这些解释的“非理想”程度。我们在五个不同的毒性数据集上对三个Llama模型(最大规模达700亿参数)和一个80亿参数的Ministral模型进行了多项实验。结果表明,尽管大语言模型能对简单提示生成合理的解释,但当提示涉及完整原因集、个体原因及其毒性立场之间的细微关系时,其对毒性的推理能力会出现崩溃,导致生成不一致且不相关的回应。我们在https://github.com/uofthcdslab/HAF开源代码,并在https://huggingface.co/collections/uofthcdslab/haf发布大语言模型生成的解释。