The discourse around toxicity and LLMs in NLP largely revolves around detection tasks. This work shifts the focus to evaluating LLMs' reasoning about toxicity - from their explanations that justify a stance - to enhance their trustworthiness in downstream tasks. Despite extensive research on explainability, it is not straightforward to adopt existing methods to evaluate free-form toxicity explanation due to their over-reliance on input text perturbations, among other challenges. To account for these, we propose a novel, theoretically-grounded multi-dimensional criterion, Argument-based Consistency (ArC), that measures the extent to which LLMs' free-form toxicity explanations reflect an ideal and logical argumentation process. Based on uncertainty quantification, we develop six metrics for ArC to comprehensively evaluate the (in)consistencies in LLMs' toxicity explanations. We conduct several experiments on three Llama models (of size up to 70B) and an 8B Ministral model on five diverse toxicity datasets. Our results show that while LLMs generate plausible explanations to simple prompts, their reasoning about toxicity breaks down when prompted about the nuanced relations between the complete set of reasons, the individual reasons, and their toxicity stances, resulting in inconsistent and irrelevant responses. We open-source our code (https://github.com/uofthcdslab/ArC) and LLM-generated explanations (https://huggingface.co/collections/uofthcdslab/arc) for future works.
翻译:自然语言处理领域关于毒性内容与大语言模型(LLM)的讨论主要集中于检测任务。本研究将焦点转向评估大语言模型对毒性内容的推理能力——通过其支撑立场的解释——以提升其在下游任务中的可信度。尽管可解释性研究已相当广泛,但由于现有方法过度依赖输入文本扰动等挑战,直接采用这些方法来评估自由形式的毒性解释并非易事。为此,我们提出一个新颖的、具有理论依据的多维标准——基于论证的一致性(ArC),用于衡量大语言模型的自由形式毒性解释在多大程度上反映了一个理想且合乎逻辑的论证过程。基于不确定性量化,我们为ArC开发了六个度量指标,以全面评估大语言模型毒性解释中的(不)一致性。我们在五个不同的毒性数据集上对三个Llama模型(规模最大达700亿参数)和一个80亿参数的Ministral模型进行了多项实验。结果表明,尽管大语言模型能对简单提示生成看似合理的解释,但当提示涉及完整理由集、单个理由及其毒性立场之间的微妙关系时,其毒性推理能力会出现崩溃,导致不一致且不相关的回应。我们开源了代码(https://github.com/uofthcdslab/ArC)和大语言模型生成的解释(https://huggingface.co/collections/uofthcdslab/arc),以供后续研究使用。