As AI becomes more integral in our lives, the need for transparency and responsibility grows. While natural language explanations (NLEs) are vital for clarifying the reasoning behind AI decisions, evaluating them through human judgments is complex and resource-intensive due to subjectivity and the need for fine-grained ratings. This study explores the alignment between ChatGPT and human assessments across multiple scales (i.e., binary, ternary, and 7-Likert scale). We sample 300 data instances from three NLE datasets and collect 900 human annotations for both informativeness and clarity scores as the text quality measurement. We further conduct paired comparison experiments under different ranges of subjectivity scores, where the baseline comes from 8,346 human annotations. Our results show that ChatGPT aligns better with humans in more coarse-grained scales. Also, paired comparisons and dynamic prompting (i.e., providing semantically similar examples in the prompt) improve the alignment. This research advances our understanding of large language models' capabilities to assess the text explanation quality in different configurations for responsible AI development.
翻译:随着人工智能在人类生活中日益不可或缺,透明性与责任性的需求也随之增长。尽管自然语言解释(NLEs)对于阐明AI决策背后的推理至关重要,但由于主观性以及细粒度评分的需求,通过人工判断对其进行评估既复杂又耗费资源。本研究探索了ChatGPT与人类评估在多种尺度(即二元、三元及7级李克特量表)上的一致性。我们从三个NLE数据集中抽样300个数据实例,并收集了900条人工标注,以信息量和清晰度评分作为文本质量度量指标。我们进一步在不同主观性得分范围内进行配对比较实验,其基线来自8,346条人工标注。结果表明:在较粗粒度尺度上,ChatGPT与人类的一致性更高。此外,配对比较与动态提示(即在提示中提供语义相似的示例)可提升一致性。本研究深化了我们对大语言模型在不同配置下评估文本解释质量的能力的理解,为负责任AI的发展提供支撑。