The increasing difficulty to distinguish language-model-generated from human-written text has led to the development of detectors of machine-generated text (MGT). However, in many contexts, a black-box prediction is not sufficient, it is equally important to know on what grounds a detector made that prediction. Explanation methods that estimate feature importance promise to provide indications of which parts of an input are used by classifiers for prediction. However, these are typically evaluated with simple classifiers and tasks that are intuitive to humans. To assess their suitability beyond these contexts, this study conducts the first systematic evaluation of explanation quality for detectors of MGT. The dimensions of faithfulness and stability are evaluated with five automated experiments, and usefulness is assessed in a user study. We use a dataset of ChatGPT-generated and human-written documents, and pair predictions of three existing language-model-based detectors with the corresponding SHAP, LIME, and Anchor explanations. We find that SHAP performs best in terms of faithfulness, stability, and in helping users to predict the detector's behavior. In contrast, LIME, perceived as most useful by users, scores the worst in terms of user performance at predicting detector behavior.
翻译:随着语言模型生成文本与人类撰写文本的区分难度日益增加,机器生成文本检测器应运而生。然而在许多应用场景中,仅获得黑盒预测结果并不足够,了解检测器做出预测的依据同样至关重要。通过估计特征重要性来解释模型的方法,有望揭示分类器在预测时具体利用了输入的哪些部分。然而,现有评估通常基于人类易于理解的简单分类器和任务展开。为检验这些方法在更复杂场景中的适用性,本研究首次对机器生成文本检测器的解释质量进行了系统性评估。通过五项自动化实验评估了忠实性和稳定性两个维度,并通过用户研究评估了解释的实用性。我们采用ChatGPT生成文本与人类撰写文本构成的数据集,将三种现有基于语言模型的检测器预测结果分别与SHAP、LIME和Anchor解释方法进行配对分析。研究发现,在忠实性、稳定性以及帮助用户预测检测器行为方面,SHAP表现最佳。与之形成对比的是,虽然用户主观评价认为LIME最具实用性,但其在辅助用户预测检测器行为方面的实际表现却最差。