Rationales in the form of manually annotated input spans usually serve as ground truth when evaluating explainability methods in NLP. They are, however, time-consuming and often biased by the annotation process. In this paper, we debate whether human gaze, in the form of webcam-based eye-tracking recordings, poses a valid alternative when evaluating importance scores. We evaluate the additional information provided by gaze data, such as total reading times, gaze entropy, and decoding accuracy with respect to human rationale annotations. We compare WebQAmGaze, a multilingual dataset for information-seeking QA, with attention and explainability-based importance scores for 4 different multilingual Transformer-based language models (mBERT, distil-mBERT, XLMR, and XLMR-L) and 3 languages (English, Spanish, and German). Our pipeline can easily be applied to other tasks and languages. Our findings suggest that gaze data offers valuable linguistic insights that could be leveraged to infer task difficulty and further show a comparable ranking of explainability methods to that of human rationales.
翻译:以人工标注输入片段形式提供的合理性(rationales),通常作为评估NLP可解释性方法时的基准真相。然而,此类标注既耗时又易受标注过程偏差影响。本文探讨基于网络摄像头的眼动追踪记录所获取的人类注视数据,在评估重要性分数时是否构成有效替代方案。我们从总阅读时长、注视熵及解码准确率等维度,评估注视数据相对于人工合理性标注所提供的额外信息。我们以一个面向信息检索型问答的多语言数据集WebQAmGaze为实验对象,将其与基于注意力机制及可解释性方法的重要性分数进行比较,覆盖4种不同的多语言Transformer语言模型(mBERT、distil-mBERT、XLMR和XLMR-L)及3种语言(英语、西班牙语和德语)。我们的分析框架可便捷地迁移至其他任务与语言。实验结果表明,注视数据能提供有价值的语言学洞察,可用于推断任务难度,并进一步揭示出与人类合理性标注具有可比性的可解释性方法排序。