With the advent of larger and more complex deep learning models, such as in Natural Language Processing (NLP), model qualities like explainability and interpretability, albeit highly desirable, are becoming harder challenges to tackle and solve. For example, state-of-the-art models in text classification are black-box by design. Although standard explanation methods provide some degree of explainability, these are mostly correlation-based methods and do not provide much insight into the model. The alternative of causal explainability is more desirable to achieve but extremely challenging in NLP due to a variety of reasons. Inspired by recent endeavors to utilize Large Language Models (LLMs) as experts, in this work, we aim to leverage the instruction-following and textual understanding capabilities of recent state-of-the-art LLMs to facilitate causal explainability via counterfactual explanation generation for black-box text classifiers. To do this, we propose a three-step pipeline via which, we use an off-the-shelf LLM to: (1) identify the latent or unobserved features in the input text, (2) identify the input features associated with the latent features, and finally (3) use the identified input features to generate a counterfactual explanation. We experiment with our pipeline on multiple NLP text classification datasets, with several recent LLMs, and present interesting and promising findings.
翻译:随着自然语言处理(NLP)等领域中更大、更复杂的深度学习模型的出现,可解释性和可理解性等模型质量虽然极具价值,但已成为愈发难以应对和解决的挑战。例如,文本分类中的最先进模型在设计上本质上是黑盒的。尽管标准解释方法能提供一定程度的可解释性,但这些方法大多基于相关性,无法深入揭示模型内部机制。因果可解释性作为替代方案更具吸引力,但鉴于多种原因,在NLP中实现这一目标极其困难。受近期利用大语言模型(LLM)作为专家研究趋势的启发,本研究旨在利用最新最先进LLM的指令遵循和文本理解能力,通过为黑盒文本分类器生成反事实解释来促进因果可解释性。为此,我们提出一个三步流水线,利用现成LLM实现:(1)识别输入文本中的潜在或未观测特征;(2)识别与潜在特征相关的输入特征;(3)利用识别出的输入特征生成反事实解释。我们在多个NLP文本分类数据集上,结合多种最新LLM进行实验,并展示了有趣且富有前景的发现。