From Data to Commonsense Reasoning: The Use of Large Language Models for Explainable AI

Commonsense reasoning is a difficult task for a computer, but a critical skill for an artificial intelligence (AI). It can enhance the explainability of AI models by enabling them to provide intuitive and human-like explanations for their decisions. This is necessary in many areas especially in question answering (QA), which is one of the most important tasks of natural language processing (NLP). Over time, a multitude of methods have emerged for solving commonsense reasoning problems such as knowledge-based approaches using formal logic or linguistic analysis. In this paper, we investigate the effectiveness of large language models (LLMs) on different QA tasks with a focus on their abilities in reasoning and explainability. We study three LLMs: GPT-3.5, Gemma and Llama 3. We further evaluate the LLM results by means of a questionnaire. We demonstrate the ability of LLMs to reason with commonsense as the models outperform humans on different datasets. While GPT-3.5's accuracy ranges from 56% to 93% on various QA benchmarks, Llama 3 achieved a mean accuracy of 90% on all eleven datasets. Thereby Llama 3 is outperforming humans on all datasets with an average 21% higher accuracy over ten datasets. Furthermore, we can appraise that, in the sense of explainable artificial intelligence (XAI), GPT-3.5 provides good explanations for its decisions. Our questionnaire revealed that 66% of participants rated GPT-3.5's explanations as either "good" or "excellent". Taken together, these findings enrich our understanding of current LLMs and pave the way for future investigations of reasoning and explainability.

翻译：常识推理对计算机而言是一项困难任务，但对人工智能（AI）却是关键能力。它能够通过使AI模型为其决策提供直观且类人的解释，从而增强模型的可解释性。这在许多领域都是必要的，尤其是在问答（QA）这一自然语言处理（NLP）最重要任务之一的场景中。长期以来，已涌现出多种解决常识推理问题的方法，例如基于知识的方法（使用形式逻辑或语言分析）。本文研究了大型语言模型（LLMs）在不同QA任务上的有效性，重点关注其推理与可解释性能力。我们考察了三种LLM：GPT-3.5、Gemma和Llama 3，并通过问卷调查进一步评估了LLM的结果。我们证明了LLMs具备常识推理能力，这些模型在多个数据集上的表现超越了人类。GPT-3.5在不同QA基准测试中的准确率在56%至93%之间，而Llama 3在所有十一个数据集上的平均准确率达到90%。因此，Llama 3在所有数据集上均优于人类，在十个数据集上的平均准确率高出人类21%。此外，我们评估发现，在可解释人工智能（XAI）的意义上，GPT-3.5能为其决策提供良好的解释。我们的问卷调查显示，66%的参与者将GPT-3.5的解释评为“良好”或“优秀”。综上所述，这些发现深化了我们对当前LLMs的理解，并为未来推理与可解释性研究开辟了道路。