Artificial intelligence, particularly through recent advancements in deep learning, has achieved exceptional performances in many tasks in fields such as natural language processing and computer vision. In addition to desirable evaluation metrics, a high level of interpretability is often required for these models to be reliably utilized. Therefore, explanations that offer insight into the process by which a model maps its inputs onto its outputs are much sought-after. Unfortunately, the current black box nature of machine learning models is still an unresolved issue and this very nature prevents researchers from learning and providing explicative descriptions for a model's behavior and final predictions. In this work, we propose a novel framework utilizing Adversarial Inverse Reinforcement Learning that can provide global explanations for decisions made by a Reinforcement Learning model and capture intuitive tendencies that the model follows by summarizing the model's decision-making process.
翻译:人工智能,尤其是通过深度学习的最新进展,在自然语言处理和计算机视觉等多个领域的众多任务中取得了卓越的性能。除了理想的评估指标之外,这些模型若要被可靠地使用,通常还需要具备高度的可解释性。因此,能够揭示模型如何将输入映射到输出过程的解释备受追捧。遗憾的是,当前机器学习模型的黑箱特性仍是一个悬而未决的问题,正是这种特性阻碍了研究者为模型的行为和最终预测提供可理解的描述性解释。在这项工作中,我们提出了一种新颖的框架,该框架利用对抗性逆强化学习,能够为强化学习模型做出的决策提供全局解释,并通过总结模型的决策过程来捕捉模型遵循的直观倾向。