Recommendation model interpretation aims to reveal the relationships between inputs, model internal representations and outputs to enhance the transparency, interpretability, and trustworthiness of recommendation systems. However, the inherent complexity and opacity of deep learning models pose challenges for model-level interpretation. Moreover, most existing methods for interpreting recommendation models are tailored to specific architectures or model types, limiting their generalizability across different types of recommenders. In this paper, we propose RecSAE, a generalizable probing framework that interprets Recommendation models with Sparse AutoEncoders. The framework extracts interpretable latents from the internal representations of recommendation models, and links them to semantic concepts for interpretations. It does not alter original models during interpretations and also enables targeted tuning to models. Experiments on three types of recommendation models (general, graph-based, sequential) with four widely used public datasets demonstrate the effectiveness and generalization of RecSAE framework. The interpreted concepts are further validated by human experts, showing strong alignment with human perception. Overall, RecSAE serves as a novel step in both model-level interpretations to various types of recommendation models without affecting their functions and offering potential for targeted tuning of models.
翻译:推荐模型解释旨在揭示输入、模型内部表示与输出之间的关系,以增强推荐系统的透明度、可解释性与可信度。然而,深度学习模型固有的复杂性与不透明性为模型级解释带来了挑战。此外,现有的大多数推荐模型解释方法均针对特定架构或模型类型设计,限制了其在不同类型推荐器间的泛化能力。本文提出RecSAE,一种基于稀疏自编码器的通用探测框架,用于解释推荐模型。该框架从推荐模型的内部表示中提取可解释的潜在变量,并将其与语义概念关联以实现解释。该框架在解释过程中不修改原始模型,同时支持对模型进行定向调优。在三种类型推荐模型(通用型、基于图、序列型)及四个广泛使用的公开数据集上的实验验证了RecSAE框架的有效性与泛化能力。经解释的概念进一步由领域专家验证,显示出与人类认知的高度一致性。总体而言,RecSAE在不影响各类推荐模型功能的前提下,为模型级解释提供了新的研究路径,并为模型的定向调优提供了潜在可能。