Explainable recommendation systems are important to enhance transparency, accuracy, and fairness. Beyond result-level explanations, model-level interpretations can provide valuable insights that allow developers to optimize system designs and implement targeted improvements. However, most current approaches depend on specialized model designs, which often lack generalization capabilities. Given the various kinds of recommendation models, existing methods have limited ability to effectively interpret them. To address this issue, we propose RecSAE, an automatic, generalizable probing method for interpreting the internal states of Recommendation models with Sparse AutoEncoder. RecSAE serves as a plug-in module that does not affect original models during interpretations, while also enabling predictable modifications to their behaviors based on interpretation results. Firstly, we train an autoencoder with sparsity constraints to reconstruct internal activations of recommendation models, making the RecSAE latents more interpretable and monosemantic than the original neuron activations. Secondly, we automated the construction of concept dictionaries based on the relationship between latent activations and input item sequences. Thirdly, RecSAE validates these interpretations by predicting latent activations on new item sequences using the concept dictionary and deriving interpretation confidence scores from precision and recall. We demonstrate RecSAE's effectiveness on two datasets, identifying hundreds of highly interpretable concepts from pure ID-based models. Latent ablation studies further confirm that manipulating latent concepts produces corresponding changes in model output behavior, underscoring RecSAE's utility for both understanding and targeted tuning recommendation models. Code and data are publicly available at https://github.com/Alice1998/RecSAE.
翻译:可解释推荐系统对于提升透明度、准确性和公平性至关重要。除了结果层面的解释,模型层面的解释能够提供有价值的洞见,使开发者能够优化系统设计并实施有针对性的改进。然而,当前大多数方法依赖于特定的模型设计,往往缺乏泛化能力。鉴于推荐模型的多样性,现有方法在有效解释这些模型方面能力有限。为解决这一问题,我们提出了RecSAE,一种基于稀疏自编码器(Sparse AutoEncoder)的自动、可泛化的探测方法,用于解释推荐模型的内部状态。RecSAE作为一个即插即用模块,在解释过程中不影响原始模型,同时还能基于解释结果对模型行为进行可预测的修改。首先,我们训练一个带有稀疏约束的自编码器来重构推荐模型的内部激活,使得RecSAE的潜在表示比原始神经元激活更具可解释性和单义性。其次,我们基于潜在激活与输入物品序列之间的关系,自动化构建概念词典。第三,RecSAE通过使用概念词典预测新物品序列的潜在激活,并根据精确率和召回率推导解释置信度分数,从而验证这些解释。我们在两个数据集上验证了RecSAE的有效性,从纯基于ID的模型中识别出数百个高度可解释的概念。潜在消融研究进一步证实,操纵潜在概念会导致模型输出行为产生相应变化,这凸显了RecSAE在理解和针对性调优推荐模型方面的实用性。代码和数据已在https://github.com/Alice1998/RecSAE 公开提供。