Interpret the Internal States of Recommendation Model with Sparse Autoencoder

Explainable recommendation systems are important to enhance transparency, accuracy, and fairness. Beyond result-level explanations, model-level interpretations can provide valuable insights that allow developers to optimize system designs and implement targeted improvements. However, most current approaches depend on specialized model designs, which often lack generalization capabilities. Given the various kinds of recommendation models, existing methods have limited ability to effectively interpret them. To address this issue, we propose RecSAE, an automatic, generalizable probing method for interpreting the internal states of Recommendation models with Sparse AutoEncoder. RecSAE serves as a plug-in module that does not affect original models during interpretations, while also enabling predictable modifications to their behaviors based on interpretation results. Firstly, we train an autoencoder with sparsity constraints to reconstruct internal activations of recommendation models, making the RecSAE latents more interpretable and monosemantic than the original neuron activations. Secondly, we automated the construction of concept dictionaries based on the relationship between latent activations and input item sequences. Thirdly, RecSAE validates these interpretations by predicting latent activations on new item sequences using the concept dictionary and deriving interpretation confidence scores from precision and recall. We demonstrate RecSAE's effectiveness on two datasets, identifying hundreds of highly interpretable concepts from pure ID-based models. Latent ablation studies further confirm that manipulating latent concepts produces corresponding changes in model output behavior, underscoring RecSAE's utility for both understanding and targeted tuning recommendation models. Code and data are publicly available at https://github.com/Alice1998/RecSAE.

翻译：可解释推荐系统对于提升透明度、准确性和公平性至关重要。除了结果层面的解释，模型层面的解释能够提供有价值的洞见，使开发者能够优化系统设计并实施有针对性的改进。然而，当前大多数方法依赖于特定的模型设计，往往缺乏泛化能力。鉴于推荐模型的多样性，现有方法在有效解释这些模型方面能力有限。为解决这一问题，我们提出了RecSAE，一种基于稀疏自编码器（Sparse AutoEncoder）的自动、可泛化的探测方法，用于解释推荐模型的内部状态。RecSAE作为一个即插即用模块，在解释过程中不影响原始模型，同时还能基于解释结果对模型行为进行可预测的修改。首先，我们训练一个带有稀疏约束的自编码器来重构推荐模型的内部激活，使得RecSAE的潜在表示比原始神经元激活更具可解释性和单义性。其次，我们基于潜在激活与输入物品序列之间的关系，自动化构建概念词典。第三，RecSAE通过使用概念词典预测新物品序列的潜在激活，并根据精确率和召回率推导解释置信度分数，从而验证这些解释。我们在两个数据集上验证了RecSAE的有效性，从纯基于ID的模型中识别出数百个高度可解释的概念。潜在消融研究进一步证实，操纵潜在概念会导致模型输出行为产生相应变化，这凸显了RecSAE在理解和针对性调优推荐模型方面的实用性。代码和数据已在https://github.com/Alice1998/RecSAE 公开提供。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日