Many current state-of-the-art models for sequential recommendations are based on transformer architectures. Interpretation and explanation of such black box models is an important research question, as a better understanding of their internals can help understand, influence, and control their behavior, which is very important in a variety of real-world applications. Recently, sparse autoencoders (SAE) have been shown to be a promising unsupervised approach to extract interpretable features from neural networks. In this work, we extend SAE to sequential recommender systems and propose a framework for interpreting and controlling model representations. We show that this approach can be successfully applied to the transformer trained on a sequential recommendation task: directions learned in such an unsupervised regime turn out to be more interpretable and monosemantic than the original hidden state dimensions. Further, we demonstrate a straightforward way to effectively and flexibly control the model's behavior, giving developers and users of recommendation systems the ability to adjust their recommendations to various custom scenarios and contexts.
翻译:当前许多先进的序列推荐模型均基于Transformer架构。对此类黑盒模型进行解释与理解是重要的研究课题,因为更好地理解其内部机制有助于理解、影响并控制其行为,这在众多实际应用中至关重要。近期研究表明,稀疏自编码器(SAE)是一种从神经网络中提取可解释特征的有效无监督方法。本研究将SAE扩展至序列推荐系统,并提出一个用于解释与控制模型表征的框架。我们证明该方法可成功应用于序列推荐任务训练的Transformer模型:通过这种无监督方式学习到的特征方向比原始隐藏状态维度更具可解释性与单义性。此外,我们展示了一种直接有效且灵活控制模型行为的方法,使推荐系统的开发者与用户能够根据不同的定制场景和上下文调整推荐结果。