Sparse autoencoders (SAEs) have recently emerged as pivotal tools for introspection into large language models. SAEs can uncover high-quality, interpretable features at different levels of granularity and enable targeted steering of the generation process by selectively activating specific neurons in their latent activations. Our paper is the first to apply this approach to collaborative filtering, aiming to extract similarly interpretable features from representations learned purely from interaction signals. In particular, we focus on a widely adopted class of collaborative autoencoders (CFAEs) and augment them by inserting an SAE between their encoder and decoder networks. We demonstrate that such representation is largely monosemantic and propose suitable mapping functions between semantic concepts and individual neurons. We also evaluate a simple yet effective method that utilizes this representation to steer the recommendations in a desired direction.
翻译:稀疏自编码器(SAEs)近年来已成为大型语言模型内省研究的关键工具。SAEs能够揭示不同粒度下的高质量可解释特征,并通过选择性激活其潜在激活中的特定神经元,实现对生成过程的有目标引导。本文首次将这一方法应用于协同过滤领域,旨在从仅基于交互信号学习到的表示中提取类似的可解释特征。具体而言,我们聚焦于一类广泛采用的协同自编码器(CFAEs),通过在编码器与解码器网络之间嵌入一个SAE对其进行增强。我们证明此类表示在很大程度上是单语义的,并提出了语义概念与单个神经元之间合适的映射函数。此外,我们评估了一种利用该表示将推荐结果引导至期望方向的简单而有效的方法。