Understanding what features are encoded by learned directions in LLM activation space requires identifying inputs that strongly activate them. Feature visualization, which optimizes inputs to maximally activate a target direction, offers an alternative to costly dataset search approaches, but remains underexplored for LLMs due to the discrete nature of text. Furthermore, existing prompt optimization techniques are poorly suited to this domain, which is highly prone to local minima. To overcome these limitations, we introduce ADAPT, a hybrid method combining beam search initialization with adaptive gradient-guided mutation, designed around these failure modes. We evaluate on Sparse Autoencoder latents from Gemma 2 2B, proposing metrics grounded in dataset activation statistics to enable rigorous comparison, and show that ADAPT consistently outperforms prior methods across layers and latent types. Our results establish that feature visualization for LLMs is tractable, but requires design assumptions tailored to the domain.
翻译:理解大语言模型激活空间中学习方向所编码的特征,需要识别能强烈激活这些方向的输入。特征可视化通过优化输入以最大化激活目标方向,为替代成本高昂的数据集搜索方法提供了可能,但由于文本的离散特性,该方法在大语言模型中仍未得到充分探索。此外,现有的提示优化技术极容易陷入局部极小值,难以适用于该领域。为克服这些局限,我们提出了ADAPT——一种结合了束搜索初始化与自适应梯度引导变异的混合方法,其设计专门针对上述失效模式。我们在Gemma 2 2B的稀疏自编码器潜在空间上进行评估,提出了基于数据集激活统计的量化指标以实现严格比较,结果表明ADAPT在不同网络层和潜在变量类型上均持续优于现有方法。我们的研究证实了大语言模型特征可视化的可行性,但需要针对该领域特点进行专门的设计假设。