We show that existing unsupervised methods on large language model (LLM) activations do not discover knowledge -- instead they seem to discover whatever feature of the activations is most prominent. The idea behind unsupervised knowledge elicitation is that knowledge satisfies a consistency structure, which can be used to discover knowledge. We first prove theoretically that arbitrary features (not just knowledge) satisfy the consistency structure of a particular leading unsupervised knowledge-elicitation method, contrast-consistent search (Burns et al. - arXiv:2212.03827). We then present a series of experiments showing settings in which unsupervised methods result in classifiers that do not predict knowledge, but instead predict a different prominent feature. We conclude that existing unsupervised methods for discovering latent knowledge are insufficient, and we contribute sanity checks to apply to evaluating future knowledge elicitation methods. Conceptually, we hypothesise that the identification issues explored here, e.g. distinguishing a model's knowledge from that of a simulated character's, will persist for future unsupervised methods.
翻译:我们证明现有的大语言模型(LLM)激活上的无监督方法并不能发现知识——它们似乎仅仅发现了激活中最显著的特征。无监督知识抽取的基本思想是知识满足一致性结构,可据此发现知识。我们首先从理论上证明任意特征(而不仅仅是知识)都满足特定领先无监督知识抽取方法——对比一致性搜索(Burns等,arXiv:2212.03827)的一致性结构。随后我们通过一系列实验表明,在特定设置中无监督方法产生的分类器并不能预测知识,而是预测另一个显著特征。我们得出结论:现有用于发现潜在知识的无监督方法存在不足,并为评估未来知识抽取方法贡献了必要的合理性检查。从概念上讲,我们假设本文探讨的识别问题(例如区分模型知识与模拟角色知识)将持续影响未来的无监督方法。