Training modern large language models (LLMs) has become a veritable smorgasbord of algorithms and datasets designed to elicit particular behaviors, making it critical to develop techniques to understand the effects of datasets on the model's properties. This is exacerbated by recent experiments that show datasets can transmit signals that are not directly observable from individual datapoints, posing a conceptual challenge for dataset-centric understandings of LLM training and suggesting a missing fundamental account of such phenomena. Towards understanding such effects, inspired by recent work on the linear structure of LLMs, we uncover a general mechanism through which hidden subtexts can arise in generic datasets. We introduce Logit-Linear-Selection (LLS), a method that prescribes how to select subsets of a generic preference dataset to elicit a wide range of hidden effects. We apply LLS to discover subsets of real-world datasets so that models trained on them exhibit behaviors ranging from having specific preferences, to responding to prompts in a different language not present in the dataset, to taking on a different persona. Crucially, the effect persists for the selected subset, across models with varying architectures, supporting its generality and universality.
翻译:训练现代大型语言模型(LLMs)已成为各种算法和数据集的集合,旨在激发特定行为,这使得开发理解数据集对模型属性影响的技术变得至关重要。最近的实验进一步加剧了这一挑战,这些实验表明数据集能够传递无法从单个数据点直接观测到的信号,这对以数据集为中心的LLM训练理解提出了概念性挑战,并暗示此类现象缺乏根本性解释。为理解此类效应,受近期关于LLMs线性结构研究的启发,我们揭示了一种通用机制,通过该机制,隐藏的潜文本可在通用数据集中产生。我们提出了对数线性选择(Logit-Linear-Selection, LLS)方法,该方法规定了如何从通用偏好数据集中选择子集以激发广泛的隐藏效应。我们应用LLS从现实数据集中发现子集,使得基于这些子集训练的模型展现出从具有特定偏好、到以数据集中未出现的不同语言响应提示、再到呈现不同人格特征等一系列行为。关键的是,这种效应在选定的子集中持续存在,且跨越不同架构的模型,支持了其普适性与通用性。