Standard linear probing declares a property "encoded" when a classifier on hidden states achieves high accuracy. The protocol works well on a snapshot but breaks across pre-training: probe accuracy saturates within the first few thousand steps, leaving most of training invisible to the instrument. We introduce fragility, a complementary per-layer metric defined as the activation-noise level at which probe accuracy collapses. Fragility is sensitive to both the margin of separability and the redundancy of representation, both of which keep evolving long after accuracy plateaus. Applied to open-checkpoint language models, fragility recovers structure that accuracy alone cannot see. Moralized representations emerge along a lexical $\to$ compositional gradient: lexical moral detection first, compositional moral encoding later. Because probe accuracy on its own tracks how lexically separable a dataset is, we establish the compositional encoding directly, by showing it transfers across construction types that share no contrast tokens. A layer-depth robustness gradient develops monotonically across training while accuracy stays flat. And matched fine-tuning corpora that produce identical probing accuracy leave distinct fragility fingerprints, showing that data curation reshapes probe robustness without changing probe accuracy. In every comparison we test, where probing accuracy returns a flat answer, fragility returns a structured one.
翻译:标准线性探测法通过分类器在隐层状态上的高准确率来判定属性是否“编码”。该协议在静态快照上表现良好,但在预训练过程中失效:探测精度在最初数千步内即饱和,导致训练后期的大部分过程对该工具不可见。我们引入“脆弱性”这一互补的逐层指标,定义为探测精度崩溃时的激活噪声水平。脆弱性对分离边界的裕度及表示的冗余性均敏感——这两者在精度进入平台期后仍持续演变。应用于开放检查点的语言模型时,脆弱性揭示了精度单独无法观察到的结构。道德化表示沿词汇→组合的梯度涌现:词汇级道德检测先出现,组合级道德编码随后。由于探测精度本身只能追踪数据集的词汇可分离程度,我们通过证明其能在无共享对比标记的构式类型间迁移,直接建立组合编码的存在。跨训练过程,当精度保持平坦时,逐层深度鲁棒性梯度呈单调发展。而匹配的微调语料即便产生相同的探测精度,也会留下不同的脆弱性指纹,表明数据筛选在不改变探测精度的前提下重塑了探测鲁棒性。在我们测试的每个比较中,当探测精度给出平坦结果时,脆弱性都提供了结构化的信息。