We investigate the extent to which an LLM's hidden-state geometry can be recovered from its behavior in psycholinguistic experiments. Across eight instruction-tuned transformer models, we run two experimental paradigms -- similarity-based forced choice and free association -- over a shared 5,000-word vocabulary, collecting 17.5M+ trials to build behavior-based similarity matrices. Using representational similarity analysis, we compare behavioral geometries to layerwise hidden-state similarity and benchmark against FastText, BERT, and cross-model consensus. We find that forced-choice behavior aligns substantially more with hidden-state geometry than free association. In a held-out-words regression, behavioral similarity (especially forced choice) predicts unseen hidden-state similarities beyond lexical baselines and cross-model consensus, indicating that behavior-only measurements retain recoverable information about internal semantic geometry. Finally, we discuss implications for the ability of behavioral tasks to uncover hidden cognitive states.
翻译:本研究探讨了从大型语言模型在心理语言学实验中的行为表现能在多大程度上恢复其隐藏状态几何结构。我们在八个经过指令微调的Transformer模型上运行两种实验范式——基于相似性的强制选择和自由联想——使用共享的5000词词汇表,通过收集超过1750万次试验数据构建基于行为的相似性矩阵。通过表征相似性分析,我们将行为几何结构与逐层隐藏状态相似性进行比较,并以FastText、BERT及跨模型共识作为基准。研究发现,强制选择行为与隐藏状态几何结构的匹配程度显著高于自由联想。在留出词汇回归分析中,行为相似性(尤其是强制选择)对未见隐藏状态相似性的预测能力超越了词汇基线模型和跨模型共识,这表明仅基于行为的测量仍保留了关于内部语义几何结构的可恢复信息。最后,我们讨论了行为任务揭示隐藏认知状态能力的潜在意义。