Circuit discovery and activation steering in transformers have developed as separate research threads, yet both operate on the same representational space. Are they two views of the same underlying structure? We show they follow a single geometric principle: answer tokens, processed in isolation, encode the directions that would produce them. This Circuit Fingerprint hypothesis enables circuit discovery without gradients or causal intervention -- recovering comparable structure to gradient-based methods through geometric alignment alone. We validate this on standard benchmarks (IOI, SVA, MCQA) across four model families, achieving circuit discovery performance comparable to gradient-based methods. The same directions that identify circuit components also enable controlled steering -- achieving 69.8\% emotion classification accuracy versus 53.1\% for instruction prompting while preserving factual accuracy. Beyond method development, this read-write duality reveals that transformer circuits are fundamentally geometric structures: interpretability and controllability are two facets of the same object.
翻译:在Transformer模型中,电路发现与激活导向已发展为两个独立的研究方向,但它们都作用于同一表征空间。它们是否是同一底层结构的两种视角?我们证明它们遵循单一的几何原理:在孤立状态下处理的答案令牌,编码了能够生成它们的空间方向。这一“电路指纹”假说使得无需梯度或因果干预即可实现电路发现——仅通过几何对齐就能恢复出与基于梯度的方法相当的结构。我们在四个模型系列的标准基准测试(IOI、SVA、MCQA)上验证了这一假说,实现的电路发现性能与基于梯度的方法相当。那些识别电路组件的方向同样能实现可控的导向——在保持事实准确性的同时,情感分类准确率达到69.8%,而指令提示方法仅为53.1%。除了方法开发之外,这种读写双重性揭示了Transformer电路本质上是几何结构:可解释性与可控性实为同一对象的两个侧面。