机制可解释性论文 - 专知

会员服务 ·

机制可解释性

机制可解释性

Frame-Conditioned Moral Computation in LLaMA 3.1-8B-Instruct: A Mechanistic Interpretability Audit of Ethical Reasoning

Arxiv

0+阅读 · 6月13日

When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability

Arxiv

0+阅读 · 5月14日

A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima

Arxiv

0+阅读 · 4月22日

Mechanistic Interpretability of Antibody Language Models Using SAEs

Arxiv

0+阅读 · 4月24日

Mechanistic Interpretability of Cognitive Complexity in LLMs via Linear Probing using Bloom's Taxonomy

Arxiv

0+阅读 · 2月19日

Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable Guarantees

Arxiv

0+阅读 · 2月18日

Momentum Attention: The Physics of In-Context Learning and Spectral Forensics for Mechanistic Interpretability

Arxiv

0+阅读 · 2月7日

Disentangling meaning from language in LLM-based machine translation

Arxiv

0+阅读 · 2月4日

Mechanistic Interpretability as Statistical Estimation: A Variance Analysis

Arxiv

0+阅读 · 2月3日

Explaining the Explainer: Understanding the Inner Workings of Transformer-based Symbolic Regression Models

Arxiv

0+阅读 · 2月3日

On the Theoretical Foundation of Sparse Dictionary Learning in Mechanistic Interpretability

Arxiv

0+阅读 · 1月13日

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

Arxiv

0+阅读 · 1月20日

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

Arxiv

1+阅读 · 1月26日

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

Arxiv

0+阅读 · 1月22日

Putting a Face to Forgetting: Continual Learning meets Mechanistic Interpretability

Arxiv

0+阅读 · 1月29日

参考链接

微信扫码咨询专知VIP会员