稀疏自编码论文 - 专知

会员服务 ·

稀疏自编码

稀疏自编码

Evaluating the Interpretability of Sparse Autoencoders with Concept Annotations

Arxiv

0+阅读 · 6月23日

From Sparse Features to Trustworthy Proxies: Certifying SAE-Based Interpretability

Arxiv

0+阅读 · 6月16日

SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior

Arxiv

0+阅读 · 6月16日

From Concept-Aligned Tokens to Vulnerable Features: Mechanistic Localization of Jailbreaks

Arxiv

0+阅读 · 6月16日

Stable and Steerable Sparse Autoencoders with Weight Regularization

Arxiv

0+阅读 · 6月16日

Rational Sparse Autoencoder

Arxiv

0+阅读 · 6月16日

Scalable Circuit Learning for Interpreting Large Language Models

Arxiv

0+阅读 · 6月15日

Cascaded Sparse Autoencoders Learn Multi-Level Visual Concepts in Multimodal LLMs

Arxiv

0+阅读 · 6月15日

DifFRACT: Diffusion Feature Reconstruction and Attribution for Circuit Tracing

Arxiv

0+阅读 · 6月14日

Analyzing Visual Aircraft Representations with Sparse Autoencoders

Arxiv

0+阅读 · 6月13日

SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation

Arxiv

0+阅读 · 5月21日

Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

Arxiv

0+阅读 · 6月10日

From Tokens to Concepts: Leveraging SAE for SPLADE

Arxiv

0+阅读 · 5月31日

ICA Lens: Interpreting Language Models Without Training Another Dictionary

Arxiv

0+阅读 · 6月10日

Ensembling Sparse Autoencoders

Arxiv

0+阅读 · 6月11日

参考链接

微信扫码咨询专知VIP会员