稀疏自编码器论文 - 专知

会员服务 ·

稀疏自编码器

稀疏自编码器

稀疏自编码器是一种无监督机器学习算法，通过计算自编码的输出与原输入的误差，不断调节自编码器的参数，最终训练出模型。自编码器可以用于压缩输入信息，提取有用的输入特征。

Sparse Autoencoders for Sequential Recommendation Models: Interpretation and Flexible Control

Arxiv

0+阅读 · 2月17日

Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders

Arxiv

0+阅读 · 2月11日

AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders

Arxiv

0+阅读 · 2月6日

Sparse Autoencoders are Capable LLM Jailbreak Mitigators

Arxiv

0+阅读 · 2月12日

Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features

Arxiv

0+阅读 · 2月12日

From Atoms to Trees: Building a Structured Feature Forest with Hierarchical Sparse Autoencoders

Arxiv

0+阅读 · 2月12日

Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

Arxiv

0+阅读 · 2月15日

AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders

Arxiv

0+阅读 · 2月4日

DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders

Arxiv

0+阅读 · 2月5日

GSAE: Graph-Regularized Sparse Autoencoders for Robust LLM Safety Steering

Arxiv

0+阅读 · 2月4日

SAFER: Probing Safety in Reward Models with Sparse Autoencoder

Arxiv

0+阅读 · 1月30日

Sparse Autoencoder Features for Classifications and Transferability

Arxiv

0+阅读 · 2月2日

Understanding Internal Representations of Recommendation Models with Sparse Autoencoders

Arxiv

0+阅读 · 1月26日

Do Sparse Autoencoders Identify Reasoning Features in Language Models?

Arxiv

0+阅读 · 1月14日

Do Sparse Autoencoders Identify Reasoning Features in Language Models?

Arxiv

0+阅读 · 1月16日

参考链接

微信扫码咨询专知VIP会员