模型安全论文 - 专知

会员服务 ·

模型安全

NeST: Neuron Selective Tuning for LLM Safety

Arxiv

0+阅读 · 2月18日

NESSiE: The Necessary Safety Benchmark -- Identifying Errors that should not Exist

Arxiv

0+阅读 · 2月18日

Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment

Arxiv

0+阅读 · 2月18日

SafeCOMM: A Study on Safety Degradation in Fine-Tuned Telecom Large Language Models

Arxiv

0+阅读 · 2月6日

DeepSight: An All-in-One LM Safety Toolkit

Arxiv

0+阅读 · 2月12日

Stop Testing Attacks, Start Diagnosing Defenses: The Four-Checkpoint Framework Reveals Where LLM Safety Breaks

Arxiv

0+阅读 · 2月10日

SecureGate: Learning When to Reveal PII Safely via Token-Gated Dual-Adapters for Federated LLMs

Arxiv

0+阅读 · 2月13日

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

Arxiv

0+阅读 · 2月6日

The Rogue Scalpel: Activation Steering Compromises LLM Safety

Arxiv

0+阅读 · 2月15日

Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing

Arxiv

0+阅读 · 2月12日

Trust The Typical

Arxiv

0+阅读 · 2月4日

Representation-Aware Unlearning via Activation Signatures: From Suppression to Knowledge-Signature Erasure

Arxiv

0+阅读 · 2月4日

RACA: Representation-Aware Coverage Criteria for LLM Safety Testing

Arxiv

0+阅读 · 2月2日

TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment

Arxiv

0+阅读 · 1月30日

Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models

Arxiv

0+阅读 · 2月3日

参考链接

微信扫码咨询专知VIP会员