迈向伦理化的大型语言模型多智能体系统：机制可解释性视角 (Towards Ethical Multi-Agent Systems of Large Language Models: A Mechanistic Interpretability Perspective) - 专知论文

会员服务 ·

0

系统 · 智能体系统 · 机制可解释性 · 多智能体系统 · 可解释性 ·

2025 年 12 月 4 日

Towards Ethical Multi-Agent Systems of Large Language Models: A Mechanistic Interpretability Perspective

翻译：迈向伦理化的大型语言模型多智能体系统：机制可解释性视角

Jae Hee Lee,Anne Lauscher,Stefano V. Albrecht

from arxiv, Accepted to LaMAS 2026@AAAI'26 (https://sites.google.com/view/lamas2026)

Large language models (LLMs) have been widely deployed in various applications, often functioning as autonomous agents that interact with each other in multi-agent systems. While these systems have shown promise in enhancing capabilities and enabling complex tasks, they also pose significant ethical challenges. This position paper outlines a research agenda aimed at ensuring the ethical behavior of multi-agent systems of LLMs (MALMs) from the perspective of mechanistic interpretability. We identify three key research challenges: (i) developing comprehensive evaluation frameworks to assess ethical behavior at individual, interactional, and systemic levels; (ii) elucidating the internal mechanisms that give rise to emergent behaviors through mechanistic interpretability; and (iii) implementing targeted parameter-efficient alignment techniques to steer MALMs towards ethical behaviors without compromising their performance.

翻译：大型语言模型（LLMs）已被广泛应用于各类应用中，常作为自主智能体在多智能体系统中相互交互。尽管这些系统在增强能力与实现复杂任务方面展现出潜力，但也带来了显著的伦理挑战。本立场论文从机制可解释性的视角出发，提出了一个旨在确保LLMs多智能体系统（MALMs）伦理行为的研究议程。我们识别出三个关键研究挑战：（i）开发全面的评估框架，以在个体、交互和系统层面评估伦理行为；（ii）通过机制可解释性阐明导致涌现行为的内部机制；（iii）实施针对性的参数高效对齐技术，以引导MALMs朝向伦理行为，同时不损害其性能。

0

相关内容

LLM4Science怎么做？UIUC等最新《科学大型语言模型及其在科学发现中的应用》综述

LLM4Science怎么做？UIUC等最新《科学大型语言模型及其在科学发现中的应用》综述

专知会员服务

35+阅读 · 2024年6月23日

非Transformer不可？最新《状态空间模型（SSM）》综述

非Transformer不可？最新《状态空间模型（SSM）》综述

专知会员服务

75+阅读 · 2024年4月16日

【俄亥俄州立大学学生论文】鲁棒自然语言理解，74页pdf，Towards More Robust Natural Language Understanding

【俄亥俄州立大学学生论文】鲁棒自然语言理解，74页pdf，Towards More Robust Natural Language Understanding

专知会员服务

19+阅读 · 2022年3月1日

AAAI 2022 | ProtGNN：自解释图神经网络

AAAI 2022 | ProtGNN：自解释图神经网络

专知会员服务

40+阅读 · 2022年2月28日

语义相似性算法演化论文，29页pdf，Evolution of Semantic Similarity - A Survey

语义相似性算法演化论文，29页pdf，Evolution of Semantic Similarity - A Survey

专知会员服务

44+阅读 · 2020年4月30日

AAAI 2022 | ProtGNN：自解释图神经网络

AAAI 2022 | ProtGNN：自解释图神经网络

专知

10+阅读 · 2022年2月28日

ICLR'21 | GNN联邦学习的新基准

ICLR'21 | GNN联邦学习的新基准

图与推荐

12+阅读 · 2021年11月15日

最新最全《深度元学习》2021综述论文，68页pdf，A Survey of Deep Meta-Learning

最新最全《深度元学习》2021综述论文，68页pdf，A Survey of Deep Meta-Learning

专知

11+阅读 · 2021年4月23日

【NeurIPS2019】图变换网络：Graph Transformer Network

【NeurIPS2019】图变换网络：Graph Transformer Network

专知

245+阅读 · 2019年11月18日

论文浅尝 | 当知识图谱遇上零样本学习——零样本学习综述

论文浅尝 | 当知识图谱遇上零样本学习——零样本学习综述

开放知识图谱

22+阅读 · 2018年9月26日

城市“建成环境——空间行为”的多尺度影响关系与机理研究

国家自然科学基金

13+阅读 · 2017年12月31日

语义Web知识库补全关键技术研究

国家自然科学基金

17+阅读 · 2017年12月31日

大脑皮层褶皱形成“共推理论”研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

动态Gr？bner 基与GVW算法

国家自然科学基金

0+阅读 · 2014年12月31日

Approximate Computation via Le Cam Simulability

Arxiv

1+阅读 · 2025年12月31日

Hojabr: Towards a Theory of Everything for AI and Data Analytics

Arxiv

0+阅读 · 2025年12月30日

SPIRAL: Symbolic LLM Planning via Grounded and Reflective Search

Arxiv

0+阅读 · 2025年12月29日

HiSciBench: A Hierarchical Multi-disciplinary Benchmark for Scientific Intelligence from Reading to Discovery

Arxiv

0+阅读 · 2025年12月28日

Think, Act, Learn: A Framework for Autonomous Robotic Agents using Closed-Loop Large Language Models

Arxiv

0+阅读 · 2025年12月26日

VIP会员

文章信息

相关主题

智能体系统

机制可解释性

多智能体系统

相关VIP内容

LLM4Science怎么做？UIUC等最新《科学大型语言模型及其在科学发现中的应用》综述

LLM4Science怎么做？UIUC等最新《科学大型语言模型及其在科学发现中的应用》综述

专知会员服务

35+阅读 · 2024年6月23日

非Transformer不可？最新《状态空间模型（SSM）》综述

非Transformer不可？最新《状态空间模型（SSM）》综述

专知会员服务

75+阅读 · 2024年4月16日

【俄亥俄州立大学学生论文】鲁棒自然语言理解，74页pdf，Towards More Robust Natural Language Understanding

【俄亥俄州立大学学生论文】鲁棒自然语言理解，74页pdf，Towards More Robust Natural Language Understanding

专知会员服务

19+阅读 · 2022年3月1日

AAAI 2022 | ProtGNN：自解释图神经网络

AAAI 2022 | ProtGNN：自解释图神经网络

专知会员服务

40+阅读 · 2022年2月28日

语义相似性算法演化论文，29页pdf，Evolution of Semantic Similarity - A Survey

语义相似性算法演化论文，29页pdf，Evolution of Semantic Similarity - A Survey

专知会员服务

44+阅读 · 2020年4月30日

热门VIP内容

开通专知VIP会员享更多权益服务

决策智能中的时间序列预测大模型

美国空军协同作战飞机的下一代机型——YFQ-48无人机

脑机接口专题报告——技术突破与商业化共振，关注脑机接口未来产业

跨越黑盒：大语言模型的理论与机制

相关资讯

AAAI 2022 | ProtGNN：自解释图神经网络

AAAI 2022 | ProtGNN：自解释图神经网络

专知

10+阅读 · 2022年2月28日

ICLR'21 | GNN联邦学习的新基准

ICLR'21 | GNN联邦学习的新基准

图与推荐

12+阅读 · 2021年11月15日

最新最全《深度元学习》2021综述论文，68页pdf，A Survey of Deep Meta-Learning

最新最全《深度元学习》2021综述论文，68页pdf，A Survey of Deep Meta-Learning

专知

11+阅读 · 2021年4月23日

【NeurIPS2019】图变换网络：Graph Transformer Network

【NeurIPS2019】图变换网络：Graph Transformer Network

专知

245+阅读 · 2019年11月18日

论文浅尝 | 当知识图谱遇上零样本学习——零样本学习综述

论文浅尝 | 当知识图谱遇上零样本学习——零样本学习综述

开放知识图谱

22+阅读 · 2018年9月26日

相关论文

Approximate Computation via Le Cam Simulability

Arxiv

1+阅读 · 2025年12月31日

Hojabr: Towards a Theory of Everything for AI and Data Analytics

Arxiv

0+阅读 · 2025年12月30日

SPIRAL: Symbolic LLM Planning via Grounded and Reflective Search

Arxiv

0+阅读 · 2025年12月29日

HiSciBench: A Hierarchical Multi-disciplinary Benchmark for Scientific Intelligence from Reading to Discovery

Arxiv

0+阅读 · 2025年12月28日

Think, Act, Learn: A Framework for Autonomous Robotic Agents using Closed-Loop Large Language Models

Arxiv

0+阅读 · 2025年12月26日

相关基金

城市“建成环境——空间行为”的多尺度影响关系与机理研究

国家自然科学基金

13+阅读 · 2017年12月31日

语义Web知识库补全关键技术研究

国家自然科学基金

17+阅读 · 2017年12月31日

大脑皮层褶皱形成“共推理论”研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

动态Gr？bner 基与GVW算法

国家自然科学基金

0+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员