We introduce the Concept Bottleneck Large Language Model (CB-LLM), a pioneering approach to creating inherently interpretable Large Language Models (LLMs). Unlike traditional black-box LLMs that rely on post-hoc interpretation methods with limited neuron function insights, CB-LLM sets a new standard with its built-in interpretability, scalability, and ability to provide clear, accurate explanations. This innovation not only advances transparency in language models but also enhances their effectiveness. Our unique Automatic Concept Correction (ACC) strategy successfully narrows the performance gap with conventional black-box LLMs, positioning CB-LLM as a model that combines the high accuracy of traditional LLMs with the added benefit of clear interpretability -- a feature markedly absent in existing LLMs.
翻译:本文提出概念瓶颈大语言模型(CB-LLM),这是一种构建本质可解释大语言模型(LLM)的开创性方法。与依赖事后解释方法且对神经元功能洞察有限的黑盒传统LLM不同,CB-LLM凭借其内置的可解释性、可扩展性以及提供清晰准确解释的能力,树立了新的标准。这一创新不仅提升了语言模型的透明度,也增强了其有效性。我们独特的自动概念校正(ACC)策略成功缩小了与传统黑盒LLM的性能差距,使CB-LLM成为一个兼具传统LLM高精度与清晰可解释性优势的模型——这一特性在现有LLM中明显缺失。