To build intelligent machine learning systems, there are two broad approaches. One approach is to build inherently interpretable models, as endeavored by the growing field of causal representation learning. The other approach is to build highly-performant foundation models and then invest efforts into understanding how they work. In this work, we relate these two approaches and study how to learn human-interpretable concepts from data. Weaving together ideas from both fields, we formally define a notion of concepts and show that they can be provably recovered from diverse data. Experiments on synthetic data and large language models show the utility of our unified approach.
翻译:为构建智能机器学习系统,现有两种主流研究范式。其一致力于构建内在可解释模型——这是日益发展的因果表征学习领域的核心追求;其二则侧重开发高性能基础模型,再投入精力探究其工作机制。本研究将这两种范式相联结,系统探究如何从数据中学习人类可理解的概念。通过融合两个领域的思想,我们正式定义了概念的形式化表征,并证明其可从多样化数据中被可靠复原。基于合成数据与大语言模型的实验验证了该统一方法的有效性。