Concept frustration: Aligning human concepts and machine representations

Aligning human-interpretable concepts with the internal representations learned by modern machine learning systems remains a central challenge for interpretable AI. We introduce a geometric framework for comparing supervised human concepts with unsupervised intermediate representations extracted from foundation model embeddings. Motivated by the role of conceptual leaps in scientific discovery, we formalise the notion of concept frustration: a contradiction that arises when an unobserved concept induces relationships between known concepts that cannot be made consistent within an existing ontology. We develop task-aligned similarity measures that detect concept frustration between supervised concept-based models and unsupervised representations derived from foundation models, and show that the phenomenon is detectable in task-aligned geometry while conventional Euclidean comparisons fail. Under a linear-Gaussian generative model we derive a closed-form expression for Bayes-optimal concept-based classifier accuracy, decomposing predictive signal into known-known, known-unknown and unknown-unknown contributions and identifying analytically where frustration affects performance. Experiments on synthetic data and real language and vision tasks demonstrate that frustration can be detected in foundation model representations and that incorporating a frustrating concept into an interpretable model reorganises the geometry of learned concept representations, to better align human and machine reasoning. These results suggest a principled framework for diagnosing incomplete concept ontologies and aligning human and machine conceptual reasoning, with implications for the development and validation of safe interpretable AI for high-risk applications.

翻译：将人类可解释概念与现代机器学习系统习得的内部表征对齐，仍是可解释人工智能的核心挑战。我们提出一个几何框架，用于比较监督式人类概念与从基础模型嵌入中提取的无监督中间表征。受概念跃迁在科学发现中作用的启发，我们形式化了"概念挫败"这一概念：当某个未观测到的概念引发已知概念间的关系，且该关系无法在现有本体论中自洽时，便产生矛盾。我们开发了任务对齐的相似性度量，用于检测监督式概念模型与源自基础模型的无监督表征之间的概念挫败，并证明该现象在任务对齐几何中可被检测，而传统欧几里得比较方法则无法发现。在线性-高斯生成模型下，我们推导出贝叶斯最优概念分类器准确率的闭式表达式，将预测信号分解为已知-已知、已知-未知和未知-未知贡献，并解析地识别挫败影响性能的条件。在合成数据及真实语言与视觉任务上的实验表明，挫败可在基础模型表征中被检测到，且将挫败概念纳入可解释模型会重组习得的概念表征几何结构，从而更好地对齐人类与机器推理。这些结果提出了一套用于诊断不完整概念本体论并对齐人类与机器概念推理的原则性框架，对高风险应用中安全可解释AI的开发与验证具有启示意义。