Humans cannot always intuit what scenarios are most challenging to LLMs. Hoping to capture challenging edge cases, developers either design problems to be difficult for humans or curate extensive benchmarks. What if we could instead anticipate which scenarios a model will fail on? In this paper, we use an LLM's representational geometry to predict which concept combinations it will fail on. We attribute this compositional failure to interference between salient features. In tasks that require systematic composition - toy programmatic settings, multihop reasoning, multilingual factual recall - we find that when a pair of concepts is encoded near-orthogonally, the model reliably composes them. When their linear encodings are close, producing interference, the model fails to compose them. Our method reliably anticipates failure modes across different compositional tasks, without evaluating specific inputs. These results lay the groundwork to use representational geometry to identify high-risk examples, construct targeted stress tests, and provide a scalable foundation for active learning in real-world deployment.
翻译:人类并非总能直观感知哪些场景对大语言模型(LLMs)最具挑战性。为了捕获棘手的边界案例,开发者要么设计对人类具有难度的任务,要么策划详尽的基准测试。倘若我们能预先推断模型会在哪些场景中失败呢?本文利用LLMs的表征几何结构,预测其会在哪些概念组合上出错。我们将这种组合失败归因于显著特征之间的干扰。在需要系统性组合的任务中——包括程序化玩具场景、多跳推理、多语言事实检索——我们发现:当一对概念以近正交方式编码时,模型能可靠地进行组合;而当其线性编码接近并产生干扰时,模型则会组合失败。无需评估具体输入,我们的方法即可可靠预判不同组合任务中的失败模式。这些成果为利用表征几何结构识别高风险样本、构建针对性压力测试奠定了方法论基础,并为现实部署中的主动学习提供了可扩展的理论框架。