Accuracy-based evaluation cannot reliably distinguish genuine generalization from shortcuts like memorization, leakage, or brittle heuristics, especially in small-data regimes. In this position paper, we argue for mechanism-aware evaluation that combines task-relevant symbolic rules with mechanistic interpretability, yielding algorithmic pass/fail scores that show exactly where models generalize versus exploit patterns. We demonstrate this on NL-to-SQL by training two identical architectures under different conditions: one without schema information (forcing memorization), one with schema (enabling grounding). Standard evaluation shows the memorization model achieves 94% field-name accuracy on unseen data, falsely suggesting competence. Our symbolic-mechanistic evaluation reveals this model violates core schema generalization rules, a failure invisible to accuracy metrics.
翻译:基于准确率的评估无法可靠区分真正的泛化与捷径(如记忆、泄露或脆弱启发式),尤其在数据稀少场景中。在本立场论文中,我们主张采用结合任务相关符号规则与机制可解释性的机制感知评估方法,从而生成算法化的通过/失败得分,精确揭示模型在何处真正泛化、何处仅利用模式。我们以自然语言转SQL任务为例,在两种不同条件下训练两个相同架构的模型:一个不含模式信息(强制记忆),另一个包含模式(实现接地)。标准评估显示,记忆模型在未见数据上达到94%的字段名准确率,虚假地暗示其具备能力。而我们的符号-机制评估则揭示该模型违反核心模式泛化规则——这一缺陷是准确率指标无法察觉的。