Recent zero-shot evaluations have highlighted important limitations in the abilities of language models (LMs) to perform meaning extraction. However, it is now well known that LMs can demonstrate radical improvements in the presence of experimental contexts such as in-context examples and instructions. How well does this translate to previously studied meaning-sensitive tasks? We present a case-study on the extent to which experimental contexts can improve LMs' robustness in performing property inheritance -- predicting semantic properties of novel concepts, a task that they have been previously shown to fail on. Upon carefully controlling the nature of the in-context examples and the instructions, our work reveals that they can indeed lead to non-trivial property inheritance behavior in LMs. However, this ability is inconsistent: with a minimal reformulation of the task, some LMs were found to pick up on shallow, non-semantic heuristics from their inputs, suggesting that the computational principles of semantic property inference are yet to be mastered by LMs.
翻译:最近的零样本评估揭示了语言模型(LMs)在意义提取能力方面的重要局限性。然而,众所周知,在实验情境(如上下文示例和指令)存在的情况下,LMs能表现出显著的改进。这一现象如何迁移到先前研究的意义敏感任务中?我们以属性继承(预测新概念的语义属性)为案例,探究实验情境能在多大程度上提升LMs的鲁棒性——该任务先前已被证明LMs会失败。在仔细控制上下文示例和指令的性质后,我们的工作揭示它们确实能导致LMs产生非平凡的属性继承行为。然而,这种能力并不一致:通过对任务进行最小化重构,部分LMs被发现从其输入中学习到浅层的非语义启发式策略,这表明语义属性推理的计算原理尚未被LMs真正掌握。