A distinction is often drawn between a model's ability to predict a label for an evaluation sample that is directly memorised from highly similar training samples versus an ability to predict the label via some method of generalisation. In the context of using Language Models for question-answering, discussion continues to occur as to the extent to which questions are answered through memorisation. We consider this issue for questions that would ideally be answered through reasoning over an associated context. We propose a method of identifying evaluation samples for which it is very unlikely our model would have memorised the answers. Our method is based on semantic similarity of input tokens and label tokens between training and evaluation samples. We show that our method offers advantages upon some prior approaches in that it is able to surface evaluation-train pairs that have overlap in either contiguous or discontiguous sequences of tokens. We use this method to identify unmemorisable subsets of our evaluation datasets. We train two Language Models in a multitask fashion whereby the second model differs from the first only in that it has two additional datasets added to the training regime that are designed to impart simple numerical reasoning strategies of a sort known to improve performance on some of our evaluation datasets but not on others. We then show that there is performance improvement between the two models on the unmemorisable subsets of the evaluation datasets that were expected to benefit from the additional training datasets. Specifically, performance on unmemorisable subsets of two of our evaluation datasets, DROP and ROPES significantly improves by 9.0%, and 25.7% respectively while other evaluation datasets have no significant change in performance.
翻译:通常,模型对评估样本标签的预测能力被区分为两种:一种是通过直接记忆高度相似的训练样本,另一种是通过某种泛化方法。在将语言模型用于问答的背景下,关于问题是否通过记忆来回答的讨论仍在持续。我们针对那些理想情况下应通过关联上下文推理来回答的问题考虑了这一问题。我们提出了一种方法,用于识别模型极不可能记忆答案的评估样本。该方法基于训练样本与评估样本在输入令牌和标签令牌上的语义相似性。我们证明,该方法相比此前的一些方法具有优势,能够揭示在连续或非连续令牌序列上存在重叠的评估-训练对。我们利用该方法从评估数据集中识别出不可记忆的子集。我们以多任务方式训练两个语言模型,其中第二个模型与第一个的唯一区别在于,其在训练过程中额外增加了两个数据集,这些数据集旨在赋予简单的数值推理策略——已知此类策略能提升部分评估数据集的性能,而对其他数据集无效。随后,我们展示了在预期能从额外训练数据集中获益的评估数据集的不可记忆子集上,两个模型之间的性能提升。具体而言,在两个评估数据集DROP和ROPES的不可记忆子集中,性能分别显著提升了9.0%和25.7%,而其他评估数据集的性能未发生显著变化。