Given a user's input text, text-matching recommender systems output relevant items by comparing the input text to available items' description, such as product-to-product recommendation on e-commerce platforms. As users' interests and item inventory are expected to change, it is important for a text-matching system to generalize to data shifts, a task known as out-of-distribution (OOD) generalization. However, we find that the popular approach of fine-tuning a large, base language model on paired item relevance data (e.g., user clicks) can be counter-productive for OOD generalization. For a product recommendation task, fine-tuning obtains worse accuracy than the base model when recommending items in a new category or for a future time period. To explain this generalization failure, we consider an intervention-based importance metric, which shows that a fine-tuned model captures spurious correlations and fails to learn the causal features that determine the relevance between any two text inputs. Moreover, standard methods for causal regularization do not apply in this setting, because unlike in images, there exist no universally spurious features in a text-matching task (the same token may be spurious or causal depending on the text it is being matched to). For OOD generalization on text inputs, therefore, we highlight a different goal: avoiding high importance scores for certain features. We do so using an intervention-based regularizer that constraints the causal effect of any token on the model's relevance score to be similar to the base model. Results on Amazon product and 3 question recommendation datasets show that our proposed regularizer improves generalization for both in-distribution and OOD evaluation, especially in difficult scenarios when the base model is not accurate.
翻译:给定用户输入文本,文本匹配推荐系统通过将输入文本与可用物品的描述进行比对来输出相关项目,例如电子商务平台上的产品间推荐。由于用户兴趣和物品库存预计会发生变化,文本匹配系统泛化到数据分布变化的能力至关重要,这一任务被称为分布外泛化。然而,我们发现,在配对的项目相关性数据(如用户点击)上微调大型基础语言模型的流行方法可能对分布外泛化产生反效果。以产品推荐任务为例,当在新类别或未来时间段内推荐物品时,微调模型的准确性低于基础模型。为解释这一泛化失败,我们考虑了一种基于干预的重要性指标,该指标表明微调模型捕获了虚假相关性,并未学习到决定任意两个文本输入之间相关性的因果特征。此外,标准的因果正则化方法在此场景中不适用,因为与图像不同,文本匹配任务中不存在普遍虚假特征(同一标记可能在不同匹配文本中扮演虚假或因果角色)。因此,对于文本输入的分布外泛化,我们强调不同的目标:避免某些特征获得高重要性分数。为此,我们使用一种基于干预的正则化器,将任意标记对模型相关性分数的因果效应约束到与基础模型相似的程度。在亚马逊产品和三个问题推荐数据集上的结果表明,我们提出的正则化器改善了分布内和分布外评估的泛化性能,尤其在基础模型不准确的困难场景下效果显著。