The NLP community typically relies on performance of a model on a held-out test set to assess generalization. Performance drops observed in datasets outside of official test sets are generally attributed to "out-of-distribution'' effects. Here, we explore the foundations of generalizability and study the various factors that affect it, articulating generalizability lessons from clinical studies. In clinical research generalizability depends on (a) internal validity of experiments to ensure controlled measurement of cause and effect, and (b) external validity or transportability of the results to the wider population. We present the need to ensure internal validity when building machine learning models in natural language processing, especially where results may be impacted by spurious correlations in the data. We demonstrate how spurious factors, such as the distance between entities in relation extraction tasks, can affect model internal validity and in turn adversely impact generalization. We also offer guidance on how to analyze generalization failures.
翻译:自然语言处理(NLP)领域通常依赖模型在独立测试集上的表现来评估其泛化能力。在官方测试集之外的数据集上观察到的性能下降通常被归因于"分布外"效应。本文旨在探索泛化性的理论基础,系统研究影响泛化能力的多种因素,并借鉴临床研究中的泛化经验。在临床研究中,泛化性取决于:(a)实验的内部有效性,以确保因果关系的受控测量;(b)结果对更广泛人群的外部有效性或可迁移性。我们提出在自然语言处理中构建机器学习模型时需确保内部有效性,特别是当结果可能受到数据中虚假相关性的影响时。通过实例证明,关系抽取任务中实体间距离等虚假因素如何影响模型内部有效性,进而对泛化能力产生负面影响。本文同时为分析泛化失效现象提供方法指导。