The NLP community typically relies on performance of a model on a held-out test set to assess generalization. Performance drops observed in datasets outside of official test sets are generally attributed to "out-of-distribution'' effects. Here, we explore the foundations of generalizability and study the various factors that affect it, articulating generalizability lessons from clinical studies. In clinical research generalizability depends on (a) internal validity of experiments to ensure controlled measurement of cause and effect, and (b) external validity or transportability of the results to the wider population. We present the need to ensure internal validity when building machine learning models in natural language processing, especially where results may be impacted by spurious correlations in the data. We demonstrate how spurious factors, such as the distance between entities in relation extraction tasks, can affect model internal validity and in turn adversely impact generalization. We also offer guidance on how to analyze generalization failures.
翻译:NLP社区通常依赖模型在保留测试集上的表现来评估其泛化能力。在官方测试集之外的数据集中观察到的性能下降通常归因于“分布外”效应。本文探究了泛化能力的基础原理,并研究了影响泛化能力的多种因素,借鉴临床研究中的泛化经验进行了阐述。在临床研究中,泛化能力取决于:(a)实验的内部有效性,以确保因果关系的受控测量;(b)外部有效性,即研究结果向更广泛人群的可迁移性。我们论证了在构建自然语言处理机器学习模型时确保内部有效性的必要性,尤其当结果可能受到数据中虚假相关性的影响时。我们通过实例展示了虚假因素(如关系抽取任务中实体间的距离)如何影响模型内部有效性,进而对泛化能力产生不利影响。此外,我们还就如何分析泛化失败提供了指导建议。