The advent of transformers, higher computational budgets, and big data has engendered remarkable progress in Natural Language Processing (NLP). Impressive performance of industry pre-trained models has garnered public attention in recent years and made news headlines. That these are industry models is noteworthy. Rarely, if ever, are academic institutes producing exciting new NLP models. Using these models is critical for competing on NLP benchmarks and correspondingly to stay relevant in NLP research. We surveyed 100 papers published at EMNLP 2022 to determine whether this phenomenon constitutes a reliance on industry for NLP publications. We find that there is indeed a substantial reliance. Citations of industry artifacts and contributions across categories is at least three times greater than industry publication rates per year. Quantifying this reliance does not settle how we ought to interpret the results. We discuss two possible perspectives in our discussion: 1) Is collaboration with industry still collaboration in the absence of an alternative? Or 2) has free NLP inquiry been captured by the motivations and research direction of private corporations?
翻译:Transformer架构、更高的计算预算以及大数据的出现,推动了自然语言处理(NLP)领域取得显著进展。近年来,工业界预训练模型的卓越性能引起了公众关注并登上新闻头条。这些模型来自工业界这一事实值得注意;学术机构几乎从未产出令人振奋的新NLP模型。使用这些模型对于在NLP基准测试中保持竞争力、进而维持NLP研究的相关性至关重要。我们调研了EMNLP 2022上发表的100篇论文,以判断这一现象是否意味着NLP出版物对工业界的依赖。研究发现确实存在显著依赖:各类别中对工业界成果和贡献的引用次数至少是工业界年均发表率的三倍。量化这一依赖并未明确我们应如何解读结果。我们在讨论中提出两种可能的视角:1)在缺乏替代选择的情况下,与工业界的合作是否仍是合作?抑或2)自由NLP探索是否已被私营企业的动机和研究方向所俘获?