Recognizing software entities such as library names from free-form text is essential to enable many software engineering (SE) technologies, such as traceability link recovery, automated documentation, and API recommendation. While many approaches have been proposed to address this problem, they suffer from small entity vocabularies or noisy training data, hindering their ability to recognize software entities mentioned in sophisticated narratives. To address this challenge, we leverage the Wikipedia taxonomy to develop a comprehensive entity lexicon with 79K unique software entities in 12 fine-grained types, as well as a large labeled dataset of over 1.7M sentences. Then, we propose self-regularization, a noise-robust learning approach, to the training of our software entity recognition (SER) model by accounting for many dropouts. Results show that models trained with self-regularization outperform both their vanilla counterparts and state-of-the-art approaches on our Wikipedia benchmark and two Stack Overflow benchmarks. We release our models, data, and code for future research.
翻译:从自由文本中识别软件实体(如库名称)对于实现许多软件工程(SE)技术至关重要,例如可追溯性链接恢复、自动化文档生成和应用程序编程接口(API)推荐。尽管已提出多种方法来解决该问题,但这些方法受限于较小的实体词汇表或含噪声的训练数据,难以识别复杂叙述中提及的软件实体。为应对这一挑战,我们利用维基百科分类体系开发了一个包含79K个独特软件实体(涵盖12个细粒度类型)的综合实体词典,以及一个包含超过170万句子的带标签大型数据集。随后,我们提出了一种噪声鲁棒学习方法——自正则化(self-regularization),通过考虑多种丢弃机制来训练我们的软件实体识别(SER)模型。结果表明,采用自正则化训练的模型在我们的维基百科基准测试和两个Stack Overflow基准测试中,均优于其标准版本及当前最优方法。我们公开发布了相关模型、数据和代码以供后续研究使用。