An Empirical Study on Data Leakage and Generalizability of Link Prediction Models for Issues and Commits

To enhance documentation and maintenance practices, developers conventionally establish links between related software artifacts manually. Empirical research has revealed that developers frequently overlook this practice, resulting in significant information loss. To address this issue, automatic link recovery techniques have been proposed. However, these approaches primarily focused on improving prediction accuracy on randomly-split datasets, with limited attention given to the impact of data leakage and the generalizability of the predictive models. LinkFormer seeks to address these limitations. Our approach not only preserves and improves the accuracy of existing predictions but also enhances their alignment with real-world settings and their generalizability. First, to better utilize contextual information for prediction, we employ the Transformer architecture and fine-tune multiple pre-trained models on both textual and metadata information of issues and commits. Next, to gauge the effect of time on model performance, we employ two splitting policies during both the training and testing phases; randomly- and temporally-split datasets. Finally, in pursuit of a generic model that can demonstrate high performance across a range of projects, we undertake additional fine-tuning of LinkFormer within two distinct transfer-learning settings. Our findings support that to simulate real-world scenarios effectively, researchers must maintain the temporal flow of data when training models. Furthermore, the results demonstrate that LinkFormer outperforms existing methodologies by a significant margin, achieving a 48% improvement in F1-measure within a project-based setting. Finally, the performance of LinkFormer in the cross-project setting is comparable to its average performance within the project-based scenario.

翻译：为提升文档编制和项目维护实践，开发者通常手动建立相关软件工件之间的链接。实证研究表明，开发者常会忽略这一实践，导致大量信息丢失。为应对此问题，研究者提出了自动链接恢复技术。然而，现有方法主要侧重于提升随机划分数据集上的预测精度，而对数据泄露的影响及预测模型的泛化性关注有限。LinkFormer旨在弥补这些不足。我们的方法不仅保持并提升了现有预测的准确性，还增强了其与现实场景的契合度及泛化能力。首先，为更有效地利用上下文信息进行预测，我们采用Transformer架构，并对问题和提交的文本及元数据信息微调多个预训练模型。其次，为评估时间因素对模型性能的影响，我们在训练和测试阶段采用两种划分策略：随机划分数据集与时间划分数据集。最后，为探索能跨项目展示高性能的通用模型，我们在两种不同的迁移学习设置下对LinkFormer进行额外微调。研究结果表明，为有效模拟现实场景，研究者需在训练模型时保持数据的时间流。此外，实验结果证明LinkFormer在项目内设置下的F1值显著超越现有方法，提升了48%。在跨项目设置下，LinkFormer的性能与其在项目内场景中的平均性能相当。