Evaluating Transfer Learning for Simplifying GitHub READMEs

Software documentation captures detailed knowledge about a software product, e.g., code, technologies, and design. It plays an important role in the coordination of development teams and in conveying ideas to various stakeholders. However, software documentation can be hard to comprehend if it is written with jargon and complicated sentence structure. In this study, we explored the potential of text simplification techniques in the domain of software engineering to automatically simplify GitHub README files. We collected software-related pairs of GitHub README files consisting of 14,588 entries, aligned difficult sentences with their simplified counterparts, and trained a Transformer-based model to automatically simplify difficult versions. To mitigate the sparse and noisy nature of the software-related simplification dataset, we applied general text simplification knowledge to this field. Since many general-domain difficult-to-simple Wikipedia document pairs are already publicly available, we explored the potential of transfer learning by first training the model on the Wikipedia data and then fine-tuning it on the README data. Using automated BLEU scores and human evaluation, we compared the performance of different transfer learning schemes and the baseline models without transfer learning. The transfer learning model using the best checkpoint trained on a general topic corpus achieved the best performance of 34.68 BLEU score and statistically significantly higher human annotation scores compared to the rest of the schemes and baselines. We conclude that using transfer learning is a promising direction to circumvent the lack of data and drift style problem in software README files simplification and achieved a better trade-off between simplification and preservation of meaning.

翻译：软件文档记录了关于软件产品的详细知识，例如代码、技术和设计。它在协调开发团队以及向不同利益相关者传达想法方面发挥着重要作用。然而，如果软件文档使用专业术语和复杂的句子结构编写，可能难以理解。在本研究中，我们探索了文本简化技术在软件工程领域的潜力，以自动简化GitHub README文件。我们收集了包含14,588个条目的软件相关GitHub README文件对，将复杂句子与其简化版本对齐，并训练了一个基于Transformer的模型来自动简化复杂版本。为缓解软件相关简化数据集的稀疏性和噪声问题，我们将通用文本简化知识应用于这一领域。由于已有许多公开的通用领域复杂到简单的Wikipedia文档对，我们探索了迁移学习的潜力：首先在Wikipedia数据上训练模型，然后在README数据上对其进行微调。利用自动BLEU分数和人工评估，我们比较了不同迁移学习方案与无迁移学习的基线模型的性能。使用在通用主题语料库上训练的最佳检查点的迁移学习模型取得了最佳性能，BLEU分数为34.68，并且与其余方案和基线相比，人工标注分数在统计上显著更高。我们得出结论，迁移学习是规避软件README文件简化中数据不足和风格漂移问题的一个有前景方向，并在简化与意义保留之间实现了更好的权衡。