Software developers often resort to Stack Overflow (SO) to fill their programming needs. Given the abundance of relevant posts, navigating them and comparing different solutions is tedious and time-consuming. Recent work has proposed to automatically summarize SO posts to concise text to facilitate the navigation of SO posts. However, these techniques rely only on information retrieval methods or heuristics for text summarization, which is insufficient to handle the ambiguity and sophistication of natural language. This paper presents a deep learning based framework called ASSORT for SO post summarization. ASSORT includes two complementary learning methods, ASSORT_S and ASSORT_{IS}, to address the lack of labeled training data for SO post summarization. ASSORT_S is designed to directly train a novel ensemble learning model with BERT embeddings and domainspecific features to account for the unique characteristics of SO posts. By contrast, ASSORT_{IS} is designed to reuse pre-trained models while addressing the domain shift challenge when no training data is present (i.e., zero-shot learning). Both ASSORT_S and ASSORT_{IS} outperform six existing techniques by at least 13% and 7% respectively in terms of the F1 score. Furthermore, a human study shows that participants significantly preferred summaries generated by ASSORT_S and ASSORT_{IS} over the best baseline, while the preference difference between ASSORT_S and ASSORT_{IS} was small.
翻译:软件开发人员经常依赖 Stack Overflow (SO) 来满足编程需求。由于相关帖子数量庞大,浏览和比较不同解决方案既繁琐又耗时。近期研究提出通过自动摘要技术将 SO 帖子生成简洁文本,以促进其导航。然而,现有技术仅依赖信息检索方法或基于启发式的文本摘要策略,难以处理自然语言的歧义性和复杂性。本文提出一种基于深度学习的框架 ASSORT,用于 SO 帖子的摘要生成。ASSORT 包含两种互补的学习方法:ASSORT_S 和 ASSORT_{IS},以应对 SO 帖子摘要生成中标注训练数据匮乏的问题。ASSORT_S 旨在直接训练一种结合 BERT 嵌入和领域特定特征的新型集成学习模型,以捕捉 SO 帖子的独特特性。相比之下,ASSORT_{IS} 设计用于在无训练数据(即零样本学习)场景下复用预训练模型,同时解决领域迁移挑战。在 F1 分数方面,ASSORT_S 和 ASSORT_{IS} 分别相比六种现有技术至少提升 13% 和 7%。此外,人工研究表明,参与者显著偏好 ASSORT_S 和 ASSORT_{IS} 生成的摘要(优于最佳基线方法),而 ASSORT_S 与 ASSORT_{IS} 间的偏好差异较小。