Representation Learning for Stack Overflow Posts: How Far are We?

The tremendous success of Stack Overflow has accumulated an extensive corpus of software engineering knowledge, thus motivating researchers to propose various solutions for analyzing its content.The performance of such solutions hinges significantly on the selection of representation model for Stack Overflow posts. As the volume of literature on Stack Overflow continues to burgeon, it highlights the need for a powerful Stack Overflow post representation model and drives researchers' interest in developing specialized representation models that can adeptly capture the intricacies of Stack Overflow posts. The state-of-the-art (SOTA) Stack Overflow post representation models are Post2Vec and BERTOverflow, which are built upon trendy neural networks such as convolutional neural network (CNN) and Transformer architecture (e.g., BERT). Despite their promising results, these representation methods have not been evaluated in the same experimental setting. To fill the research gap, we first empirically compare the performance of the representation models designed specifically for Stack Overflow posts (Post2Vec and BERTOverflow) in a wide range of related tasks, i.e., tag recommendation, relatedness prediction, and API recommendation. To find more suitable representation models for the posts, we further explore a diverse set of BERT-based models, including (1) general domain language models (RoBERTa and Longformer) and (2) language models built with software engineering-related textual artifacts (CodeBERT, GraphCodeBERT, and seBERT). However, it also illustrates the ``No Silver Bullet'' concept, as none of the models consistently wins against all the others. Inspired by the findings, we propose SOBERT, which employs a simple-yet-effective strategy to improve the best-performing model by continuing the pre-training phase with the textual artifact from Stack Overflow.

翻译：Stack Overflow的巨大成功积累了大量的软件工程知识语料库，这促使研究者提出各种解决方案来分析其内容。这些解决方案的性能在很大程度上取决于Stack Overflow帖子表示模型的选择。随着Stack Overflow相关文献数量的持续激增，这凸显了对强大Stack Overflow帖子表示模型的需求，并推动研究者开发能够巧妙捕捉Stack Overflow帖子复杂特性的专用表示模型。最先进的Stack Overflow帖子表示模型是Post2Vec和BERTOverflow，它们基于卷积神经网络和Transformer架构等流行神经网络构建。尽管这些表示方法取得了令人鼓舞的结果，但尚未在相同的实验环境下进行评估。为填补这一研究空白，我们首先通过广泛的关联任务（即标签推荐、相关性预测和API推荐）对专门为Stack Overflow帖子设计的表示模型（Post2Vec和BERTOverflow）进行实证性能比较。为寻找更合适的帖子表示模型，我们进一步探索了多种基于BERT的模型，包括：(1)通用领域语言模型（RoBERTa和Longformer），(2)基于软件工程相关文本制品构建的语言模型（CodeBERT、GraphCodeBERT和seBERT）。然而，这同时也印证了"没有银弹"的概念，因为没有任何模型能够始终优于其他所有模型。受此发现启发，我们提出SOBERT，它采用简单而有效的策略，通过使用Stack Overflow的文本制品继续预训练阶段来改进表现最佳的模型。