AVIATE: Exploiting Translation Variants of Artifacts to Improve IR-based Traceability Recovery in Bilingual Software Projects

Traceability plays a vital role in facilitating various software development activities by establishing the traces between different types of artifacts (e.g., issues and commits in software repositories). Among the explorations for automated traceability recovery, the IR (Information Retrieval)-based approaches leverage textual similarity to measure the likelihood of traces between artifacts and show advantages in many scenarios. However, the globalization of software development has introduced new challenges, such as the possible multilingualism on the same concept (e.g., "ShuXing" vs. "attribute") in the artifact texts, thus significantly hampering the performance of IR-based approaches. Existing research has shown that machine translation can help address the term inconsistency in bilingual projects. However, the translation can also bring in synonymous terms that are not consistent with those in the bilingual projects (e.g., another translation of "ShuXing" as "property"). Therefore, we propose an enhancement strategy called AVIATE that exploits translation variants from different translators by utilizing the word pairs that appear simultaneously across the translation variants from different kinds artifacts (a.k.a. consensual biterms). We use these biterms to first enrich the artifact texts, and then to enhance the calculated IR values for improving IR-based traceability recovery for bilingual software projects. The experiments on 17 bilingual projects (involving English and 4 other languages) demonstrate that AVIATE significantly outperformed the IR-based approach with machine translation (the state-of-the-art in this field) with an average increase of 16.67 in Average Precision (31.43%) and 8.38 (11.22%) in Mean Average Precision, indicating its effectiveness in addressing the challenges of multilingual traceability recovery.

翻译：可追溯性通过在不同类型的工件（例如软件仓库中的问题与提交）之间建立追踪关系，在促进各类软件开发活动中发挥着至关重要的作用。在自动化可追溯性恢复的探索中，基于信息检索（IR）的方法利用文本相似性来衡量工件间存在追踪关系的可能性，并在许多场景中显示出优势。然而，软件开发的全球化带来了新的挑战，例如工件文本中同一概念可能存在多语言表达（如“属性”与“attribute”），这严重阻碍了基于IR方法的性能。现有研究表明，机器翻译有助于解决双语项目中的术语不一致问题。然而，翻译也可能引入与双语项目中现有术语不一致的同义词（例如“属性”的另一种翻译“property”）。因此，我们提出了一种名为AVIATE的增强策略，该策略通过利用在不同类型工件的翻译变体中同时出现的词对（即共识双词），来挖掘不同翻译器产生的翻译变体。我们首先使用这些双词来丰富工件文本，然后利用它们增强计算出的IR值，以改进针对双语软件项目的基于IR的可追溯性恢复。在17个双语项目（涉及英语及其他4种语言）上的实验表明，AVIATE显著优于结合机器翻译的IR方法（该领域的当前最佳方法），在平均精确率上平均提升了16.67（31.43%），在平均平均精确率上平均提升了8.38（11.22%），这证明了其在应对多语言可追溯性恢复挑战方面的有效性。