The recent advancements in Transformer-based Language Models have demonstrated significant potential in enhancing the multilingual capabilities of these models. The remarkable progress made in this domain not only applies to natural language tasks but also extends to the domain of programming languages. Despite the ability of these models to learn from multiple languages, evaluations typically focus on particular combinations of the same languages. In this study, we evaluate the similarity of programming languages by analyzing their representations using a CodeBERT-based model. Our experiments reveal that token representation in languages such as C++, Python, and Java exhibit proximity to one another, whereas the same tokens in languages such as Mathematica and R display significant dissimilarity. Our findings suggest that this phenomenon can potentially result in performance challenges when dealing with diverse languages. Thus, we recommend using our similarity measure to select a diverse set of programming languages when training and evaluating future models.
翻译:近期基于Transformer的语言模型在提升多语言能力方面展现出显著潜力。这一领域的重大突破不仅适用于自然语言任务,同样延伸至编程语言领域。尽管这类模型能够从多种语言中学习,但现有评估通常聚焦于特定语言组合。本研究通过基于CodeBERT的模型分析编程语言的表征来评估其相似性。实验发现,C++、Python和Java等语言中的词元表征彼此接近,而Mathematica和R等语言中的相同词元则表现出显著差异。研究结果表明,这一现象可能为处理多样化语言带来性能挑战。因此,我们建议在训练和评估未来模型时,采用本文提出的相似性度量方法来选择多样化的编程语言集合。