Automatic parallelization remains a challenging problem in software engineering, particularly in identifying code regions where loops can be safely executed in parallel on modern multi-core architectures. Traditional static analysis techniques, such as dependence analysis and polyhedral models, often struggle with irregular or dynamically structured code. In this work, we propose a Transformer-based approach to classify the parallelization potential of source code, focusing on distinguishing independent (parallelizable) loops from undefined ones. We adopt DistilBERT to process source code sequences using subword tokenization, enabling the model to capture contextual syntactic and semantic patterns without handcrafted features. The approach is evaluated on a balanced dataset combining synthetically generated loops and manually annotated real-world code, using 10-fold cross-validation and multiple performance metrics. Results show consistently high performance, with mean accuracy above 99\% and low false positive rates, demonstrating robustness and reliability. Compared to prior token-based methods, the proposed approach simplifies preprocessing while improving generalization and maintaining computational efficiency. These findings highlight the potential of lightweight Transformer models for practical identification of parallelization opportunities at the loop level.
翻译:自动并行化仍然是软件工程中的一个挑战性问题,特别是在识别那些可在现代多核架构上安全并行执行的循环代码区域方面。传统的静态分析技术(如依赖分析和多面体模型)在处理不规则或动态结构代码时常常面临困难。在本文中,我们提出了一种基于Transformer的方法来对源代码的并行化潜力进行分类,重点在于区分可独立并行化的循环与不确定的循环。我们采用DistilBERT,通过子词分词处理源代码序列,使模型能够在不依赖人工特征工程的情况下捕获上下文中的语法和语义模式。该方法在一个结合了合成生成循环与人工标注真实世界代码的平衡数据集上进行了评估,采用了10折交叉验证和多种性能指标。结果显示,该方法持续表现出高性能,平均准确率超过99%,且假阳性率低,展示了其鲁棒性和可靠性。与先前的基于令牌的方法相比,所提方法简化了预处理过程,同时提升了泛化能力并保持了计算效率。这些发现凸显了轻量级Transformer模型在实际层面识别循环级并行化机会的潜力。