Debugging data races is a major challenge for students learning parallel programming due to the non-deterministic nature of concurrent execution and the complexity of shared-memory semantics. Recent advances in Large Language Models (LLMs) suggest that they could serve as AI teaching assistants, but the capabilities of lower-cost open-weight models for parallel debugging remain unclear. In this paper, we evaluate two Gemma4 open-weight models, Gemma4-E4B and Gemma4-31B, on their ability to identify, explain, and repair data races in OpenMP programs from the DataRaceBench benchmark suite. We also investigate whether contextual hints, including ThreadSanitizer (TSan) reports and model-generated explanations, improve repair quality. Our results show that Gemma4-E4B correctly explained 82 of 104 race-condition programs and successfully repaired 73, while Gemma4-31B achieved 100 correct explanations and 98 successful repairs. Surprisingly, additional context did not consistently improve repair effectiveness and sometimes reduced performance. These findings suggest that open-weight LLMs can provide valuable support for student self-debugging, with larger models offering near-complete coverage of the benchmark suite.
翻译:调试数据竞争是学习并行编程的学生面临的主要挑战,其根源在于并发执行的非确定性以及共享内存语义的复杂性。近年来大语言模型的进展表明,它们可作为AI教学助手,但低成本开源权重模型在并行调试方面的能力仍不明确。本文评估了两种Gemma4开源权重模型(Gemma4-E4B和Gemma4-31B)对DataRaceBench基准测试套件中OpenMP程序的数据竞争识别、解释及修复能力。我们同时探究了包含线程清理器(TSan)报告与模型生成解释的上下文提示是否能提升修复质量。实验结果显示:Gemma4-E4B正确解释了104个竞态程序中的82个并成功修复73个,而Gemma4-31B实现了100个正确解释与98个成功修复。令人意外的是,额外上下文并未持续提升修复效果,有时反而降低了性能。这些发现表明,开源权重大语言模型可为学生的自主调试提供有效支持,其中较大模型几乎能完整覆盖基准测试套件。