Despite its crucial role in research experiments, code correctness is often presumed only on the basis of the perceived quality of results. This assumption comes with the risk of erroneous outcomes and potentially misleading findings. To address this issue, we posit that the current focus on reproducibility should go hand in hand with the emphasis on software quality. We present a case study in which we identify and fix three bugs in widely used implementations of the state-of-the-art Conformer architecture. Through experiments on speech recognition and translation in various languages, we demonstrate that the presence of bugs does not prevent the achievement of good and reproducible results, which however can lead to incorrect conclusions that potentially misguide future research. As a countermeasure, we propose a Code-quality Checklist and release pangoliNN, a library dedicated to testing neural models, with the goal of promoting coding best practices and improving research software quality within the NLP community.
翻译:尽管代码正确性在研究实验中扮演着关键角色,但人们往往仅基于结果的可感知质量来假设其正确性。这种假设伴随着产生错误结果和潜在误导性发现的风险。为解决这一问题,我们认为当前对可重复性的关注应当与对软件质量的重视齐头并进。我们通过一项案例研究,在当前最先进的Conformer架构的广泛实现中识别并修复了三个缺陷。基于对多种语言的语音识别和翻译任务的实验,我们证明缺陷的存在并不妨碍获得良好且可重复的结果,但这些结果却可能导出误导未来研究的错误结论。作为对策,我们提出了"代码质量检查清单",并发布了专用于测试神经模型的库pangoliNN,旨在推广编码最佳实践并提升自然语言处理领域的研究软件质量。