Despite its pivotal role in research experiments, code correctness is often presumed only on the basis of the perceived quality of the results. This comes with the risk of erroneous outcomes and potentially misleading findings. To address this issue, we posit that the current focus on result reproducibility should go hand in hand with the emphasis on coding best practices. We bolster our call to the NLP community by presenting a case study, in which we identify (and correct) three bugs in widely used open-source implementations of the state-of-the-art Conformer architecture. Through comparative experiments on automatic speech recognition and translation in various language settings, we demonstrate that the existence of bugs does not prevent the achievement of good and reproducible results and can lead to incorrect conclusions that potentially misguide future research. In response to this, this study is a call to action toward the adoption of coding best practices aimed at fostering correctness and improving the quality of the developed software.
翻译:尽管代码正确性在研究实验中具有关键作用,人们往往仅依据结果的可感知质量来假定其正确性。这带来了错误结果和潜在误导性发现的风险。为解决这一问题,我们认为当前对结果可重复性的关注应与对编码最佳实践的重视齐头并进。我们通过一个案例研究来支持这一呼吁——在最先进的Conformer架构的广泛使用的开源实现中,我们识别出(并修正了)三个错误。通过在多种语言设置下的自动语音识别和翻译对比实验,我们证明:错误的存在并不能阻止获得良好且可重复的结果,并可能导致错误结论,从而潜在地误导未来研究方向。为此,本研究呼吁采取行动,采纳旨在提升正确性和改善开发软件质量的编码最佳实践。