Rerunning a metric-based evaluation should be more straightforward, and results should be closer, than in a human-based evaluation, especially where code and model checkpoints are made available by the original authors. As this report of our efforts to rerun a metric-based evaluation of a set of single-attribute and multiple-attribute controllable text generation (CTG) techniques shows however, such reruns of evaluations do not always produce results that are the same as the original results, and can reveal errors in the reporting of the original work.
翻译:基于度量的评估的可复现性应比基于人类的评估更为直接,且结果应更趋一致,尤其在原作者公开代码和模型检查点时更是如此。然而,我们尝试复现一组单属性和多属性可控文本生成(CTG)技术的基于度量的评估时发现,此类复现性评估并非总能得到与原始结果一致的结果,反而可能揭示原研究报告中的错误。