Perception of toxicity evolves over time and often differs between geographies and cultural backgrounds. Similarly, black-box commercially available APIs for detecting toxicity, such as the Perspective API, are not static, but frequently retrained to address any unattended weaknesses and biases. We evaluate the implications of these changes on the reproducibility of findings that compare the relative merits of models and methods that aim to curb toxicity. Our findings suggest that research that relied on inherited automatic toxicity scores to compare models and techniques may have resulted in inaccurate findings. Rescoring all models from HELM, a widely respected living benchmark, for toxicity with the recent version of the API led to a different ranking of widely used foundation models. We suggest caution in applying apples-to-apples comparisons between studies and lay recommendations for a more structured approach to evaluating toxicity over time. Code and data are available at https://github.com/for-ai/black-box-api-challenges.
翻译:毒性的感知随时间演变,且往往因地域和文化背景的不同而存在差异。同样,用于检测毒性的黑盒商业API(如Perspective API)并非一成不变,而是经常重新训练以解决任何未受关注的弱点和偏差。我们评估了这些变化对旨在遏制毒性的模型和方法相对优劣研究结果的可复现性所产生的影响。我们的发现表明,依赖继承的自动毒性评分来比较模型和技术的研究可能产生了不准确的结论。使用最新版本的API对HELM(一个广受尊重的动态基准)中所有模型进行毒性重新评分,导致了广泛使用的基础模型排名发生变化。我们建议在研究之间进行同类比较时需谨慎,并提出了随时间推移更结构化地评估毒性的建议。代码和数据可在https://github.com/for-ai/black-box-api-challenges获取。