Following the rapid progress in natural language processing (NLP) models, language models are applied to increasingly more complex interactive tasks such as negotiations and conversation moderations. Having human evaluators directly interact with these NLP models is essential for adequately evaluating the performance on such interactive tasks. We develop BotEval, an easily customizable, open-source, evaluation toolkit that focuses on enabling human-bot interactions as part of the evaluation process, as opposed to human evaluators making judgements for a static input. BotEval balances flexibility for customization and user-friendliness by providing templates for common use cases that span various degrees of complexity and built-in compatibility with popular crowdsourcing platforms. We showcase the numerous useful features of BotEval through a study that evaluates the performance of various chatbots on their effectiveness for conversational moderation and discuss how BotEval differs from other annotation tools.
翻译:随着自然语言处理(NLP)模型的快速发展,语言模型被应用于日益复杂的交互式任务,如谈判和对话审核。让人类评估者直接与这些NLP模型进行交互,对于充分评估此类交互任务的性能至关重要。我们开发了BotEval,这是一个易于定制、开源的评估工具包,其核心在于将人机交互作为评估过程的一部分,而非让人类评估者对静态输入做出判断。BotEval通过提供涵盖不同复杂度的常见用例模板,以及与主流众包平台的内置兼容性,在定制灵活性和用户友好性之间取得了平衡。我们通过一项研究展示了BotEval的众多实用功能,该研究评估了多种聊天机器人在对话审核任务中的有效性,并讨论了BotEval与其他标注工具的区别。