Dialogue systems for Automatic Differential Diagnosis (ADD) have a wide range of real-life applications. These dialogue systems are promising for providing easy access and reducing medical costs. Building end-to-end ADD dialogue systems requires dialogue training datasets. However, to the best of our knowledge, there is no publicly available ADD dialogue dataset in English (although non-English datasets exist). Driven by this, we introduce MDDial, the first differential diagnosis dialogue dataset in English which can aid to build and evaluate end-to-end ADD dialogue systems. Additionally, earlier studies present the accuracy of diagnosis and symptoms either individually or as a combined weighted score. This method overlooks the connection between the symptoms and the diagnosis. We introduce a unified score for the ADD system that takes into account the interplay between symptoms and diagnosis. This score also indicates the system's reliability. To the end, we train two moderate-size of language models on MDDial. Our experiments suggest that while these language models can perform well on many natural language understanding tasks, including dialogue tasks in the general domain, they struggle to relate relevant symptoms and disease and thus have poor performance on MDDial. MDDial will be released publicly to aid the study of ADD dialogue research.
翻译:自动鉴别诊断(ADD)对话系统在现实生活中有广泛的应用。这些对话系统有望提供便捷的访问途径并降低医疗成本。构建端到端的ADD对话系统需要对话训练数据集。然而,据我们所知,目前尚无公开可用的英文ADD对话数据集(尽管存在非英文数据集)。基于此,我们引入了MDDial——首个英文鉴别诊断对话数据集,可用于构建和评估端到端的ADD对话系统。此外,以往的研究中,诊断和症状的准确性要么单独评估,要么作为组合加权评分计算,这种方法忽略了症状与诊断之间的关联。我们提出了一种针对ADD系统的统一评分方法,该方法考虑了症状与诊断之间的相互作用,同时还能指示系统的可靠性。最后,我们在MDDial上训练了两个中等规模的语言模型。实验表明,尽管这些语言模型在包括通用领域对话任务在内的许多自然语言理解任务中表现良好,但它们在关联相关症状与疾病方面存在困难,因此在MDDial上表现不佳。MDDial将公开发布,以支持ADD对话研究。