Background: Large language models such as ChatGPT are capable of generating grammatically perfect and human-like text content, and a large number of ChatGPT-generated texts have appeared on the Internet. However, medical texts such as clinical notes and diagnoses require rigorous validation, and erroneous medical content generated by ChatGPT could potentially lead to disinformation that poses significant harm to healthcare and the general public. Objective: This research is among the first studies on responsible and ethical AIGC (Artificial Intelligence Generated Content) in medicine. We focus on analyzing the differences between medical texts written by human experts and generated by ChatGPT, and designing machine learning workflows to effectively detect and differentiate medical texts generated by ChatGPT. Methods: We first construct a suite of datasets containing medical texts written by human experts and generated by ChatGPT. In the next step, we analyze the linguistic features of these two types of content and uncover differences in vocabulary, part-of-speech, dependency, sentiment, perplexity, etc. Finally, we design and implement machine learning methods to detect medical text generated by ChatGPT. Results: Medical texts written by humans are more concrete, more diverse, and typically contain more useful information, while medical texts generated by ChatGPT pay more attention to fluency and logic, and usually express general terminologies rather than effective information specific to the context of the problem. A BERT-based model can effectively detect medical texts generated by ChatGPT, and the F1 exceeds 95%.
翻译:背景:诸如ChatGPT等大型语言模型能够生成语法完美、类人的文本内容,且互联网上已出现大量ChatGPT生成的文本。然而,临床记录、诊断等医学文本需经过严格验证,ChatGPT生成的错误医学内容可能引发虚假信息,对医疗体系及公众健康造成严重危害。目标:本研究是医学领域关于负责任且合乎伦理的人工智能生成内容(AIGC)的早期探索之一。我们重点分析人类专家撰写的医学文本与ChatGPT生成文本的差异,并设计机器学习工作流以有效检测和区分ChatGPT生成的医学文本。方法:首先构建包含人类专家撰写及ChatGPT生成的医学文本数据集套件。随后分析两类内容的语言学特征,揭示其在词汇、词性、依存关系、情感、困惑度等方面的差异。最后设计并实现检测ChatGPT生成医学文本的机器学习方法。结果:人类撰写的医学文本更具体、更多样化,通常包含更多有效信息;而ChatGPT生成的医学文本更注重流畅性与逻辑性,常表达通用术语而非面向问题语境的有效信息。基于BERT的模型能有效检测ChatGPT生成的医学文本,F1值超过95%。