ChatGPT is becoming a new reality. In this paper, we show how to distinguish ChatGPT-generated publications from counterparts produced by scientists. Using a newly designed supervised Machine Learning algorithm, we demonstrate how to detect machine-generated publications from those produced by scientists. The algorithm was trained using 100 real publication abstracts, followed by a 10-fold calibration approach to establish a lower-upper bound range of acceptance. In the comparison with ChatGPT content, it was evident that ChatGPT contributed merely 23\% of the bigram content, which is less than 50\% of any of the other 10 calibrating folds. This analysis highlights a significant disparity in technical terms where ChatGPT fell short of matching real science. When categorizing the individual articles, the xFakeBibs algorithm accurately identified 98 out of 100 publications as fake, with 2 articles incorrectly classified as real publications. Though this work introduced an algorithmic approach that detected the ChatGPT-generated fake science with a high degree of accuracy, it remains challenging to detect all fake records. This work is indeed a step in the right direction to counter fake science and misinformation.
翻译:ChatGPT正成为一种新现实。本文展示了如何区分ChatGPT生成的出版物与科学家撰写的真实出版物。通过一种新设计的监督机器学习算法,我们证明了如何检测机器生成的出版物与科学家撰写的出版物。该算法使用100条真实出版物摘要进行训练,随后采用10折校准方法建立可接受的置信区间上下限。与ChatGPT生成内容对比发现,ChatGPT仅贡献了23%的二元组内容,这一比例低于其他10个校准折中任意一折的50%。该分析凸显了ChatGPT在专业术语使用方面与真实科学论文之间存在显著差距。在对单篇论文进行分类时,xFakeBibs算法准确识别出100篇出版物中的98篇为虚假论文,仅2篇文章被错误归类为真实出版物。尽管本研究提出的算法能够高精度检测ChatGPT生成的虚假科学论文,但完全检测所有虚假记录仍具挑战性。这项工作无疑是朝着抵制虚假科学与错误信息迈出的正确一步。