With the development of generative models like GPT-3, it is increasingly more challenging to differentiate generated texts from human-written ones. There is a large number of studies that have demonstrated good results in bot identification. However, the majority of such works depend on supervised learning methods that require labelled data and/or prior knowledge about the bot-model architecture. In this work, we propose a bot identification algorithm that is based on unsupervised learning techniques and does not depend on a large amount of labelled data. By combining findings in semantic analysis by clustering (crisp and fuzzy) and information techniques, we construct a robust model that detects a generated text for different types of bot. We find that the generated texts tend to be more chaotic while literary works are more complex. We also demonstrate that the clustering of human texts results in fuzzier clusters in comparison to the more compact and well-separated clusters of bot-generated texts.
翻译:随着GPT-3等生成模型的发展,区分机器生成文本与人类撰写文本的难度日益增大。大量研究已在机器人识别领域取得良好成果,但多数工作依赖需要标注数据和/或对机器人模型架构有先验知识的监督学习方法。本研究提出一种基于无监督学习技术的机器人识别算法,该算法不依赖大量标注数据。通过结合语义分析中的聚类方法(硬聚类与模糊聚类)及信息论技术,我们构建了一个能检测不同类型机器人生成文本的鲁棒模型。研究发现,机器生成文本趋向于更混沌,而文学作品则更为复杂。同时,我们还证明,相比机器人生成文本形成的紧凑且分离明显的聚类,人类文本的聚类结果呈现更模糊的簇结构。