This submission to the binary AI detection task is based on a modular stylometric pipeline, where: public spaCy models are used for text preprocessing (including tokenisation, named entity recognition, dependency parsing, part-of-speech tagging, and morphology annotation) and extracting several thousand features (frequencies of n-grams of the above linguistic annotations); light-gradient boosting machines are used as the classifier. We collect a large corpus of more than 500 000 machine-generated texts for the classifier's training. We explore several parameter options to increase the classifier's capacity and take advantage of that training set. Our approach follows the non-neural, computationally inexpensive but explainable approach found effective previously.
翻译:本文针对二元人工智能检测任务提出了一种模块化风格计量流程:首先使用公开的spaCy模型进行文本预处理(包括分词、命名实体识别、依存句法分析、词性标注和形态学标注),并提取数千个特征(上述语言标注的n-元语法频率);随后采用轻量梯度提升机作为分类器。我们收集了超过50万篇机器生成文本构建大规模训练语料库,并通过多组参数实验提升分类器性能以充分利用训练数据。该方法延续了先前研究中被证明有效的非神经网络路径,兼具计算成本低廉与结果可解释性的优势。