The prevalence of Large Language Models (LLMs) for generating multilingual text and source code has only increased the imperative for machine-generated content detectors to be accurate and efficient across domains. Current detectors, predominantly utilizing zero-shot methods, such as Fast DetectGPT or GPTZero, either incur high computational cost or lack sufficient accuracy, often with a trade-off between the two, leaving room for further improvement. To address these gaps, we propose the fine-tuning of encoder-only Small Language Models (SLMs), in particular, the pre-trained models of RoBERTA and CodeBERTa using specialized datasets on source code and other natural language to prove that for the task of binary classification, SLMs outperform LLMs by a huge margin whilst using a fraction of compute. Our encoders achieve AUROC $= 0.97$ to $0.99$ and macro-F1 $0.89$ to $0.94$ while reducing latency by $8$-$12\times$ and peak VRAM by $3$-$5\times$ at $512$-token inputs. Under cross-generator shifts and adversarial transformations (paraphrase, back-translation; code formatting/renaming), performance retains $\geq 92%$ of clean AUROC. We release training and evaluation scripts with seeds and configs; a reproducibility checklist is also included.
翻译:大型语言模型(LLM)在生成多语言文本和源代码方面的普及,进一步提升了跨领域机器生成内容检测器对准确性与效率的需求。当前检测器主要采用零样本方法(如 Fast DetectGPT 或 GPTZero),它们要么计算成本高昂,要么缺乏足够精度,往往需要在二者之间权衡,存在较大改进空间。为弥补这些不足,我们提出对仅编码器型小型语言模型(SLM)进行微调,特别是基于源代码及其他自然语言专用数据集对预训练模型 RoBERTa 与 CodeBERTa 进行调优,证明在二分类任务中,SLM 以极低计算量显著优于 LLM。我们的编码器在 512 词元输入长度下达到 AUROC $= 0.97$ 至 $0.99$、宏平均 F1 分数 $0.89$ 至 $0.94$,同时将延迟降低 $8$-$12$ 倍、峰值显存占用减少 $3$-$5$ 倍。在跨生成器迁移与对抗性变换(文本复述、回译;代码格式化/重命名)场景下,性能仍保持不低于干净数据 AUROC 的 $92\%$。我们公开了包含随机种子与配置的训练评估脚本,并附有可复现性核对清单。