We investigate the phenomenon of an LLM's untruthful response using a large set of 220 handcrafted linguistic features. We focus on GPT-3 models and find that the linguistic profiles of responses are similar across model sizes. That is, how varying-sized LLMs respond to given prompts stays similar on the linguistic properties level. We expand upon this finding by training support vector machines that rely only upon the stylistic components of model responses to classify the truthfulness of statements. Though the dataset size limits our current findings, we present promising evidence that truthfulness detection is possible without evaluating the content itself.
翻译:我们通过一组包含220个手工构建的语言特征,探究了大语言模型(LLM)生成不真实回应的现象。聚焦于GPT-3系列模型,我们发现不同规模模型生成回应的语言特征具有相似性——即不同参数规模的LLM对给定提示词(prompts)的回应在语言属性层面保持高度一致。基于这一发现,我们训练了仅依赖模型回应风格特征(stylistic components)来分类陈述真实性的支持向量机(support vector machines)。尽管当前数据集规模限制了研究结论的普适性,但我们提供了初步证据表明:在不评估内容本身的前提下,对陈述真实性进行检测具有可行性。