Currently, there is a trend for the wider public to rely on LLMs for financial or legal consultation, medical and mental support (Chatterji et al., 2025), often accepting the advice provided without necessarily seeking logical verification or empirical validation. While one might be fortunate enough to encounter a model with a particularly solid 'ground truth' or with auxiliary logic-symbolic reasoning capabilities, it remains a somewhat uncertain endeavour. Output is simply taken at face value, without further question. Yet, careless reliance on AI to answer our questions and to judge our output is a violation of Grice's Maxim of Quality as well as a violation of Lemoine's legal Maxim of Innocence. A low-sensitivity plagiarism scanner may produce a Type II error by failing to detect difference (the null hypothesis wrongly maintained). The fallacy of affirming the consequent occurs when the failure to detect difference is then interpreted as evidence of equivalence or demonstration of AI authorship. If the test is specified so that 'AI-generated' is effectively treated as the default H0, then a finding of 'no difference from AI' is taken as support for that null. Such a mis-specified test results in students being treated as guilty (AI/plagiarism) unless suspects can generate sufficient detectable difference from AI output, which yields false accusations under a fair null hypothesis (that the student wrote the work). To avoid LLMs becoming a sorcerer's apprentice, knowledge is required about which inference systems are or should become integrated for an LLM to become a trustworthy sparring partner. We end on a wider perspective where the formalisation of the observer effect shows that uncertainty, classification, and interpretation are already shaped by the human or artificial agency's belief system, affective state, and tolerance for ambiguity, rather than at the stage of LLM output.
翻译:当前,公众普遍倾向于依赖大语言模型进行财务或法律咨询、医疗及心理支持(Chatterji等,2025),往往不加逻辑验证或经验实证便接受其提供的建议。即便有幸遇到具备坚实"基础事实"或辅助逻辑符号推理能力的模型,这仍是充满不确定性的尝试。输出内容被直接照单全收,缺乏进一步质疑。然而,轻率依赖AI回答我们的问题并评判我们的输出,既违反了格赖斯质量准则,也违背了勒穆瓦纳法律意义上的"无罪推定原则"。低灵敏度剽窃检测器可能产生第二类错误——未能检测出差异(错误维持零假设)。当"未检测出差异"被曲解为"等同性的证据"或"AI创作归属的证明"时,便出现了"肯定后件"逻辑谬误。若将检测标准设定为"AI生成"被默认视为零假设H0,则"与AI无差异"的发现会被视为支持该零假设。这种错误设定的检测导致学生被判有罪(AI创作/剽窃),除非其能证明自身输出与AI生成内容存在足够显著差异——在公平的零假设(文本由学生原创撰写)下,这必然产生冤假错案。为避免大语言模型沦为魔法师的学徒,我们需要明确:要使LLM成为值得信赖的对话伙伴,需要整合哪些推理系统及其整合方式。本文最后从更宏观视角指出:观察者效应的形式化表明,不确定性、分类与诠释早在人类或人工智能主体的信念系统、情感状态及对歧义的容忍度中就已形成定式,而非仅存于大语言模型输出阶段。