In this paper, we explore the challenges inherent to Large Language Models (LLMs) like GPT-4, particularly their propensity for hallucinations, logic mistakes, and incorrect conclusions when tasked with answering complex questions. The capacity of LLMs to present erroneous answers in a coherent and semantically rigorous manner further complicates the detection of factual inaccuracies. This issue is especially pronounced in fields that require specialized expertise. Our work delves into these challenges, aiming to enhance the understanding and mitigation of such errors, thereby contributing to the improvement of LLM accuracy and reliability in scientific and other specialized domains. Our findings reveal a non-linear relationship between the context's relevancy and the answers' measured quality. In addition, we demonstrate that with the correct calibration, it is possible to automate the grading procedure -- a finding suggesting that, at least to some degree, the LLMs can be used to self-examine the quality of their own performance. Finally, we describe an experimental platform that can be seen as a proof-of-concept of the techniques described in this work.
翻译:在本文中,我们探讨了GPT-4等大语言模型(LLM)在回答复杂问题时固有的挑战,特别是其产生幻觉、逻辑错误及不正确结论的倾向。LLM能够以连贯且语义严谨的方式呈现错误答案,这进一步加剧了对事实性不准确的检测难度。该问题在需要专业知识的领域尤为突出。我们的研究深入剖析这些挑战,旨在增强对此类错误的理解与缓解能力,从而提升LLM在科学及其他专业领域的准确性与可靠性。研究发现,上下文相关性程度与答案质量之间存在非线性关系。此外,我们证明通过正确的校准,可自动化评分流程——这一发现表明,至少在某种程度上,LLM可用于自我检验其自身表现质量的优劣。最后,我们描述了一个实验平台,可视为本文所述技术的概念验证。