Hallucination detection in Large Language Models (LLMs) is crucial for ensuring their reliability. This work presents our participation in the CLEF ELOQUENT HalluciGen shared task, where the goal is to develop evaluators for both generating and detecting hallucinated content. We explored the capabilities of four LLMs: Llama 3, Gemma, GPT-3.5 Turbo, and GPT-4, for this purpose. We also employed ensemble majority voting to incorporate all four models for the detection task. The results provide valuable insights into the strengths and weaknesses of these LLMs in handling hallucination generation and detection tasks.
翻译:大语言模型中的幻觉检测对于确保其可靠性至关重要。本研究介绍了我们参与CLEF ELOQUENT HalluciGen共享任务的情况,该任务旨在开发用于生成和检测幻觉内容的评估器。为此,我们探索了四种大语言模型的能力:Llama 3、Gemma、GPT-3.5 Turbo和GPT-4。在检测任务中,我们还采用了集成多数投票法来整合所有四种模型。研究结果为了解这些大语言模型在处理幻觉生成与检测任务时的优势与局限性提供了有价值的见解。