High-risk domains pose unique challenges that require language models to provide accurate and safe responses. Despite the great success of large language models (LLMs), such as ChatGPT and its variants, their performance in high-risk domains remains unclear. Our study delves into an in-depth analysis of the performance of instruction-tuned LLMs, focusing on factual accuracy and safety adherence. To comprehensively assess the capabilities of LLMs, we conduct experiments on six NLP datasets including question answering and summarization tasks within two high-risk domains: legal and medical. Further qualitative analysis highlights the existing limitations inherent in current LLMs when evaluating in high-risk domains. This underscores the essential nature of not only improving LLM capabilities but also prioritizing the refinement of domain-specific metrics, and embracing a more human-centric approach to enhance safety and factual reliability. Our findings advance the field toward the concerns of properly evaluating LLMs in high-risk domains, aiming to steer the adaptability of LLMs in fulfilling societal obligations and aligning with forthcoming regulations, such as the EU AI Act.
翻译:高风险领域对语言模型提出了独特挑战,要求其提供准确且安全的响应。尽管以ChatGPT及其变体为代表的大型语言模型取得了巨大成功,但它们在风险领域的表现尚不明确。本研究深入分析了经过指令微调的大型语言模型的性能,重点关注事实准确性与安全性遵循。为全面评估模型能力,我们在法律与医疗两个高风险领域内的六个自然语言处理数据集上开展实验,涵盖问答与文本摘要任务。进一步的定性分析揭示了当前LLM在高风险领域评估中存在的固有局限性。这凸显了不仅需要提升LLM能力,更要优先完善领域特定评估指标,并采用更加以人为本的方法来增强安全性与事实可靠性的本质需求。我们的研究推动了领域对高风险领域中LLM评估问题的关注,旨在引导LLM在履行社会责任和适应欧盟人工智能法案等未来法规方面的适应性发展。