Large language models (LLMs) are increasingly proposed for crisis preparedness and response, particularly for multilingual communication. However, their suitability for high-stakes crisis contexts remains insufficiently evaluated. This work examines the performance of state-of-the-art LLMs and machine translation systems in crisis-domain translation, with a focus on preserving urgency, which is a critical property for effective crisis communication and triaging. Using multilingual crisis data and a newly introduced urgency-annotated dataset covering over 32 languages, we show that both dedicated translation models and LLMs exhibit substantial performance degradation and instability. Crucially, even linguistically adequate translations can distort perceived urgency, and LLM-based urgency classifications vary widely depending on the language of the prompt and input. These findings highlight significant risks in deploying general-purpose language technologies for crisis communication and underscore the need for crisis-aware evaluation frameworks.
翻译:大型语言模型(LLMs)在危机准备与响应中的应用日益受到关注,特别是在多语言沟通方面。然而,其在高风险危机情境中的适用性仍未得到充分评估。本研究考察了前沿大型语言模型与机器翻译系统在危机领域翻译任务中的表现,重点关注对紧迫性的保持——这是实现有效危机沟通与分诊的关键属性。通过使用多语言危机数据及一个新构建的、涵盖超过32种语言的紧迫性标注数据集,我们发现专用翻译模型与大型语言模型均表现出显著的性能下降与不稳定性。关键的是,即使语言上恰当的翻译也可能扭曲感知的紧迫性,且基于大型语言模型的紧迫性分类结果会因提示语言与输入语言的差异而产生巨大波动。这些发现凸显了将通用语言技术应用于危机沟通的显著风险,并强调了建立具备危机感知能力的评估框架的必要性。