Environmental conservation organizations routinely monitor news content on conservation in protected areas to maintain situational awareness of developments that can have an environmental impact. Existing automated media monitoring systems require large amounts of data labeled by domain experts, which is only feasible at scale for high-resource languages like English. However, such tools are most needed in the global south where news of interest is mainly in local low-resource languages, and far fewer experts are available to annotate datasets sustainably. In this paper, we propose NewsSerow, a method to automatically recognize environmental conservation content in low-resource languages. NewsSerow is a pipeline of summarization, in-context few-shot classification, and self-reflection using large language models (LLMs). Using at most 10 demonstration example news articles in Nepali, NewsSerow significantly outperforms other few-shot methods and achieves comparable performance with models fully fine-tuned using thousands of examples. The World Wide Fund for Nature (WWF) has deployed NewsSerow for media monitoring in Nepal, significantly reducing their operational burden, and ensuring that AI tools for conservation actually reach the communities that need them the most. NewsSerow has also been deployed for countries with other languages like Colombia.
翻译:环境保护组织定期监测保护区内与环保相关的新闻内容,以保持对环境影响的动态态势感知。现有的自动化媒体监测系统需要由领域专家标注大量数据,这仅对英语等高资源语言具备规模化可行性。然而,此类工具在发展中国家最为迫切——当地重要新闻主要使用低资源本地语言,且可持续标注数据集的专家资源极为匮乏。本文提出NewsSerow方法,可自动识别低资源语言中的环境保护相关内容。NewsSerow是一个融合摘要生成、上下文少样本分类及基于大语言模型(LLMs)自我反思的流水线系统。在仅使用最多10篇尼泊尔语示例新闻的条件下,NewsSerow显著优于其他少样本方法,其性能与使用数千示例进行全微调的模型相当。世界自然基金会(WWF)已将NewsSerow部署于尼泊尔的媒体监测工作,大幅降低了运营负担,确保环保人工智能工具真正惠及最需要这些技术的群体。目前该方案已在哥伦比亚等其他语言国家完成部署。