In recent years, large language models (e.g., Open AI's GPT-4, Meta's LLaMa, Google's PaLM) have become the dominant approach for building AI systems to analyze and generate language online. However, the automated systems that increasingly mediate our interactions online -- such as chatbots, content moderation systems, and search engines -- are primarily designed for and work far more effectively in English than in the world's other 7,000 languages. Recently, researchers and technology companies have attempted to extend the capabilities of large language models into languages other than English by building what are called multilingual language models. In this paper, we explain how these multilingual language models work and explore their capabilities and limits. Part I provides a simple technical explanation of how large language models work, why there is a gap in available data between English and other languages, and how multilingual language models attempt to bridge that gap. Part II accounts for the challenges of doing content analysis with large language models in general and multilingual language models in particular. Part III offers recommendations for companies, researchers, and policymakers to keep in mind when considering researching, developing and deploying large and multilingual language models.
翻译:近年来,大型语言模型(如OpenAI的GPT-4、Meta的LLaMA、Google的PaLM)已成为构建AI系统以分析和生成在线语言的主流方法。然而,日益中介我们在线互动的自动化系统(如聊天机器人、内容审核系统和搜索引擎)主要针对英语设计,并且其效果远优于全球其他7000种语言。最近,研究人员和技术公司试图通过构建所谓的多语言语言模型,将大型语言模型的能力扩展到英语以外的语言。本文解释了这些多语言语言模型的工作原理,并探讨了它们的能力与局限性。第一部分提供了大型语言模型工作原理的通俗技术解释,说明了英语与其他语言之间可用数据差距存在的原因,以及多语言语言模型如何试图弥合这一差距。第二部分阐述了使用大型语言模型(尤其是多语言语言模型)进行内容分析所面临的挑战。第三部分为企业在研究、开发和部署大型及多语言语言模型时,向公司、研究人员和政策制定者提供了应牢记的建议。