Language models now constitute essential tools for improving efficiency for many professional tasks such as writing, coding, or learning. For this reason, it is imperative to identify inherent biases. In the field of Natural Language Processing, five sources of bias are well-identified: data, annotation, representation, models, and research design. This study focuses on biases related to geographical knowledge. We explore the connection between geography and language models by highlighting their tendency to misrepresent spatial information, thus leading to distortions in the representation of geographical distances. This study introduces four indicators to assess these distortions, by comparing geographical and semantic distances. Experiments are conducted from these four indicators with ten widely used language models. Results underscore the critical necessity of inspecting and rectifying spatial biases in language models to ensure accurate and equitable representations.
翻译:语言模型现已成为提升写作、编程或学习等众多专业任务效率的重要工具。因此,识别其中固有的偏见至关重要。在自然语言处理领域,已明确识别出五种偏见来源:数据、标注、表征、模型及研究设计。本研究聚焦于与地理知识相关的偏见。我们通过凸显语言模型扭曲空间信息的倾向,探究地理与语言模型之间的关联,进而导致地理距离表征的失真。本研究提出四项指标,通过对比地理距离与语义距离来评估这些扭曲。基于这四项指标,我们在十个广泛使用的语言模型上进行了实验。实验结果强调,必须审视并纠正语言模型中的空间偏见,以确保获得准确且公平的表征。