Wikipedia serves as a globally accessible knowledge source with content in over 300 languages. Despite covering the same topics, the different versions of Wikipedia are written and updated independently. This leads to factual inconsistencies that can impact the neutrality and reliability of the encyclopedia and AI systems, which often rely on Wikipedia as a main training source. This study investigates cross-lingual inconsistencies in Wikipedia's structured content, with a focus on tabular data. We developed a methodology to collect, align, and analyze tables from Wikipedia multilingual articles, defining categories of inconsistency. We apply various quantitative and qualitative metrics to assess multilingual alignment using a sample dataset. These insights have implications for factual verification, multilingual knowledge interaction, and design for reliable AI systems leveraging Wikipedia content.
翻译:维基百科作为覆盖300多种语言的全球可访问知识库,不同语言版本虽涉及相同主题,却独立撰写和更新。这种独立性导致事实不一致现象,可能影响百科全书的客观性与可靠性,以及依赖维基百科为主要训练来源的人工智能系统。本研究聚焦维基百科结构化内容中的跨语言不一致性,特别关注表格数据。我们开发了一套方法论,用于收集、对齐和分析多语言维基百科文章中的表格,并定义不一致性类别。基于样本数据集,我们应用多种定量与定性指标评估多语言对齐效果。这些发现对事实核查、多语言知识交互以及构建依赖维基百科内容的可靠人工智能系统具有重要启示。