Peer review plays a central role in the NLP publication process, but is susceptible to various biases. Here, we study language-of-study (LoS) bias: the tendency for reviewers to evaluate a paper differently based on the language(s) it studies, rather than its scientific merit. Despite being explicitly flagged in reviewing guidelines, such biases are poorly understood. Prior work treats such comments as part of broader categories of weak or unconstructive reviews without defining them as a distinct form of bias. We present the first systematic characterization of LoS bias, distinguishing negative and positive forms, and introduce the human-annotated dataset LOBSTER (Language-Of-study Bias in ScienTific pEer Review) and a method achieving 87.37 macro F1 for detection. We analyze 15,645 reviews to estimate how negative and positive biases differ with respect to the LoS, and find that non-English papers face substantially higher bias rates than English-only ones, with negative bias consistently outweighing positive bias. Finally, we identify four subcategories of negative bias, and find that demanding unjustified cross-lingual generalization is the most dominant form. We publicly release all resources to support work on fairer reviewing practices in NLP and beyond.
翻译:同行评审在自然语言处理(NLP)论文发表过程中发挥着核心作用,但容易受到各种偏见的影响。本研究聚焦于研究语言(LoS)偏见:即评审者基于论文所研究的语言而非其科学价值对论文进行差异化评估的倾向。尽管审稿指南明确指出了这类偏见,但其机制尚不明确。现有研究往往将其归入泛泛的薄弱或非建设性评论类别中,并未将其定义为独立的偏见形式。我们首次系统刻画了LoS偏见,区分了负面与正面两种形式,并提出了人工标注数据集LOBSTER(学术同行评审中的研究语言偏见)以及一种在检测任务上达到87.37宏F1值的方法。通过分析15,645条审稿意见,我们估算了负面与正面偏见随LoS变化的差异,发现非英语论文面临的偏见率显著高于纯英语论文,且负面偏见始终压倒正面偏见。最终,我们识别出四种负面偏见子类,其中要求不合理的跨语言泛化能力是最主要的表现形式。我们公开发布所有资源,以支持NLP领域及更广泛学术圈推动更公平的审稿实践。