The objective of topic inference in research proposals aims to obtain the most suitable disciplinary division from the discipline system defined by a funding agency. The agency will subsequently find appropriate peer review experts from their database based on this division. Automated topic inference can reduce human errors caused by manual topic filling, bridge the knowledge gap between funding agencies and project applicants, and improve system efficiency. Existing methods focus on modeling this as a hierarchical multi-label classification problem, using generative models to iteratively infer the most appropriate topic information. However, these methods overlook the gap in scale between interdisciplinary research proposals and non-interdisciplinary ones, leading to an unjust phenomenon where the automated inference system categorizes interdisciplinary proposals as non-interdisciplinary, causing unfairness during the expert assignment. How can we address this data imbalance issue under a complex discipline system and hence resolve this unfairness? In this paper, we implement a topic label inference system based on a Transformer encoder-decoder architecture. Furthermore, we utilize interpolation techniques to create a series of pseudo-interdisciplinary proposals from non-interdisciplinary ones during training based on non-parametric indicators such as cross-topic probabilities and topic occurrence probabilities. This approach aims to reduce the bias of the system during model training. Finally, we conduct extensive experiments on a real-world dataset to verify the effectiveness of the proposed method. The experimental results demonstrate that our training strategy can significantly mitigate the unfairness generated in the topic inference task.
翻译:研究计划书主题推断的目标是从资助机构定义的学科体系中获取最合适的学科划分,资助机构随后将基于此划分从其数据库中寻找合适的同行评审专家。自动化主题推断能够减少人工主题填报带来的错误,弥合资助机构与项目申请人之间的知识差距,并提升系统效率。现有方法主要将该任务建模为层级多标签分类问题,采用生成模型迭代推断最合适的主题信息。然而,这些方法忽视了跨学科研究计划书与非跨学科研究计划书之间的规模差距,导致自动化推断系统倾向于将跨学科计划书归类为非跨学科的不公平现象,进而造成专家分配环节的不公正。如何在复杂学科体系下解决这一数据不均衡问题,从而消除这种不公平性?本文基于Transformer编码器-解码器架构实现了一个主题标签推断系统。进一步地,我们利用插值技术在训练过程中基于非参数指标(如跨主题概率和主题出现概率),从非跨学科计划书中生成一系列伪跨学科计划书,旨在降低模型训练时的系统偏差。最后,我们在真实数据集上进行了大量实验验证所提方法的有效性,结果表明我们的训练策略能够显著缓解主题推断任务中产生的不公平性。