The objective of topic inference in research proposals aims to obtain the most suitable disciplinary division from the discipline system defined by a funding agency. The agency will subsequently find appropriate peer review experts from their database based on this division. Automated topic inference can reduce human errors caused by manual topic filling, bridge the knowledge gap between funding agencies and project applicants, and improve system efficiency. Existing methods focus on modeling this as a hierarchical multi-label classification problem, using generative models to iteratively infer the most appropriate topic information. However, these methods overlook the gap in scale between interdisciplinary research proposals and non-interdisciplinary ones, leading to an unjust phenomenon where the automated inference system categorizes interdisciplinary proposals as non-interdisciplinary, causing unfairness during the expert assignment. How can we address this data imbalance issue under a complex discipline system and hence resolve this unfairness? In this paper, we implement a topic label inference system based on a Transformer encoder-decoder architecture. Furthermore, we utilize interpolation techniques to create a series of pseudo-interdisciplinary proposals from non-interdisciplinary ones during training based on non-parametric indicators such as cross-topic probabilities and topic occurrence probabilities. This approach aims to reduce the bias of the system during model training. Finally, we conduct extensive experiments on a real-world dataset to verify the effectiveness of the proposed method. The experimental results demonstrate that our training strategy can significantly mitigate the unfairness generated in the topic inference task.
翻译:科研项目申请中的主题推断旨在从资助机构定义的学科体系中获取最合适的学科划分。资助机构随后将根据此划分从专家数据库中匹配相应的同行评审专家。自动化主题推断能够减少人工填写主题时的人为错误,弥合资助机构与项目申请人之间的知识鸿沟,并提升系统效率。现有方法将其建模为层次化多标签分类问题,采用生成式模型迭代推断最合适的主题信息。然而,这些方法忽略了跨学科与非跨学科项目申请之间规模差异,导致自动化推断系统将跨学科项目归类为非跨学科项目的不公正现象,进而造成专家分配过程中的不公平性。如何在复杂学科体系下解决数据不平衡问题以消除这种不公平性?本文基于Transformer编码器-解码器架构实现了主题标签推断系统。进一步地,我们利用插值技术,在训练过程中依据跨主题概率、主题出现概率等非参数化指标,从非跨学科项目申请中生成一系列伪跨学科项目申请。该方法旨在减少模型训练过程中的系统偏差。最后,我们在真实数据集上开展了广泛实验以验证所提方法的有效性。实验结果表明,我们的训练策略能显著缓解主题推断任务中产生的不公平性。