In addressing the imbalanced issue of data within the realm of Natural Language Processing, text data augmentation methods have emerged as pivotal solutions. This data imbalance is prevalent in the research proposals submitted during the funding application process. Such imbalances, resulting from the varying popularity of disciplines or the emergence of interdisciplinary studies, significantly impede the precision of downstream topic models that deduce the affiliated disciplines of these proposals. At the data level, proposals penned by experts and scientists are inherently complex technological texts, replete with intricate terminologies, which augmenting such specialized text data poses unique challenges. At the system level, this, in turn, compromises the fairness of AI-assisted reviewer assignment systems, which raises a spotlight on solving this issue. This study leverages large language models (Llama V1) as data generators to augment research proposals categorized within intricate disciplinary hierarchies, aiming to rectify data imbalances and enhance the equity of expert assignments. We first sample within the hierarchical structure to find the under-represented class. Then we designed a prompt for keyword-based research proposal generation. Our experiments attests to the efficacy of the generated data, demonstrating that research proposals produced using the prompts can effectively address the aforementioned issues and generate high quality scientific text data, thus help the model overcome the imbalanced issue.
翻译:在自然语言处理领域,针对数据不平衡问题,文本数据增强方法已成为关键解决方案。这种数据不平衡现象在基金申请过程中提交的研究提案中普遍存在。由于学科流行程度不同或交叉学科研究的涌现所导致的不平衡,严重阻碍了推断这些提案所属学科的细粒度主题模型的精确性。在数据层面,由专家和科学家撰写的研究提案本质上是复杂的科技文本,充斥着专业术语,对此类专业文本数据进行增强面临独特挑战。在系统层面,这反过来损害了人工智能辅助评审分配系统的公平性,从而凸显解决该问题的重要性。本研究利用大语言模型(Llama V1)作为数据生成器,对分类于复杂学科层次结构中的研究提案进行增强,旨在纠正数据不平衡并提升专家分配的公平性。我们首先在层次结构内进行采样,以发现代表性不足的类别。随后,我们设计了一种基于关键词的研究提案生成提示。实验证明了生成数据的有效性,表明使用该提示生成的研究提案能够有效解决上述问题,并生成高质量的科技文本数据,从而帮助模型克服不平衡问题。