This study introduces MedGen, a comprehensive natural language processing (NLP) toolkit designed for medical text processing. MedGen is tailored for biomedical researchers and healthcare professionals with an easy-to-use, all-in-one solution that requires minimal programming expertise. It includes (1) Generative Functions: For the first time, MedGen includes four advanced generative functions: question answering, text summarization, text simplification, and machine translation; (2) Basic NLP Functions: MedGen integrates 12 essential NLP functions such as word tokenization and sentence segmentation; and (3) Query and Search Capabilities: MedGen provides user-friendly query and search functions on text corpora. We fine-tuned 32 domain-specific language models, evaluated them thoroughly on 24 established benchmarks and conducted manual reviews with clinicians. Additionally, we expanded our toolkit by introducing query and search functions, while also standardizing and integrating functions from third-party libraries. The toolkit, its models, and associated data are publicly available via https://github.com/Yale-LILY/MedGen.
翻译:本研究介绍MedGen,一个专为医学文本处理设计的综合性自然语言处理(NLP)工具包。MedGen面向生物医学研究人员和医疗保健专业人员,提供易于使用、功能一体化的解决方案,仅需最低限度的编程知识即可操作。其功能包括:(1)生成式功能:首次集成四项高级生成功能——问答、文本摘要、文本简化与机器翻译;(2)基础NLP功能:集成12项基础NLP功能,如分词与句子分割;(3)查询与检索功能:提供用户友好的文本语料库查询与检索功能。我们微调了32个领域特定语言模型,在24个既定基准上进行了全面评估,并与临床医生开展了人工审查。此外,我们通过引入查询与检索功能扩展工具包,同时标准化并整合了第三方库的功能。该工具包、模型及相关数据可通过https://github.com/Yale-LILY/MedGen 公开获取。