Scarcity of data and technological limitations for resource-poor languages in developing countries like India poses a threat to the development of sophisticated NLU systems for healthcare. To assess the current status of various state-of-the-art language models in healthcare, this paper studies the problem by initially proposing two different Healthcare datasets, Indian Healthcare Query Intent-WebMD and 1mg (IHQID-WebMD and IHQID-1mg) and one real world Indian hospital query data in English and multiple Indic languages (Hindi, Bengali, Tamil, Telugu, Marathi and Gujarati) which are annotated with the query intents as well as entities. Our aim is to detect query intents and extract corresponding entities. We perform extensive experiments on a set of models in various realistic settings and explore two scenarios based on the access to English data only (less costly) and access to target language data (more expensive). We analyze context specific practical relevancy through empirical analysis. The results, expressed in terms of overall F1 score show that our approach is practically useful to identify intents and entities.
翻译:发展中印度等资源匮乏国家中低资源语言的数据稀缺及技术限制对开发高端医疗自然语言理解系统构成威胁。为评估当前医疗领域各类最先进语言模型的状态,本文通过首次提出两个不同医疗数据集——印度医疗查询意图-WebMD与1mg数据集(IHQID-WebMD与IHQID-1mg),以及一个包含英语及多种印度语言(印地语、孟加拉语、泰米尔语、泰卢固语、马拉地语和古吉拉特语)的真实世界印度医院查询数据集,研究该问题。这些数据集均标注了查询意图和实体。我们的目标是检测查询意图并提取相应实体。我们在多种现实场景下对一系列模型进行了广泛实验,并探索了两种情景:仅访问英语数据(成本较低)与访问目标语言数据(成本较高)。通过实证分析,我们评估了上下文相关的实际相关性。以整体F1分数表示的结果表明,我们的方法在意图与实体识别方面具有实用价值。