Code-mixing, the integration of lexical and grammatical elements from multiple languages within a single sentence, is a widespread linguistic phenomenon, particularly prevalent in multilingual societies. In India, social media users frequently engage in code-mixed conversations using the Roman script, especially among migrant communities who form online groups to share relevant local information. This paper focuses on the challenges of extracting relevant information from code-mixed conversations, specifically within Roman transliterated Bengali mixed with English. This study presents a novel approach to address these challenges by developing a mechanism to automatically identify the most relevant answers from code-mixed conversations. We have experimented with a dataset comprising of queries and documents from Facebook, and Query Relevance files (QRels) to aid in this task. Our results demonstrate the effectiveness of our approach in extracting pertinent information from complex, code-mixed digital conversations, contributing to the broader field of natural language processing in multilingual and informal text environments. We use GPT-3.5 Turbo via prompting alongwith using the sequential nature of relevant documents to frame a mathematical model which helps to detect relevant documents corresponding to a query.
翻译:代码混合是指在单个句子中融合多种语言的词汇和语法元素,这是一种普遍的语言现象,在多语言社会中尤为常见。在印度,社交媒体用户经常使用罗马字母进行代码混合对话,尤其是在移民社区中,他们通过组建在线群组来分享相关的本地信息。本文重点关注从代码混合对话中提取相关信息所面临的挑战,特别是针对罗马字母转写的孟加拉语与英语混合的文本。本研究提出了一种新颖方法,通过开发一种机制来自动识别代码混合对话中最相关的回答,以应对这些挑战。我们使用了一个包含来自Facebook的查询和文档的数据集,以及查询相关性文件(QRels)来辅助此任务。我们的结果证明了该方法在从复杂、代码混合的数字对话中提取相关信息方面的有效性,为多语言和非正式文本环境下的自然语言处理领域做出了贡献。我们通过提示使用GPT-3.5 Turbo,并结合相关文档的序列特性构建了一个数学模型,该模型有助于检测与查询相对应的相关文档。