Question answering (QA) in law is a challenging problem because legal documents are much more complicated than normal texts in terms of terminology, structure, and temporal and logical relationships. It is even more difficult to perform legal QA for low-resource languages like Vietnamese where labeled data are rare and pre-trained language models are still limited. In this paper, we try to overcome these limitations by implementing a Vietnamese article-level retrieval-based legal QA system and introduce a novel method to improve the performance of language models by improving data quality through weak labeling. Our hypothesis is that in contexts where labeled data are limited, efficient data enrichment can help increase overall performance. Our experiments are designed to test multiple aspects, which demonstrate the effectiveness of the proposed technique.
翻译:法律领域的问答是一项具有挑战性的问题,因为法律文书在术语、结构、时间与逻辑关系上比普通文本复杂得多。对于越南语等低资源语言而言,由于标注数据稀缺且预训练语言模型仍较为有限,法律问答任务更加困难。本文尝试通过实现一种基于越南语文章级检索的法律问答系统来克服这些限制,并提出一种新颖方法——通过弱标注提升数据质量,从而改善语言模型的性能。我们的假设是,在标注数据有限的场景下,高效的数据增强有助于提升整体性能。实验从多个维度设计,验证了所提技术的有效性。