This paper introduces an approach for building a Named Entity Recognition (NER) model built upon a Bidirectional Encoder Representations from Transformers (BERT) architecture, specifically utilizing the SlovakBERT model. This NER model extracts address parts from data acquired from speech-to-text transcriptions. Due to scarcity of real data, a synthetic dataset using GPT API was generated. The importance of mimicking spoken language variability in this artificial data is emphasized. The performance of our NER model, trained solely on synthetic data, is evaluated using small real test dataset.
翻译:本文提出了一种基于双向编码器表示(BERT)架构构建命名实体识别(NER)模型的方法,具体采用SlovakBERT模型。该NER模型旨在从语音转文字转录所获取的数据中提取地址组成部分。由于真实数据稀缺,我们利用GPT API生成了合成数据集。研究强调了在人工数据中模拟口语语言变异性重要性。通过小规模真实测试数据集,评估了仅基于合成数据训练所得到的NER模型性能。