One of the prominent problems with processing and operating on text data is the non uniformity of it. Due to the change in the dialects and languages, the caliber of translation is low. This creates a unique problem while using NLP in text data; which is the spell variation arising from the inconsistent translations and transliterations. This problem can also be further aggravated by the human error arising from the various ways to write a Proper Noun from an Indian language into its English equivalent. Translating proper nouns originating from Indian languages can be complicated as some proper nouns are also used as common nouns which might be taken literally. Applications of NLP that require addresses, names and other proper nouns face this problem frequently. We propose a method to cluster these spell variations for proper nouns using ML techniques and mathematical similarity equations. We aimed to use Affinity Propagation to determine relative similarity between the tokens. The results are augmented by filtering the token-variation pair by a similarity threshold. We were able to reduce the spell variations by a considerable amount. This application can significantly reduce the amount of human annotation efforts needed for data cleansing and formatting.
翻译:文本数据处理与操作中的一个显著问题是其非一致性。由于方言和语言的差异,翻译质量普遍较低。这在使用自然语言处理处理文本数据时引发了一个独特问题:即因不一致的翻译和音译而产生的拼写变体。该问题可能因将印度语言中的专有名词转写为英语等价词时存在多种书写方式所导致的人为错误而进一步加剧。翻译源自印度语言的专有名词可能较为复杂,因为某些专有名词也被用作普通名词,可能被按字面理解。需要处理地址、名称及其他专有名词的自然语言处理应用频繁面临此问题。我们提出一种基于机器学习技术与数学相似度方程对专有名词拼写变体进行聚类的方法。我们旨在利用亲和传播确定词元间的相对相似性,并通过相似度阈值过滤词元-变体对来增强结果。该方法成功显著减少了拼写变体数量。该应用能大幅降低数据清洗与格式化所需的人工标注工作量。