Emojis are widely used across social media platforms but are often lost in noisy or garbled text, posing challenges for data analysis and machine learning. Conventional preprocessing approaches recommend removing such text, risking the loss of emojis and their contextual meaning. This paper proposes a three-step reverse-engineering methodology to retrieve emojis from garbled text in social media posts. The methodology also identifies reasons for the generation of such text during social media data mining. To evaluate its effectiveness, the approach was applied to 509,248 Tweets about the Mpox outbreak, a dataset referenced in about 30 prior works that failed to retrieve emojis from garbled text. Our method retrieved 157,748 emojis from 76,914 Tweets. Improvements in text readability and coherence were demonstrated through metrics such as Flesch Reading Ease, Flesch-Kincaid Grade Level, Coleman-Liau Index, Automated Readability Index, Dale-Chall Readability Score, Text Standard, and Reading Time. Additionally, the frequency of individual emojis and their patterns of usage in these Tweets were analyzed, and the results are presented.
翻译:表情符号在社交媒体平台上被广泛使用,但常常在嘈杂或乱码文本中丢失,这给数据分析和机器学习带来了挑战。传统的预处理方法建议删除此类文本,但这可能导致表情符号及其上下文含义的丢失。本文提出了一种三步逆向工程方法,用于从社交媒体帖子的乱码文本中检索表情符号。该方法还识别了在社交媒体数据挖掘过程中产生此类文本的原因。为了评估其有效性,该方法被应用于509,248条关于Mpox疫情的推文数据集,该数据集在约30项先前研究中被引用,但未能从乱码文本中检索到表情符号。我们的方法从76,914条推文中检索出157,748个表情符号。通过Flesch阅读易度、Flesch-Kincaid年级水平、Coleman-Liau指数、自动可读性指数、Dale-Chall可读性分数、文本标准和阅读时间等指标,证明了文本可读性和连贯性的改善。此外,本文还分析了这些推文中单个表情符号的使用频率及其使用模式,并展示了相关结果。