Analysing multilingual social media discourse remains a major challenge in natural language processing, particularly when large-scale public debates span across diverse languages. This study investigates how different approaches for cross-lingual text classification can support reliable analysis of global conversations. Using hydrogen energy as a case study, we analyse a decade-long dataset of over nine million tweets in English, Japanese, Hindi, and Korean (2013--2022) for topic discovery. The online keyword-driven data collection results in a significant amount of irrelevant content. We explore four approaches to filter relevant content: (1) translating English annotated data into target languages for building language-specific models for each target language, (2) translating unlabelled data appearing from all languages into English for creating a single model based on English annotations, (3) applying English fine-tuned multilingual transformers directly to each target language data, and (4) a hybrid strategy that combines translated annotations with multilingual training. Each approach is evaluated for its ability to filter hydrogen-related tweets from noisy keyword-based collections. Subsequently, topic modeling is performed to extract dominant themes within the relevant subsets. The results highlight key trade-offs between translation and multilingual approaches, offering actionable insights into optimising cross-lingual pipelines for large-scale social media analysis.
翻译:分析多语言社交媒体话语仍然是自然语言处理领域的重大挑战,尤其是在大规模公共讨论跨越多种语言的情况下。本研究探讨了不同的跨语言文本分类方法如何支持全球对话的可靠分析。以氢能源为案例,我们分析了跨越十年(2013-2022年)、包含超过九百万条英语、日语、印地语和韩语推文的数据集以进行主题发现。基于在线关键词驱动的数据收集产生了大量无关内容。我们探索了四种过滤相关内容的方法:(1)将英语标注数据翻译成目标语言,为每种目标语言构建特定语言模型;(2)将所有语言出现的未标注数据翻译成英语,基于英语标注创建单一模型;(3)将英语微调的多语言Transformer直接应用于各目标语言数据;(4)结合翻译标注与多语言训练的混合策略。每种方法均评估了其从基于关键词的噪声数据集中过滤氢能相关推文的能力。随后,通过主题建模提取相关子集中的主导主题。研究结果揭示了翻译方法与多语言方法之间的关键权衡,为优化大规模社交媒体分析的跨语言处理流程提供了可行见解。