This paper presents an embedding-based approach to detecting variation without relying on prior normalisation or predefined variant lists. The method trains subword embeddings on raw text and groups related forms through combined cosine and n-gram similarity. This allows spelling and morphological diversity to be examined and analysed as linguistic structure rather than treated as noise. Using a large corpus of Luxembourgish user comments, the approach uncovers extensive lexical and orthographic variation that aligns with patterns described in dialectal and sociolinguistic research. The induced families capture systematic correspondences and highlight areas of regional and stylistic differentiation. The procedure does not strictly require manual annotation, but does produce transparent clusters that support both quantitative and qualitative analysis. The results demonstrate that distributional modelling can reveal meaningful patterns of variation even in ''noisy'' or low-resource settings, offering a reproducible methodological framework for studying language variety in multilingual and small-language contexts.
翻译:本文提出一种基于嵌入的方法,用于检测语言变体,该方法不依赖于先前的归一化处理或预定义的变体列表。该方法在原始文本上训练子词嵌入,并通过结合余弦相似度和n-gram相似度对相关形式进行分组。这使得拼写和形态多样性能够作为语言结构被检视和分析,而非被当作噪声处理。通过使用一个大型的卢森堡语用户评论语料库,该方法揭示了大量的词汇和正字法变体,这些变体与方言学和社会语言学研究中所描述的模式相符。归纳出的词族捕捉到了系统的对应关系,并突出了区域和风格差异的领域。该过程并不严格要求人工标注,但确实能产生透明的聚类,支持定量和定性分析。结果表明,即使在“嘈杂”或低资源环境中,分布建模也能揭示有意义的变体模式,为研究多语言和小语言环境中的语言多样性提供了一个可复现的方法框架。