Synthetic data offers a promising solution for mitigating data scarcity and demographic bias in mental health analysis, yet existing approaches largely rely on pretrained large language models (LLMs), which may suffer from limited output diversity and propagate biases inherited from their training data. In this work, we propose a pretraining-free diffusion-based approach for synthetic text generation that frames bias mitigation as a style transfer problem. Using the CARMA Arabic mental health corpus, which exhibits a substantial gender imbalance, we focus on male-to-female style transfer to augment underrepresented female-authored content. We construct five datasets capturing varying linguistic and semantic aspects of gender expression in Arabic and train separate diffusion models for each setting. Quantitative evaluations demonstrate consistently high semantic fidelity between source and generated text, alongside meaningful surface-level stylistic divergence, while qualitative analysis confirms linguistically plausible gender transformations. Our results show that diffusion-based style transfer can generate high-entropy, semantically faithful synthetic data without reliance on pretrained LLMs, providing an effective and flexible framework for mitigating gender bias in sensitive, low-resource mental health domains.
翻译:合成数据为缓解心理健康分析中的数据稀缺性和人口统计偏差提供了有前景的解决方案,然而现有方法主要依赖于预训练的大型语言模型(LLMs),这些模型可能面临输出多样性有限以及传播其训练数据中固有偏见的问题。在本工作中,我们提出了一种无需预训练的、基于扩散的合成文本生成方法,该方法将偏见缓解构建为一个风格迁移问题。利用存在显著性别不平衡的CARMA阿拉伯语心理健康语料库,我们专注于从男性到女性的风格迁移,以增强代表性不足的女性创作内容。我们构建了五个数据集,用于捕捉阿拉伯语中性别表达的不同语言学和语义层面,并为每种设定训练了独立的扩散模型。定量评估表明,源文本与生成文本之间始终保持着较高的语义保真度,同时具有有意义的表层风格差异,而定性分析则证实了语言上合理的性别转换。我们的结果表明,基于扩散的风格迁移能够在不依赖预训练LLMs的情况下,生成高熵值且语义忠实的合成数据,为在敏感且资源匮乏的心理健康领域中缓解性别偏见提供了一个有效且灵活的框架。