When Does Synthetic Patent Data Help? Volume-Fidelity Trade-offs in Low-Resource Multi-Label Classification

The issues that must be considered regarding the utilization of synthetic data generated through LLMs for multilabel patent classification include (i) when the use of such data may help and (ii) why. Indeed, the former part appropriately adjusts for the possibility of improving results by an increase in sample size. The current experiment involves six open-source LLMs (from 3.8B to 12B parameters) for four real-data regimes in classification of 64 WIPO labels of assistive technologies. Both full-synthesis generation, conditioned on the label set, and paraphrasing methods are applied, with each used in combination with three classifier categories. It is shown that the claimed improvements in micro F1 for BERT-for-Patents from 0.120 to 0.702 mainly reflect a volume effect; indeed, replication with replacement in 165 examples produces 0.678. Thus, the improvement over the control is +0.024, while compared to the best baseline (focal loss reweighting) is +0.219. The second crucial point to consider here is that of evolving fidelity scores as the data generation regime varies. For low real-data regimes, the volume effect dominates and the correlation coefficient between maximum mean discrepancy (MMD) and classification performance equals r = +0.95. As more real data is used, the correlation becomes inverted and reaches r = -0.73 at the 1:10 regime (Fisher z = +6.47, p < 0.001, 95% CI on Delta r [ +0.96, +1.00 ]). In terms of a fixed budget allocation, combining real data (about 20-30%) with synthetic (70-80%) outperforms both purely synthetic and purely real strategies. Moreover, a corpus that allows for improvement in classification performance up to +0.58 in raw micro F1 may adversely affect a Jaccard-overlap retrieval proxy. Prompt-family variations for other genres may provide some explanation of the phenomenon, but using the standard-patent filter still decreases nDCG@10 by 26%.

翻译：关于利用大语言模型生成的合成数据进行多标签专利分类时须考虑的问题包括：(i) 此类数据在何种情况下可能有助益，以及(ii) 原因何在。具体而言，前者恰当地调整了通过增加样本量来改进结果的可能性。本实验涉及六种开源大语言模型（参数规模从3.8B至12B），针对辅助技术领域64个世界知识产权组织标签的分类任务，在四种真实数据条件下展开研究。研究同时应用了基于标签集条件生成的完全合成方法及释义改写方法，每种方法均与三类分类器组合使用。研究表明，BERT-for-Patents模型的微F1值从0.120提升至0.702的声称改进主要反映了数量效应：事实上，对165个样本进行有放回复制即可达到0.678。因此，相较于对照组的改进幅度为+0.024，而相较于最佳基线（焦点损失重加权）的改进幅度为+0.219。另一关键要点在于：随着数据生成条件的变化，保真度评分呈现演化特征。在低真实数据条件下，数量效应占主导地位，此时最大均值差异与分类性能的相关系数达r=+0.95。随着更多真实数据的引入，相关性发生反转，在1:10条件下达到r=-0.73（Fisher z=+6.47，p<0.001，Delta r的95%置信区间[+0.96, +1.00]）。在固定预算分配方面，结合20-30%真实数据与70-80%合成数据的策略优于纯合成或纯真实数据策略。此外，允许原始微F1值提升高达+0.58的语料库，可能对基于Jaccard重叠的检索代理产生不利影响。其他类型提示族的变化虽可为该现象提供部分解释，但采用标准专利过滤器仍会使nDCG@10降低26%。