PEARL: Prototype-Enhanced Alignment for Label-Efficient Representation Learning with Deployment-Driven Insights from Digital Governance Communication Systems

翻译：PEARL：面向数字治理通信系统的原型增强对齐方法——基于部署驱动视角的标签高效表征学习

Ruiyu Zhang,Lin Nie,Wai-Fung Lam,Qihao Wang,Xin Zhao

from arxiv, 15 pages, 1 figure

In many deployed systems, new text inputs are handled by retrieving similar past cases, for example when routing and responding to citizen messages in digital governance platforms. When these systems fail, the problem is often not the language model itself, but that the nearest neighbors in the embedding space correspond to the wrong cases. Modern machine learning systems increasingly rely on fixed, high-dimensional embeddings produced by large pretrained models and sentence encoders. In real-world deployments, labels are scarce, domains shift over time, and retraining the base encoder is expensive or infeasible. As a result, downstream performance depends heavily on embedding geometry. Yet raw embeddings are often poorly aligned with the local neighborhood structure required by nearest-neighbor retrieval, similarity search, and lightweight classifiers that operate directly on embeddings. We propose PEARL (Prototype-Enhanced Aligned Representation Learning), a label-efficient approach that uses limited supervision to softly align embeddings toward class prototypes. The method reshapes local neighborhood geometry while preserving dimensionality and avoiding aggressive projection or collapse. Its aim is to bridge the gap between purely unsupervised post-processing, which offers limited and inconsistent gains, and fully supervised projections that require substantial labeled data. We evaluate PEARL under controlled label regimes ranging from extreme label scarcity to higher-label settings. In the label-scarce condition, PEARL substantially improves local neighborhood quality, yielding 25.7% gains over raw embeddings and more than 21.1% gains relative to strong unsupervised post-processing, precisely in the regime where similarity-based systems are most brittle.

翻译：在许多已部署的系统中，新输入的文本通过检索相似的历史案例进行处理，例如在数字治理平台中对公民消息进行路由和响应时。当这些系统出现故障时，问题往往不在于语言模型本身，而在于嵌入空间中的最近邻对应了错误的案例。现代机器学习系统日益依赖于由大型预训练模型和句子编码器生成的固定高维嵌入。在实际部署中，标签数据稀缺、领域会随时间漂移，且重新训练基础编码器的成本高昂或难以实现。因此，下游性能在很大程度上依赖于嵌入空间的几何结构。然而，原始嵌入往往与最近邻检索、相似性搜索以及直接在嵌入上运行的轻量级分类器所需的局部邻域结构对齐不佳。我们提出PEARL（原型增强对齐表征学习），这是一种标签高效的方法，利用有限的监督将嵌入软对齐到类别原型。该方法在保持维度不变且避免激进投影或坍缩的前提下，重塑局部邻域几何结构。其目标是弥合纯无监督后处理（其改进有限且不稳定）与需要大量标注数据的全监督投影之间的差距。我们在从极端标签稀缺到较高标签量的受控标签机制下评估PEARL。在标签稀缺条件下，PEARL显著提升了局部邻域质量，相比原始嵌入获得了25.7%的性能提升，相较于强无监督后处理方法也实现了超过21.1%的增益，而这正是在基于相似性的系统最为脆弱的场景中实现的。