UniSage: A Unified and Post-Analysis-Aware Sampling for Microservices

Traces and logs serve as the backbone of observability in microservice architectures, yet their sheer volume imposes prohibitive storage and computational burdens. To reduce overhead, operators rely on sampling; however, current frameworks generally employ a sample-before-analysis strategy. This approach creates a fundamental trade-off: to save space, systems must discard data before knowing its diagnostic value, often losing critical context required for troubleshooting anomalies and latency spikes. In this paper, we propose UniSage, a unified sampling framework that addresses this trade-off by adopting a post-analysis-aware paradigm. Unlike prior works that focus solely on tracing, UniSageintegrates both traces and logs, leveraging a lightweight anomaly detection and root cause analysis module to scan the full data stream before sampling decisions are made. This pre-computation enables a dual-pillar strategy: an analysis-guided sampler that retains high-value data associated with detected anomalies, and an edge-case sampler that preserves rare but critical behaviors to ensure diversity. Evaluation on three datasets confirms that UniSage achieves superior data retention. At a 2.5% sampling rate, UniSage captures 71% of critical traces and 96.25% of relevant logs, substantially exceeding the best existing methods (which achieve 42.9% and 1.95%, respectively). Moreover, evaluations on a real-world dataset demonstrate UniSage's efficiency; it processes a 20-minute multi-modal data block in an average of 10 seconds, making it practical for production environments.

翻译：在微服务架构中，追踪（trace）与日志（log）是可观测性的基石，但其庞大的数据量带来了巨大的存储与计算负担。为降低开销，运维人员通常采用采样技术；然而，现有框架普遍采用“先采样后分析”的策略。这种方法导致一个根本性的权衡：为节省存储空间，系统必须在获知数据的诊断价值前将其丢弃，从而常常丢失用于排查异常与延迟尖峰的关键上下文信息。本文提出UniSage，一个统一的采样框架，它通过采用后分析感知的范式来解决这一权衡问题。与以往仅关注追踪的工作不同，UniSage将追踪与日志进行整合，并利用一个轻量级的异常检测与根因分析模块，在做出采样决策前对全量数据流进行扫描。这种预计算机制支持一种双支柱策略：一个分析引导的采样器，用于保留与检测到的异常相关的高价值数据；以及一个边缘案例采样器，用于保留罕见但关键的行为以确保数据多样性。在三个数据集上的评估证实，UniSage实现了更优的数据保留能力。在2.5%的采样率下，UniSage能够捕获71%的关键追踪与96.25%的相关日志，显著优于现有最佳方法（后者分别仅达到42.9%与1.95%）。此外，在真实数据集上的评估证明了UniSage的高效性；其处理一个20分钟的多模态数据块平均仅需10秒，表明其适用于生产环境。