With increasingly volatile market conditions and rapid product innovations, operational decision-making for large-scale systems entails solving thousands of problems with limited data. Data aggregation is proposed to combine the data across problems to improve the decisions obtained by solving those problems individually. We propose a novel cluster-based shrunken-SAA approach that can exploit the cluster structure among problems when implementing the data aggregation approaches. We prove that, as the number of problems grows, leveraging the known cluster structure among problems yields additional benefits over the data aggregation approaches that neglect such structure. When the cluster structure is unknown, we show that unveiling the cluster structure, even at the cost of a few data points, can be beneficial, especially when the distance between clusters of problems is substantial. Our proposed approach can be extended to general cost functions under mild conditions. When the number of problems gets large, the optimality gap of our proposed approach decreases exponentially in the distance between the clusters. We explore the performance of the proposed approach through the application of managing newsvendor systems via numerical experiments. We investigate the impacts of distance metrics between problem instances on the performance of the cluster-based Shrunken-SAA approach with synthetic data. We further validate our proposed approach with real data and highlight the advantages of cluster-based data aggregation, especially in the small-data large-scale regime, compared to the existing approaches.
翻译:随着市场环境日益波动与产品创新快速迭代,大规模系统的运营决策需在有限数据条件下解决数千个问题。数据聚合方法被提出,旨在跨问题整合数据以提升个体问题求解所获决策的质量。本文提出一种新颖的基于聚类的收缩SAA方法,该方法在实施数据聚合时能利用问题间的聚类结构。我们证明,随着问题数量增长,利用已知的问题聚类结构相比忽略该结构的数据聚合方法能带来额外效益。当聚类结构未知时,本文表明揭示聚类结构(即使以牺牲少量数据点为代价)仍能带来收益,尤其当问题聚类间距离较大时效果显著。所提方法可在温和条件下拓展至通用成本函数。当问题规模增大时,本文方法的最优间隙随聚类间距离呈指数级衰减。通过管理报童系统的数值实验,我们探索了所提方法的性能表现。利用合成数据,本文研究了问题实例间距离度量对基于聚类的收缩SAA方法性能的影响。进一步通过真实数据验证所提方法,并凸显了基于聚类的数据聚合方法相较于现有方法的优势,尤其在"小数据-大规模"场景下表现突出。