Many real-world data sets can be presented in the form of a matrix whose entries correspond to the interaction between two entities of different natures (number of times a web user visits a web page, a student's grade in a subject, a patient's rating of a doctor, etc.). We assume in this paper that the mentioned interaction is determined by unobservable latent variables describing each entity. Our objective is to estimate the conditional expectation of the data matrix given the unobservable variables. This is presented as a problem of estimation of a bivariate function referred to as graphon. We study the cases of piecewise constant and H\"older-continuous graphons. We establish finite sample risk bounds for the least squares estimator and the exponentially weighted aggregate. These bounds highlight the dependence of the estimation error on the size of the data set, the maximum intensity of the interactions, and the level of noise. As the analyzed least-squares estimator is intractable, we propose an adaptation of Lloyd's alternating minimization algorithm to compute an approximation of the least-squares estimator. Finally, we present numerical experiments in order to illustrate the empirical performance of the graphon estimator on synthetic data sets.
翻译:许多现实世界数据集可以表示为矩阵形式,其条目对应两种不同性质实体间的交互(如网络用户访问网页的次数、学生的学科成绩、患者对医生的评分等)。本文假设上述交互由描述每个实体的不可观察潜变量决定,目标是在给定不可观察变量的条件下估计数据矩阵的条件期望。这被归结为双变量函数(称为图)的估计问题。我们研究了分段常数和Hölder连续图的情况,建立了最小二乘估计量和指数加权聚合量的有限样本风险界,这些风险界揭示了估计误差对数据集规模、交互最大强度及噪声水平的依赖性。由于所分析的最小二乘估计量难以求解,我们提出了Lloyd交替最小化算法的改进版本,用于计算最小二乘估计量的近似解。最后,通过数值实验展示了图估计量在合成数据集上的实证表现。