Core-Elements for Classical Linear Regression - 专知论文

会员服务 ·

0

估计/估计量 · 子采样 · 预测器/决策函数 · 线性回归 · 线性的 ·

2023 年 3 月 17 日

Core-Elements for Classical Linear Regression

翻译：经典线性回归的核心元素方法

Mengyu Li,Jun Yu,Tao Li,Cheng Meng

The coresets approach, also called subsampling or subset selection, aims to select a subsample as a surrogate for the observed sample. Such an approach has been used pervasively in large-scale data analysis. Existing coresets methods construct the subsample using a subset of rows from the predictor matrix. Such methods can be significantly inefficient when the predictor matrix is sparse or numerically sparse. To overcome the limitation, we develop a novel element-wise subset selection approach, called core-elements, for large-scale least squares estimation in classical linear regression. We provide a deterministic algorithm to construct the core-elements estimator, only requiring an $O(\mbox{nnz}(\mathbf{X})+rp^2)$ computational cost, where $\mathbf{X}$ is an $n\times p$ predictor matrix, $r$ is the number of elements selected from each column of $\mathbf{X}$, and $\mbox{nnz}(\cdot)$ denotes the number of non-zero elements. Theoretically, we show that the proposed estimator is unbiased and approximately minimizes an upper bound of the estimation variance. We also provide an approximation guarantee by deriving a coresets-like finite sample bound for the proposed estimator. To handle potential outliers in the data, we further combine core-elements with the median-of-means procedure, resulting in an efficient and robust estimator with theoretical consistency guarantees. Numerical studies on various synthetic and open-source datasets demonstrate the proposed method's superior performance compared to mainstream competitors.

翻译：核心集方法（又称子抽样或子集选择）旨在选取一个子样本作为观测样本的替代。该方法在大规模数据分析中得到了广泛应用。现有核心集方法通过从预测变量矩阵中抽取行子集来构建子样本，但当预测矩阵稀疏或数值稀疏时，此类方法的效率显著降低。为克服这一局限，我们提出一种新型元素级子集选择方法——核心元素法，用于经典线性回归中的大规模最小二乘估计。我们给出了一个确定性算法来构建核心元素估计量，其计算复杂度仅为$O(\mbox{nnz}(\mathbf{X})+rp^2)$，其中$\mathbf{X}$是$n\times p$的预测矩阵，$r$是从$\mathbf{X}$每列中选取的元素个数，$\mbox{nnz}(\cdot)$表示非零元素个数。理论上，我们证明该估计量无偏，且能近似最小化估计方差的上界。通过推导该估计量的核心集式有限样本界，我们进一步提供了近似保证。为处理数据中的潜在异常值，我们将核心元素法与均值中位数过程相结合，得到具有理论一致性保证的高效稳健估计量。在多种合成数据集和开源数据集上的数值实验表明，与主流对比方法相比，所提方法具有更优性能。

0

相关内容

估计/估计量

估计/估计量

【干货书】数据分析优化，Optimization for Modern Data Analysis，117页pdf

【干货书】数据分析优化，Optimization for Modern Data Analysis，117页pdf

专知会员服务

66+阅读 · 2023年2月15日

【ICML2021】核持续学习，Kernel Continual Learning

专知会员服务

32+阅读 · 2021年7月15日

《算法凸几何》简明书，Algorithmic Convex Geometry，50页pdf

专知会员服务

42+阅读 · 2021年4月2日

【CVPR2021】自监督几何感知

【CVPR2021】自监督几何感知

专知会员服务

46+阅读 · 2021年3月6日

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

52+阅读 · 2020年12月14日

【经典书】现代统计方法基础，267页pdf，Fundamentals of Modern Statistical Methods

【经典书】现代统计方法基础，267页pdf，Fundamentals of Modern Statistical Methods

专知会员服务

64+阅读 · 2020年8月10日

【伯克利-Ke Li】学习优化，74页ppt，Learning to Optimize

【伯克利-Ke Li】学习优化，74页ppt，Learning to Optimize

专知会员服务

41+阅读 · 2020年7月23日

最大均方差正则化贝叶斯神经网络，Bayesian Neural Networks With Maximum Mean Discrepancy Regularization

最大均方差正则化贝叶斯神经网络，Bayesian Neural Networks With Maximum Mean Discrepancy Regularization

专知会员服务

54+阅读 · 2020年3月5日

【机器学习基础最新版】（Mathematics for Machine Learning），417页pdf

【机器学习基础最新版】（Mathematics for Machine Learning），417页pdf

专知会员服务

246+阅读 · 2019年10月21日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

局部学习的特征选择：Local-Learning-Based Feature Selection

局部学习的特征选择：Local-Learning-Based Feature Selection

我爱读PAMI

14+阅读 · 2019年9月20日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

深度自进化聚类：Deep Self-Evolution Clustering

深度自进化聚类：Deep Self-Evolution Clustering

我爱读PAMI

15+阅读 · 2019年4月13日

笔记 | Deep active learning for named entity recognition

笔记 | Deep active learning for named entity recognition

黑龙江大学自然语言处理实验室

24+阅读 · 2018年5月27日

【CNN】一文读懂卷积神经网络CNN

【CNN】一文读懂卷积神经网络CNN

产业智能官

18+阅读 · 2018年1月2日

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

机器学习研究会

20+阅读 · 2017年12月17日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

11+阅读 · 2017年11月12日

机器学习(23)之GBDT详解

机器学习(23)之GBDT详解

机器学习算法与Python学习

12+阅读 · 2017年10月25日

MNIST入门：贝叶斯方法

MNIST入门：贝叶斯方法

Python程序员

23+阅读 · 2017年7月3日

基于SURE/PURE准则的图像盲反卷积算法研究

国家自然科学基金

3+阅读 · 2013年12月31日

基于Universum学习的降维方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

多元线性整值时间序列的统计分析

国家自然科学基金

2+阅读 · 2013年12月31日

非参数与半参数混合模型的统计推断及应用

国家自然科学基金

3+阅读 · 2012年12月31日

基于空间优化的连续型多设施选址方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

整数值时间序列数据的建模方法研究

国家自然科学基金

1+阅读 · 2012年12月31日

正相协及缺失数据情形的经验似然推断

国家自然科学基金

0+阅读 · 2012年12月31日

外源添加物质对百子莲胚性愈伤组织超低温保存逆境应答的调控机理

国家自然科学基金

0+阅读 · 2011年12月31日

多尺度自适应方法的研究和应用

国家自然科学基金

0+阅读 · 2011年12月31日

基于list-mode数据的快速SART真3D PET断层重建算法的研究

国家自然科学基金

0+阅读 · 2011年12月31日

On near-redundancy and identifiability of parametric hazard regression models under censoring

Arxiv

0+阅读 · 2023年5月9日

Adaptive Localized Reduced Basis Methods for Large Scale Parameterized Systems

Arxiv

0+阅读 · 2023年5月9日

Toward Auto-evaluation with Confidence-based Category Relation-aware Regression

Arxiv

0+阅读 · 2023年5月9日

Sparse Sliced Inverse Regression via Random Projection

Arxiv

0+阅读 · 2023年5月9日

A faster algorithm for counting the integer points number in $Δ$-modular polyhedra (corrected version)

Arxiv

0+阅读 · 2023年5月8日

Sliced Inverse Regression with Large Structural Dimensions

Arxiv

0+阅读 · 2023年5月7日

A minimax optimal approach to high-dimensional double sparse linear regression

Arxiv

0+阅读 · 2023年5月7日

A technical note on bilinear layers for interpretability

Arxiv

0+阅读 · 2023年5月5日

Carbon Price Forecasting with Quantile Regression and Feature Selection

Arxiv

0+阅读 · 2023年5月5日

An Assessment of the Supremizer and Aggregation Methods of Stabilization for Reduced-Order Models

Arxiv

0+阅读 · 2023年5月4日

VIP会员

文章信息

相关主题

估计/估计量

预测器/决策函数

最新内容

《基于智能体建模与仿真的无人机蜂群模型目标定位涌现行为比较分析》360页

《基于智能体建模与仿真的无人机蜂群模型目标定位涌现行为比较分析》360页

专知会员服务

7+阅读 · 7月18日

欧洲智能弹药战略创新管理：迈向制导弹药、巡飞系统与自主无人机蜂群的技术主权研究路线图

欧洲智能弹药战略创新管理：迈向制导弹药、巡飞系统与自主无人机蜂群的技术主权研究路线图

专知会员服务

4+阅读 · 7月18日

从领域适配到部署与可解释：Berkeley博士论文解析大语言模型真实落地

从领域适配到部署与可解释：Berkeley博士论文解析大语言模型真实落地

专知会员服务

6+阅读 · 7月18日

综述 | 长程智能体研究全景：基础、演化、框架、优化与前沿

综述 | 长程智能体研究全景：基础、演化、框架、优化与前沿

专知会员服务

4+阅读 · 7月18日

DARPA拟打造十万规模自主思考作战的AI智能体集群：“受控涌现式分布式人工智能”（DICE）项目

DARPA拟打造十万规模自主思考作战的AI智能体集群：“受控涌现式分布式人工智能”（DICE）项目

专知会员服务

8+阅读 · 7月17日

《边缘端实时无线感知赋能现场多机器人部署》200页

《边缘端实时无线感知赋能现场多机器人部署》200页

专知会员服务

7+阅读 · 7月17日

战力倍增器：自主武器系统与乌克兰及加沙冲突

战力倍增器：自主武器系统与乌克兰及加沙冲突

专知会员服务

4+阅读 · 7月17日

人工智能赋能战场情报：提速决策进程

人工智能赋能战场情报：提速决策进程

专知会员服务

2+阅读 · 7月17日

《拥抱新兴技术：面向未来军官的教育革新》

《拥抱新兴技术：面向未来军官的教育革新》

专知会员服务

5+阅读 · 7月17日

ACM MM 2026 | MAR-GRPO：稳定混合图像生成的强化学习训练

ACM MM 2026 | MAR-GRPO：稳定混合图像生成的强化学习训练

专知会员服务

3+阅读 · 7月17日

综述 | 大模型水印理论与部署：来源追踪、攻击鲁棒与可信治理

综述 | 大模型水印理论与部署：来源追踪、攻击鲁棒与可信治理

专知会员服务

4+阅读 · 7月17日

《火线上的后勤保障：对抗环境下的随机规划模型研究——俄乌场景案例分析》99页

《火线上的后勤保障：对抗环境下的随机规划模型研究——俄乌场景案例分析》99页

专知会员服务

12+阅读 · 7月16日

《无人地面战车（UGV）的崛起》报告

《无人地面战车（UGV）的崛起》报告

专知会员服务

7+阅读 · 7月16日

《无人机参数化与集群飞行创新项目的监控流程管理：模型、策略及自适应解决方案》

《无人机参数化与集群飞行创新项目的监控流程管理：模型、策略及自适应解决方案》

专知会员服务

6+阅读 · 7月16日

《美军开放式任务系统（OMS）定义与文档（D&D）——Java关键抽象层（CAL）接口生成规范》47页标准

《美军开放式任务系统（OMS）定义与文档（D&D）——Java关键抽象层（CAL）接口生成规范》47页标准

专知会员服务

14+阅读 · 7月16日

相关VIP内容

【干货书】数据分析优化，Optimization for Modern Data Analysis，117页pdf

【干货书】数据分析优化，Optimization for Modern Data Analysis，117页pdf

专知会员服务

66+阅读 · 2023年2月15日

【ICML2021】核持续学习，Kernel Continual Learning

专知会员服务

32+阅读 · 2021年7月15日

《算法凸几何》简明书，Algorithmic Convex Geometry，50页pdf

专知会员服务

42+阅读 · 2021年4月2日

【CVPR2021】自监督几何感知

【CVPR2021】自监督几何感知

专知会员服务

46+阅读 · 2021年3月6日

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

52+阅读 · 2020年12月14日

【经典书】现代统计方法基础，267页pdf，Fundamentals of Modern Statistical Methods

【经典书】现代统计方法基础，267页pdf，Fundamentals of Modern Statistical Methods

专知会员服务

64+阅读 · 2020年8月10日

【伯克利-Ke Li】学习优化，74页ppt，Learning to Optimize

【伯克利-Ke Li】学习优化，74页ppt，Learning to Optimize

专知会员服务

41+阅读 · 2020年7月23日

最大均方差正则化贝叶斯神经网络，Bayesian Neural Networks With Maximum Mean Discrepancy Regularization

最大均方差正则化贝叶斯神经网络，Bayesian Neural Networks With Maximum Mean Discrepancy Regularization

专知会员服务

54+阅读 · 2020年3月5日

【机器学习基础最新版】（Mathematics for Machine Learning），417页pdf

【机器学习基础最新版】（Mathematics for Machine Learning），417页pdf

专知会员服务

246+阅读 · 2019年10月21日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

欧洲智能弹药战略创新管理：迈向制导弹药、巡飞系统与自主无人机蜂群的技术主权研究路线图

综述 | 长程智能体研究全景：基础、演化、框架、优化与前沿

《基于智能体建模与仿真的无人机蜂群模型目标定位涌现行为比较分析》360页

从领域适配到部署与可解释：Berkeley博士论文解析大语言模型真实落地

相关资讯

局部学习的特征选择：Local-Learning-Based Feature Selection

局部学习的特征选择：Local-Learning-Based Feature Selection

我爱读PAMI

14+阅读 · 2019年9月20日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

深度自进化聚类：Deep Self-Evolution Clustering

深度自进化聚类：Deep Self-Evolution Clustering

我爱读PAMI

15+阅读 · 2019年4月13日

笔记 | Deep active learning for named entity recognition

笔记 | Deep active learning for named entity recognition

黑龙江大学自然语言处理实验室

24+阅读 · 2018年5月27日

【CNN】一文读懂卷积神经网络CNN

【CNN】一文读懂卷积神经网络CNN

产业智能官

18+阅读 · 2018年1月2日

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

机器学习研究会

20+阅读 · 2017年12月17日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

11+阅读 · 2017年11月12日

机器学习(23)之GBDT详解

机器学习(23)之GBDT详解

机器学习算法与Python学习

12+阅读 · 2017年10月25日

MNIST入门：贝叶斯方法

MNIST入门：贝叶斯方法

Python程序员

23+阅读 · 2017年7月3日

相关论文

On near-redundancy and identifiability of parametric hazard regression models under censoring

Arxiv

0+阅读 · 2023年5月9日

Adaptive Localized Reduced Basis Methods for Large Scale Parameterized Systems

Arxiv

0+阅读 · 2023年5月9日

Toward Auto-evaluation with Confidence-based Category Relation-aware Regression

Arxiv

0+阅读 · 2023年5月9日

Sparse Sliced Inverse Regression via Random Projection

Arxiv

0+阅读 · 2023年5月9日

A faster algorithm for counting the integer points number in $Δ$-modular polyhedra (corrected version)

Arxiv

0+阅读 · 2023年5月8日

Sliced Inverse Regression with Large Structural Dimensions

Arxiv

0+阅读 · 2023年5月7日

A minimax optimal approach to high-dimensional double sparse linear regression

Arxiv

0+阅读 · 2023年5月7日

A technical note on bilinear layers for interpretability

Arxiv

0+阅读 · 2023年5月5日

Carbon Price Forecasting with Quantile Regression and Feature Selection

Arxiv

0+阅读 · 2023年5月5日

An Assessment of the Supremizer and Aggregation Methods of Stabilization for Reduced-Order Models

Arxiv

0+阅读 · 2023年5月4日

相关基金

基于SURE/PURE准则的图像盲反卷积算法研究

国家自然科学基金

3+阅读 · 2013年12月31日

基于Universum学习的降维方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

多元线性整值时间序列的统计分析

国家自然科学基金

2+阅读 · 2013年12月31日

非参数与半参数混合模型的统计推断及应用

国家自然科学基金

3+阅读 · 2012年12月31日

基于空间优化的连续型多设施选址方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

整数值时间序列数据的建模方法研究

国家自然科学基金

1+阅读 · 2012年12月31日

正相协及缺失数据情形的经验似然推断

国家自然科学基金

0+阅读 · 2012年12月31日

外源添加物质对百子莲胚性愈伤组织超低温保存逆境应答的调控机理

国家自然科学基金

0+阅读 · 2011年12月31日

多尺度自适应方法的研究和应用

国家自然科学基金

0+阅读 · 2011年12月31日

基于list-mode数据的快速SART真3D PET断层重建算法的研究

国家自然科学基金

0+阅读 · 2011年12月31日

微信扫码咨询专知VIP会员