We study repeated bilateral trade when the valuations of the sellers and the buyers are contextual. More precisely, the agents' valuations are given by the inner product of a context vector with two unknown $d$-dimensional vectors -- one for the buyers and one for the sellers. At each time step $t$, the learner receives a context and posts two prices, one for the seller and one for the buyer, and the trade happens if both agents accept their price. We study two objectives for this problem, gain from trade and profit, proving no-regret with respect to a surprisingly strong benchmark: the best omniscient dynamic strategy. In the natural scenario where the learner observes \emph{separately} whether the agents accept their price -- the so-called \emph{two-bit} feedback -- we design algorithms that achieve $O(d\log d)$ regret for gain from trade, and $O(d \log\log T + d\log d)$ regret for profit maximization. Both results are tight, up to the $\log(d)$ factor, and implement per-step budget balance, meaning that the learner never incurs negative profit. In the less informative \emph{one-bit} feedback model, the learner only observes whether a trade happens or not. For this scenario, we show that the tight two-bit regret regimes are still attainable, at the cost of allowing the learner to possibly incur a small negative profit of order $O(d\log d)$, which is notably independent of the time horizon. As a final set of results, we investigate the combination of one-bit feedback and per-step budget balance. There, we design an algorithm for gain from trade that suffers regret independent of the time horizon, but \emph{exponential} in the dimension $d$. For profit maximization, we maintain this exponential dependence on the dimension, which gets multiplied by a $\log T$ factor.
翻译:本文研究在卖方和买方估值具有上下文信息时的重复双边交易问题。具体而言,智能体的估值由上下文向量与两个未知的$d$维向量(分别对应买方和卖方)的内积给出。在每个时间步$t$,学习者接收一个上下文并发布两个价格(分别针对卖方和买方),当双方智能体均接受其对应价格时交易发生。我们针对该问题研究两个目标:交易收益与利润,并证明相对于一个惊人的强基准——最优全知动态策略——能够实现无悔学习。在自然场景下,若学习者能分别观测到智能体是否接受其价格(即所谓的双比特反馈),我们设计了可实现$O(d\log d)$交易收益遗憾值和$O(d \log\log T + d\log d)$利润最大化遗憾值的算法。这两个结果在忽略$\log(d)$因子的意义下均为紧界,且满足每步预算平衡,即学习者永不产生负利润。在信息量较少的单比特反馈模型中,学习者仅能观测交易是否发生。针对该场景,我们证明仍可达到紧致的双比特遗憾值区间,代价是允许学习者可能产生$O(d\log d)$量级的微小负利润(该值与时间范围无关)。作为最后一组结果,我们探究了单比特反馈与每步预算平衡的结合。在此条件下,我们设计了交易收益算法,其遗憾值与时间范围无关,但与维度$d$呈指数级依赖关系。对于利润最大化问题,我们维持了这种对维度的指数依赖关系,并需额外乘以$\log T$因子。