Skip to main content
SearchLoginLogin or Signup

On Learnability Under General Stochastic Processes

Published onOct 27, 2022
On Learnability Under General Stochastic Processes
·

Abstract

Statistical learning theory under independent and identically distributed (iid) sampling and online learning theory for worst case individual sequences are two of the best developed branches of learning theory. Statistical learning under general non-iid stochastic processes is less mature. We provide two natural notions of learnability of a function class under a general stochastic process. We show that both notions are in fact equivalent to online learnability. Our results hold for both binary classification and regression.

Keywords: learnability, statistical learning, online learning, prequential principle, uniform martingale law of large numbers


Media Summary

This article is a contribution to ‘learning theory,’ which has largely developed within the computer science community, but has much potential for cross-fertilization with theoretical statistics. The basic framework is as follows. At successive time points we observe a signal, and wish to predict an associated outcome, using one from a given class of possible functions. Performance is measured by a specified loss, measuring the discrepancy between prediction and outcome. Learning theory aims to characterize conditions that allow us to make provably good predictions by analyzing past data to determine a suitable prediction function. Two approaches have emerged, largely in isolation from each other. In the first, (iid) statistical learning theory, we assume all past and future signal-outcome pairs have the same probabilistic structure, independently. The second, online learning theory, makes no probabilistic assumptions, and aims to track the best predictor from those available, no matter what data are observed. In this work we introduce two natural ways of merging these approaches, assuming arbitrary probabilistic dependence across time. We show that both approaches are equivalent to online learnability. Our results hold for binary classification as well as regression.

1. Introduction

One of the most beautiful and best developed branches of machine learning theory is classical statistical learning theory. (For a nontechnical overview and for more extensive references, see von Luxburg & Schölkopf, 2011.) However, it deals primarily with independent and identically distributed (iid) sampling of examples. Several attempts have been made to deal with both dependence and non-stationarity; we discuss some of these extensions in Section 1.1. However in general the non-iid case is not as well developed as the classical iid case.

Another well-developed branch of learning theory that has its own share of elegant mathematical ideas is online learning theory. (For an excellent introduction, see Cesa-Bianchi & Lugosi, 2006.) With roots in game theory and the area of information theory known as universal prediction of individual sequences, online learning theory, unlike statistical learning theory, does not use probabilistic foundations. It is therefore quite surprising that there are uncanny parallels between iid learning theory and online learning theory. The reader is invited to compare the statements of the fundamental theorems in these two areas (restated in this article as Theorem 1 and Theorem 2).

Our main goal in this article is to study learnability of a function class in the statistical setting under extremely general assumptions that do not require independence or stationarity. We first summarize the key theorems of iid and online learning in Section 3 and Section 4. Although this material is not new, we feel that the broader data science community might not be very familiar with results in online learning since it is a younger field compared to statistical learning theory. Also, presenting both iid learning and online learning results in a unified way allows us to draw parallels between the two theories and to motivate the need for novel theories that connect these two.

We propose a definition of learnability under general stochastic processes in Section 5. We show that learning under this general definition is equivalent to online learnability (Theorem 6). We give a prequential version of our main definition in Section 6. In the prequential version, as in online learning, the function output by the learning algorithm at any given time cannot peek into the future. We show that learnability under the prequential version of our general learning setting is also equivalent to online learnability (Theorem 9). We focus on the problem of binary classification for simplicity. But we also provide extensions of our equivalence results to the problem of real valued prediction (i.e., regression) in Section 7 (see Theorem 13 and Theorem 14).

1.1. Related Work

The iid assumption of statistical learning theory has been relaxed and replaced with various types of mixing assumptions, especially β\beta-mixing (Mohri & Rostamizadeh, 2008; Vidyasagar, 2002). However, in this line of investigation, the stationary assumption is kept and the theory resembles the iid theory to a large extent since mixing implies approximate independence of random variables that are sufficiently separated in time. Mixing assumptions can be shown to hold for some interesting classes of processes, including some Markov and hidden Markov processes. Markov sampling has also been considered on its own as a generalization of iid sampling (Aldous & Vazirani, 1995; Gamarnik, 2003; Smale & Zhou, 2009).

There has been work on performance guarantees of specific algorithms like boosting (Lozano et al., 2005) and support vector machines (Steinwart & Anghel, 2009; Steinwart et al., 2009) under non-iid assumptions. However, our focus here is not on any specific learning methodology. We would like to point out that, while we focus on learnability of functions in a fixed class, the question of universal consistency has also been studied in the context of general stochastic processes (Hanneke, 2021; Nobel, 1999).

There are a handful of papers that focus, as we do, on conditional risk given a sequence of observation drawn from a general non-iid stochastic processes (Pestov, 2010; Shalizi & Kontorovich, 2013; Zimin & Lampert, 2017). These papers focus on process decompositions: expressing a complex stochastic process as a mixture of simpler stochastic processes. For example, de Finetti’s theorem shows that exchangeable distributions are mixtures of iid distributions. The basic idea is to output a function with small expected loss one step beyond the observed sample where the expectation is also conditioned on the observed sample. While closely related, our performance measures are cumulative in nature and are inspired more by regret analysis in online learning than PAC bounds in computational learning theory.

The use of tools from online learning theory (e.g., sequential Rademacher complexity) for developing learning theory for dependent, nonstationary processes was pioneered by Kuznetsov and Mohri (2015, 2017). However, their focus is on time series forecasting applications and therefore their performance measures always involve the expected loss of the function chosen by the learning algorithm some steps into the future (i.e., the part not seen by the learning algorithm) of the process. In contrast, our definition uses conditional distributions of the stochastic process on the realized path to define our performance measure. We also point out that there are earlier papers that apply learning theory tools to understand time series prediction (Alquier & Wintenberger, 2012; Meir, 2000; Modha & Masry, 1998). Some very recent work has also begun to extend some of the work on time series to processes with spatial structure and dependence such as those occurring on a network (Dagan et al., 2019).

A direct inspiration for this article is the work of Skouras and Dawid (2000) on estimation in semiparametric statistical models under misspecification. They highlighted that, under misspecification, M-estimators, including maximum likelihood estimators, may not converge to a deterministic limit even asymptotically. Instead, the limit can be stochastic. This is because, under misspecification, the ‘best’ model can depend on the observed sequence of data. They gave examples showing that this can happen for non-ergodic processes or processes with long-range dependencies that do not decay fast enough. Our work can be seen as a direct extension of their ideas to the learning theory setting where the focus is not on parameter estimation but on loss minimization over potentially massive function spaces.

2. Preliminaries

We consider a supervised learning setting where we want to learn a mapping from an input space X\mathcal{X} to an output space Y\mathcal{Y}. Two output spaces of interest to us in this article are Y={1,+1}\mathcal{Y}= \{-1,+1\} (binary classification) and Y=[1,+1]\mathcal{Y}= [-1,+1] (regression). Instead of talking about the difficulty of learning individual functions, we will define learnability for a class of functions that we will denote by FYX\mathcal{F}\subseteq \mathcal{Y}^{\mathcal{X}}. Let Z=X×Y\mathcal{Z}= \mathcal{X}\times \mathcal{Y} and let :Z×FR+\ell: \mathcal{Z}\times \mathcal{F}\to \mathbb{R}_+ be a loss function mapping an input-output pair (x,y)(x,y) and a function ff to a nonnegative loss. The set {1,,n}\{1,\ldots,n\} will be denoted by [n][n] and we use 1l[C]1\kern-0.25em\text{l}\left[C\right] to denote an indicator function that is 11 if the condition CC is true and 00 otherwise. Two important loss functions are the 00-11 loss ((x,y),f)=1l[yf(x)]\ell((x,y),f) = 1\kern-0.25em\text{l}\left[y \neq f(x)\right] (in binary classification) and the absolute loss ((x,y),f)=yf(x)\ell((x,y),f) = |y - f(x)| (in regression).

We often denote an input-output pair (x,y)(x,y) by zz. When the input-output pair is random, we will denote it by Z=(X,Y)Z = (X,Y), perhaps with additional time indices such as Zt=(Xt,Yt)Z_t = (X_t,Y_t). We will use the abbreviation Z1:tZ_{1:t} to denote the sequence Z1,,ZtZ_1,\ldots,Z_t. A learning rule f^n\widehat{f}_n is a map from Zn\mathcal{Z}^n to F\mathcal{F}. We will abuse notation a bit and refer to the learning rule and the function output by the learning rule both by f^n\widehat{f}_n. An important learning rule is empirical risk minimization (ERM): given a sequence z1:tz_{1:t} of input-output pairs, it outputs the function,

f^nERM=argminfF1nt=1n(zt,f).        (2.1)\widehat{f}^{\text{ERM}}_n = \mathop{\mathrm{argmin}}_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \ell(z_t,f) . \ \ \ \ \ \ \ \ \text{(2.1)}

Note that, for infinite function classes, the minimum may not be achieved. In that case, one can work with functions achieving empirical risks that are arbitrarily close to the infimum of the empirical risk over the class F\mathcal{F}.

Given a distribution PP on Z\mathcal{Z}, the loss function can be extended as follows:

(P,f)=EZP[(Z,f)].\ell(P, f) = \mathbb{E}_{Z\sim P}\left[\ell(Z,f)\right] .

The function minimizing the expectation above is

fP=argminfF(P,f).f^\star_P = \mathop{\mathrm{argmin}}_{f \in \mathcal{F}} \ell(P, f).

The PP-regret of a function fFf \in \mathcal{F} is defined as

ρ(P,f)=(P,f)inffF(P,f)=(P,f)(P,fP).\begin{aligned} \rho(P, f) &= \ell(P, f) - \inf_{f' \in \mathcal{F}} \ell(P, f') \\ &= \ell(P, f) - \ell(P, f^\star_P).\end{aligned}

Note that the PP-regret depends on the class F\mathcal{F} but we hide this dependence when the function class is clear from the context.

3. Learnability in the IID Setting

In this section we review some basic results of statistical learning theory under iid sampling. For more details the reader can consult standard texts in this area (Anthony & Bartlett, 1999; Shalev-Shwartz & Ben-David, 2014; Vidyasagar, 2002). In the standard formulation of statistical learning theory, we draw a sequence Z1:nZ_{1:n} of iid examples from a distribution PP. That is, the joint distribution of Z1:nZ_{1:n} is a product distribution P=PPP\mathbf{P}= P \otimes P \otimes \ldots \otimes P. We adopt the minimax framework to define learnability of a class F\mathcal{F} of functions with respect to a loss function \ell. Define the worst case performance of a learning rule f^n\widehat{f}_n by

Vniid(f^n,F)=supPE[ρ(P,f^n)]V^{\text{iid}}_n(\widehat{f}_n, \mathcal{F}) = \sup_{P} \mathbb{E}\left[ \rho(P, \widehat{f}_n) \right]

and the minimax value by

Vniid(F)=inff^nVniid(f^n,F).V^{\text{iid}}_n(\mathcal{F}) = \inf_{\widehat{f}_n} V^{\text{iid}}_n(\widehat{f}_n, \mathcal{F}) .

For the sake of conciseness, the notation above hides the fact that f^n\widehat{f}_n depends on the sequence Z1:nZ_{1:n}. The expectation above is taken over the randomness in these samples.

Definition 1. We say that F\mathcal{F} is learnable in the iid learning setting if

lim supnVniid(F)=0.\limsup_{n \to \infty} V^{\text{iid}}_n(\mathcal{F}) = 0 .

Furthermore, we say that F\mathcal{F} is learnable via a sequence f^n\widehat{f}_n of learning rules if

lim supnVniid(f^n,F)=0.\limsup_{n \to \infty} V^{\text{iid}}_n(\widehat{f}_n, \mathcal{F}) = 0 .

One of the major achievements of statistical learning theory was the determination of necessary and sufficient conditions for learnability of a class F\mathcal{F}. Learnability in both binary classification with 00-11 loss and regression with absolute loss is known to be equivalent to a probabilistic condition, namely the uniform law of large numbers (ULLN) for the class F\mathcal{F}:

lim supnsupPE[supfF1nt=1nf(Xt)Pf]=0.        (3.1)\limsup_{n \to \infty} \sup_{P} \mathbb{E}\left[ \sup_{f \in \mathcal{F}} \left| \frac{1}{n} \sum_{t=1}^n f(X_t) - Pf \right|\right] = 0 . \ \ \ \ \ \ \ \ \text{(3.1)}


Here X1:nX_{1:n} are drawn iid from PP and Pf=EXP[f(X)]Pf = \mathbb{E}_{X \sim P}\left[f(X)\right]. Whether or not ULLN holds for a class F\mathcal{F} depends on the finiteness of different combinatorial parameters, depending on whether we are in the binary classification or regression setting. We will discuss the binary classification case here, leaving the regression case to Section 7.

The Vapnik-Chervonenkis (VC) dimension of F\mathcal{F}, denoted by VCdim(F)\text{VCdim}(\mathcal{F}), is the length nn of the longest sequence x1:nx_{1:n} shattered by F\mathcal{F}. We say that a sequence x1:nx_{1:n} is shattered by F\mathcal{F} if

ϵ1:n{±1}n,fF, s.t. t[n],f(xt)=ϵt.\forall \epsilon_{1:n} \in \{\pm1\}^n, \exists f \in \mathcal{F}, \text{ s.t. } \forall t \in [n], f(x_t) = \epsilon_t .


Finally, we recall the definition of the (expected) Rademacher complexity of a function class with respect to a distribution PP:

Rn(P,F)=E[supfF1nt=1nϵtf(Xt)]\mathfrak{R}_n(P, \mathcal{F}) = \mathbb{E}\left[ \sup_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \epsilon_t f(X_t) \right]


Note that the expectation above is with respect to both X1:nX_{1:n} and ϵ1:t\epsilon_{1:t}. The former are drawn iid from PP, whereas the latter are iid {±1}\{\pm 1\}-valued Rademacher (also called symmetric Bernoulli) random variables. The worst case, over PP, Rademacher complexity is denoted by

Rn(F)=supPRn(P,F).\mathfrak{R}_n(\mathcal{F}) = \sup_{P} \mathfrak{R}_n(P, \mathcal{F}) .

Theorem 1. Consider binary classification with 00-11 loss in the iid setting. Then, the following are equivalent:

(1) F\mathcal{F} is learnable.
(2) F\mathcal{F} is learnable via ERM.
(3) The ULLN condition (3.1) holds for F\mathcal{F}.
(4) VCdim(F)<\text{VCdim}(\mathcal{F}) < \infty.
(5) lim supnRn(F)=0\limsup_{n \to \infty} \mathfrak{R}_n(\mathcal{F}) = 0.

A similar result holds for regression with absolute loss with the VC dimension condition (i.e., condition number 4 above) replaced with a similar one involving its scale-sensitive counterpart, called the fat-shattering dimension (see Section 7.1 for details).

4. Learnability in the Online Setting

A second learning setting with a well-developed theory is the online learning setting, where no probabilistic assumptions are placed on the data-generating process. Compared to statistical learning theory under iid sampling, online learning theory is a younger field. The main combinatorial parameter in this area, the Littlestone dimension, was defined by Littlestone (1988). It was given the name “Littlestone dimension” by Ben-David et al. (2009), where it was also shown that it fully characterizes learnability in the binary classification setting. Scale-sensitive analogues of the Littlestone dimension for regression problems and the sequential version of Rademacher complexity were studied in Rakhlin et al. (2015a, 2015b).

The online learning setting takes an individual sequence approach, where results are sought that hold for every possible sequence z1:nZnz_{1:n} \in \mathcal{Z}^n that might be encountered by the learning rule.

We consider a sequence f^0:n1\widehat{f}_{0:n-1} of learning rules, where f^t\widehat{f}_t takes in as input the sequence z1:tz_{1:t} and outputs a (possibly random) function in F\mathcal{F}. Define the expected (normalized) regret of f^0:n1\widehat{f}_{0:n-1} on sequence z1:nz_{1:n}:

ρonline(f^0:n1,z1:n)=E[1n(t=1n(zt,f^t1)inffFt=1n(zt,f))].\rho^{\text{online}}(\widehat{f}_{0:n-1},z_{1:n}) = \mathbb{E}\left[ \frac{1}{n} \left( \sum_{t=1}^n \ell(z_t, \widehat{f}_{t-1}) - \inf_{f \in \mathcal{F}} \sum_{t=1}^n \ell(z_t, f) \right) \right].


This is similar in flavor to, but distinct from, the regret function ρ\rho used in the iid setting. It obeys the prequential principle (Dawid, 1984): performance of f^t1\widehat{f}_{t-1}, which is learned using z1:t1z_{1:t-1}, is judged using loss evaluated on ztz_t with no overlap between data used for learning and for performance evaluation. The expectation is needed because the learning rules f^0:n1\widehat{f}_{0:n-1} may use internal randomization to achieve robustness to adversarial data. The regret nomenclature comes from the fact that f^0:n1\widehat{f}_{0:n-1} cannot peek into the future to lower its loss but its cumulative performance is compared with lowest possible loss, in hindsight, over the entire sequence z1:nz_{1:n}. However, the comparator term has its own restriction: it uses the best fixed function ff in hindsight, as opposed to the best sequence of functions.

The object of interest is now the following minimax value:

Vnonline(F)=inff^0:n1Vonline(f^0:n1,F),V^{\text{online}}_n(\mathcal{F}) = \inf_ {\widehat{f}_{0:n-1}} V^{\text{online}}(\widehat{f}_{0:n-1}, \mathcal{F}),

where

Vonline(f^0:n1,F)=supz1:nZnρonline(f^0:n1,z1:n)V^{\text{online}}(\widehat{f}_{0:n-1}, \mathcal{F}) = \sup_{z_{1:n} \in \mathcal{Z}^n} \rho^{\text{online}}(\widehat{f}_{0:n-1},z_{1:n})


is the worst-case performance of the sequence f^0:n1\widehat{f}_{0:n-1} of learning rules, with f^t\widehat{f}_t taking in as input the sequence z1:tz_{1:t} and outputting a function in F\mathcal{F}. The infimum is then taken over all such learning rule sequences.

Definition 2. We say that F\mathcal{F} is learnable in the online learning setting if

lim supnVnonline(F)=0.\limsup_{n \to \infty} V^{\text{online}}_n(\mathcal{F}) = 0 .


As in statistical learning, we have necessary and sufficient conditions for learnability that almost mirror those in Theorem 1. The ULLN condition gets replaced by the Uniform Martingale Law of Large Numbers (UMLLN). We say that UMLLN holds for F\mathcal{F} if

lim supnsupP,AE[supfF1nt=1n(f(Xt)E[f(Xt)At1])]=0.       (4.1)\limsup_{n \to \infty} \sup_{\mathbf{P}, \mathcal{A}} \mathbb{E}\left[ \sup_{f \in \mathcal{F}} \left| \frac{1}{n} \sum_{t=1}^n \left( f(X_t) - \mathbb{E}\left[f(X_t)|\mathcal{A}_{t-1}\right] \right) \right|\right] = 0 . \ \ \ \ \ \ \ \text{(4.1)}


The crucial difference between the UMLLN condition and the ULLN condition is that here the supremum is taken over all joint distributions P\mathbf{P} of X1:nX_{1:n}. In particular X1:nX_{1:n} need not be iid. Also, to obtain a martingale structure, we use an arbitrary filtration A=(At)t=0n1\mathcal{A}= (\mathcal{A}_t)_{t=0}^{n-1} such that XtX_t is At\mathcal{A}_t-measurable. It is easy to see that UMLLN is a stronger condition than ULLN: simply restrict P\mathbf{P} to be a product distribution and let A\mathcal{A} be the natural filtration of XtX_t. Then the UMLLN condition reduces to the ULLN condition.

The VC dimension of F\mathcal{F} is replaced by another combinatorial parameter, called the Littlestone dimension of F\mathcal{F}, denoted by Ldim(F)\text{Ldim}(\mathcal{F}). Before we present the definition of Littlestone dimension, we need some notation to handle complete binary trees labeled with examples drawn from the input space X\mathcal{X}. We think of a complete binary tree x\mathbf{x} of depth nn as defining a sequence xt,1tn\mathbf{x}_t, 1 \le t \le n, of maps. The map xt\mathbf{x}_t gives us the examples sitting at level tt of the tree. For example, x1\mathbf{x}_1 is the root, x2(1)\mathbf{x}_2(-1) is the left child of the root, x2(+1)\mathbf{x}_2(+1) is the right child of the root, and so on. In general xt(ϵ1:t1)\mathbf{x}_t(\epsilon_{1:t-1}) is the node at level tt that we reach by following the path given by the sign sequence ϵ1:t1{±1}t1\epsilon_{1:t-1} \in \{\pm1\}^{t-1}, where 1-1 means ‘go left’ and +1+1 means ‘go right’. The Littlestone dimension of F\mathcal{F} is the depth nn of the largest complete binary tree x\mathbf{x} shattered by F\mathcal{F}. We say that a complete binary tree x\mathbf{x} is shattered by F\mathcal{F} if

ϵ1:n{±1}n,fF, s.t. t[n],f(xt(ϵ1:t1))=ϵt.\forall \epsilon_{1:n} \in \{\pm1\}^n, \exists f \in \mathcal{F}, \text{ s.t. } \forall t \in [n], f(\mathbf{x}_t(\epsilon_{1:t-1})) = \epsilon_t .


Finally, Rademacher complexity gets replaced with its sequential analogue, called the sequential Rademacher complexity. We first define the sequential Rademacher complexity of F\mathcal{F} given a tree x\mathbf{x} of depth nn as:

Rnseq(x,F)=E[supfF1nt=1nϵtf(xt(ϵ1:t1))].\mathfrak{R}^\text{seq}_n(\mathbf{x}, \mathcal{F}) = \mathbb{E}\left[ \sup_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \epsilon_t f(\mathbf{x}_t(\epsilon_{1:t-1})) \right] .


Note that the expectation above is only with respect to the Rademacher random variables ϵ1:t\epsilon_{1:t} as x\mathbf{x} is a fixed tree. Taking the worst case over all complete binary trees x\mathbf{x} of depth nn gives us the sequential Rademacher complexity of F\mathcal{F}:

Rnseq(F)=supxRnseq(x,F).\mathfrak{R}^\text{seq}_n(\mathcal{F}) = \sup_{\mathbf{x}} \mathfrak{R}^\text{seq}_n(\mathbf{x}, \mathcal{F}) .

Theorem 2. Consider binary classification with 00-11 loss in the online (individual sequence) setting. Then, the following are equivalent:

(1) F\mathcal{F} is learnable.
(2) The UMLLN condition (4.1) holds for F\mathcal{F}.
(3) Ldim(F)<\text{Ldim}(\mathcal{F}) < \infty.
(4) lim supnRnseq(F)=0\limsup_{n \to \infty} \mathfrak{R}^\text{seq}_n(\mathcal{F}) = 0.


As in the iid setting, a similar result holds for online regression with absolute loss, with the Littlestone dimension condition (i.e., condition number 3 above) replaced by a similar one involving its scale-sensitive counterpart, called the sequential fat shattering dimension (see Section 7.2 for details).

It is well known that online learnability is harder than iid learnability. That is, VCdim(F)Ldim(F)\text{VCdim}(\mathcal{F}) \le \text{Ldim}(\mathcal{F}) for any F\mathcal{F}, and the gap in this inequality can be arbitrarily large. For example, the set of threshold functions on R\mathbb{R}:

Fthreshold={x1l[x>θ]:θR}        (4.2)\mathcal{F}_{\text{threshold}} = \{ x \mapsto 1\kern-0.25em\text{l}\left[x > \theta\right] \::\: \theta \in \mathbb{R} \} \ \ \ \ \ \ \ \ \text{(4.2)}


has VCdim(Fthreshold)=1\text{VCdim}(\mathcal{F}_{\text{threshold}}) = 1 but Ldim(Fthreshold)=\text{Ldim}(\mathcal{F}_{\text{threshold}}) = \infty.

A conspicuous difference between Theorem 1 and Theorem 2 is the absence of the condition involving ERM. Indeed, ERM is not necessarily a good learning rule in the online setting: there exist classes learnable in the online setting that are not learnable via ERM. Unfortunately, the learning rules that learn a class F\mathcal{F} in the online setting are quite complex (Ben-David et al., 2009). It is not known if there exists a rule as simple as ERM that will learn a class F\mathcal{F} whenever F\mathcal{F} is online learnable. In any case, ERM does not play as central a role in online learning as it does in learning in the iid setting.

5. Learnability Under General Stochastic Processes

In this section we move beyond the iid setting to cover all distributions, not just product distributions. For a general stochastic process P\mathbf{P}, we still have an analogue of PP at time tt, namely

Pt(;z1:t1)=P(Z1:t1=z1:t1).P_t(\cdot; z_{1:t-1}) = \mathbf{P}(\cdot|Z_{1:t-1}=z_{1:t-1}) .


This is the conditional distribution of ZtZ_t given Z1:t1Z_{1:t-1}. Just like PP, this is unknown to the learning rule. However, unlike PP in the iid case, PtP_t is data-dependent. Therefore the PtP_t-regret of a function ρ(Pt,f)\rho(P_t, f) is data-dependent. We will often hide the dependence of PtP_t on past data Z1:t1Z_{1:t-1}. We can use the average of the PtP_t-regrets,

Rn(Z1:n,f)=1nt=1nρ(Pt,f)R_n(Z_{1:n}, f) = \frac{1}{n} \sum_{t=1}^n \rho(P_t, f)


as a performance measure. Note that the minimizer of this performance measure is data-dependent, unlike in the iid case. As in the iid setting a learning rule f^n\widehat{f}_n is a map from Zn\mathcal{Z}^n to F\mathcal{F}. To reduce clutter in our notation, we will continue to hide the dependence of f^n\widehat{f}_n on the realized sample Z1:nZ_{1:n}. The value of a learning rule f^n\widehat{f}_n is now defined as

Vngen(f^n,F)=supPE[Rn(Z1:n,f^n)inffFRn(Z1:n,f)]=supPE[1nt=1n(Pt,f^n)inffF1nt=1n(Pt,f)],\begin{aligned} V^{\text{gen}}_n(\widehat{f}_n, \mathcal{F}) &= \sup_{\mathbf{P}} \mathbb{E}\left[ R_n(Z_{1:n}, \widehat{f}_n) - \inf_{f \in \mathcal{F}} R_n(Z_{1:n}, f) \right] \\ &= \sup_{\mathbf{P}} \mathbb{E}\left[ \frac{1}{n} \sum_{t=1}^n \ell(P_t, \widehat{f}_n) - \inf_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \ell(P_t, f) \right] ,\end{aligned}


where the supremum is now taken over all joint distributions P\mathbf{P} over Z1:nZ_{1:n}. This leads to consideration of the following minimax value to define learnability:

Vngen(F)=inff^nVngen(f^n,F).V^{\text{gen}}_n(\mathcal{F}) = \inf_{\widehat{f}_n} V^{\text{gen}}_n(\widehat{f}_n, \mathcal{F}) .

Definition 3. We say that F\mathcal{F} is process learnable if

lim supnVngen(F)=0.\limsup_{n \to \infty} V^{\text{gen}}_n(\mathcal{F}) = 0 .

Furthermore, we say that F\mathcal{F} is process learnable via a sequence f^n\widehat{f}_n of learning rules if

lim supnVngen(f^n,F)=0.\limsup_{n \to \infty} V^{\text{gen}}_n(\widehat{f}_n, \mathcal{F}) = 0 .


Note that in the iid case, when P\mathbf{P} is a product distribution with marginal PP, we have Pt=PP_t = P for all tt and therefore, for any ff,

1nt=1n(Pt,f)=(P,f).\frac{1}{n} \sum_{t=1}^n \ell(P_t, f) = \ell(P, f) .


We have the following result as an immediate consequence.

Lemma 3. Fix any loss function \ell and function class F\mathcal{F}. For any learning rule f^n\widehat{f}_n, Vgen(f^n,F)Viid(f^n,F)V^{\text{gen}}(\widehat{f}_n, \mathcal{F}) \ge V^{\text{iid}}(\widehat{f}_n, \mathcal{F}). This also means that Vngen(F)Vniid(F)V^{\text{gen}}_n(\mathcal{F}) \ge V^{\text{iid}}_n(\mathcal{F}).

The result above is not surprising: process learnability has to be harder than iid learnability. However, somewhat surprisingly, we can show that process learnability is at least as hard as online learnability.

Theorem 4. Consider binary classification with 0-1 loss in the general stochastic process setting. Suppose the class F\mathcal{F} is not online learnable, i.e., Ldim(F)=\text{Ldim}(\mathcal{F}) = \infty. Then for any n1n \ge 1, Vngen(F)1/8V^{\text{gen}}_n(\mathcal{F}) \ge 1/8. Therefore, the class F\mathcal{F} is not process learnable.

To complement the lower bound above, we will now give a performance guarantee for ERM in the general stochastic process setting. Given a loss \ell and function class F\mathcal{F}, define the loss class F\ell \circ \mathcal{F} as

F={z(z,f):fF}.\ell \circ \mathcal{F}= \{ z \mapsto \ell(z,f) \::\: f \in \mathcal{F} \} .


We define the sequential Rademacher complexity of a loss class F\ell \circ \mathcal{F} as

Rnseq(z,F)=E[supfF1nt=1nϵt(zt(ϵ1:t1),f)],Rnseq(F)=supzRnseq(z,F).\begin{aligned} \mathfrak{R}^\text{seq}_n(\mathbf{z}, \ell \circ \mathcal{F}) &= \mathbb{E}\left[ \sup_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \epsilon_t \ell(\mathbf{z}_t(\epsilon_{1:t-1}), f) \right] ,\\ \mathfrak{R}^\text{seq}_n(\ell \circ \mathcal{F}) &= \sup_{\mathbf{z}} \mathfrak{R}^\text{seq}_n(\mathbf{z}, \ell \circ \mathcal{F}) .\end{aligned}

Note that the supremum here is over Z\mathcal{Z}-valued trees that are labeled with input-output pairs. It is easy for us to connect the complexity to the loss class to the complexity of the underlying function class for a simple loss function like the 00-11 loss (see Appendix A for details.)

Theorem 5. Fix any loss function \ell and function class F\mathcal{F}. Let f^ERM\widehat{f}^{\text{ERM}} denote the ERM learning rule defined in (2.1). Then we have

Vngen(F)Vngen(f^nERM,F)4Rnseq(F).V^{\text{gen}}_n(\mathcal{F}) \le V^{\text{gen}}_n(\widehat{f}^{\text{ERM}}_n, \mathcal{F}) \le 4 \mathfrak{R}^\text{seq}_n(\ell \circ \mathcal{F}).


Proof. The first inequality is true by definition of Vngen(F)V^{\text{gen}}_n(\mathcal{F}). So we just have to prove the second one.

Note, by definition of f^ERM\widehat{f}^{\text{ERM}},

1nt=1n(Zt,f^nERM)=inffF1nt=1n(Zt,f).\frac{1}{n} \sum_{t=1}^n \ell(Z_t, \widehat{f}^{\text{ERM}}_n) = \inf_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \ell(Z_t, f) .


Therefore, we have

Rn(Z1:n,f^nERM)inffFRn(Z1:n,f)=1nt=1n(Pt,f^nERM)inffF1nt=1n(Pt,f)=1nt=1n(Pt,f^nERM)1nt=1n(Zt,f^nERM)+inffF1nt=1n(Zt,f)inffF1nt=1n(Pt,f)supfF1n(t=1n(Pt,f)(Zt,f))+supfF1n(t=1n(Zt,f)(Pt,f)).        (5.1)\begin{aligned} &\quad R_n(Z_{1:n}, \widehat{f}^{\text{ERM}}_n) - \inf_{f \in \mathcal{F}} R_n(Z_{1:n}, f) \\ &= \frac{1}{n} \sum_{t=1}^n \ell(P_t, \widehat{f}^{\text{ERM}}_n) - \inf_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \ell(P_t, f) \\ &= \frac{1}{n} \sum_{t=1}^n \ell(P_t, \widehat{f}^{\text{ERM}}_n) - \frac{1}{n} \sum_{t=1}^n \ell(Z_t, \widehat{f}^{\text{ERM}}_n) + \inf_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \ell(Z_t, f) - \inf_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \ell(P_t, f) \\ &\le \sup_{f \in \mathcal{F}} \frac{1}{n} \left( \sum_{t=1}^n \ell(P_t, f) - \ell(Z_t, f) \right) + \sup_{f \in \mathcal{F}} \frac{1}{n} \left( \sum_{t=1}^n \ell(Z_t, f) - \ell(P_t, f) \right) . \end{aligned} \ \ \ \ \ \ \ \ \text{(5.1)}


The justification for the last inequality is as follows. First, we know that f^nERMF\widehat{f}^{\text{ERM}}_n \in \mathcal{F}. Second, when inffF1nt=1n(Pt,f)\inf_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \ell(P_t, f) is achieved, at ff^\star say, we have,

inffF1nt=1n(Zt,f)inffF1nt=1n(Pt,f)1nt=1n(Zt,f)1nt=1n(Pt,f)supf1n(t=1n(Zt,f)1nt=1n(Zt,f)).\begin{aligned} &\inf_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \ell(Z_t, f) - \inf_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \ell(P_t, f)\\ &\leq \frac{1}{n} \sum_{t=1}^n \ell(Z_t, f^\star) - \frac{1}{n} \sum_{t=1}^n \ell(P_t, f^\star) \\ &\leq \sup_f \frac{1}{n} \left(\sum_{t=1}^n \ell(Z_t, f) - \frac{1}{n} \sum_{t=1}^n \ell(Z_t, f)\right).\end{aligned}


Taking expectations on both sides of (5.1) gives us

E[Rn(Z1:n,f^nERM)inffFRn(Z1:n,f)]E[supfF1n(t=1n(Pt,f)(Zt,f))]+E[supfF1n(t=1n(Zt,f)(Pt,f))]4Rnseq(F).\begin{aligned} & \quad \mathbb{E}\left[ R_n(Z_{1:n}, \widehat{f}^{\text{ERM}}_n) - \inf_{f \in \mathcal{F}} R_n(Z_{1:n}, f) \right] \\ &\le \mathbb{E}\left[ \sup_{f \in \mathcal{F}} \frac{1}{n} \left( \sum_{t=1}^n \ell(P_t, f) - \ell(Z_t, f) \right) \right] \\ &\quad + \mathbb{E}\left[ \sup_{f \in \mathcal{F}} \frac{1}{n} \left( \sum_{t=1}^n \ell(Z_t, f) - \ell(P_t, f) \right) \right] \\ &\le 4 \mathfrak{R}^\text{seq}_n(\ell \circ \mathcal{F}) .\end{aligned}


Note that the last inequality follows from Theorem 2 of Rakhlin et al. (2015b). Since the last quantity above does not depend on P\mathbf{P}, we can take supremum over P\mathbf{P} on both sides to finish the proof.

We now have everything in place to be able to show the equivalence of process learnability and online learnability. A similar result can also be shown in the regression case (see Section 7.3).

Theorem 6. Consider binary classification with 00-11 loss. Then all of the equivalent conditions in Theorem 2 are also equivalent to:

F\mathcal{F} is process learnable.


Proof. Theorem 4 established that learnability in the general stochastic process setting implies online learnability. For the other direction, note that according to Theorem 5 we have

Vngen(F)4Rnseq(F)2Rnseq(F) ,V^{\text{gen}}_n(\mathcal{F}) \le 4 \, \mathfrak{R}^\text{seq}_n(\ell \circ \mathcal{F}) \le 2 \, \mathfrak{R}^\text{seq}_n(\mathcal{F}) \ ,


where the second inequality follows from Theorem 16 in Appendix A. Taking limsup\lim \sup of both sides as nn tends to infinity shows that online learnability implies process learnability.

Although online learnability turns out to be equivalent to process learnability, there is an important difference between the two settings that has to do with the importance of ERM. In the former, ERM is not a good learning rule, whereas in the latter a learnable class is learnable via ERM. Therefore ERM continues to play a special role in the general stochastic process setting just like the iid setting.

Also note that Theorem 6 is stated in terms of learnability, which is an asymptotic concept. However, the proof clearly shows that the rate of convergence is determined by the sequential Rademacher complexity of F\mathcal{F}, which scales as O(Ldim(F)/n)O\left(\sqrt{{\text{Ldim}(\mathcal{F})}/{n}}\right) (Alon et al., 2021).

5.1. Examples

We end this section with some examples showing that our definition of process learnability is natural, interesting, and worth studying.

IID Sampling. Let us note once again that if P=PPP\mathbf{P}= P \otimes P \otimes \ldots \otimes P is a product measure then (Pt,f)\ell(P_t, f) is just (P,f)\ell(P, f) and therefore not random. In this special but important case, our definition of learnability reduces to the standard definition of learnability under iid sampling.

Asymptotically Stationary Process. Suppose that P\mathbf{P} is not a product measure but the process is asymptotically stationary in the sense that the random probability measure Pˉn=1nt=1nPt\bar{P}_n= \frac{1}{n} \sum_{t=1}^n P_t converges to some fixed deterministic PP^\star in total variation TV\|\cdot\|_{TV} as nn \to \infty. For a class F\mathcal{F} that is learnable in the general stochastic process setting and for loss function bounded by 11, we have

E[(P,f^nERM)]inffF(P,f)=E[(P,f^nERM)(P,fP)]2E[supfF(P,f)(Pˉn,f)]+E[(Pˉn,f^nERM)(Pˉn,fP)]2E[PPˉnTV]+E[(Pˉn,f^nERM)inffF(Pˉn,f)].\begin{aligned} &\quad \mathbb{E}\left[ \ell(P^\star, \widehat{f}^{\text{ERM}}_n)\right] - \inf_{f \in \mathcal{F}} \ell(P^\star, f) \\ &= \mathbb{E}\left[ \ell(P^\star, \widehat{f}^{\text{ERM}}_n) - \ell(P^\star, f_{P^\star})\right] \\ &\le 2 \, \mathbb{E}\left[ \sup_{f \in \mathcal{F}} | \ell(P^\star, f) - \ell(\bar{P}_n, f) | \right] \\ &\quad + \mathbb{E}\left[ \ell(\bar{P}_n, \widehat{f}^{\text{ERM}}_n) - \ell(\bar{P}_n, f_{P^\star}) \right] \\ &\le 2 \, \mathbb{E}\left[ \|P^\star - \bar{P}_n\|_{TV} \right] \\ &\quad + \mathbb{E}\left[ \ell(\bar{P}_n, \widehat{f}^{\text{ERM}}_n) - \inf_{f \in \mathcal{F}} \ell(\bar{P}_n, f) \right] .\end{aligned}

By the stationarity assumption, the first term on the right in the last inequality goes to zero. Moreover, the rate of convergence can often be characterized in terms of the mixing coefficients of the stochastic process (Vidyasagar, 2002). By learnability of F\mathcal{F} via ERM in the general stochastic process setting, the last term goes to zero. Note that (P,f)\ell(P, f) is linear in PP and therefore (Pˉn,f)=1nt=1n(Pt,f)\ell(\bar{P}_n, f) = \frac{1}{n} \sum_{t=1}^n \ell(P_t, f). So, under stationarity, our learnability condition implies that ERM does well when its performance is measured under the (asymptotic) stationary distribution PP^\star.

Mixture of IID. Consider a simple mixture of product distributions

P=λPPP+(1λ)QQQ\mathbf{P}= \lambda P \otimes P \otimes \ldots \otimes P + (1-\lambda) Q \otimes Q \otimes \ldots \otimes Q

where, for simplicity, assume that PP and QQ have disjoint supports. Then with probability λ\lambda we have t>1\forall t>1 Pt=PP_t = P, and with probability 1λ1-\lambda we have t>1\forall t>1 Pt=QP_t = Q. Therefore, the minimizer of

1nt=1n(Pt,f)         (5.2)\begin{aligned} \frac{1}{n} \sum_{t=1}^n \ell(P_t, f)\end{aligned} \ \ \ \ \ \ \ \ \ \text{(5.2)}

is fPf^\star_{P} with probability λ\lambda and fQf^\star_{Q} with probability 1λ1-\lambda (assuming, again for simplicity, that the minimizers fP,fQf^\star_{P}, f^\star_{Q} are unique). Here, unlike the iid and stationary examples, the ‘best’ function, even with infinite data, is not deterministic but is random depending on which mixture component was selected. Still, if learnability in our general sense holds, then ERM will do well according to the performance measure (5.2). Note that this example can be easily generalized to a mixture of more than two iid processes. It can also be generalized, with additional technical conditions, to the general case where PQP \neq Q. The main difference from the disjoint support case that we consider here will be that with probability λ\lambda, PtP_t will converge to PP (in a suitable sense) and to QQ otherwise. Similarly, the minimizer of (5.2) would not equal fPf^\star_P or fQf^\star_Q but it would converge (again, in some appropriate sense determined by technical conditions) to one of them with probability λ\lambda and 1λ1-\lambda, respectively.

Random Level. Fix the squared loss (z,f)=(yf(x))2\ell(z,f) = (y-f(x))^2 and consider a class F\mathcal{F} that is iid learnable and closed under translations by a constant, that is, if fFf \in \mathcal{F} then f+cFf + c \in \mathcal{F} for any constant cRc \in \mathbb{R}. Let X1:nX_{1:n} be iid drawn from some distribution PXP_X on XRd\mathcal{X}\subseteq \mathbb{R}^d that has a density with respect to Lebesgue measure on Rd\mathbb{R}^d. Let Yt=f(Xt)+ξt+ξ0Y_t = f^\star(X_t) + \xi_t + \xi_0 for some fFf^\star \in \mathcal{F} and 1tn1 \le t \le n where (ξt)t=0n(\xi_t)_{t=0}^n are iid standard normal. Note that the process Zt=(Xt,Yt),1tnZ_t = (X_t,Y_t), 1 \le t \le n is not iid. It is not even mixing in any sense due to long-range dependence in YtY_t caused by ξ0\xi_0. Now ERM over F\mathcal{F} is given by:

f^nERM(Z1:n)=argminfF1nt=1n(f(Xt)+ξ0+ξtf(Xt))2=argminfF1nt=1n(f(Xt)+ξt(f(Xt)ξ0))2=ξ0+argmingFξ01nt=1n(f(Xt)+ξtg(Xt))2=ξ0+argmingF1nt=1n(f(Xt)+ξtg(Xt))2,\begin{aligned} \widehat{f}^{\text{ERM}}_n(Z_{1:n}) &= \mathop{\mathrm{argmin}}_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n (f^\star(X_t) + \xi_0 + \xi_t - f(X_t))^2 \\ &= \mathop{\mathrm{argmin}}_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n (f^\star(X_t) + \xi_t - (f(X_t)-\xi_0))^2 \\ &= \xi_0 + \mathop{\mathrm{argmin}}_{g \in \mathcal{F}-\xi_0} \frac{1}{n} \sum_{t=1}^n (f^\star(X_t) + \xi_t - g(X_t))^2 \\ &= \xi_0 + \mathop{\mathrm{argmin}}_{g \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n (f^\star(X_t) + \xi_t - g(X_t))^2 ,\end{aligned}


where the last equality holds because Fξ0=F\mathcal{F}- \xi_0 = \mathcal{F} and we have assumed that all empirical minimizers are unique with probability 11. Thus, we have shown that

f^nERM(Z1:n)=f^nERM((Xt,f(Xt)+ξt)t=1n)+ξ0.\widehat{f}^{\text{ERM}}_n(Z_{1:n}) = \widehat{f}^{\text{ERM}}_n((X_t,f^\star(X_t) + \xi_t)_{t=1}^n) + \xi_0 .


Since F\mathcal{F} is iid learnable, f^nERM((Xt,f(Xt)+ξt)t=1n)\widehat{f}^{\text{ERM}}_n((X_t,f^\star(X_t) + \xi_t)_{t=1}^n) converges (in L2(PX)L_2(P_X) sense) to the function ff^\star, which means the ERM on Z1:nZ_{1:n} converges to the random function f+ξ0f^\star + \xi_0.

Next we compute (Pt,f)\ell(P_t,f), as follows. Let PtP'_t be the conditional distribution of ZtZ_t given X1:t1X_{1:t-1} and ξ0:t1\xi_{0:t-1}. Then we have

(Pt,f)=E[(Ytf(Xt))2X1:t1,ξ0:t1]=E[(f(Xt)+ξt+ξ0f(Xt))2X1:t1,ξ0:t1]=E[(f(Xt)+ξt+ξ0f(Xt))2ξ0]=1+ff+ξ0L2(PX)2\begin{aligned} \ell(P'_t, f) &= \mathbb{E}\left[ (Y_t - f(X_t))^2 | X_{1:t-1}, \xi_{0:t-1} \right] \\ &= \mathbb{E}\left[ (f^\star(X_t) + \xi_t + \xi_0 - f(X_t))^2 | X_{1:t-1}, \xi_{0:t-1} \right] \\ & = \mathbb{E}\left[ (f^\star(X_t) + \xi_t + \xi_0 - f(X_t))^2 | \xi_{0} \right] \\ &= 1 + \| f^\star - f + \xi_0 \|^2_{L_2(P_X)}\end{aligned}


(with ξ0\xi_0 regarded as fixed). Then (Pt,f)=E[(Pt)Z1:t1]=1+E[ff+ξ0L2(PX)2Z1:t1]\ell(P_t, f) = \mathbb{E}\left[\ell(P'_t) | Z_{1:t-1}\right] = 1 + \mathbb{E}\left[\| f^\star - f + \xi_0 \|^2_{L_2(P_X)}| Z_{1:t-1}\right], where now ξ0\xi_0 (only) is regarded as random. It is easy to show that the distribution of ξ0\xi_0, given Z1:t1Z_{1:t-1}, is normal with variance 1/t1/t and mean

Ut1=i=1t1(Yif(Xi))t.U_{t-1} = \frac{\sum_{i=1}^{t-1}(Y_i-f^*(X_i))}{t}.

Consequently

(Pt,f)=1+1t+ff+Ut1L2(PX)2.\ell(P_t,f)= 1 + \frac 1 t + \left\| f^\star - f + U_{t-1}\right\|^2_{L_2(P_X)}.

In particular,

inffF1nt=1n(Pt,f)1.\inf_{f\in \mathcal{F}}\frac 1 n \sum_{t=1}^n \ell(P_t,f) \geq 1.


Now, with f^nERM=f^ERM(Z1:n)\widehat{f}^{\text{ERM}}_n = \widehat{f}^{\text{ERM}}(Z_{1:n}), consider (Pt,f^nERM)\ell(P_t,\widehat{f}^{\text{ERM}}_n). We have shown ff^nERMξ0f^*-\widehat{f}^{\text{ERM}}_n \rightarrow -\xi_0 in mean square. Also,

Ut1=ξ0+1t(i=1t1ξiξ0)ξ0U_{t-1} = \xi_0 + \frac 1 t \left(\sum_{i=1}^{t-1}{\xi_i}-\xi_0\right)\rightarrow \xi_0


in mean square. So 1nt=1n(Pt,f^nERM)1\frac 1 n \sum_{t=1}^n \ell(P_t,\widehat{f}^{\text{ERM}}_n) \rightarrow 1, the smallest possible value. That is, asymptotically the minimizer of 1Tt=1T(Pt,f)\frac{1}{T} \sum_{t=1}^T \ell(P_t, f) over F\mathcal{F} is f^TERM\widehat{f}^{\text{ERM}}_T (which converges, not to ff^\star, but to the random function f+ξ0f^\star+ \xi_0).

6. A Prequential Definition of Learnability

The previous section generalized the statistical setting to include nonproduct distributions and extended the definition of learnability to a more general setting. In this section we will generalize the online learnability definition to obtain a prequential version of learnability, while still keeping the level of generality of the previous section. As in the online setting, consider a sequence of learning rules f^0:n1\widehat{f}_{0:n-1}, where f^t\widehat{f}_t is a function only of Z1:t1Z_{1:t-1}, that is, it cannot peek ahead to access Zt:nZ_{t:n}. Unlike the online learning setting, Z1:tZ_{1:t} is a random sequence drawn from some general distribution P\mathbf{P} over Zn\mathcal{Z}^n. Now, define the minimax value

Vnpreq(F)=inff^0:n1Vnpreq(f^0:n1,F),V^{\text{preq}}_n(\mathcal{F}) = \inf_{\widehat{f}_{0:n-1}} V^{\text{preq}}_n(\widehat{f}_{0:n-1}, \mathcal{F}) ,

where

Vnpreq(f^0:n1,F)=supPE[1nt=1n(Pt,f^t1)inffF1nt=1n(Pt,f)].V^{\text{preq}}_n(\widehat{f}_{0:n-1}, \mathcal{F}) = \sup_{\mathbf{P}} \mathbb{E}\left[ \frac{1}{n} \sum_{t=1}^n \ell(P_t, \widehat{f}_{t-1}) - \inf_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \ell(P_t, f) \right].


Note that the expectation above is with respect to both P\mathbf{P} and any internal randomness used by the rules f^0:n1\widehat{f}_{0:n-1}. As before, the definition of the minimax value leads to the definition of learnability.

Definition 4. We say that F\mathcal{F} is prequentially learnable if

lim supnVnpreq(F)=0.\limsup_{n \to \infty} V^{\text{preq}}_n(\mathcal{F}) = 0 .


The definition of Vnpreq(F)V^{\text{preq}}_n(\mathcal{F}) can be obtained from the definition of Vngen(F)V^{\text{gen}}_n(\mathcal{F}) by replacing f^n\widehat{f}_n, which depends on the entire sequence Z1:nZ_{1:n}, by f^t1\widehat{f}_{t-1}, which depends only on Z1:t1Z_{1:t-1}, in the loss term that involves PtP_t. It can also be thought of as a generalization of VonlineV^{\text{online}} because of the following. When the distribution P\mathbf{P} degenerates to a point mass at a specific sequence z1:nz_{1:n} then PtP_t becomes a point mass at ztz_t and the difference of cumulative losses above reduces to the individual sequence regret of f^0:n1\widehat{f}_{0:n-1} on z1:nz_{1:n}. This observation immediately gives us the following result.

Lemma 7. Fix any loss function \ell and function class F\mathcal{F}. Then we have Vnpreq(F)Vnonline(F)V^{\text{preq}}_n(\mathcal{F}) \ge V^{\text{online}}_n(\mathcal{F}).

The lemma above says that prequential learnability is at least as hard as online learnability. Our next lemma provides a converse result.

Lemma 8. Fix any loss function and function class F\mathcal{F}. Then for any sequence f^0:n1\widehat{f}_{0:n-1} of learning rules we have

Vnpreq(f^0:n1,F)Vnonline(f^0:n1,F)+2Rnseq(F).V^{\text{preq}}_n(\widehat{f}_{0:n-1}, \mathcal{F}) \le V^{\text{online}}_n(\widehat{f}_{0:n-1}, \mathcal{F}) + 2 \mathfrak{R}^\text{seq}_n(\ell \circ \mathcal{F}) .

This also means that

Vnpreq(F)Vnonline(F)+2Rnseq(F).V^{\text{preq}}_n(\mathcal{F}) \le V^{\text{online}}_n(\mathcal{F}) + 2 \mathfrak{R}^\text{seq}_n(\ell \circ \mathcal{F}) .


Proof. Let P\mathbf{P} be an arbitrary distribution. We have the following three term decomposition:

1nt=1n(Pt,f^t1)inffF1nt=1n(Pt,f)=1nt=1n(Pt,f^t1)1nt=1n(Zt,f^t1)(I)+1nt=1n(Zt,f^t1)inffF1nt=1n(Zt,f)(II)+inffF1nt=1n(Zt,f)inffF1nt=1n(Pt,f)(III).\begin{aligned} & \quad \frac{1}{n} \sum_{t=1}^n \ell(P_t, \widehat{f}_{t-1}) - \inf_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \ell(P_t, f) \\ &= \underset{(I)}{\underbrace{ \frac{1}{n} \sum_{t=1}^n \ell(P_t, \widehat{f}_{t-1}) - \frac{1}{n} \sum_{t=1}^n \ell(Z_t, \widehat{f}_{t-1}) }} \\ &\quad + \underset{(II)}{\underbrace{ \frac{1}{n} \sum_{t=1}^n \ell(Z_t, \widehat{f}_{t-1}) - \inf_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \ell(Z_t, f) }} \\ &\quad + \underset{(III)}{\underbrace{ \inf_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \ell(Z_t, f) - \inf_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \ell(P_t, f) }}.\end{aligned}


The term (I)(I) involves a martingale difference sequence (Pt,f^t1)(Zt,f^t1)\ell(P_t, \widehat{f}_{t-1}) - \ell(Z_t, \widehat{f}_{t-1}) and hence has expectation zero under P\mathbf{P}. Term (II)(II) is the individual sequence regret of f^0:n1\widehat{f}_{0:n-1} on the sequence Z1:nZ_{1:n} and hence is bounded, in expectation, by Vnonline(f^0:n1,F)V^{\text{online}}_n(\widehat{f}_{0:n-1}, \mathcal{F}). Term (III)(III), in expectation, is at most,

E[supfF1nt=1n((Zt,f)(Pt,f))]2Rnseq(F),\mathbb{E}\left[ \sup_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \left( \ell(Z_t, f) - \ell(P_t, f) \right) \right] \le 2 \mathfrak{R}^\text{seq}_n(\ell \circ \mathcal{F}) ,


where the inequality again follows from Theorem 2 of Rakhlin et al. (2015b).

The lemma now follows by taking expectations on both sides of the three-term decomposition above and plugging in the upper bounds for each term’s expected value.

We now have all the ingredients to characterize prequential learnability for binary classification.

Theorem 9. Consider binary classification with 00-11 loss. Then all of the conditions in Theorem 2 are also equivalent to:

F\mathcal{F} is prequentially learnable.


Proof. From Lemma 7, we know that if a class is prequentially learnable then it is online learnable. In the other direction, using Lemma 8, we have

Vnpreq(F)Vnonline(F)+2Rnseq(F)Vnonline(F)+Rnseq(F),\begin{aligned} V^{\text{preq}}_n(\mathcal{F}) &\le V^{\text{online}}_n(\mathcal{F}) + 2 \mathfrak{R}^\text{seq}_n(\ell \circ \mathcal{F}) \\ &\le V^{\text{online}}_n(\mathcal{F}) + \mathfrak{R}^\text{seq}_n(\mathcal{F}) ,\end{aligned}