Skip to main content

# Abstract

# Media Summary

# 1. Introduction

## 1.1. Related Work

# 2. Preliminaries

# 3. Learnability in the IID Setting

**Definition 1**. We say that $\mathcal{F}$ is learnable in the iid learning setting if

#### Furthermore, we say that $\mathcal{F}$ is learnable via a sequence $\widehat{f}_n$ of learning rules if

**Theorem 1**. *Consider binary classification with *$0$ *-*$1$ * loss in the iid setting. Then, the following are equivalent:*

*(1) *$\mathcal{F}$ * is learnable.*

*(2) *$\mathcal{F}$ * is learnable via ERM.*

*(3) The ULLN condition *(3.1)* holds for *$\mathcal{F}$ *.*

*(4) *$\text{VCdim}(\mathcal{F}) < \infty$ *. *

*(5) *$\limsup_{n \to \infty} \mathfrak{R}_n(\mathcal{F}) = 0$ *.*

# 4. Learnability in the Online Setting

**Definition 2**. We say that $\mathcal{F}$ is learnable in the online learning setting if

**Theorem 2**. *Consider binary classification with *$0$ *-*$1$ * loss in the online (individual sequence) setting. Then, the following are equivalent:*

*(1) *$\mathcal{F}$ * is learnable.*

*(2) The UMLLN condition *(4.1)* holds for *$\mathcal{F}$ *.*

*(3) *$\text{Ldim}(\mathcal{F}) < \infty$ *. *

*(4) *$\limsup_{n \to \infty} \mathfrak{R}^\text{seq}_n(\mathcal{F}) = 0$ *.*

# 5. Learnability Under General Stochastic Processes

**Definition 3**. We say that $\mathcal{F}$ is process learnable if

#### Furthermore, we say that $\mathcal{F}$ is process learnable via a sequence $\widehat{f}_n$ of learning rules if

**Lemma 3**. *Fix any loss function *$\ell$ * and function class *$\mathcal{F}$ *. For any learning rule *$\widehat{f}_n$ *, *$V^{\text{gen}}(\widehat{f}_n, \mathcal{F}) \ge V^{\text{iid}}(\widehat{f}_n, \mathcal{F})$ *. This also means that *$V^{\text{gen}}_n(\mathcal{F}) \ge V^{\text{iid}}_n(\mathcal{F})$ *.*

**Theorem 4**. *Consider binary classification with 0-1 loss in the general stochastic process setting. Suppose the class *$\mathcal{F}$ * is not online learnable, i.e., *$\text{Ldim}(\mathcal{F}) = \infty$ *. Then for any *$n \ge 1$ *, *$V^{\text{gen}}_n(\mathcal{F}) \ge 1/8$ *. Therefore, the class *$\mathcal{F}$ * is not process learnable.*

**Theorem 5**. *Fix any loss function *$\ell$ * and function class *$\mathcal{F}$ *. Let *$\widehat{f}^{\text{ERM}}$ * denote the ERM learning rule defined in *(2.1). *Then we have *

**Theorem 6**. *Consider binary classification with *$0$ *-*$1$ * loss. Then all of the equivalent conditions in **Theorem 2** are also equivalent to:*

•* *$\mathcal{F}$ * is process learnable.*

## 5.1. Examples

# 6. A Prequential Definition of Learnability

**Definition 4**. We say that $\mathcal{F}$ is prequentially learnable if

**Lemma 7**. *Fix any loss function *$\ell$ * and function class *$\mathcal{F}$ *. Then we have *$V^{\text{preq}}_n(\mathcal{F}) \ge V^{\text{online}}_n(\mathcal{F})$ *.*

**Lemma 8**. *Fix any loss function and function class *$\mathcal{F}$ *. Then for any sequence *$\widehat{f}_{0:n-1}$ * of learning rules we have *

*This also means that *

**Theorem 9**. *Consider binary classification with *$0$ *-*$1$ * loss. Then all of the conditions in **Theorem 2** are also equivalent to:*

•$\mathcal{F}$ * is prequentially learnable.*

On Learnability Under General Stochastic Processes

Published onOct 27, 2022

On Learnability Under General Stochastic Processes

Statistical learning theory under independent and identically distributed (iid) sampling and online learning theory for worst case individual sequences are two of the best developed branches of learning theory. Statistical learning under general non-iid stochastic processes is less mature. We provide two natural notions of learnability of a function class under a general stochastic process. We show that both notions are in fact equivalent to online learnability. Our results hold for both binary classification and regression.

**Keywords:** learnability, statistical learning, online learning, prequential principle, uniform martingale law of large numbers

This article is a contribution to ‘learning theory,’ which has largely developed within the computer science community, but has much potential for cross-fertilization with theoretical statistics. The basic framework is as follows. At successive time points we observe a signal, and wish to predict an associated outcome, using one from a given class of possible functions. Performance is measured by a specified loss, measuring the discrepancy between prediction and outcome. Learning theory aims to characterize conditions that allow us to make provably good predictions by analyzing past data to determine a suitable prediction function. Two approaches have emerged, largely in isolation from each other. In the first, (iid) statistical learning theory, we assume all past and future signal-outcome pairs have the same probabilistic structure, independently. The second, online learning theory, makes no probabilistic assumptions, and aims to track the best predictor from those available, no matter what data are observed. In this work we introduce two natural ways of merging these approaches, assuming arbitrary probabilistic dependence across time. We show that both approaches are equivalent to online learnability. Our results hold for binary classification as well as regression.

One of the most beautiful and best developed branches of machine learning theory is classical statistical learning theory. (For a nontechnical overview and for more extensive references, see von Luxburg & Schölkopf, 2011.) However, it deals primarily with independent and identically distributed (iid) sampling of examples. Several attempts have been made to deal with both dependence and non-stationarity; we discuss some of these extensions in Section 1.1. However in general the non-iid case is not as well developed as the classical iid case.

Another well-developed branch of learning theory that has its own share of elegant mathematical ideas is online learning theory. (For an excellent introduction, see Cesa-Bianchi & Lugosi, 2006.) With roots in game theory and the area of information theory known as universal prediction of individual sequences, online learning theory, unlike statistical learning theory, does not use probabilistic foundations. It is therefore quite surprising that there are uncanny parallels between iid learning theory and online learning theory. The reader is invited to compare the statements of the fundamental theorems in these two areas (restated in this article as Theorem 1 and Theorem 2).

Our main goal in this article is to study learnability of a function class in the statistical setting under extremely general assumptions that do not require independence or stationarity. We first summarize the key theorems of iid and online learning in Section 3 and Section 4. Although this material is not new, we feel that the broader data science community might not be very familiar with results in online learning since it is a younger field compared to statistical learning theory. Also, presenting both iid learning and online learning results in a unified way allows us to draw parallels between the two theories and to motivate the need for novel theories that connect these two.

We propose a definition of learnability under general stochastic processes in Section 5. We show that learning under this general definition is equivalent to online learnability (Theorem 6). We give a prequential version of our main definition in Section 6. In the prequential version, as in online learning, the function output by the learning algorithm at any given time cannot peek into the future. We show that learnability under the prequential version of our general learning setting is also equivalent to online learnability (Theorem 9). We focus on the problem of binary classification for simplicity. But we also provide extensions of our equivalence results to the problem of real valued prediction (i.e., regression) in Section 7 (see Theorem 13 and Theorem 14).

The iid assumption of statistical learning theory has been relaxed and replaced with various types of *mixing* assumptions, especially

There has been work on performance guarantees of specific algorithms like boosting (Lozano et al., 2005) and support vector machines (Steinwart & Anghel, 2009; Steinwart et al., 2009) under non-iid assumptions. However, our focus here is not on any specific learning methodology. We would like to point out that, while we focus on *learnability* of functions in a fixed class, the question of *universal consistency* has also been studied in the context of general stochastic processes (Hanneke, 2021; Nobel, 1999).

There are a handful of papers that focus, as we do, on conditional risk given a sequence of observation drawn from a general non-iid stochastic processes (Pestov, 2010; Shalizi & Kontorovich, 2013; Zimin & Lampert, 2017). These papers focus on process decompositions: expressing a complex stochastic process as a mixture of simpler stochastic processes. For example, de Finetti’s theorem shows that exchangeable distributions are mixtures of iid distributions. The basic idea is to output a function with small expected loss one step beyond the observed sample where the expectation is also conditioned on the observed sample. While closely related, our performance measures are cumulative in nature and are inspired more by regret analysis in online learning than PAC bounds in computational learning theory.

The use of tools from online learning theory (e.g., sequential Rademacher complexity) for developing learning theory for dependent, nonstationary processes was pioneered by Kuznetsov and Mohri (2015, 2017). However, their focus is on time series forecasting applications and therefore their performance measures always involve the expected loss of the function chosen by the learning algorithm some steps into the future (i.e., the part not seen by the learning algorithm) of the process. In contrast, our definition uses conditional distributions of the stochastic process *on the realized path* to define our performance measure. We also point out that there are earlier papers that apply learning theory tools to understand time series prediction (Alquier & Wintenberger, 2012; Meir, 2000; Modha & Masry, 1998). Some very recent work has also begun to extend some of the work on time series to processes with spatial structure and dependence such as those occurring on a network (Dagan et al., 2019).

A direct inspiration for this article is the work of Skouras and Dawid (2000) on estimation in semiparametric statistical models under misspecification. They highlighted that, under misspecification, M-estimators, including maximum likelihood estimators, may not converge to a deterministic limit even asymptotically. Instead, the limit can be stochastic. This is because, under misspecification, the ‘best’ model can depend on the observed sequence of data. They gave examples showing that this can happen for non-ergodic processes or processes with long-range dependencies that do not decay fast enough. Our work can be seen as a direct extension of their ideas to the learning theory setting where the focus is not on parameter estimation but on loss minimization over potentially massive function spaces.

We consider a supervised learning setting where we want to learn a mapping from an input space *class* of functions that we will denote by

We often denote an input-output pair

$\widehat{f}^{\text{ERM}}_n = \mathop{\mathrm{argmin}}_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \ell(z_t,f) . \ \ \ \ \ \ \ \ \text{(2.1)}$

Note that, for infinite function classes, the minimum may not be achieved. In that case, one can work with functions achieving empirical risks that are arbitrarily close to the infimum of the empirical risk over the class

Given a distribution

$\ell(P, f) = \mathbb{E}_{Z\sim P}\left[\ell(Z,f)\right] .$

The function minimizing the expectation above is

$f^\star_P = \mathop{\mathrm{argmin}}_{f \in \mathcal{F}} \ell(P, f).$

The

$\begin{aligned}
\rho(P, f) &= \ell(P, f) - \inf_{f' \in \mathcal{F}} \ell(P, f') \\
&= \ell(P, f) - \ell(P, f^\star_P).\end{aligned}$

Note that the

In this section we review some basic results of statistical learning theory under iid sampling. For more details the reader can consult standard texts in this area (Anthony & Bartlett, 1999; Shalev-Shwartz & Ben-David, 2014; Vidyasagar, 2002). In the standard formulation of statistical learning theory, we draw a sequence

$V^{\text{iid}}_n(\widehat{f}_n, \mathcal{F}) = \sup_{P} \mathbb{E}\left[ \rho(P, \widehat{f}_n) \right]$

and the minimax value by

$V^{\text{iid}}_n(\mathcal{F}) = \inf_{\widehat{f}_n} V^{\text{iid}}_n(\widehat{f}_n, \mathcal{F}) .$

For the sake of conciseness, the notation above hides the fact that

$\limsup_{n \to \infty} V^{\text{iid}}_n(\mathcal{F}) = 0 .$

$\limsup_{n \to \infty} V^{\text{iid}}_n(\widehat{f}_n, \mathcal{F}) = 0 .$

One of the major achievements of statistical learning theory was the determination of necessary and sufficient conditions for learnability of a class

$\limsup_{n \to \infty} \sup_{P} \mathbb{E}\left[ \sup_{f \in \mathcal{F}} \left| \frac{1}{n} \sum_{t=1}^n f(X_t) - Pf \right|\right] = 0 . \ \ \ \ \ \ \ \ \text{(3.1)}$

Here

The Vapnik-Chervonenkis (VC) dimension of

$\forall \epsilon_{1:n} \in \{\pm1\}^n, \exists f \in \mathcal{F}, \text{ s.t. } \forall t \in [n], f(x_t) = \epsilon_t .$

Finally, we recall the definition of the (expected) Rademacher complexity of a function class with respect to a distribution

$\mathfrak{R}_n(P, \mathcal{F}) = \mathbb{E}\left[ \sup_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \epsilon_t f(X_t) \right]$

Note that the expectation above is with respect to both

$\mathfrak{R}_n(\mathcal{F}) = \sup_{P} \mathfrak{R}_n(P, \mathcal{F}) .$

A similar result holds for regression with absolute loss with the VC dimension condition (i.e., condition number 4 above) replaced with a similar one involving its scale-sensitive counterpart, called the fat-shattering dimension (see Section 7.1 for details).

A second learning setting with a well-developed theory is the *online* learning setting, where no probabilistic assumptions are placed on the data-generating process. Compared to statistical learning theory under iid sampling, online learning theory is a younger field. The main combinatorial parameter in this area, the Littlestone dimension, was defined by Littlestone (1988). It was given the name “Littlestone dimension” by Ben-David et al. (2009), where it was also shown that it fully characterizes learnability in the binary classification setting. Scale-sensitive analogues of the Littlestone dimension for regression problems and the sequential version of Rademacher complexity were studied in Rakhlin et al. (2015a, 2015b).

The online learning setting takes an individual sequence approach, where results are sought that hold for every possible sequence

We consider a sequence

$\rho^{\text{online}}(\widehat{f}_{0:n-1},z_{1:n}) =
\mathbb{E}\left[
\frac{1}{n}
\left(
\sum_{t=1}^n \ell(z_t, \widehat{f}_{t-1}) - \inf_{f \in \mathcal{F}} \sum_{t=1}^n \ell(z_t, f)
\right)
\right].$

This is similar in flavor to, but distinct from, the regret function *prequential principle* (Dawid, 1984): performance of

The object of interest is now the following minimax value:

$V^{\text{online}}_n(\mathcal{F}) = \inf_
{\widehat{f}_{0:n-1}} V^{\text{online}}(\widehat{f}_{0:n-1}, \mathcal{F}),$

where

$V^{\text{online}}(\widehat{f}_{0:n-1}, \mathcal{F}) = \sup_{z_{1:n} \in \mathcal{Z}^n} \rho^{\text{online}}(\widehat{f}_{0:n-1},z_{1:n})$

is the worst-case performance of the sequence

$\limsup_{n \to \infty} V^{\text{online}}_n(\mathcal{F}) = 0 .$

As in statistical learning, we have necessary and sufficient conditions for learnability that almost mirror those in Theorem 1. The ULLN condition gets replaced by the Uniform Martingale Law of Large Numbers (UMLLN). We say that UMLLN holds for

$\limsup_{n \to \infty} \sup_{\mathbf{P}, \mathcal{A}} \mathbb{E}\left[ \sup_{f \in \mathcal{F}} \left| \frac{1}{n} \sum_{t=1}^n \left( f(X_t) - \mathbb{E}\left[f(X_t)|\mathcal{A}_{t-1}\right] \right) \right|\right] = 0 . \ \ \ \ \ \ \ \text{(4.1)}$

The crucial difference between the UMLLN condition and the ULLN condition is that here the supremum is taken over *all* joint distributions

The VC dimension of

$\forall \epsilon_{1:n} \in \{\pm1\}^n, \exists f \in \mathcal{F}, \text{ s.t. } \forall t \in [n], f(\mathbf{x}_t(\epsilon_{1:t-1})) = \epsilon_t .$

Finally, Rademacher complexity gets replaced with its sequential analogue, called the sequential Rademacher complexity. We first define the sequential Rademacher complexity of

$\mathfrak{R}^\text{seq}_n(\mathbf{x}, \mathcal{F}) = \mathbb{E}\left[ \sup_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \epsilon_t f(\mathbf{x}_t(\epsilon_{1:t-1})) \right] .$

Note that the expectation above is only with respect to the Rademacher random variables

$\mathfrak{R}^\text{seq}_n(\mathcal{F}) = \sup_{\mathbf{x}} \mathfrak{R}^\text{seq}_n(\mathbf{x}, \mathcal{F}) .$

As in the iid setting, a similar result holds for online regression with absolute loss, with the Littlestone dimension condition (i.e., condition number 3 above) replaced by a similar one involving its scale-sensitive counterpart, called the sequential fat shattering dimension (see Section 7.2 for details).

It is well known that online learnability is harder than iid learnability. That is,

$\mathcal{F}_{\text{threshold}}
= \{
x \mapsto 1\kern-0.25em\text{l}\left[x > \theta\right]
\::\:
\theta \in \mathbb{R}
\} \ \ \ \ \ \ \ \ \text{(4.2)}$

has

A conspicuous difference between Theorem 1 and Theorem 2 is the absence of the condition involving ERM. Indeed, ERM is not necessarily a good learning rule in the online setting: there exist classes learnable in the online setting that are not learnable via ERM. Unfortunately, the learning rules that learn a class

In this section we move beyond the iid setting to cover *all* distributions, not just product distributions. For a general stochastic process

$P_t(\cdot; z_{1:t-1}) = \mathbf{P}(\cdot|Z_{1:t-1}=z_{1:t-1}) .$

This is the conditional distribution of *data-dependent*. Therefore the

$R_n(Z_{1:n}, f) = \frac{1}{n} \sum_{t=1}^n \rho(P_t, f)$

as a performance measure. Note that the minimizer of this performance measure is data-dependent, unlike in the iid case. As in the iid setting a learning rule

$\begin{aligned}
V^{\text{gen}}_n(\widehat{f}_n, \mathcal{F})
&= \sup_{\mathbf{P}} \mathbb{E}\left[
R_n(Z_{1:n}, \widehat{f}_n) - \inf_{f \in \mathcal{F}} R_n(Z_{1:n}, f)
\right] \\
&= \sup_{\mathbf{P}} \mathbb{E}\left[
\frac{1}{n} \sum_{t=1}^n \ell(P_t, \widehat{f}_n)
- \inf_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \ell(P_t, f)
\right] ,\end{aligned}$

where the supremum is now taken over all joint distributions

$V^{\text{gen}}_n(\mathcal{F}) = \inf_{\widehat{f}_n} V^{\text{gen}}_n(\widehat{f}_n, \mathcal{F}) .$

$\limsup_{n \to \infty} V^{\text{gen}}_n(\mathcal{F}) = 0 .$

$\limsup_{n \to \infty} V^{\text{gen}}_n(\widehat{f}_n, \mathcal{F}) = 0 .$

Note that in the iid case, when

$\frac{1}{n} \sum_{t=1}^n \ell(P_t, f) = \ell(P, f) .$

We have the following result as an immediate consequence.

The result above is not surprising: process learnability has to be harder than iid learnability. However, somewhat surprisingly, we can show that process learnability is at least as hard as online learnability.

To complement the lower bound above, we will now give a performance guarantee for ERM in the general stochastic process setting. Given a loss

$\ell \circ \mathcal{F}=
\{
z \mapsto \ell(z,f) \::\:
f \in \mathcal{F}
\} .$

We define the sequential Rademacher complexity of a loss class

$\begin{aligned}
\mathfrak{R}^\text{seq}_n(\mathbf{z}, \ell \circ \mathcal{F}) &= \mathbb{E}\left[ \sup_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \epsilon_t \ell(\mathbf{z}_t(\epsilon_{1:t-1}), f) \right] ,\\
\mathfrak{R}^\text{seq}_n(\ell \circ \mathcal{F}) &= \sup_{\mathbf{z}} \mathfrak{R}^\text{seq}_n(\mathbf{z}, \ell \circ \mathcal{F}) .\end{aligned}$

Note that the supremum here is over

$V^{\text{gen}}_n(\mathcal{F}) \le V^{\text{gen}}_n(\widehat{f}^{\text{ERM}}_n, \mathcal{F}) \le 4 \mathfrak{R}^\text{seq}_n(\ell \circ \mathcal{F}).$

** Proof.** The first inequality is true by definition of

Note, by definition of

$\frac{1}{n} \sum_{t=1}^n \ell(Z_t, \widehat{f}^{\text{ERM}}_n) =
\inf_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \ell(Z_t, f) .$

Therefore, we have

$\begin{aligned}
&\quad R_n(Z_{1:n}, \widehat{f}^{\text{ERM}}_n) - \inf_{f \in \mathcal{F}} R_n(Z_{1:n}, f) \\
&= \frac{1}{n} \sum_{t=1}^n \ell(P_t, \widehat{f}^{\text{ERM}}_n)
- \inf_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \ell(P_t, f) \\
&= \frac{1}{n} \sum_{t=1}^n \ell(P_t, \widehat{f}^{\text{ERM}}_n)
- \frac{1}{n} \sum_{t=1}^n \ell(Z_t, \widehat{f}^{\text{ERM}}_n) + \inf_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \ell(Z_t, f)
- \inf_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \ell(P_t, f) \\
&\le \sup_{f \in \mathcal{F}} \frac{1}{n} \left( \sum_{t=1}^n \ell(P_t, f)
- \ell(Z_t, f) \right) + \sup_{f \in \mathcal{F}} \frac{1}{n} \left( \sum_{t=1}^n \ell(Z_t, f)
- \ell(P_t, f) \right) . \end{aligned} \ \ \ \ \ \ \ \ \text{(5.1)}$

The justification for the last inequality is as follows. First, we know that

$\begin{aligned}
&\inf_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \ell(Z_t, f) - \inf_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \ell(P_t, f)\\
&\leq \frac{1}{n} \sum_{t=1}^n \ell(Z_t, f^\star) - \frac{1}{n} \sum_{t=1}^n \ell(P_t, f^\star) \\
&\leq \sup_f \frac{1}{n} \left(\sum_{t=1}^n \ell(Z_t, f) - \frac{1}{n} \sum_{t=1}^n \ell(Z_t, f)\right).\end{aligned}$

Taking expectations on both sides of (5.1) gives us

$\begin{aligned}
& \quad \mathbb{E}\left[ R_n(Z_{1:n}, \widehat{f}^{\text{ERM}}_n) - \inf_{f \in \mathcal{F}} R_n(Z_{1:n}, f) \right] \\
&\le \mathbb{E}\left[
\sup_{f \in \mathcal{F}} \frac{1}{n} \left( \sum_{t=1}^n \ell(P_t, f) - \ell(Z_t, f) \right)
\right] \\
&\quad + \mathbb{E}\left[
\sup_{f \in \mathcal{F}} \frac{1}{n} \left( \sum_{t=1}^n \ell(Z_t, f)
- \ell(P_t, f) \right)
\right] \\
&\le 4 \mathfrak{R}^\text{seq}_n(\ell \circ \mathcal{F}) .\end{aligned}$

Note that the last inequality follows from Theorem 2 of Rakhlin et al. (2015b). Since the last quantity above does not depend on

We now have everything in place to be able to show the equivalence of process learnability and online learnability. A similar result can also be shown in the regression case (see Section 7.3).

•

** Proof.** Theorem 4 established that learnability in the general stochastic process setting implies online learnability. For the other direction, note that according to Theorem 5 we have

$V^{\text{gen}}_n(\mathcal{F}) \le 4 \, \mathfrak{R}^\text{seq}_n(\ell \circ \mathcal{F})
\le
2 \, \mathfrak{R}^\text{seq}_n(\mathcal{F}) \ ,$

where the second inequality follows from Theorem 16 in Appendix A. Taking

Although online learnability turns out to be equivalent to process learnability, there is an important difference between the two settings that has to do with the importance of ERM. In the former, ERM is not a good learning rule, whereas in the latter a learnable class is learnable via ERM. Therefore ERM continues to play a special role in the general stochastic process setting just like the iid setting.

Also note that Theorem 6 is stated in terms of learnability, which is an asymptotic concept. However, the proof clearly shows that the rate of convergence is determined by the sequential Rademacher complexity of

We end this section with some examples showing that our definition of process learnability is natural, interesting, and worth studying.

**IID Sampling.** Let us note once again that if

**Asymptotically Stationary Process.** Suppose that *asymptotically stationary* in the sense that the random probability measure

$\begin{aligned}
&\quad
\mathbb{E}\left[ \ell(P^\star, \widehat{f}^{\text{ERM}}_n)\right] - \inf_{f \in \mathcal{F}} \ell(P^\star, f) \\
&=
\mathbb{E}\left[ \ell(P^\star, \widehat{f}^{\text{ERM}}_n) - \ell(P^\star, f_{P^\star})\right] \\
&\le
2 \, \mathbb{E}\left[ \sup_{f \in \mathcal{F}} |
\ell(P^\star, f) - \ell(\bar{P}_n, f)
| \right] \\
&\quad + \mathbb{E}\left[
\ell(\bar{P}_n, \widehat{f}^{\text{ERM}}_n) - \ell(\bar{P}_n, f_{P^\star}) \right] \\
&\le
2 \, \mathbb{E}\left[ \|P^\star - \bar{P}_n\|_{TV} \right] \\
&\quad + \mathbb{E}\left[
\ell(\bar{P}_n, \widehat{f}^{\text{ERM}}_n) - \inf_{f \in \mathcal{F}} \ell(\bar{P}_n, f)
\right] .\end{aligned}$

By the stationarity assumption, the first term on the right in the last inequality goes to zero. Moreover, the rate of convergence can often be characterized in terms of the mixing coefficients of the stochastic process (Vidyasagar, 2002). By learnability of

**Mixture of IID.** Consider a simple mixture of product distributions

$\mathbf{P}= \lambda P \otimes P \otimes \ldots \otimes P +
(1-\lambda) Q \otimes Q \otimes \ldots \otimes Q$

where, for simplicity, assume that

$\begin{aligned}
\frac{1}{n} \sum_{t=1}^n \ell(P_t, f)\end{aligned} \ \ \ \ \ \ \ \ \ \text{(5.2)}$

is

**Random Level.** Fix the squared loss *not* iid. It is not even mixing in any sense due to long-range dependence in

$\begin{aligned}
\widehat{f}^{\text{ERM}}_n(Z_{1:n})
&= \mathop{\mathrm{argmin}}_{f \in \mathcal{F}} \frac{1}{n}
\sum_{t=1}^n (f^\star(X_t) + \xi_0 + \xi_t - f(X_t))^2 \\
&= \mathop{\mathrm{argmin}}_{f \in \mathcal{F}} \frac{1}{n}
\sum_{t=1}^n (f^\star(X_t) + \xi_t - (f(X_t)-\xi_0))^2 \\
&= \xi_0 + \mathop{\mathrm{argmin}}_{g \in \mathcal{F}-\xi_0} \frac{1}{n}
\sum_{t=1}^n (f^\star(X_t) + \xi_t - g(X_t))^2 \\
&= \xi_0 + \mathop{\mathrm{argmin}}_{g \in \mathcal{F}} \frac{1}{n}
\sum_{t=1}^n (f^\star(X_t) + \xi_t - g(X_t))^2 ,\end{aligned}$

where the last equality holds because

$\widehat{f}^{\text{ERM}}_n(Z_{1:n}) = \widehat{f}^{\text{ERM}}_n((X_t,f^\star(X_t) + \xi_t)_{t=1}^n) + \xi_0 .$

Since *random* function

Next we compute

$\begin{aligned}
\ell(P'_t, f) &=
\mathbb{E}\left[ (Y_t - f(X_t))^2 | X_{1:t-1}, \xi_{0:t-1} \right] \\
&= \mathbb{E}\left[
(f^\star(X_t) + \xi_t + \xi_0 - f(X_t))^2 |
X_{1:t-1}, \xi_{0:t-1}
\right] \\
& = \mathbb{E}\left[
(f^\star(X_t) + \xi_t + \xi_0 - f(X_t))^2 |
\xi_{0}
\right] \\
&= 1 + \| f^\star - f + \xi_0 \|^2_{L_2(P_X)}\end{aligned}$

(with

$U_{t-1} = \frac{\sum_{i=1}^{t-1}(Y_i-f^*(X_i))}{t}.$

Consequently

$\ell(P_t,f)=
1 + \frac 1 t + \left\| f^\star - f + U_{t-1}\right\|^2_{L_2(P_X)}.$

In particular,

$\inf_{f\in \mathcal{F}}\frac 1 n \sum_{t=1}^n \ell(P_t,f) \geq 1.$

Now, with

$U_{t-1} = \xi_0 + \frac 1 t \left(\sum_{i=1}^{t-1}{\xi_i}-\xi_0\right)\rightarrow \xi_0$

in mean square. So

The previous section generalized the statistical setting to include nonproduct distributions and extended the definition of learnability to a more general setting. In this section we will generalize the online learnability definition to obtain a prequential version of learnability, while still keeping the level of generality of the previous section. As in the online setting, consider a sequence of learning rules

$V^{\text{preq}}_n(\mathcal{F}) = \inf_{\widehat{f}_{0:n-1}} V^{\text{preq}}_n(\widehat{f}_{0:n-1}, \mathcal{F}) ,$

where

$V^{\text{preq}}_n(\widehat{f}_{0:n-1}, \mathcal{F}) = \sup_{\mathbf{P}}
\mathbb{E}\left[
\frac{1}{n} \sum_{t=1}^n \ell(P_t, \widehat{f}_{t-1}) - \inf_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \ell(P_t, f)
\right].$

Note that the expectation above is with respect to both

$\limsup_{n \to \infty} V^{\text{preq}}_n(\mathcal{F}) = 0 .$

The definition of

The lemma above says that prequential learnability is at least as hard as online learnability. Our next lemma provides a converse result.

$V^{\text{preq}}_n(\widehat{f}_{0:n-1}, \mathcal{F})
\le
V^{\text{online}}_n(\widehat{f}_{0:n-1}, \mathcal{F})
+
2 \mathfrak{R}^\text{seq}_n(\ell \circ \mathcal{F}) .$

$V^{\text{preq}}_n(\mathcal{F})
\le
V^{\text{online}}_n(\mathcal{F})
+
2 \mathfrak{R}^\text{seq}_n(\ell \circ \mathcal{F}) .$

*Proof**.* Let

$\begin{aligned}
& \quad \frac{1}{n} \sum_{t=1}^n \ell(P_t, \widehat{f}_{t-1})
- \inf_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \ell(P_t, f) \\
&= \underset{(I)}{\underbrace{
\frac{1}{n} \sum_{t=1}^n \ell(P_t, \widehat{f}_{t-1})
- \frac{1}{n} \sum_{t=1}^n \ell(Z_t, \widehat{f}_{t-1})
}} \\
&\quad + \underset{(II)}{\underbrace{
\frac{1}{n} \sum_{t=1}^n \ell(Z_t, \widehat{f}_{t-1})
- \inf_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \ell(Z_t, f)
}} \\
&\quad + \underset{(III)}{\underbrace{
\inf_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \ell(Z_t, f)
- \inf_{f \in \mathcal{F}} \frac{1}{n} \sum_{t=1}^n \ell(P_t, f)
}}.\end{aligned}$

The term

$\mathbb{E}\left[
\sup_{f \in \mathcal{F}}
\frac{1}{n} \sum_{t=1}^n \left(
\ell(Z_t, f) - \ell(P_t, f)
\right)
\right]
\le 2 \mathfrak{R}^\text{seq}_n(\ell \circ \mathcal{F}) ,$

where the inequality again follows from Theorem 2 of Rakhlin et al. (2015b).

The lemma now follows by taking expectations on both sides of the three-term decomposition above and plugging in the upper bounds for each term’s expected value.

We now have all the ingredients to characterize prequential learnability for binary classification.

•

*Proof**.* From Lemma 7, we know that if a class is prequentially learnable then it is online learnable. In the other direction, using Lemma 8, we have

$\begin{aligned}
V^{\text{preq}}_n(\mathcal{F})
&\le
V^{\text{online}}_n(\mathcal{F})
+
2 \mathfrak{R}^\text{seq}_n(\ell \circ \mathcal{F}) \\
&\le
V^{\text{online}}_n(\mathcal{F})
+
\mathfrak{R}^\text{seq}_n(\mathcal{F}) ,\end{aligned}$