Statistical learning theory under independent and identically distributed (iid) sampling and online learning theory for worst case individual sequences are two of the best developed branches of learning theory. Statistical learning under general non-iid stochastic processes is less mature. We provide two natural notions of learnability of a function class under a general stochastic process. We show that both notions are in fact equivalent to online learnability. Our results hold for both binary classification and regression.
Keywords: learnability, statistical learning, online learning, prequential principle, uniform martingale law of large numbers
This article is a contribution to ‘learning theory,’ which has largely developed within the computer science community, but has much potential for cross-fertilization with theoretical statistics. The basic framework is as follows. At successive time points we observe a signal, and wish to predict an associated outcome, using one from a given class of possible functions. Performance is measured by a specified loss, measuring the discrepancy between prediction and outcome. Learning theory aims to characterize conditions that allow us to make provably good predictions by analyzing past data to determine a suitable prediction function. Two approaches have emerged, largely in isolation from each other. In the first, (iid) statistical learning theory, we assume all past and future signal-outcome pairs have the same probabilistic structure, independently. The second, online learning theory, makes no probabilistic assumptions, and aims to track the best predictor from those available, no matter what data are observed. In this work we introduce two natural ways of merging these approaches, assuming arbitrary probabilistic dependence across time. We show that both approaches are equivalent to online learnability. Our results hold for binary classification as well as regression.
One of the most beautiful and best developed branches of machine learning theory is classical statistical learning theory. (For a nontechnical overview and for more extensive references, see von Luxburg & Schölkopf, 2011.) However, it deals primarily with independent and identically distributed (iid) sampling of examples. Several attempts have been made to deal with both dependence and non-stationarity; we discuss some of these extensions in Section 1.1. However in general the non-iid case is not as well developed as the classical iid case.
Another well-developed branch of learning theory that has its own share of elegant mathematical ideas is online learning theory. (For an excellent introduction, see Cesa-Bianchi & Lugosi, 2006.) With roots in game theory and the area of information theory known as universal prediction of individual sequences, online learning theory, unlike statistical learning theory, does not use probabilistic foundations. It is therefore quite surprising that there are uncanny parallels between iid learning theory and online learning theory. The reader is invited to compare the statements of the fundamental theorems in these two areas (restated in this article as Theorem 1 and Theorem 2).
Our main goal in this article is to study learnability of a function class in the statistical setting under extremely general assumptions that do not require independence or stationarity. We first summarize the key theorems of iid and online learning in Section 3 and Section 4. Although this material is not new, we feel that the broader data science community might not be very familiar with results in online learning since it is a younger field compared to statistical learning theory. Also, presenting both iid learning and online learning results in a unified way allows us to draw parallels between the two theories and to motivate the need for novel theories that connect these two.
We propose a definition of learnability under general stochastic processes in Section 5. We show that learning under this general definition is equivalent to online learnability (Theorem 6). We give a prequential version of our main definition in Section 6. In the prequential version, as in online learning, the function output by the learning algorithm at any given time cannot peek into the future. We show that learnability under the prequential version of our general learning setting is also equivalent to online learnability (Theorem 9). We focus on the problem of binary classification for simplicity. But we also provide extensions of our equivalence results to the problem of real valued prediction (i.e., regression) in Section 7 (see Theorem 13 and Theorem 14).
The iid assumption of statistical learning theory has been relaxed and replaced with various types of mixing assumptions, especially -mixing (Mohri & Rostamizadeh, 2008; Vidyasagar, 2002). However, in this line of investigation, the stationary assumption is kept and the theory resembles the iid theory to a large extent since mixing implies approximate independence of random variables that are sufficiently separated in time. Mixing assumptions can be shown to hold for some interesting classes of processes, including some Markov and hidden Markov processes. Markov sampling has also been considered on its own as a generalization of iid sampling (Aldous & Vazirani, 1995; Gamarnik, 2003; Smale & Zhou, 2009).
There has been work on performance guarantees of specific algorithms like boosting (Lozano et al., 2005) and support vector machines (Steinwart & Anghel, 2009; Steinwart et al., 2009) under non-iid assumptions. However, our focus here is not on any specific learning methodology. We would like to point out that, while we focus on learnability of functions in a fixed class, the question of universal consistency has also been studied in the context of general stochastic processes (Hanneke, 2021; Nobel, 1999).
There are a handful of papers that focus, as we do, on conditional risk given a sequence of observation drawn from a general non-iid stochastic processes (Pestov, 2010; Shalizi & Kontorovich, 2013; Zimin & Lampert, 2017). These papers focus on process decompositions: expressing a complex stochastic process as a mixture of simpler stochastic processes. For example, de Finetti’s theorem shows that exchangeable distributions are mixtures of iid distributions. The basic idea is to output a function with small expected loss one step beyond the observed sample where the expectation is also conditioned on the observed sample. While closely related, our performance measures are cumulative in nature and are inspired more by regret analysis in online learning than PAC bounds in computational learning theory.
The use of tools from online learning theory (e.g., sequential Rademacher complexity) for developing learning theory for dependent, nonstationary processes was pioneered by Kuznetsov and Mohri (2015, 2017). However, their focus is on time series forecasting applications and therefore their performance measures always involve the expected loss of the function chosen by the learning algorithm some steps into the future (i.e., the part not seen by the learning algorithm) of the process. In contrast, our definition uses conditional distributions of the stochastic process on the realized path to define our performance measure. We also point out that there are earlier papers that apply learning theory tools to understand time series prediction (Alquier & Wintenberger, 2012; Meir, 2000; Modha & Masry, 1998). Some very recent work has also begun to extend some of the work on time series to processes with spatial structure and dependence such as those occurring on a network (Dagan et al., 2019).
A direct inspiration for this article is the work of Skouras and Dawid (2000) on estimation in semiparametric statistical models under misspecification. They highlighted that, under misspecification, M-estimators, including maximum likelihood estimators, may not converge to a deterministic limit even asymptotically. Instead, the limit can be stochastic. This is because, under misspecification, the ‘best’ model can depend on the observed sequence of data. They gave examples showing that this can happen for non-ergodic processes or processes with long-range dependencies that do not decay fast enough. Our work can be seen as a direct extension of their ideas to the learning theory setting where the focus is not on parameter estimation but on loss minimization over potentially massive function spaces.
We consider a supervised learning setting where we want to learn a mapping from an input space to an output space . Two output spaces of interest to us in this article are (binary classification) and (regression). Instead of talking about the difficulty of learning individual functions, we will define learnability for a class of functions that we will denote by . Let and let be a loss function mapping an input-output pair and a function to a nonnegative loss. The set will be denoted by and we use to denote an indicator function that is if the condition is true and otherwise. Two important loss functions are the - loss (in binary classification) and the absolute loss (in regression).
We often denote an input-output pair by . When the input-output pair is random, we will denote it by , perhaps with additional time indices such as . We will use the abbreviation to denote the sequence . A learning rule is a map from to . We will abuse notation a bit and refer to the learning rule and the function output by the learning rule both by . An important learning rule is empirical risk minimization (ERM): given a sequence of input-output pairs, it outputs the function,
Note that, for infinite function classes, the minimum may not be achieved. In that case, one can work with functions achieving empirical risks that are arbitrarily close to the infimum of the empirical risk over the class .
Given a distribution on , the loss function can be extended as follows:
The function minimizing the expectation above is
The -regret of a function is defined as
Note that the -regret depends on the class but we hide this dependence when the function class is clear from the context.
3. Learnability in the IID Setting
In this section we review some basic results of statistical learning theory under iid sampling. For more details the reader can consult standard texts in this area (Anthony & Bartlett, 1999; Shalev-Shwartz & Ben-David, 2014; Vidyasagar, 2002). In the standard formulation of statistical learning theory, we draw a sequence of iid examples from a distribution . That is, the joint distribution of is a product distribution . We adopt the minimax framework to define learnability of a class of functions with respect to a loss function . Define the worst case performance of a learning rule by
and the minimax value by
For the sake of conciseness, the notation above hides the fact that depends on the sequence . The expectation above is taken over the randomness in these samples.
Definition 1. We say that is learnable in the iid learning setting if
Furthermore, we say that is learnable via a sequence of learning rules if
One of the major achievements of statistical learning theory was the determination of necessary and sufficient conditions for learnability of a class . Learnability in both binary classification with - loss and regression with absolute loss is known to be equivalent to a probabilistic condition, namely the uniform law of large numbers (ULLN) for the class :
Here are drawn iid from and . Whether or not ULLN holds for a class depends on the finiteness of different combinatorial parameters, depending on whether we are in the binary classification or regression setting. We will discuss the binary classification case here, leaving the regression case to Section 7.
The Vapnik-Chervonenkis (VC) dimension of , denoted by , is the length of the longest sequence shattered by . We say that a sequence is shattered by if
Finally, we recall the definition of the (expected) Rademacher complexity of a function class with respect to a distribution :
Note that the expectation above is with respect to both and . The former are drawn iid from , whereas the latter are iid -valued Rademacher (also called symmetric Bernoulli) random variables. The worst case, over , Rademacher complexity is denoted by
Theorem 1. Consider binary classification with - loss in the iid setting. Then, the following are equivalent:
(1) is learnable.
(2) is learnable via ERM.
(3) The ULLN condition (3.1) holds for .
A similar result holds for regression with absolute loss with the VC dimension condition (i.e., condition number 4 above) replaced with a similar one involving its scale-sensitive counterpart, called the fat-shattering dimension (see Section 7.1 for details).
4. Learnability in the Online Setting
A second learning setting with a well-developed theory is the online learning setting, where no probabilistic assumptions are placed on the data-generating process. Compared to statistical learning theory under iid sampling, online learning theory is a younger field. The main combinatorial parameter in this area, the Littlestone dimension, was defined by Littlestone (1988). It was given the name “Littlestone dimension” by Ben-David et al. (2009), where it was also shown that it fully characterizes learnability in the binary classification setting. Scale-sensitive analogues of the Littlestone dimension for regression problems and the sequential version of Rademacher complexity were studied in Rakhlin et al. (2015a, 2015b).
The online learning setting takes an individual sequence approach, where results are sought that hold for every possible sequence that might be encountered by the learning rule.
We consider a sequence of learning rules, where takes in as input the sequence and outputs a (possibly random) function in . Define the expected (normalized) regret of on sequence :
This is similar in flavor to, but distinct from, the regret function used in the iid setting. It obeys the prequential principle (Dawid, 1984): performance of , which is learned using , is judged using loss evaluated on with no overlap between data used for learning and for performance evaluation. The expectation is needed because the learning rules may use internal randomization to achieve robustness to adversarial data. The regret nomenclature comes from the fact that cannot peek into the future to lower its loss but its cumulative performance is compared with lowest possible loss, in hindsight, over the entire sequence . However, the comparator term has its own restriction: it uses the best fixed function in hindsight, as opposed to the best sequence of functions.
The object of interest is now the following minimax value:
is the worst-case performance of the sequence of learning rules, with taking in as input the sequence and outputting a function in . The infimum is then taken over all such learning rule sequences.
Definition 2. We say that is learnable in the online learning setting if
As in statistical learning, we have necessary and sufficient conditions for learnability that almost mirror those in Theorem 1. The ULLN condition gets replaced by the Uniform Martingale Law of Large Numbers (UMLLN). We say that UMLLN holds for if
The crucial difference between the UMLLN condition and the ULLN condition is that here the supremum is taken over all joint distributions of . In particular need not be iid. Also, to obtain a martingale structure, we use an arbitrary filtration such that is -measurable. It is easy to see that UMLLN is a stronger condition than ULLN: simply restrict to be a product distribution and let be the natural filtration of . Then the UMLLN condition reduces to the ULLN condition.
The VC dimension of is replaced by another combinatorial parameter, called the Littlestone dimension of , denoted by . Before we present the definition of Littlestone dimension, we need some notation to handle complete binary trees labeled with examples drawn from the input space . We think of a complete binary tree of depth as defining a sequence , of maps. The map gives us the examples sitting at level of the tree. For example, is the root, is the left child of the root, is the right child of the root, and so on. In general is the node at level that we reach by following the path given by the sign sequence , where means ‘go left’ and means ‘go right’. The Littlestone dimension of is the depth of the largest complete binary tree shattered by . We say that a complete binary tree is shattered by if
Finally, Rademacher complexity gets replaced with its sequential analogue, called the sequential Rademacher complexity. We first define the sequential Rademacher complexity of given a tree of depth as:
Note that the expectation above is only with respect to the Rademacher random variables as is a fixed tree. Taking the worst case over all complete binary trees of depth gives us the sequential Rademacher complexity of :
Theorem 2. Consider binary classification with - loss in the online (individual sequence) setting. Then, the following are equivalent:
(1) is learnable.
(2) The UMLLN condition (4.1) holds for .
As in the iid setting, a similar result holds for online regression with absolute loss, with the Littlestone dimension condition (i.e., condition number 3 above) replaced by a similar one involving its scale-sensitive counterpart, called the sequential fat shattering dimension (see Section 7.2 for details).
It is well known that online learnability is harder than iid learnability. That is, for any , and the gap in this inequality can be arbitrarily large. For example, the set of threshold functions on :
has but .
A conspicuous difference between Theorem 1 and Theorem 2 is the absence of the condition involving ERM. Indeed, ERM is not necessarily a good learning rule in the online setting: there exist classes learnable in the online setting that are not learnable via ERM. Unfortunately, the learning rules that learn a class in the online setting are quite complex (Ben-David et al., 2009). It is not known if there exists a rule as simple as ERM that will learn a class whenever is online learnable. In any case, ERM does not play as central a role in online learning as it does in learning in the iid setting.
5. Learnability Under General Stochastic Processes
In this section we move beyond the iid setting to cover all distributions, not just product distributions. For a general stochastic process , we still have an analogue of at time , namely
This is the conditional distribution of given . Just like , this is unknown to the learning rule. However, unlike in the iid case, is data-dependent. Therefore the -regret of a function is data-dependent. We will often hide the dependence of on past data . We can use the average of the -regrets,
as a performance measure. Note that the minimizer of this performance measure is data-dependent, unlike in the iid case. As in the iid setting a learning rule is a map from to . To reduce clutter in our notation, we will continue to hide the dependence of on the realized sample . The value of a learning rule is now defined as
where the supremum is now taken over all joint distributions over . This leads to consideration of the following minimax value to define learnability:
Definition 3. We say that is process learnable if
Furthermore, we say that is process learnable via a sequence of learning rules if
Note that in the iid case, when is a product distribution with marginal , we have for all and therefore, for any ,
We have the following result as an immediate consequence.
Lemma 3. Fix any loss function and function class . For any learning rule , . This also means that .
The result above is not surprising: process learnability has to be harder than iid learnability. However, somewhat surprisingly, we can show that process learnability is at least as hard as online learnability.
Theorem 4. Consider binary classification with 0-1 loss in the general stochastic process setting. Suppose the class is not online learnable, i.e., . Then for any , . Therefore, the class is not process learnable.
To complement the lower bound above, we will now give a performance guarantee for ERM in the general stochastic process setting. Given a loss and function class , define the loss class as
We define the sequential Rademacher complexity of a loss class as
Note that the supremum here is over -valued trees that are labeled with input-output pairs. It is easy for us to connect the complexity to the loss class to the complexity of the underlying function class for a simple loss function like the - loss (see Appendix A for details.)
Theorem 5. Fix any loss function and function class . Let denote the ERM learning rule defined in (2.1). Then we have
Proof. The first inequality is true by definition of . So we just have to prove the second one.
Note, by definition of ,
Therefore, we have
The justification for the last inequality is as follows. First, we know that . Second, when is achieved, at say, we have,
Taking expectations on both sides of (5.1) gives us
Note that the last inequality follows from Theorem 2 of Rakhlin et al. (2015b). Since the last quantity above does not depend on , we can take supremum over on both sides to finish the proof.
We now have everything in place to be able to show the equivalence of process learnability and online learnability. A similar result can also be shown in the regression case (see Section 7.3).
Theorem 6. Consider binary classification with - loss. Then all of the equivalent conditions in Theorem 2 are also equivalent to:
• is process learnable.
Proof. Theorem 4 established that learnability in the general stochastic process setting implies online learnability. For the other direction, note that according to Theorem 5 we have
where the second inequality follows from Theorem 16 in Appendix A. Taking of both sides as tends to infinity shows that online learnability implies process learnability.
Although online learnability turns out to be equivalent to process learnability, there is an important difference between the two settings that has to do with the importance of ERM. In the former, ERM is not a good learning rule, whereas in the latter a learnable class is learnable via ERM. Therefore ERM continues to play a special role in the general stochastic process setting just like the iid setting.
Also note that Theorem 6 is stated in terms of learnability, which is an asymptotic concept. However, the proof clearly shows that the rate of convergence is determined by the sequential Rademacher complexity of , which scales as (Alon et al., 2021).
We end this section with some examples showing that our definition of process learnability is natural, interesting, and worth studying.
IID Sampling. Let us note once again that if is a product measure then is just and therefore not random. In this special but important case, our definition of learnability reduces to the standard definition of learnability under iid sampling.
Asymptotically Stationary Process. Suppose that is not a product measure but the process is asymptotically stationary in the sense that the random probability measure converges to some fixed deterministic in total variation as . For a class that is learnable in the general stochastic process setting and for loss function bounded by , we have
By the stationarity assumption, the first term on the right in the last inequality goes to zero. Moreover, the rate of convergence can often be characterized in terms of the mixing coefficients of the stochastic process (Vidyasagar, 2002). By learnability of via ERM in the general stochastic process setting, the last term goes to zero. Note that is linear in and therefore . So, under stationarity, our learnability condition implies that ERM does well when its performance is measured under the (asymptotic) stationary distribution .
Mixture of IID. Consider a simple mixture of product distributions
where, for simplicity, assume that and have disjoint supports. Then with probability we have , and with probability we have . Therefore, the minimizer of
is with probability and with probability (assuming, again for simplicity, that the minimizers are unique). Here, unlike the iid and stationary examples, the ‘best’ function, even with infinite data, is not deterministic but is random depending on which mixture component was selected. Still, if learnability in our general sense holds, then ERM will do well according to the performance measure (5.2). Note that this example can be easily generalized to a mixture of more than two iid processes. It can also be generalized, with additional technical conditions, to the general case where . The main difference from the disjoint support case that we consider here will be that with probability , will converge to (in a suitable sense) and to otherwise. Similarly, the minimizer of (5.2) would not equal or but it would converge (again, in some appropriate sense determined by technical conditions) to one of them with probability and , respectively.
Random Level. Fix the squared loss and consider a class that is iid learnable and closed under translations by a constant, that is, if then for any constant . Let be iid drawn from some distribution on that has a density with respect to Lebesgue measure on . Let for some and where are iid standard normal. Note that the process is not iid. It is not even mixing in any sense due to long-range dependence in caused by . Now ERM over is given by:
where the last equality holds because and we have assumed that all empirical minimizers are unique with probability . Thus, we have shown that
Since is iid learnable, converges (in sense) to the function , which means the ERM on converges to the random function .
Next we compute , as follows. Let be the conditional distribution of given and . Then we have
(with regarded as fixed). Then , where now (only) is regarded as random. It is easy to show that the distribution of , given , is normal with variance and mean
Now, with , consider . We have shown in mean square. Also,
in mean square. So , the smallest possible value. That is, asymptotically the minimizer of over is (which converges, not to , but to the random function ).
6. A Prequential Definition of Learnability
The previous section generalized the statistical setting to include nonproduct distributions and extended the definition of learnability to a more general setting. In this section we will generalize the online learnability definition to obtain a prequential version of learnability, while still keeping the level of generality of the previous section. As in the online setting, consider a sequence of learning rules , where is a function only of , that is, it cannot peek ahead to access . Unlike the online learning setting, is a random sequence drawn from some general distribution over . Now, define the minimax value
Note that the expectation above is with respect to both and any internal randomness used by the rules . As before, the definition of the minimax value leads to the definition of learnability.
Definition 4. We say that is prequentially learnable if
The definition of can be obtained from the definition of by replacing , which depends on the entire sequence , by , which depends only on , in the loss term that involves . It can also be thought of as a generalization of because of the following. When the distribution degenerates to a point mass at a specific sequence then becomes a point mass at and the difference of cumulative losses above reduces to the individual sequence regret of on . This observation immediately gives us the following result.
Lemma 7. Fix any loss function and function class . Then we have .
The lemma above says that prequential learnability is at least as hard as online learnability. Our next lemma provides a converse result.
Lemma 8. Fix any loss function and function class . Then for any sequence of learning rules we have
This also means that
Proof. Let be an arbitrary distribution. We have the following three term decomposition: