Machine learning (ML) is about computational methods that enable machines to learn concepts from experience. In handling a wide variety of experience ranging from data instances, knowledge, and constraints to rewards, adversaries, and lifelong interaction in an ever-growing spectrum of tasks, contemporary ML/AI (artificial intelligence) research has resulted in a multitude of learning paradigms and methodologies. Despite the continual progresses on all different fronts, the disparate, narrowly focused methods also make standardized, composable, and reusable development of ML approaches difficult, and preclude the opportunity to build AI agents that panoramically learn from all types of experience. This article presents a standardized ML formalism, in particular a 'standard equation' of the learning objective, that offers a unifying understanding of many important ML algorithms in the supervised, unsupervised, knowledge-constrained, reinforcement, adversarial, and online learning paradigms, respectively—those diverse algorithms are encompassed as special cases due to different choices of modeling components. The framework also provides guidance for mechanical design of new ML approaches and serves as a promising vehicle toward panoramic machine learning with all experience.
Keywords: standard model, panoramic learning, experience, composable machine learning, unification
Media Summary
Humans learn from a range of experience, what about a computer? The past decades of AI and Machine Learning (ML) research has resulted in a multitude of paradigms and algorithms, each specialized to train ML models with a certain type of information and experience in a certain type of problem. While pushing the field forward rapidly, the bewildering and ever-growing variety of paradigms and algorithms also makes it extremely difficult to master existing ML techniques and to develop universal, repeatable, and reusable computer programs that can simultaneously learn from diverse experience in the real world. It is a constant desire and aspiration to search for a standardized ML formalism that unifies the distinct learning principles, much like the Standard Model in physics, to gain a more holistic understanding of the diverse paradigms and algorithms, lay out a blueprint permitting fuller and more systematic exploration in the design and analysis of new algorithms, and eventually serves as a vehicle toward panoramic machine learning capable of integrating all available information (data, knowledge, constraints, reward, adversary, etc.) in learning and thus applicable to all problems. This work presents an attempt toward this end. In particular, we establish a standard equation of the learning objective, which subsumes many of the known algorithms as special cases and offers guiding principles of designing new more powerful learning algorithms in a mechanical and composable way.
1. Introduction
Human learning has the hallmark of learning concepts from diverse sources of information. Take the example of learning a language. Humans can benefit from various experience—by observing examples through reading and hearing, studying abstract definitions and grammar, making mistakes and getting correction from teachers, interacting with others and observing implicit feedback, and so on. Knowledge of a prior language can also accelerate the acquisition of a new one. How can we build artificial intelligence (AI) agents that are similarly capable of learning from all types of experience? We refer to the capability of flexibly integrating all available experience in learning as panoramic learning.
In handling different experience ranging from data instances, knowledge, constraints, to rewards, adversaries, and lifelong interplay in an ever-growing spectrum of tasks, contemporary ML and AI research has resulted in a large multitude of learning paradigms (e.g., supervised, unsupervised, active, reinforcement, adversarial learning), models, optimization techniques, not mentioning countless approximation heuristics and tuning tricks, plus combinations of all the above. While pushing the field forward rapidly, these results also make mastering existing ML techniques very difficult, and fall short of reusable, repeatable, and composable development of ML approaches to diverse problems with distinct available experience.
Those fundamental challenges call for a standardized ML formalism that offers a principled framework for understanding, unifying, and generalizing current major paradigms of learning algorithms, and for mechanical design of new approaches for integrating any useful experience in learning. The power of standardized theory is perhaps best demonstrated in physics, which has a long history of pursuing symmetry and simplicity of its principles: exemplified by the famed Maxwell’s equations in the 1800s that reduced various principles of electricity and magnetism into a single electromagnetic theory, followed by General Relativity in the 1910s and the Standard Model in the 1970s, physicists describe the world best by unifying and reducing different theories to a standardized one. Likewise, it is a constant quest in the field of ML to establish a ‘Standard Model’ (Domingos, 2015; Langley, 1989), that gives a holistic view of the broad learning principles, lays out a blueprint permitting fuller and more systematic exploration in the design and analysis of new algorithms, and eventually serves as a vehicle toward panoramic learning that integrates all available sources of experience.
This paper presents an attempt toward this end. In particular, our principal focus is on the learning objective that drives the model training given the experience and thus often lies at the core for designing new algorithms, understanding learning properties, and validating outcomes. We investigate the underlying connections between a range of seemingly distinct ML paradigms. Each of these paradigms has made particular assumptions on the form of experience available. For example, the present most popular supervised learning relies on collections of data instances, often applying a maximum likelihood objective solved with simple gradient descent. Maximum likelihood based unsupervised learning instead can invoke different solvers, such as expectation-maximization (EM), variational inference, and wake-sleep in training for varied degree of approximation to the problem. Active learning(Settles, 2012) manages data instances which, instead of being given all at once, are adaptively selected. Reinforcement learning(Sutton & Barto, 2018) makes use of feedback obtained via interaction with the environment. Knowledge-constrained learning like posterior regularization (Ganchev et al., 2010; Zhu et al., 2014) incorporates structures, knowledge, and rules expressed as constraints. Generative adversarial learning(Goodfellow et al., 2014) leverages a companion model called discriminator to guide training of the model of interest.
In light of these results, we present a standard equation (SE) of the objective function. The SE formulates a rather broad design space of learning algorithms. We show that many of the well-known algorithms of the above paradigms are all instantiations of the general formulation. More concretely, the SE, based on the maximum entropy and variational principles, consists of three principled terms, including the experience term that offers a unified language to express arbitrary relevant information to supervise the learning, the divergence term that measures the fitness of the target model to be learned, and the uncertainty term that regularizes the complexity of the system. The single succinct formula re-derives the objective functions of a large diversity of learning algorithms, reducing them to different choices of the components. The formulation thus sheds new light on the fundamental relationships between the diverse algorithms that were each originally designed to deal with a specific type of experience.
The modularity and generality of the framework is particularly appealing not only from the theoretical point of view, but also because it offers guiding principles for designing algorithmic approaches to new problems in a mechanical way. Specifically, the SE by its nature allows combining together all different experience to learn a model of interest. Designing a problem solution boils down to choosing what experience to use depending on the problem structure and available resources, without worrying too much about how to use the experience in the training. Besides, the standardized ML perspective also highlights that many learning problems in different research areas are essentially the same and just correspond to different specifications of the SE components. This enables us to systematically repurpose successful techniques in one area to solve problems in another.
The remainder of the article is organized as follows. Section 2 gives an overview of relevant learning and inference techniques as a prelude of the standardized framework. Section 3 presents the standard equation as a general formulation of the objective function in learning algorithms. The subsequent two sections discuss different choices of two of the key components in the standard equation, respectively, illustrating that many existing methods are special cases of the formulation: Section 4 is devoted to discussion of the experience function and Section 5 focuses on the divergence function. Section 6 discusses an extended view of the standard equation in dynamic environments. Section 7 focuses on the optimization algorithms for solving the standard equation objective. Section 8 discusses the diverse types of target models. Section 9 discusses the utility of the standardized formalism for mechanical design of panoramic learning approaches. Section 10 reviews related work. Section 11 concludes the article with discussion of future directions—in particular, we discuss the broader aspects of ML not covered in the present work (e.g., more advanced learning such as continual learning in complex evolving environments, theoretical analysis of learnability, generalization and complexity, and automated algorithm composition) and how their unified characterization based on or inspired by the current framework could potentially lead toward a full ‘Standard Model’ of ML and a turnkey approach to panoramic learning with all types of experience.
2. Preliminaries: The Maximum Entropy View of Learning and Inference
Depending on the nature of the task (e.g., classification or regression), data (e.g., labeled or unlabeled), information scope (e.g., with or without latent variables), and form of domain knowledge (e.g., prior distributions or parameter constraints), and so on, different learning paradigms with often complementary (but not necessarily easy to combine) advantages have been developed for different needs. For example, the paradigms built on the maximum likelihood principles, Bayesian theories, variational calculus, and Monte Carlo simulation have led to much of the foundation underlying a wide spectrum of probabilistic graphical models, exact/approximate inference algorithms, and even probabilistic logic programs suitable for probabilistic inference and parameter estimations in multivariate, structured, and fully or partially observed domains, while the paradigms built on convex optimization, duality theory, regularization, and risk minimization have led to much of the foundation underlying algorithms such as support vector machine (SVM), boosting, sparse learning, structure learning, and so on. Historically, there have been numerous efforts in establishing a unified machine learning framework that can bridge these complementary paradigms so that advantages in model design, solver efficiency, side-information incorporation, and theoretical guarantees can be translated across paradigms. As a prelude of our presentation of the ‘standard equation’ framework toward this goal, here we begin with a recapitulation of the maximum entropy view of statistical learning. By naturally marrying the probabilistic frameworks with the optimization-theoretic frameworks, the maximum entropy viewpoint had played an important historical role in offering the same lens to understanding several popular methodologies such as maximum likelihood learning, Bayesian inference, and large margin learning.
2.1. Maximum Likelihood Estimation (MLE)
We start with the maximum entropy perspective of the maximum likelihood learning.
2.1.1. Supervised MLE
We consider an arbitrary probabilistic model (e.g., a neural network or probabilistic graphical model for, say, language generation) with parameters $\bm{\theta}\in\bm{\Theta}$ to be learned. Let $p_\theta(\bm{x})\in\mathcal{P}(\mathcal{X})$ denote the distribution defined by the model, where $\mathcal{X}$ is the data space (e.g., all language text) and $\mathcal{P}(\mathcal{X})$ denotes the set of all probability distributions on $\mathcal{X}$. Given a set of independent and identically distributed (i.i.d.) data examples $\mathcal{D}=\{ \bm{x}^* \in \mathcal{X}\}$, the most common method for estimating the parameters $\bm{\theta}$ is perhaps maximum likelihood estimation (MLE). MLE learns the model by minimizing the negative log-likelihood:
MLE is known to be intimately related to the maximum entropy principle (Jaynes, 1957). In particular, when the model $p_\theta(\bm{x})$ is in the exponential family of the form:
where $T(\bm{x})$ is the sufficient statistics of data $\bm{x}$ and $Z(\bm{\theta})=\sum_{\bm{x}\in\mathcal{X}} \exp\{\bm{\theta}\cdot T(\bm{x})\}$ is the normalization factor, it is shown that MLE is the convex dual of maximum entropy estimation.
In a maximum entropy formulation, rather than assuming a specific parametric from of the target model distribution, denoted as $p(\bm{x})$, we instead impose constraints on the model distribution. Specifically, in the supervised setting, the constraints require the expectation of the features $T(\bm{x})$ to be equal to the empirical expectation:
In general, there exist many distributions $p\in\mathcal{P}(\mathcal{X})$ that satisfy the constraint. The principle of maximum entropy resolves the ambiguity by choosing the distribution such that its Shannon entropy, $\text{H}(p):=-\mathbb{E}_p[\log p(\bm{x})]$, is maximized. Following this principle, in the supervised setting, we thus have the specific constrained optimization problem:
where $\bm{\theta}$ and $\mu$ are Lagrangian multipliers. Setting the derivative w.r.t. $p$ and $\mu$ to equal zero implies that $p$ must have the same form as in Equation 2.2:
where we see the parameters $\bm{\theta}$ in the exponential family parameterization are the Lagrangian multipliers that enforce the constraints. Plugging the solution back into the Lagrangian, we obtain:
which is simply the negative of the MLE objective in Equation 2.1.
Thus maximum entropy is dual to maximum likelihood. It provides an alternative view of the problem of fitting a model into data, where the data instances in the training set are treated as constraints, and the learning problem is treated as a constrained optimization problem. This optimization-theoretic view of learning will be revisited repeatedly in the sequel to allow extending machine learning under all experience of which data instances is just a special case.
2.1.2. Unsupervised MLE
Similar to the MLE framework for supervised learning, unsupervised learning via MLE can also be reformulated as a constraint optimization problem with entropy maximization. Consider learning a multivariate model with latent variables, where each data instance is partitioned into observed variables $\bm{x}\in\mathcal{X}$ and latent variables $\bm{y}\in\mathcal{Y}$. For example, in the problem of image clustering, $\bm{x}\in\mathbb{R}^{d}$ is the observed image of $d$ pixels and $\bm{y}\in\{1,\dots,K\}$ is the unobserved cluster indicator (where $K$ is the number of clusters). The goal is to learn a model $p_\theta(\bm{x},\bm{y})$ that captures the joint distribution of $\bm{x}$ and $\bm{y}$. Since $\bm{y}$ is unobserved, we minimize the negative log-likelihood with $\bm{y}$ marginalized out:
Direct optimization of the marginal log-likelihood is typically intractable due to the summation over $\bm{y}$. Earlier work thus developed different solvers with varying levels of approximations.
It can be shown that the intractable negative log-likelihood above can be upper bounded by a more tractable term known as the variational free energy(Neal & Hinton, 1998). Let $q(\bm{y}|\bm{x})$ represent an arbitrary auxiliary distribution acting as a surrogate of the true posterior $p(\bm{y}|\bm{x})$, which is known as a variational distribution. Then, for each instance $\bm{x}^*\in\mathcal{D}$, we have:
where the inequality holds because KL divergence is always nonnegative. The free energy upper bound contains two terms: the first one is the entropy of the variational distribution, which captures the intrinsic randomness (i.e., amount of information carried by an auxiliary distribution); the second term, now written as $-\mathbb{E}_{q(\bm{y}|\bm{x}^*)\tilde{p}_d(\bm{x}^*)}\left[ \log p_\theta(\bm{x}^*, \bm{y}) \right]$, by taking into account the empirical distribution $\tilde{p}_d$ from which the instance $\bm{x}^*$ is drawn, is the cross entropy between the distributions $q(\bm{y}|\bm{x}^*)\tilde{p}_d(\bm{x}^*)$ and $p_\theta(\bm{x}^*, \bm{y})$, driving the two to be close and thereby allowing $q$ to approximate $p$.
The popular expectation maximization (EM) algorithm for unsupervised learning via MLE can be interpreted as minimizing the variational free energy (Neal & Hinton, 1998). In fact, as we discuss subsequently, popular heuristics such as the variational EM and the wake-sleep algorithms, are approximations to the EM algorithm by introducing approximating realizations to either the free energy objective function $\mathcal{L}$ or to the solution space of the variational distribution $q$.
Expectation Maximization (EM). The most common approach to learning with unlabeled data or partially observed multivariate models is perhaps the EM algorithm (Dempster et al., 1977). With the use of the variational free energy as a surrogate objective to the original marginal likelihood as in Equation 2.9, EM can be also understood as an alternating minimization algorithm, where $\mathcal{L}(q,\bm{\theta})$ is minimized with regard to $q$ and $\bm{\theta}$ in two stages, respectively. At each iteration $n$, the expectation (E) step maximizes $\mathcal{L}(q,\bm{\theta}^{(n)})$ w.r.t. $q$. From Equation 2.9, this is achieved by setting $q$ to the current true posterior:
so that the KL divergence vanishes and the upper bound is tight. In the subsequent maximization (M) step, $\mathcal{L}(q^{(n+1)},\bm{\theta})$ is minimized w.r.t. $\bm{\theta}$:
which is to maximize the expected complete data log-likelihood. The EM algorithm has an appealing property that it monotonically decreases the negative marginal log-likelihood over iterations. To see this, notice that after the E-step the upper bound $\mathcal{L}(q^{(n+1)},\bm{\theta}^{(n)})$ is equal to the negative marginal log-likelihood, and the M-step further decreases the upper bound (and thus the negative marginal log-likelihood).
Variational EM. When the model $p_\theta(\bm{x}, \bm{y})$ is complex (e.g., a neural network or a multilayer graphical model), directly working with the true posterior in the E-step becomes intractable. Variational EM overcomes the difficulty with approximations. It considers a restricted family $\mathcal{Q}'$ of the variational distribution $q(\bm{y})$ such that optimization w.r.t. $q$ within the family is tractable:
A common way to restrict the $q$ family is the mean-field methods, which partition the components of $\bm{y}$ into sub-groups $\bm{y}=(\bm{y}_1, \dots, \bm{y}_M)$ and assume that $q$ factorizes w.r.t. the groups: $q(\bm{y})=\prod_{i=1}^{M}q_i(\bm{y}_i)$. The variational principle summarized in (Wainwright & Jordan, 2008) gives a more principled interpretation of the mean-field and other approximation methods. In particular, in the case where $p_\theta(\bm{x},\bm{y})$ is an exponential family distribution with sufficient statistics $T(\bm{x},\bm{y})$, the exact E-step (Equation 2.10) can be interpreted as seeking the optimal valid mean parameters (i.e., expected sufficient statistics) for which the free energy is minimized. For discrete latent variables $\bm{y}$, the set of all valid mean parameters constitutes a marginal polytope $\mathcal{M}$. In this perspective, the mean-field methods (Equation 2.12) correspond to replacing $\mathcal{M}$ with an inner approximation $\mathcal{M}'\subseteq\mathcal{M}$. With the restricted set $\mathcal{M}'$ of mean parameters, the E-step generally no longer tightens the bound of the negative marginal log-likelihood, and the algorithm does not necessarily decrease the negative marginal log-likelihood monotonically. However, the algorithm preserves the property that it minimizes the upper bound of the negative marginal log-likelihood. Besides the mean-field methods, there are other approaches for approximation such as belief propagation. These methods correspond to using an outer approximation $\mathcal{M}''\supseteq\mathcal{M}$ of the marginal polytope, and do not guarantee upper bounds on the negative marginal log-likelihood.
Another approach to restrict the family of $q$ is to assume a parametric distribution $q_{\omega}(\bm{y}|\bm{x})$ and optimize the parameters $\bm{\omega}$ in the E-step. The approach has been used in black-box variational inference (Ranganath et al., 2014), and variational auto-encoders (VAEs) (Kingma & Welling, 2014) where $q$ is parameterized as a neural network (a.k.a ‘inference network,’ or ‘encoder’).
It is worth mentioning that the variational approach has also been used for approximate Gaussian processes (GPs, as a nonparametric methods; Titsias, 2009; Wilson et al., 2016b), where $\bm{y}$ is the inducing points and the variational distribution $q(\bm{y})$ is parameterized as a Gaussian distribution with a nondiagonal covariance matrix that preserves the structures within the true covariance (and hence is different from the above mean-field approximation, which assumes a diagonal variational covariance matrix). We refer interested readers to Wilson et al. (2016b) for more details.
Wake-Sleep. In some cases when the auxiliary $q$ is assumed to have a certain form (e.g., a deep network), the approximate E-step in Equation 2.12 may still be too complex to be tractable, or the gradient estimator (w.r.t. the parameters of $q$) can suffer from high variance (Mnih & Gregor, 2014; Paisley et al., 2012). To tackle the challenge, more approximations are introduced. The wake-sleep algorithm (Hinton et al., 1995) is one of such methods. In the E-step w.r.t. $q$, rather than minimizing $\text{KL}(q(\bm{y}) \| p_\theta(\bm{y}|\bm{x}^*))$ (Equation 2.9) as in EM and variational EM, the wake-sleep algorithm makes an approximation by minimizing the Kullback–Leibler (KL) divergence in opposite direction:
which can be optimized efficiently with gradient descent when $q$ is parameterized. Besides wake-sleep, one can also use other methods for low-variance gradient estimation in Equation 2.12, such as reparameterization gradient (Kingma & Welling, 2014) and score gradient (Glynn, 1990; Mnih & Gregor, 2014; Ranganath et al., 2014).
In sum, the entropy maximization perspective has formulated unsupervised MLE as an optimization-theoretic framework that permits simple alternating minimization solvers. Starting from the upper bound of negative marginal log-likelihood (Equation 2.9) with maximum entropy and minimum cross entropy, the originally intractable MLE problem gets simplified, and a series of optimization algorithms, ranging from (variational) EM to wake-sleep, arise naturally as an approximation to the original solution.
2.2. Bayesian Inference
Now we revisit another classical learning framework, Bayesian inference, and examine its intriguing connections with the maximum entropy principle. Interestingly, the the maximum entropy principle can also help to reformulate Bayesian inference as a constraint optimization problem, as for MLE.
Different from MLE, Bayesian approach for statistical inference treats the hypotheses (parameters $\bm{\theta}$) to be inferred as random variables. Assuming a prior distribution $\pi(\bm{\theta})$ over the parameters, and considering a probabilistic model that defines a conditional distribution $p(\bm{x}|\bm{\theta})$, the inference is based on Bayes’ theorem:
where $p(\bm{\theta}|\mathcal{D})$ is the posterior distribution after observing the data $\mathcal{D}$ (which we assume are i.i.d.); and $p(\mathcal{D}) = \int_{\bm{\theta}}\pi(\bm{\theta})\prod_{\bm{x}^*} p(\bm{x}^*|\bm{\theta}) d\bm{\theta}$ is the marginal likelihood.
Interestingly, the early work by Zellner (1988) showed the relations between Bayesian inference and maximum entropy, by reformulating the statistical inference problem from the perspective of information processing, and rediscovering Bayes’ theorem as the optimal information processing rule. More specifically, statistical inference can be seen as a procedure of information processing, where the system receives input information in the form of prior knowledge and data, and emits output information in the form of parameter estimates and others. An efficient inference procedure should generate an output distribution such that the system retains all input information and not inject any extraneous information. The learning objective is thus to minimize the difference between the input and output information w.r.t. the output distribution:
where the first two terms measure the output information in the output distribution $q(\bm{\theta})$ and marginal $p(\mathcal{D})$, and the third term measures the input information in the prior $\pi(\bm{\theta})$ and data likelihood $p(\bm{x}^*|\bm{\theta})$. Here $\mathcal{P}(\bm{\Theta})$ is the space of all probability distributions over $\bm{\theta}$.
The optimal solution of $q(\bm{\theta})$ is precisely the the posterior distribution $p(\bm{\theta}|\mathcal{D})$ due to Bayes’ theorem (Equation 2.14). The proof is straightforward by noticing that the objective can be rewritten as $\min_q \text{KL}(q(\bm{\theta}) \| p(\bm{\theta}|\mathcal{D}))$.
Similar to the case of duality between MLE and maximum entropy (Equation 2.4), the same entropy maximization principle can cast Bayesian inference as a constrained optimization problem. As Jaynes (1988) commented, this fresh interpretation of Bayes’ theorem “could make the use of Bayesian methods more attractive and widespread, and stimulate new developments in the general theory of inference” (Jaynes, 1988, p. 280). The next subsection reviews how entropy maximization as a “useful tool in generating probability distributions” (Jaynes, 1988, p.280) has related to and resulted in more general learning and inference frameworks, such as posterior regularization.
2.3. Posterior Regularization
The optimization-based formulation of Bayesian inference in Equation 2.15 offers important additional flexibility in learning by allowing rich constraints on machine learning models to be imposed to regularize the outcome. For example, in Equation 2.15 we have seen the standard normality constraint of a probability distribution being imposed on the posterior $q$. It is natural to consider other types of constraints that encode richer problem structures and domain knowledge, which can regularize the model to learn desired behaviors.
The idea has led to posterior regularization (Ganchev et al., 2010) or regularized Bayes (Reg-Bayes; Zhu et al., 2014), which augments the Bayesian inference objective with additional constraints:
where we have rearranged the terms and dropped any constant factors in Equation 2.15, and added constraints with $\bm{\xi}$ being a vector of slack variables, $U(\bm{\xi})$ a penalty function (e.g., $\ell_1$ norm of $\bm{\xi}$), and $\mathcal{Q}(\bm{\xi})$ a subset of valid distributions over $\bm{\theta}$ that satisfy the constraints determined by $\bm{\xi}$. The optimization problem is generally easy to solve when the penalty/constraints are convex and defined w.r.t. a linear operator (e.g., expectation) of the posterior $q$. For example, let $T(\bm{x}^*;\bm{\theta})$ be a feature vector of data instance $\bm{x}^*\in\mathcal{D}$, the constraint posterior set $Q$ can be defined as:
which bounds the feature expectations with $\bm{\xi}$.
Max-margin constraint is another expectation constraint that has shown to be widely effective in classification and regression (Vapnik, 1998). The maximum entropy discrimination (MED) by (Jaakkola et al., 2000) regularizes linear regression models with the max-margin constraints, which is latter generalized to more complex models $p(\bm{x}|\bm{\theta})$, such as Markov networks (Taskar et al., 2004) and latent variable models (Zhu et al., 2014). Formally, let $\bm{y}^*\in\mathbb{R}$ be the observed label associated with $\bm{x}^*$. The margin-based constraint says that a classification/regression function $h(\bm{x};\bm{\theta})$ should make at most $\epsilon$ deviation from the true label $\bm{y}^*$. Specifically, consider the common choice of the function $h$ as a linear function: $h(\bm{x};\bm{\theta}) = \bm{\theta}^\top T(\bm{x})$, where $T(\bm{x})$ is, with a slight abuse of notation, the feature of instance $\bm{x}$. The constraint is written as:
for all instances $(\bm{x}^*, \bm{y}^*)\in\mathcal{D}$.
Alternating optimization for posterior regularization. Having seen EM-style alternating minimization algorithms being applied as a general solver for a number of optimization-theoretic frameworks described above, it is not surprising that the posterior regularization framework can also be solved with an alternating minimization procedure. For example, consider the simple case of linear constraint in Equation 2.17, penalty function $U(\bm{\xi})=\|\bm{\xi}\|_1$, and $q$ factorizing across $\bm{\theta}=\{\bm{\theta}_c\}$. At each iteration $n$, the solution of $q(\bm{\theta}_c)$ is given as (Ganchev et al., 2010):
where $\bm{\theta}_{\backslash c}$ denotes all components of $\bm{\theta}$ except $\bm{\theta}_c$, and $Z$ is the normalization factor. Intuitively, a configuration of $\bm{\theta}_{c}$ with a higher expected constraint value $\mathbb{E}_{\backslash c} T(\bm{x}^*; \bm{\theta})$ will receive a higher probability under $q^{(n+1)}(\bm{\theta}_{c})$. The optimization procedure iterates over all components $c$ of $\bm{\theta}$.
2.4. Summary
In this section, we have seen that the maximum entropy formalism provides an alternative insight into the classical learning frameworks of MLE, Bayesian inference, and posterior regularization. It provides a general expression of these three paradigms as a constrained optimization problem, with a paradigm-specific loss on the model parameters $\bm{\theta}$ and an auxiliary distribution $q$, over a properly designed constraint space $\mathcal{Q}$ where $q$ must reside:
In particular, the use of the auxiliary distribution $q$ converts the originally highly complex problem of directly optimizing $\bm{\theta}$ against data, to an alternating optimization problem over $q$ and $\bm{\theta}$, which is algorithmically easier to solve since $q$ often acts as an easy-to-optimize proxy to the target model. The auxiliary $q$ can also be more flexibly updated to absorb influence from data or constraints, offering a teacher-student–style iterative mechanism to incrementally update $\bm{\theta}$ as we will see in the sequel.
By reformulating learning as a constrained optimization problem, the maximum entropy point of view also offers a great source of flexibility for applying many powerful tools for efficient approximation and enhanced learning, such as variational approximation (e.g., by relaxing $\mathcal{Q}$ to be easy-to-inference family of $q$ such as the mean field family, Jordan et al., 1999, and Xing et al., 2002), convex duality (e.g., facilitating dual sparsity of support vectors via the complementary slackness in the KKT conditions), and kernel methods as used in (Taskar et al., 2004; Zhu & Xing, 2009).
It is intriguing that, in the dual point of view on the problem of (supervised) MLE, data instances are encoded as constraints (Equation 2.4), much like the structured constraints in posterior regularization. In the following sections, we present the standardized formalism of machine learning algorithms and show that indeed a myriad types of experience besides data instances and constraints can all be encoded in the same generic form and be used in learning.
3. A Standard Model for Objective Function
Generalizing from Equation 2.16, we present the following general formulation for learning a target model via a constrained loss minimization program. We would refer to the formulation as the ‘Standard Equation’ because it presents a general space of learning objectives that encompasses many specific formalisms used in different machine learning paradigms.
Without loss of generality, let $\bm{t}\in\mathcal{T}$ be the variable of interest, for example, the input-output pair $\bm{t}=(\bm{x},\bm{y})$ in a prediction task, or the target variable $\bm{t}=\bm{x}$ in generative modeling. Let $p_\theta(\bm{t})$ be the target model with parameters $\bm{\theta}$ to be learned. Generally, the SE is agnostic to the specific forms of the target model, meaning that the target model can take an arbitrary form as desired by the problem at hand (e.g., classification, regression, generation, control) and can be of arbitrary types ranging from deep neural networks of arbitrary architectures, prompts for pretrained models, symbolic systems (e.g., knowledge graph), probabilistic graphical models of arbitrary dependence structures, and so on. We discuss more details of the different choices of the target model in Section 8.
Let $q(\bm{t})$ be an auxiliary distribution. The SE is written as:
The SE contains three major terms that constitute a learning formalism: the uncertainty function$\mathbb{H}\left( \cdot \right)$ that controls the compactness of the output model (e.g., as measured by the amount of allowed randomness while trying to fit experience); the divergence function$\mathbb{D}\left( \cdot, \cdot \right)$ that measures the distance between the target model to be trained and the auxiliary model that facilitates a teacher–student mechanism as shown below; and the experience function, which is introduced by a penalty term $U(\bm{\xi})$ that draws in the set of ‘experience functions’ $f_k^{(\theta)}$ that represent external experience of various kinds for training the target model. The hyperparameters $\alpha,\beta\geq 0$ enable trade-offs between these components.
Experience function. Perhaps the most powerful in terms of impacting the learning outcome and utility is the experience functions $f_k^{(\theta)}$. An experience function $f^{(\theta)}(\bm{t})\in \mathbb{R}$ measures the goodness of a configuration $\bm{t}$ in light of any given experience. The superscript ${(\theta)}$ highlights that the experience in some settings (e.g., reward experience as in Section 4.3) could depend on or be coupled with the target model parameters $\bm{\theta}$. In the following, we omit the superscript when there is no ambiguity. As discussed in Section 4, all diverse forms of experience that can be utilized for model training, such as data examples, constraints, logical rules, rewards, and adversarial discriminators, can be encoded as an experience function. The experience function hence provides a unified language to express all exogenous information about the target model, which we consider as an essential ingredient for panoramic learning to flexibly incorporate diverse experience in learning. Based on the uniform treatment of experience, a standardized optimization program as above can be formulated to identify the desired model. Specifically, the experience functions contribute to the optimization objective via the penalty term $U(\bm{\xi})$ over slack variables $\bm{\xi}\in\mathbb{R}^K$ applied to the expectation $\mathbb{E}_{q}\left[ f_k \right]$. The effect of maximizing the expectation is such that the auxiliary model $q$ is encouraged to produce samples of high quality in light of the experience (i.e., samples receiving high scores as evaluated by the experience function).
Divergence function. The divergence function $\mathbb{D}\left( q, p_\theta \right)$ measures the ‘quality’ of the target model $p_\theta$ in terms of its distance (divergence) with the auxiliary model $q$. Intuitively, we want to minimize the distance from $p_\theta$ to $q$, which is optimized to fit the experience as above. Section 5 gives a concrete example of how the divergence term would directly impact the model training: with a certain specification of the different components (e.g., the experience function, $\alpha/\beta$), the SE in Equation 3.1 would reduce to $\min_{\bm{\theta}} \mathbb{D}(p_{d}, p_\theta)$. That is, the learning objective is to minimize the divergence between the target model distribution $p_\theta$ and the data distribution $p_{d}$, and the divergence function $\mathbb{D}(\cdot, \cdot)$ determines the specific optimization problem. The divergence function can have a variety of choices, ranging from the family of $f$-divergence (e.g., KL divergence), or Bregman divergence, to optimal transport distance (e.g., Wasserstein distance), and so on. We discuss the divergence term in Section 5 in more detail.
Uncertainty function. The uncertainty function $\mathbb{H}(q)$ describes the uncertainty of the auxiliary distribution $q$ and thus controls the complexity of the learning system. It conforms with the maximum entropy principle discussed in Section 2 that one should pick the most uncertain solution among those that fit all experience. Like other components in SE, the uncertainty measure $\mathbb{H}(\cdot)$ can take different forms, such as the popular Shannon entropy, as well as other generalized ones such as Tsallis entropy. In this article, we assume Shannon entropy by default.
For the discussion in the following sections, it is often convenient to consider a special case of the SE in Equation 3.1. Specifically, we assume a common choice of the penalty $U(\bm{\xi})=\sum_k\xi_k$, and, with a slight abuse of notations, $f=\sum_k f_k$. In this case, the SE in Equation 3.1 can equivalently be written in an unconstrained form:
which can be easily seen by optimizing Equation 3.1 over $\bm{\xi}$. In the special unconstrained form, the interplay between the exogenous experience, divergence, and the endogenous uncertainty become more explicit.
Optimization: Teacher-student mechanism. The introduction of the auxiliary distribution $q$ relaxes the learning problem of $p_\theta$, originally only over $\bm{\theta}$, to be now alternating between $q$ and $\bm{\theta}$. Here $q$ acts as a conduit between the exogenous experience and the target model: it on the one hand subsumes the experience (by maximizing the expected $f$ value), and on the other hand passes it incrementally to the target model (by minimizing the divergence $\mathbb{D}$). The following fixed point iteration between $q$ and $\bm{\theta}$ illustrates this optimization strategy under the SE. Let us plug into Equation 3.2 the popular cross entropy (CE) as the divergence function, that is, $\mathbb{D}(q,p_\theta)=-\mathbb{E}_q[\log p_\theta]$, and Shannon entropy as the uncertainty measure, that is, $\mathbb{H}(q)=-\mathbb{E}_q[\log q]$. We further assume the experience $f$ is independent of the model parameters $\bm{\theta}$ (the assumption is indeed not necessary for the teacher step). We have, at iteration $n$:
where $Z$ is the normalization factor. The first step embodies a ‘teacher’s update’ where the teacher $q$ ingests experience $f$ and builds on current states of the student $p_{\theta^{(n)}}$; the second step is reminiscent of a ‘student’s update’ where the student $p_\theta$ updates its states by maximizing its alignment (here measured by CE) with the teacher.
Besides, the auxiliary $q$ is an easy-to-manipulate intermediate form in the training that permits rich approximate inference tools for tractable optimization. We have the flexibility of choosing its surrogate functions, ranging from the principled variational approximations for the target distribution in a properly relaxed space (e.g., mean fields) where gaps and bounds can be characterized, to the arbitrary neural network-based ‘inference networks’ that are highly expressive and easy to compute. As can be easily shown (e.g., see Section 4.1.3), popular training heuristics, such as EM, variational EM, wake-sleep, forward and backward propagation, and so on, are all direct instantiations or variants of the above teacher-student mechanism with different choices of the form of $q$.
More generally, a broad set of sophisticated algorithms, such as the policy gradient for reinforcement learning and the generative adversarial learning, can also be easily derived by plugging in specific designs of the experience function $f$ and divergence $\mathbb{D}$. Table 1 summarizes various specifications of the SE components that recover a range of existing well-known algorithms from different paradigms. As shown in more detail in the subsequent sections, the standard equation (Equations 3.1 and 3.2) offers a unified and universal paradigm for model training under many scenarios based on many types of experience, potentiating a turnkey implementation and a more generalizable theoretical characterization.
Table 1. Example configurations of the components in the standard equation (Equations 3.1 and 3.2), which recover different existing algorithms. Here, ‘CE’ means Cross Entropy; ‘JSD’ is the Jensen-Shannon divergence; ‘W_{1} dist.’ is the first-order Wasserstein distance; and ‘KL’ is the KL divergence. Refer to Sections 4, 5, and 6 for more details. (Scroll or use slider at bottom of table to see additional columns.)
The experience function $f(\bm{t})$ in the standard equation can be instantiated to encode vastly distinct types of experience. Different choices of $f(\bm{t})$ result in learning algorithms applied to different problems. With particular choices, the standard equation rediscovers a wide array of well-known algorithms. The resulting common treatment of the previously disparate algorithms is appealing as it offers new holistic insights into the commonalities and differences of those algorithms. Table 1 shows examples of extant algorithms that are recovered by the standard equation.
4.1. Data Instance Experience
We first consider the most common type of experience, namely, data instances, which are assumed to be independent and identically distributed (i.i.d.). Such data instance experience can appear in a wide range of contexts, including supervised, self-supervised, unsupervised, actively supervised, and other scenarios with data augmentation and manipulation. Figure 2 illustrates the experience functions based on the data instances.
4.1.1. Supervised Data Instances
Without loss of generality and for consistency of notations with the rest of the section, we consider data instances to consist of a pair of input-output variables, namely $\bm{t}=(\bm{x}, \bm{y})$. For example, in image classification, $\bm{x}$ represents the input image and $\bm{y}$ is the object label. In the supervised setting, we observe the full data drawn i.i.d. from the data distribution $(\bm{x}^*,\bm{y}^*)\sim p_{d}(\bm{x},\bm{y})$. For an arbitrary configuration $(\bm{x}_0,\bm{y}_0)$, its probability $p_{d}(\bm{x}_0,\bm{y}_0)$ under the data distribution can be seen as measuring the expected similarity between $(\bm{x}_0,\bm{y}_0)$ and true data samples $(\bm{x}^*,\bm{y}^*)$, and be written as $p_{d}(\bm{x}_0,\bm{y}_0)=\mathbb{E}_{p_{d}(\bm{x}^*,\bm{y}^*)}\left[ \mathbb{I}_{(\bm{x}^*,\bm{y}^*)}(\bm{x}_0,\bm{y}_0) \right]$. Here the similarity measure is $\mathbb{I}_{(\bm{x}^*,\bm{y}^*)}(\bm{x},\bm{y})$, an indicator function that takes the value $1$ if $(\bm{x},\bm{y})$ equals $(\bm{x}^*,\bm{y}^*)$ and $0$ otherwise (we will see other similarity measures shortly). In practice, we are given an empirical distribution $\tilde{p}_{d}(\bm{x},\bm{y})$ by observing a collection of instances $\mathcal{D}$ on which the expected similarity is evaluated:
where $N$ is the size of the data set $\mathcal{D}$, and $m(\bm{x},\bm{y})$ is the number of occurrences of the configuration $(\bm{x},\bm{y})$ in $\mathcal{D}$.
The experience function $f$ accommodates the data instance experience straightforwardly as below:
Figure 2 (a)–(b) shows an illustration. In particular, the logarithm of the expected similarity is used as the experience function score, that is, the more ‘similar’ a configuration $(\bm{x},\bm{y})$ is to the observed data instances, the higher its quality. The logarithm serves to make the subsequent derivations more convenient as can be seen below.
With this from of $f$, we show that the SE derives the conventional supervised MLE algorithm.
Supervised MLE. In the SE Equation 3.2 (with cross entropy and Shannon entropy), we set $\alpha=1$, and $\beta$ to a very small positive value $\epsilon$. As a result, the auxiliary distribution $q(\bm{x},\bm{y})$ is determined directly by the full data instances (not the model $p_\theta$). That is, the solution of $q$ in the teacher-step (Equation 3.3) is:
which reduces to the empirical distribution. The subsequent student-step that maximizes the log-likelihood of samples from $q$ then leads to the supervised MLE updates w.r.t. $\bm{\theta}$.
4.1.2. Self-Supervised Data Instances
Given an observed data instance $\bm{t}^* \in \mathcal{D}$ in general, one could potentially derive various supervision signals based on the structures of the data and the target model. In particular, one could apply a “split” function that artificially partitions $\bm{t}^*$ into two parts $(\bm{x}^*,\bm{y}^*)=\textit{split}(\bm{t}^*)$ in different, sometimes stochastic ways. Then the two parts are treated as the input and output for the properly designed target model $p_\theta(\bm{x}, \bm{y})$ for supervised MLE as above, by plugging in the slightly altered experience function:
A key difference from the above standard supervised learning setting is that now the target variable $\bm{y}$ is not costly obtained labels or annotations, but rather part of the massively available data instances. The paradigm of treating part of observed instance as the prediction target is called ‘self-supervised’ learning (Lecun & Misra, 2021) and has achieved great success in language and vision modeling. For example, in language modeling (Brown et al., 2020;Devlin et al., 2019), the instance $\bm{t}$ is a piece of text, and the ‘split’ function usually selects from $\bm{t}$ one or few words to be the target $\bm{y}$ and the remaining words to be $\bm{x}$.
4.1.3. Unsupervised Data Instances
In the unsupervised setting, for each instance $\bm{t}=(\bm{x},\bm{y})$, such as (image, cluster index), we only observe the $\bm{x}$ part. That is, we are given a data set $\mathcal{D}=\{\bm{x}^*\}$ without the associated $\bm{y}^*$. The data set defines the empirical distribution $\tilde{p}_{d}(\bm{x})$. The experience can be encoded in the same form as the supervised data (Equation 4.2) but now with only the information of $\bm{x}^*$:
Applying the SE to this setting with proper specifications derives the unsupervised MLE algorithm.
Unsupervised MLE. The form of Equation 3.2 is reminiscent of the variational free energy objective in the standard EM for unsupervised MLE (Equation 2.9). We can indeed get exact correspondence by setting $\alpha=\beta=1$, and setting the auxiliary distribution $q(\bm{x},\bm{y})=\tilde{p}_{d}(\bm{x})q(\bm{y}|\bm{x})$. The reason for $\beta=1$, which differs from the specification $\beta=\epsilon$ in the supervised setting, is that the auxiliary distribution $q$ cannot be determined fully by the unsupervised ‘incomplete’ data experience alone. Instead, it additionally relies on $p_\theta$ through the divergence term. Here $q$ is assumed a specialized decomposition $q(\bm{x},\bm{y})=\tilde{p}_{d}(\bm{x})q(\bm{y}|\bm{x})$ where $\tilde{p}_{d}(\bm{x})$ is fixed and thus not influenced by $p_\theta$. In contrast, if no structure of $q$ is assumed, we could potentially obtain an extended, instance-weighted version of EM where each instance $\bm{x}^*$ is weighted by the marginal likelihood $p_\theta(\bm{x}^*)$, in line with the previous weighted EM methods for robust clustering (Gebru et al., 2016; Yu et al., 2011).
4.1.4. Manipulated Data Instances
Data manipulation, such as reweighting data instances or augmenting an existing data set with new instances, is often a crucial step for efficient learning, such as in a low data regime or in presence of low-quality data sets (e.g., imbalanced labels). We show that the rich data manipulation schemes can be treated as experience and be naturally encoded in the experience function (Hu, Tan et al., 2019). This is done by extending the data-instance experience function (Equation 4.2), in particular by enriching the similarity metric in different ways. The discussion here generally applies to data instance $\bm{t}$ of any structures, for example, $\bm{t}=(\bm{x},\bm{y})$ or $\bm{t}=\bm{x}$.
Data reweighting. Rather than assuming the same importance of all data instances, we can associate each instance $\bm{t}^*$ with an importance weight $w(\bm{t}^*)\in\mathbb{R}$, so that the learning pays more attention to those high-quality instances, while low-quality ones (e.g., with noisy labels) are downplayed. This can be done by scaling the above 0/1 indicator function (e.g., Equation 4.2) with the weight (Figure 2[c]):
Plugging $f_{\text{data-w}}$ into the SE (Equation 3.2) with the same other specification of supervised MLE ($\alpha=1, \beta=\epsilon$), we get the update rule of model parameters $\bm{\theta}$ in the student-step (Equation 3.3):
which is the familiar weighted supervised MLE. The weights $w$ can be specified a priori based on heuristics, for example, using inverse class frequency. In many cases it is desirable to automatically induce and adapt the weights during the course of model training. In Section 9.2, we discuss how the SE framework can easily enable automated data reweighting by reusing existing algorithms that were designed to solve other seemingly unrelated problems.
Data augmentation. Data augmentation expands existing data by adding synthetically modified copies of existing data instances (e.g., by rotating an existing image at random angles), and is widely used for increasing data size or encouraging invariance in learned representations (e.g., object label is invariant to image rotation). The indicator function $\mathbb{I}$ as the similarity metric in Equation 4.2 restrictively requires exact match between the true $\bm{t}^*$ and the configuration $\bm{t}$. Data augmentation arises as a ‘relaxation’ to the similarity metric. Let $a_{\bm{t}^*}(\bm{t})\geq 0$ be a distribution that assigns non-zero probability to not only the exact $\bm{t}^*$ but also other configurations $\bm{t}$ related to $\bm{t}^*$ in certain ways (e.g., all rotated images $\bm{t}$ of the observed image $\bm{t}^*$). Replacing the indicator function metric in Equation 4.2 with the new $a_{\bm{t}^*}(\bm{t})\geq 0$ yields the experience function for data augmentation (Figure 2[d]):
The metric $a_{\bm{t}^*}(\bm{t})$ can be defined in various ways, leading to different augmentation strategies. For example, setting $a_{\bm{t}^*}(\bm{t})\propto \exp\{ R(\bm{t},\bm{t}^*) \}$, where $R(\bm{t},\bm{t}^*)$ is a task-specific evaluation metric such as BLEU for machine translation, results in the reward-augmented maximum likelihood (RAML) algorithm (Norouzi et al., 2016). Besides the manually designed strategies, we can also specify $a_{\bm{t}^*}(\bm{t})$ as a parameterized transformation process and learn any free parameters thereof automatically (Section 6). Notice the same form of the augmentation experience $f_{\text{data-aug}}$ and the reweighting experience $f_{\text{data-w}}$, where the similarity metrics both include learnable components (i.e., $a_{\bm{t}^*}(\bm{t})$ and $w(\bm{t}^*)$, respectively). Thus the same approach to automated data reweighting can also be applied for automated data augmentation, as discussed more in Section 9.2.
4.1.5. Actively Supervised Data Instances
Instead of access to data instances $\bm{x}^*$ with readily available labels $\bm{y}^*$, in the active supervision setting, we are presented with a large pool of unlabeled instances $\mathcal{D}=\{\bm{x}^*\}$ as well as a certain budget for querying an oracle (e.g., human annotators) for labeling a limited set of instances. To minimize the need for labeled instances, we need to strategically select queries from the pool according to an informativeness measure $u(\bm{x})\in\mathbb{R}$. For example, $u(\bm{x})$ can be the predictive uncertainty on the instance $\bm{x}$, quantified by the Shannon entropy of the predictive distribution or the vote entropy based on a committee of predictors (Dagan & Engelson, 1995).
Mapping the standard equation to this setting, we show the informativeness measure $u(\bm{x})$ is subsumed as part of the experience. Intuitively, $u(\bm{x})$ encodes our heuristic belief about sample ‘informativeness’. This heuristic is a form of information we inject into the learning system. Denote the oracle as $o$ from which we can draw a label $\bm{y}^*\sim o(\bm{x}^*)$. The active supervision experience function is then defined as:
where the first term is essentially the same as the supervised data experience function (Equation 4.2) with the only difference that now the label $\bm{y}^*$ is from the oracle rather than pre-given in $\mathcal{D}$; $\lambda>0$ is a trade-off parameter. The formulation of the active supervision is interesting as it is simply a combination of the common supervision experience and the informativeness measure in an additive manner.
We plug $f_{\text{active}}$ into the SE and obtain the algorithm to carry out learning. The result turns out to recover classical active learning algorithms.
Active learning. Specifically, in Equation 3.2, setting $f=f_{\text{active}}$, and $(\alpha=1,\beta=\epsilon)$ as in supervised MLE, the resulting student-step in Equation 3.3 for updating $\bm{\theta}$ is written as
If the pool $\mathcal{D}$ is large, the update can be carried out by the following procedure: we first pick a random subset $\mathcal{D}_{\text{sub}}$ from $\mathcal{D}$, and select a sample from $\mathcal{D}_{\text{sub}}$ according to the informativeness distribution proportional to $\exp\{\lambda u(\bm{x})\}$ over $\mathcal{D}_{\text{sub}}$. The sample is then labeled by the oracle, which is finally used to update the target model. By setting $\lambda$ to a very large value (i.e., a near-zero ‘temperature’ $1/\lambda$), we tend to select the most informative sample from $\mathcal{D}_{\text{sub}}$. The procedure rediscovers the algorithm proposed in (Ertekin et al., 2007) and more generally the pooling-based active learning algorithms (Settles, 2012).
4.2. Knowledge-Based Experience
Many aspects of problem structures and human knowledge are difficult if not impossible to be expressed through individual data instances. Examples include the knowledge of expected feature values, maximum margin structures (Section 2.3), logical rules, and so on. The knowledge generally imposes constraints that we want the target model to satisfy. The experience function in the standard equation is a natural vehicle for incorporating such knowledge constraints in learning. Given a configuration $\bm{t}$, the experience function $f(\bm{t})$ measures the degree to which the configuration satisfies the constraints.
As an example, we consider first-order logic (FOL) rules, which provide an expressive declarative language to encode complex symbolic knowledge (Hu et al., 2016). More concretely, let $f_{\text{rule}}(\bm{t})$ be an FOL rule w.r.t. the variables $\bm{t}$. For flexibility, we use soft logic (Bach et al., 2017) to formulate the rule. Soft logic allows continuous truth values from the interval $[0,1]$ instead of $\{0, 1\}$, and the Boolean logical operators are redefined as:
$\begin{split}
A \& B = \max\{A+B-1, 0\}&,\quad A\vee B =\min\{A+B,1\} \\
A_1 \wedge \dots \wedge A_N = \sum\nolimits_{i}A_i / N&,\quad \neg A = 1 - A.
\end{split}
\ \ \ \ \ \text{(4.12)}$
Here $\&$ and $\wedge$ are two different approximations to logical conjunction: $\&$ is useful as a selection operator (e.g., $A\&B=B$ when $A=1$, and $A\&B=0$ when $A=0$), while $\wedge$ is an averaging operator. To give a concrete example, consider the problem of sentiment classification, where given a sentence $\bm{x}$, we want to predict its sentiment $\bm{y}\in \{\text{negative}\ 0,\text{positive}\ 1\}$. A challenge for a sentiment classifier is to understand the contrastive sense within a sentence and capture the dominant sentiment precisely. For example, if a sentence is of structure ‘A-but-B’ with the connective ‘but’, the sentiment of the half sentence after ‘but’ dominates. Let $\bm{x}_B$ be the half sentence after ‘but’ and $\tilde{\bm{y}}_B\in[0,1]$ the (soft) sentiment prediction over $\bm{x}_B$ by the current model, a possible way to express the knowledge as a logical rule $f_{\text{rule}}(\bm{x}, \bm{y})$ is:
where $\mathbb{I}(\cdot)$ is an indicator function that takes $1$ when its argument is true, and 0 otherwise. Given an instantiation (a.k.a. grounding) of $(\bm{x}, \bm{y}, \tilde{\bm{y}}_B)$, the truth value of $f_{\text{rule}}(\bm{x}, \bm{y})$ can be evaluated by definitions in Equation 4.12. Intuitively, the $f_{\text{rule}}(\bm{x}, \bm{y})$ truth value gets closer to $1$ when $\bm{y}$ and $\tilde{\bm{y}}_B$ are more consistent.
We then make use of the knowledge-based experience such as $f_{\text{rule}}(\bm{t})$ to drive learning. The standard equation rediscovers classical algorithms for learning with symbolic knowledge.
Posterior regularization and extensions. By setting $\alpha=\beta=1$ and $f$ to a constraint function such as $f_{\text{rule}}$, the SE with cross entropy naturally leads to a generalized posterior regularization framework (Hu et al., 2016):
which extends the conventional Bayesian inference formulation (Section 2.3) by permitting regularization on arbitrary random variables of arbitrary models (e.g., deep neural networks) with complex rule constraints.
The trade-off hyperparameters can also take other values. For example, by allowing arbitrary $\alpha\in\mathbb{R}$, the objective corresponds to the unified expectation maximization (UEM) algorithm (Samdani et al., 2012) that extends the posterior regularization for added flexibility.
4.3. Reward Experience
We now consider a very different learning setting commonly seen in robotic control and other sequential decision making problems. In this setting, experience is gained by the agent interacting with external environment and collecting feedback in the form of rewards. Formally, we consider a Markov decision process (MDP) as illustrated in Figure 3, where $\bm{t}=(\bm{x}, \bm{y})$ is the state-action pair. For example, in playing a video game, the state $\bm{x}$ is the game screen by the environment (the game engine) and $\bm{y}$ can be any game actions.
At time $t$, environment is in state $\bm{x}_t$. The agent draws an action $\bm{y}_t$ according to the policy $p_\theta(\bm{y}|\bm{x})$. The state subsequently transitions to $\bm{x}_{t+1}$ following certain transition dynamics of the environment, and yields a reward $r_t=r(\bm{x}_t, \bm{y}_t)\in \mathbb{R}$. The general goal of the agent is to learn the policy $p_\theta(\bm{y}|\bm{x})$ to maximize the reward in the long run. There could be different specifications of the goal. In this section we focus on the one where we want to maximize the expected discounted reward starting from a state drawn from an arbitrary state distribution $p_0(\bm{x})$, with a discount factor $\gamma\in[0,1]$ applied to future rewards.
A base concept that plays a central role in characterizing the learning in this setting is the action value function, also known as the $Q$ function, which is the expected discounted future reward of taking action $\bm{y}$ in state $\bm{x}$ and continuing with the policy $p_{\theta}$:
where the expectation is taken by following the state dynamics induced by the policy (thus the dependence of $Q^{\theta}$ on policy parameters $\bm{\theta}$). We next discuss how $Q^{\theta}(\bm{x},\bm{y})$ can be used to specify the experience function in different ways, which in turn derives various known algorithms in reinforcement learning (RL) (Sutton & Barto, 2018). Note that here we are primarily interested in learning the conditional model (policy) $p_\theta(\bm{y}|\bm{x})$. Yet we can still define the joint distribution as $p_\theta(\bm{x},\bm{y})=p_\theta(\bm{y}|\bm{x})p_0(\bm{x})$.
4.3.1. Expected Future Reward
The first simple way to use the reward signals as the experience is by defining the experience function as the logarithm of the expected future reward:
which leads to the classical policy gradient algorithm (Sutton et al., 2000).
Policy gradient. With $\alpha=\beta=1$, we arrive at policy gradient. To see this, consider the teacher-student optimization procedure in Equation 3.3, where the teacher-step yields the $q$ solution:
Here the first equation is due to the log-derivative trick $g \nabla \log g = \nabla g$; and the second equation is due to the policy gradient theorem (Sutton et al., 2000), where $\mu^{\theta}(\bm{x})=\sum_{t=0}^{\infty} \gamma^t p(\bm{x}_t=\bm{x})$ is the unnormalized discounted state visitation measure. The final form is exactly the policy gradient up to a multiplication factor $1/Z$.
We can also consider a slightly different use of the reward, by directly setting the experience function to the $Q$ function:
This turns out to connect to the known RL-as-inference approach that has a long history of research (Abdolmaleki et al., 2018; Dayan & Hinton, 1997; Deisenroth et al., 2013; Levine, 2018; Rawlik et al., 2012).
RL as inference. We set $\alpha=\beta:=\rho>0$. The configuration corresponds to the approach that casts RL as a probabilistic inference problem. To see this, we introduce an additional binary random variable $o$, with $p(o=1|\bm{x},\bm{y})\propto\exp\{Q(\bm{x},\bm{y}) / \rho\}$. Here $o=1$ is interpreted as the event that maximum reward is obtained, $p(o=1|\bm{x},\bm{y})$ is seen as the ‘conditional likelihood’, and $\rho$ is the temperature. The goal of learning is to maximize the marginal likelihood of optimality: $\log p(o=1)$, which, however, is intractable to solve. Much like how the standard equation applied to unsupervised MLE provides a surrogate variational objective for the marginal data likelihood (Section 4.1.3), here the standard equation also derives a variational bound for $\log p(o=1)$ (up to a constant factor) with the above specification of $(f,\alpha,\beta)$:
The subsequent student-step involves approximation by fixing $\bm{\theta}=\bm{\theta}^{(n)}$ in $Q^{\theta}(\bm{x},\bm{y})$ in the above variational objective, and minimizes only $\mathbb{E}_{q^{(n)}(\bm{x},\bm{y})}\left[ \log p_\theta(\bm{x},\bm{y}) \right]$ w.r.t. $\bm{\theta}$.
4.3.2. Intrinsic Reward
Rewards provided by the extrinsic environment can be sparse in many real-world sequential decision problems. Learning in such problems is thus difficult due to the lack of supervision signals. A method to alleviate the difficulty is to supplement the extrinsic reward with dense intrinsic reward that is generated by the agent itself (i.e., the agent is intrinsically motivated). The intrinsic reward can be induced in various ways, such as the ‘curiosity’-based reward that encourages the agent to explore novel or ‘surprising’ states (Houthooft et al., 2016; Pathak et al., 2017; Schmidhuber, 2010), or the ‘optimal reward’, which is designed with the goal of encouraging maximum extrinsic reward at the end (Singh et al., 2010; Zheng et al., 2018). Formally, let $r_t^{in}=r^{in}(\bm{x}_t, \bm{y}_t)\in\mathbb{R}$ be the intrinsic reward at time $t$ with state $\bm{x}_t$ and action $\bm{y}_t$. For example, in (Pathak et al., 2017), $r^{in}_t$ is the prediction error (i.e., the ‘surprise’) of the next state $\bm{x}_{t+1}$. Let $Q^{in,\theta}(\bm{x},\bm{y})$ denote the action-value function for the intrinsic reward, defined in a similar way as the extrinsic $Q^{\theta}(\bm{x},\bm{y})$:
It is straightforward to derive the intrinsically motivated variant of the policy gradient algorithm (and other RL algorithms discussed below), by replacing the standard extrinsic-only $Q^{\theta}(\bm{x},\bm{y})$ in the experience function Equation 4.16 with the combined $Q^{\theta}(\bm{x},\bm{y}) + Q^{in,\theta}(\bm{x},\bm{y})$. Let $f^{\theta}_\text{reward,ex+in}(\bm{x},\bm{y})$ denote the resulting experience function that incorporates both the extrinsic and the additive intrinsic rewards.
We can notice some sort of symmetry between $f^{\theta}_\text{reward,ex+in}(\bm{x},\bm{y})$