# Discriminative Learning under Covariate Shift with a Single Optimization Problem

# Discriminative Learning under Covariate Shift with a Single Optimization Problem

# Abstract and Keywords

This chapter derives a discriminative model for learning under differing training and test distributions, and is organized as follows. Section 9.2 formalizes the problem setting. Section 9.3 reviews models for different training and test distributions. Section 9.4 introduces the discriminative model, and Section 9.5 describes the joint optimization problem. Primal and kernelized classifiers are derived for various training and test distributions in Sections 9.6 and 9.7. Section 9.8 analyzes the convexity of the integrated optimization problem. Section 9.9 provides empirical results, and Section 9.10 concludes.

*Keywords:*
covariate shift adaptation, training, test distributions, discriminative model, joint optimization, kernelize learning

We address classification problems for which the training instances are governed by a distribution that is allowed to differ arbitrarily from the test distribution—problems also referred to as classification under covariate shift. We derive a solution that is purely discriminative: neither training nor test distribution is modeled explicitly. We formulate the general problem of learning under covariate shift as an integrated optimization problem and instantiate a kernel logistic regression and an exponential model classifier for differing training and test distributions. We show under which condition the optimization problem is convex. We study the method empirically on problems of spam filtering, text classification, and land mine detection.

# 9.1 Introduction

Most machine learning algorithms are constructed under the assumption that the training data is governed by the exact same distribution which the model will later be exposed to. In practice, control over the data generation process is often less than perfect. Training data may be obtained under laboratory conditions that cannot be expected after deployment of a system; spam filters may be used by individuals whose distribution of inbound emails diverges from the distribution reflected in public training corpora; image processing systems may be deployed to foreign geographic regions where vegetation and lighting conditions result in a distinct distribution of input patterns.

The case of distinct training and test distributions in a learning problem has been referred to as *covariate shift* and *sample selection bias*—albeit the term sample selection bias actually refers to a case in which each training instance is originally
(p.162)
drawn from the test distribution, but is then selected into the training sample with some probability, or discarded otherwise.

The covariate shift model and the *missing at random* case in the sample selection bias model allow for differences between the training and test distribution of instances; the conditional distribution of the class variable given the instance is constant over training and test set.

In discriminative learning tasks such as classification, the classifier’s goal is to produce the correct output given the input. It is widely accepted that this is best performed by discriminative learners that directly maximize a quality measure of the produced output. Model-based optimization criteria such as the joint likelihood of input and output, by contrast, additionally assess how well the classifier models the distribution of input values. This amounts to adding a term to the criterion that is irrelevant for the task at hand.

We contribute a discriminative model for learning under arbitrarily different training and test distributions. The model directly characterizes the divergence between training and test distribution, without the intermediate – intrinsically model-based – step of estimating training and test distribution. We formulate the search for all model parameters as an integrated optimization problem. This complements the predominant procedure of first estimating the bias of the training sample, and then learning the classifier on a weighted version of the training sample. We show that the integrated optimization can be convex, depending on the model type; it is convex for the exponential model. We derive a Newton gradient descent procedure, leading to a kernel logistic regression and an exponential model classifier for covariate shift.

After formalizing the problem setting in section 9.2, we review models for differing training and test distributions in section 9.3. Section 9.4 introduces our discriminative model [Bickel et al., 2007] and section 9.5 describes the joint optimization problem. We derive primal and kernelized classifiers for differing training and test distributions in sections 9.6 and 9.7. In section 9.8, we analyze the convexity of the integrated optimization problem. Section 9.9 provides empirical results and section 9.10 concludes.

# 9.2 Problem Setting

In the *covariate shift* problem setting, a labeled training sample X^{tr} = ((**x**_{1}), …, **x**_{ntr} with labels y^{tr} = ((y_{1}), …, y_{ntr} is available. This training sample is governed by an unknown distribution *p*(**x**|λ); labels are drawn according to an unknown target concept *p*(*y*|**x**). In addition, an unlabeled test set *X*^{te} =(**x**_{ntr}+1, …, **x**_{ntr}+*n*_{te}) becomes available. The test set is governed by a different unknown distribution, *p*(**x**|*θ*). Training and test distribution may differ arbitrarily, but there is only one unknown target conditional class distribution *p*(*y*|**x**).

The goal is to find a classifier and to predict the missing labels *y*_{ntr}+1, …,*y*_{ntr}+_{nte} for the test instances. From a purely transductive perspective,
(p.163)
the classifier can even be seen as an auxiliary step and may be discarded after the labels *y*_{ntr}+1, …,*y*_{ntr}+_{nte} have been conceived. The classifier should in any case perform well on the test data; that is, it should minimize some loss function *E*_{(x, y)∼θ}[ℓ(*f*(**x**), *y*)] that is defined with respect to the unknown test distribution *p*(**x**|*θ*).

Note that directly training *f* on the training data *X*^{tr} would minimize the loss with respect to *p*(**x**|*λ*). The minimum of this optimization problem will not generally coincide with the minimal loss on *p*(**x**|*θ*).

# 9.3 Prior Work

If training and test distributions were known, then the loss on the test distribution could be minimized by weighting the loss on the training distribution with an instance-specific factor. Proposition 9.1 [Shimodaira, 2000] illustrates that the scaling factor has to be $\frac{p\left(x|\theta \right)}{p\left(x|\text{\lambda}\right)}$.

**Proposition 9.1** *The expected loss with respect to 9 equals the expected loss with respect to λ with weights $\frac{p\left(x|\theta \right)}{p\left(x|\text{\lambda}\right)}$ for the loss incurred by each* **x**, *provided that the support of p*(**x**|*θ*) *is contained in the support of p*(**x**|*λ*):

After expanding the expected value into its integral $\int \text{}\ell \left(f\left(x\right),\text{}y\right)p\left(x\text{,}y|\theta \right)d\theta $, the joint distribution *p*(**x**,*y*|λ) is decomposed into *p*(**x**|λ)*p*(*y*|**x**, *θ*). Since *p*(*y*|**x**, λ) = *p*(*y*|**x**) = *p*(*y*|**x**, *θ*) is the global conditional distribution of the class variable given the instance, proposition 9.1 follows. All instances **x** with positive *p*(**x**|*θ*) are integrated over. Hence, (9.1) holds as long as each **x** with positive *p*(**x**|*θ*) also has a positive *p*(**x**|λ); otherwise, the denominator vanishes. This shows that covariate shift can only be compensated for as long as the training distribution covers the entire support of the test distribution. If a test instance had zero density under the training distribution, the test-to-training density ratio which it would need to be scaled with would incur a zero denominator.

Both, *p*(**x**|*θ*) and *p*(**x**|λ) are unknown, but *p*(**x**|*θ*) is reflected in *X*^{te}, as is *p*(**x**|λ) in *X*^{tr}. A straightforward approach to compensating for covariate shift is to first obtain estimates $\widehat{p}\left(x|\theta \right)$ and $\widehat{p}\left(x|\text{\lambda}\right)$ from the test and training data, respectively, using kernel density estimation [Shimodaira, 2000; Sugiyama and Müller, 2005b], (see also chapter 7). In a second step, the estimated density ratio is used to re-sample the training instances, or to train with weighted examples.

This method decouples the problem. First, it estimates training and test distributions. This step is intrinsically model-based and only loosely related to the ultimate goal of accurate classification. In a subsequent step, the classifier is derived given fixed weights. Since the parameters of the final classifier and the parameters that
(p.164)
control the weights are not independent, this decomposition into two optimization steps cannot generally find the optimal setting of the *joint* parameter vector.

A line of work on learning under sample selection bias has meandered from the statistics and econometrics community into machine learning [Heckman, 1979; Zadrozny, 2004]. Sample selection bias relies on a model of the data generation process. Test instances are drawn under *p*(**x**|*θ*). Training instances are drawn by first sampling **x** from the test distribution *p*(**x**|*θ*). A selector variable *s* then decides whether **x** is moved into the training set (*s* = 1) or moved into the rejected set (*s* = 0). For instances in the training set (*s* = 1) a label is drawn from *p*(*y*|**x**); for the instances in the rejected set the labels are unknown. A typical scenario for sample selection bias is credit scoring. The labeled training sample consists of customers who were given a loan in the past and the rejected sample are customers that asked for but were not given a loan. New customers asking for a loan reflect the test distribution.

The distribution of the selector variable maps the test onto the training distribution:

Proposition 9.2 [Zadrozny, 2004; Bickel and Scheffer, 2007] says that minimizing the loss on instances weighted by *p*(*s*|**x**, *θ*, λ)^{−1} in fact minimizes the expected loss with respect to *θ*.

**Proposition 9.2** *The expected loss with respect to θ is proportional to the expected loss with respect to λ with weights p*(*s* = 1||**x**, *θ*, λ)^{−1} *for the loss incurred by each* **x**, *provided that the support of p*(**x**|*θ*) *is contained in the support of p*(**x**|λ).

When the model is implemented, *p*(*s* = 1|**x**, *θ*, λ) is learned by discriminating the training against the rejected examples; in a second step the target model is learned by following proposition 9.2 and weighting training examples by *p*(*s*|**x**, *θ*, λ)^{−1}. No test examples drawn directly from *p*(**x**|*θ*) are needed to train the model; only labeled selected and unlabeled rejected examples are required. This is in contrast to the covariate shift model that requires samples drawn from the test distribution, but no selection process is assumed and no rejected examples are needed.

Propensity scores [Rosenbaum and Rubin, 1983; Lunceford and Davidian, 2004] are applied in settings related to sample selection bias; the training data is again assumed to be drawn from the test distribution *p*(**x**|*θ*) followed by a selection process. The difference from the sample selection bias setting is that the selected *and* the rejected examples are labeled. Weighting the selected examples by the inverse of the propensity score *p*(*s* = 1|**x**, λ, *θ*)^{−1} and weighting the rejected examples by *p*(*s* = 0|**x**, λ, *θ*)^{−1} results in two unbiased samples with respect to the test distribution.

(p.165) Propensity scoring can precede a variety of analysis steps. This can be the training of a target model on reweighted data or just a statistical analysis of the two reweighted samples. A typical application for propensity scores is the analysis of the success of a medical treatment. Patients are selected to be given the treatment and some other patients are selected into the control group. If the selection is not randomized the outcome (e.g., ratio of cured patients) of the two groups cannot be compared directly and propensity scores can be applied.

Maximum entropy density estimation under sample selection bias has been studied by Dudík et al. [2006]. Bickel and Scheffer [2007] impose a Dirichlet process prior on several learning problems with related sample selection bias. Elkan [2001] and Japkowicz and Stephen [2002] investigate the case of training data that is only biased with respect to the class ratio; this can be seen as sample selection bias where the selection only depends on *y*.

Kernel mean matching (Gretton et al. in chapter 8) is a two-step method that first finds weights for the training instances such that the first momentum of training and test sets—i.e., their mean value—matches in feature space. The subsequent training step uses these weights. The procedure requires a universal kernel. Matching the means in feature space is equivalent to matching all moments of the distributions if a universal kernel is used.

Φ(⋅) is a mapping into a feature space and *B* is a regularization parameter. Gretton et al. in chapter 8 derive a quadratic program from (9.4) that can be solved with standard optimization tools:

# 9.4 Discriminative Weighting Factors

In this section, we derive a purely discriminative model that directly estimates weights for the training instances. No distributions over instances are modeled explicitly. We first introduce a selector variable *σ*: For each element **x** of the training set, selector variable *σ* = 1 indicates that it has been drawn into *X*^{tr}. For each **x** in the test data, *σ* = −1 indicates that it has been drawn into the test set. The probability *p*(*σ* = 1|**x**, *θ*, λ) has the following intuitive meaning: Given that an instance **x** has been drawn at random from the bag *X*^{tr} ∪ *X*^{te} of training and test set, the probability that **x** originates from *X*^{tr} is *p*(*σ* = 1|**x**, *θ*, λ). Hence, the value of *σ* is observable for all training (*σ* = 1) and test (*σ* = −1) instances. The dependency between the instances and *σ* is undirected; neither training nor test set is assumed to be generated from the other sample.

In the following equations we will derive a discriminative expression for $\frac{p\left(x|\theta \right)}{p\left(x|\text{\lambda}\right)}$ which will no longer include any density on instances. When *p*(*σ* = −1) = 0 – which
(p.166)
is implied by the test set not being empty – then the definition of *σ* allows us to rewrite the test distribution as *p*(**x**|*θ*) = *p*(**x**|λ = −1, *θ*). Since test instances are only dependent on parameter *θ* but not on parameter λ, equation *p*(**x**|*σ* = − 1, *θ*) = *p*(**x**|σ = − 1, *θ*) follows. By an analogous argument, *p*(**x**|*θ* = *p*(**x**|*σ* =1, *θ*, λ) when *p*(*σ* = 1) ≠ 0. This implies (9.5).

In (9.6) the Bayes rule is applied twice; the two terms of *p*(**x**|*θ*, λ) cancel each other out in (9.7). Since *p*(*σ* = −1|**x**, *θ*, λ) = 1 − *p*(*σ* = 1|**x**, *θ*, λ), (9.8) follows. The conditional *p*(*σ* = 1|**x**, *θ*, λ) discriminates training (*σ* = 1) against test instances (*σ* = −1).

The significance of (9.8) is that it shows how the optimal example weights, the test-to-training ratio $\frac{p\left(x|\theta \right)}{p\left(x|\text{\lambda}\right)}$, can be determined without knowledge of either training or test density. The right-hand side of (9.8) can be evaluated based on a model that discriminates training from test examples and outputs how much more likely an instance is to occur in the test data than it is to occur in the training data. Instead of potentially high-dimensional densities *p*(**x**|*θ*) and *p*(**x**|λ), a conditional distribution of the single binary variable *σ* needs to be modeled.

Equation (9.8) leaves us with the problem of estimating a parametric model *p*(*σ* = 1|**x**, **v**) of *p*(*σ* = 1|**x**, *θ*, λ). Such a model would predict test-to-training density ratios for the training data in *L* according to (9.8). In the following, we will derive the optimization problem that simultaneously determines parameters **v** of the test-to-training ratios and parameters **w** of the target classifier.

# 9.5 Integrated Model

Our goal is to find a classifier *f* which minimizes the expected loss under the test distribution. To this end, the best conceivable approximation is given by the Bayes decision based on all data available (9.9). For each test instance **x**, the Bayes rule decides on the class which minimizes the expected loss given **x** and all available data (9.10),

(p.167)
Let **w** be the parameters of a classification function *p*(*y*|**x**, **w**) and let **v** parameterize a model *p*(*σ* = 1|**x**, **v**) that characterizes the training-test difference. The Bayes decision is obtained by *Bayesian model averaging*—i.e., by integrating over all model parameters in (9.11),

(9.11) exploits that class-label posterior *p*(*y*|**x**, **w**) is conditionally independent of the parameters **v** of the test-to-training ratio given **w**, and also conditionally independent of the data given its parameters **w**. Bayesian model averaging (9.11) is usually computationally infeasible. The integral is therefore approximated by the single assignment of values the parameters which maximizes it, the MAP estimator. In our case, the MAP estimator naturally assigns values to all parameters, **w** and **v** (9.13):

Equation (9.14) factorizes the joint posterior; (9.15) exploits that **w** is conditionally independent of the test data when the training-test difference **v** is given. Equation 9.16 applies the Bayes rule and shows that the posterior can be factorized into a likelihood function of the training data given the model parameters *P*(*X*^{tr}|**w**, **v**), a likelihood function of the observed selection variables *σ*—written *P*(*X*^{tr}, *X*^{te}|**v**)—and the priors on the model parameters.

The class-label posterior *p*(*y*|**x**, **w**^{MAP}) is conditionally independent of **v**^{MAP} given **w**^{MAP}. However, **w**^{MAP} and **v**^{MAP} are dependent. Assigning a single MAP value to [**w**, **v**] instead of integrating over all values corresponds to the common approximation of the Bayes decision rule by a MAP hypothesis. However, sequential maximization of *p*(**v**|X^{tr},X^{te}) over parameters **v** followed by maximization of *p*(**w**|**v**, X^{tr}) with fixed **v** over parameters **w** would amount to an additional degree of approximation and will not generally coincide with the maximum of the product in (9.14).

We will now discuss the likelihood functions *P*(*X*^{tr}|**w**, **v**) and *P*(*X*^{tr}, *X*^{te}|**v**). Since our goal is discriminative training, the likelihood function *P*(*X*^{tr}|**w**) (not taking training-test difference **v** into account) would be ${\prod}_{i}\text{}p\left({y}_{i}|{x}_{i},\text{}w\right)$. Intuitively, p**X**py dictates how many times, on average, **x** should occur in *X*^{tr} if *X*^{tr} was governed by the test distribution *θ*. When the individual conditional likelihood of **x** is *p*(*y*|**x**, **w**), then the likelihood of $\frac{p\left(x|\theta \right)}{p\left(x|\text{\lambda}\right)}$ occurrences of **x** is $p{\left(y|x,w\right)}^{\frac{p\left(x|\theta \right)}{p\left(x|\text{\lambda}\right)}}$. Using a parametric model *p*(*σ*|**x**, **v**), according to (9.8) the test-to-training ratio $\frac{p\left(x|\theta \right)}{p\left(x|\text{\lambda}\right)}$ can
(p.168)
be expressed as

Therefore, we define the likelihood function as

As an immediate corollary of Manski and Lerman [1977], the likelihood function of (9.17) has the property that when the true value **v*** is given, its maximizer over **w** is a consistent estimator of the true parameter **w*** that has produced labels for the test data under the test distribution *θ*. That is, as the sample grows, the maximizer of (9.17) converges in probability to the true value **w*** of parameter **w**.

The likelihood function *P*(*X*^{tr}, **X**^{te}|**v**) resolves to *P*(*σ*_{i} = 1|**x**_{i}; **v**) for all training instances and *P*(*σ*_{i} = −1|**x**_{i}; **v**) for all test instances:

Equation (9.19) summarizes (9.13) through (9.18). Equation (9.20) inserts the likelihood models (9.17) and (9.18) and draws constants *p*(*σ* = 1|**v**) and *p*(*σ* = −1|**v**) out of the product.

Out of curiosity, let us briefly consider the extreme case of disjoint training and test distributions, i.e., *p*(**x**|*θ*)*p*(**x**|λ) = 0 for all **x**. In this case, the second factor is maximized by a **v** that assigns *p*(*σ* = 1|**x**; **v**) = 1 for all elements of *X*^{tr} (subject to a possible regularization imposed by *p*(**v**)). Hence, the likelihood of the training data $\begin{array}{l}=\text{}{\left({\displaystyle \prod _{i=1}^{{n}_{\text{tr}}}p{\left(y|{\text{x}}_{i};w\right)}^{\frac{1}{p\left({\sigma}_{i}=1|{\text{x}}_{i};v\right)}\text{}-\text{1}}}\right)}^{\frac{p\left(\sigma =1|v\right)}{p\left(\sigma =-1|v\right)}}\hfill \\ \text{}\left({\displaystyle \prod _{i=1}^{{n}_{\text{tr}}}p\left({\sigma}_{i}\text{}=\text{1|}{x}_{i};v\right)}\text{}{\displaystyle \prod _{i={n}_{\text{tr}}+1}^{{n}_{\text{tr}}+{n}_{\text{te}}}p\left({\sigma}_{i}\text{}=\text{}-\text{1|}{x}_{i};v\right)}\right)p\left(w\right)p\left(v\right)\text{}\text{.}\hfill \end{array}\text{}$
equals 1 for all possible classifiers **w**. The choice of the classifier **w** is thus determined solely by the inductive bias *p*(**w**). This result makes perfect sense because the training sample contains no information about the test distribution.

Using a logistic model for *p*(*σ* = 1|**x**, **v**), we notice that (9.8) can be simplified as in (9.21):

Optimization problem (9.3) is derived from (9.20) in logarithmic form, using linear models ${v}^{\top}{x}_{i}$
and ${w}^{\top}{x}_{i}$ and a logistic model for *p*(*σ* = 1|**x**, **v**). Negative
(p.169)
log-likelihoods are abbreviated *ℓ*_{w}(*y*_{i}**w**^{⊤}**x**_{i}) = −log*p*(*y*_{i}|**x**_{i}; **w**) and *ℓ*_{v}(*σ*_{v}^{⊤}**x**_{i}) = −log*p*(*σ*_{i}|**x**_{i}; **v**), respectively; this notation emphasizes the duality between likelihoods and empirical loss functions. The regularization terms correspond to Gaussian priors on **v** and **w** with variances ${s}_{V}^{2}$ and ${s}_{w}^{2}$.

**Optimization Problem 9.3** *Over all* **w** and **v**, *minimize*

# 9.6 Primal Learning Algorithm

We derive a Newton gradient method that directly minimizes optimization problem 9.3 in the attribute space. To this end, we need to derive the gradient and the Hessian of the objective function. The update rule assumes the form of a set of linear equations that have to be solved for the update vector [**Δ**_{v}, **Δ**_{W}]^{⊤}. It depends on the current parameters [**v**, **w**]^{⊤}, all combinations of training and test data, and resulting coefficients. In order to express the update rule as a single equation in matrix form, we define

where X^{tr} and X^{te} are the matrices of training vectors, and test vectors respectively. We abbreviate

and denote the objective function of optimization problem 9.3 by

(p.170)
We compute the gradient with respect to **v** and **w**.

The Hessian is the matrix of second derivatives.

We can rewrite gradient as *X***g** + *S*[**v**, **w**]^{⊤} and Hessian as *XΛX*^{⊤} + *S* using the following definitions, **g** = [**g**^{(1)}, **g**^{(2)}, **g**^{(3)}]^{⊤}, $S=\left[\begin{array}{cc}{s}^{v}& 0\\ 0& {s}^{w}\end{array}\right]$ with

The update step for the Newton gradient descent minimization of optimization problem 9.3 is [**v**′, **w**′]^{⊤} ← [**v**, **w**]^{⊤} + [**Δ**_{v}, **Δ**_{w}]^{⊤} with

Given the parameter **w**, a test instance **x** is classified as $f\left(x;w\right)\text{}=\text{sign}\left({w}^{\top}x\right)$
(p.171)

# 9.7 Kernelized Learning Algorithm

We derive a kernelized version of the integrated classifier for differing training and test distributions. A transformation *Φ* maps instances into a target space in which a kernel function *k*(**x***i*, **x***j*) calculates the inner product $\Phi {\left({x}_{i}\right)}^{\top}\Phi \left({x}_{j}\right)$.

The update rule (9.38) thus becomes

Φ(*X*) is defined by

According to the representer theorem, the optimal separator is a linear combination of examples. Parameter vectors **α** and **β** in the dual space weight the influence of all examples:

Equation 9.39 can therefore be rewritten as (9.42). We now multiply Φ(*X*)^{⊤} from the left to both sides and obtain (9.43). We replace all resulting occurrences of Φ(*X*)^{⊤}Φ(*X*) by the kernel matrix *K* and arrive at (9.44); *S* is replaced by *S*′ such that $\Phi {\left({x}_{i}\right)}^{\top}\Phi \left({x}_{j}\right)$, i.e., $\Phi {\left({x}_{i}\right)}^{\top}\Phi \left({x}_{j}\right)$ for *i* = 1..*n*_{tr} + *n*_{te} and ${{S}^{\prime}}_{{n}_{\text{tr}}+{n}_{\text{te}}+i,{n}_{\text{tr}}+{n}_{\text{te}}+i}\text{}=\text{}{s}_{w}^{-2}$ for *i* = 1..*n*_{tr}. Equation 9.44 is satisfied when (9.45) is satisfied. Equation 9.45 is the update rule for the dual Newton gradient descent.

Given the parameters, test instance **x** is classified by $f\left(x;\alpha \right)\text{}=\text{sign}\left({\displaystyle {\sum}_{i=1}^{{n}_{\text{tr}}}{\beta}_{i}k\left(x,{x}_{i}\right)}\right)$.
(p.172)

# 9.8 Convexity Analysis and Solving the Optimization Problems

The following theorem specifies the conditions for convexity of optimization problem 9.3. With this theorem we can easily check whether the integrated classifier for covariate shift is convex for specific models of the negative log-likelihood functions. The negative log-likelihood function *ℓ*_{w} itself and its first and second derivatives are needed.

**Theorem 9.4** *Optimization problem 9.3 is in general convex if* *ℓ*_{v} *is convex and*

**Proof:** Looking at optimization problem 9.3 we immediately see that the regularizers are convex and if *ℓ*_{v} is convex the second term is convex as well; we only need to analyze the convexity of the last term

A function is convex if the Hessian is positive semidefinite and this is the case if and only if

for all vectors **a** and Hessian *H*.

With the notation of section 9.6 the Hessian of (9.47) is

Using the condition of (9.48) the Hessian is positive semidefinite if the following matrix is positive semidefinite:

Applying (9.48) and splitting **a** into two equally sized subvectors **a** = [**a**_{1} **a**_{2}]^{⊤}, the condition for convexity is

Multiplication of (9.51) with ${{\ell}^{\u2033}}_{w,i}$ and adding and subtracting ${{\ell}^{\u2033}}_{w,i}$ leads to (9.52). Equation 9.53 holds by the binomial theorem. For ${{\ell}^{\u2033}}_{w,i}$ the term $\Vert {{{\ell}^{\u2033}}_{w,i}{\text{a}}_{1}\text{}-\text{}{{\ell}^{\prime}}_{w,i}{y}_{i}{\text{a}}_{2}\Vert}^{2}$ takes its minimum value zero; this means (9.53) is nonnegative
(p.173)
for arbitrary **a**_{1} and **a**_{2} if (9.54) is nonnegative.

In order to check the optimization criterion (optimization problem 9.3) for convexity we need to choose models of the negative log-likelihood *ℓ*_{v} and *ℓ*_{w} and derive their first and second derivatives. These derivations are also needed to actually minimize the optimization criterion with the Newton update steps derived in the last section.

For the model of the covariate shift we use a logistic model ${\ell}_{v}\left({\sigma}_{i}{v}^{\top}x\right)\text{}=\text{log}\left(1\text{}+\text{}\mathrm{exp}\left(-{\sigma}_{i}{v}^{\top}x\right)\right)$ the abbreviations of section 9.6 can now be expanded:

For the model of the target classifier we detail the derivations for logistic and for exponential models of ℓ_{w}. For the logistic model the derivatives of ℓ_{w} are the same as for ℓ_{v}, only **v** needs to be replaced by **w** and *σ*_{i} by *y*_{i}. For an exponential model with ${\ell}_{w}\left({y}_{i}{w}^{\top}x\right)=\mathrm{exp}\left(-{y}_{i}{w}^{\top}x\right)$ the abbreviations are expanded as follows:

Using theorem 9.4 we can now easily check the convexity of the integrated classifier with logistic model and with exponential model of *ℓ*_{w}.

**Corollary 9.5** *Optimization problem 9.3 with logistic model for* *ℓ*_{w} *is nonconvex*.

**Proof:** Inserting the logistic function into (9.46) we get the following solution.

The first term of (9.57) is always positive, the difference term is always negative, thus optimization problem 9.3 with logistic model for *ℓ*_{w} is nonconvex.▪

Empirically, we find that it is a good choice to select the parameters of a regular i.i.d. logistic regression classifier as starting point for the Newton gradient search. Since i.i.d. logistic regression has a convex optimization criterion, this starting point is easily found.

One can easily show that optimization problem 9.3 is nonconvex when *ℓ*_{w} are chosen as hinge loss or quadratic loss.

**Corollary 9.6** *Optimization problem 9.3 with exponential model for* *ℓ*_{w} *is convex*.

(p.174)
**Proof:** *Inserting the exponential model into the above criterion results in the nonnegative expression*

This means the global optimum of optimization problem 9.3 with exponential model for *ℓ*_{w} can easily be found by Newton gradient descent.

# 9.9 Empirical Results

We study the benefit of two versions of the integrated classifier for covariate shift and other reference methods on spam filtering, text classification, and land mine detection problems. The first integrated classifier uses a logistic model for *ℓ*_{w} (“integrated log model”), the second an exponential model for *ℓ*_{w} (“integrated exp model”).

The first baseline is a classifier trained under i.i.d. assumption with logistic *ℓ*_{w}. All other reference methods consist of a two-stage procedure: first, the difference between training and test distribution is estimated; the classifier is trained on weighted data in a second step. The baselines differ in the first stage; the second stage is based on a logistic regression classifier with weighted examples in any case.

The first reference method is two-stage logistic regression (“two-stage LR”). The example weights are computed according to (9.8); $p\left(\sigma \text{}=\text{}1|x,v\right)$ is estimated by training a logistic regression that discriminates training from test examples. The second method is kernel mean matching (chapter 8); we set $\in \text{}=\text{}\sqrt{{n}_{\text{tr}}\text{}-\text{}1}/\sqrt{{n}_{\text{tr}}}$ as proposed by the authors. In the third method, separate density estimates for *p*(**x**|λ) and *p*(**x**|*θ*) are obtained using kernel density estimation [Shimodaira, 2000]; the bandwidth of the kernel is chosen according to the rule of thumb of Silverman [1986]. We tune the regularization parameters of all the methods and the variance parameter of the RBF kernels on a separate tuning set. We use a maximum likelihood estimate of $\frac{{n}_{\text{tr}}}{{n}_{\text{te}}}$ for $\frac{p\left(\sigma =1|\theta ,\text{\lambda}\right)}{p\left(\sigma =-1|\theta ,\text{\lambda}\right)}$.

We use the spam filtering data of Bickel et al. [2007]; the collection contains nine different inboxes with test emails (5270-10964 emails, depending on inbox) and one set of training emails from different sources. We use a fixed set of 1000 emails as training data. We randomly select 32-2048 emails from one of the original inboxes. We repeat this process ten times for 2048 test emails and 20-640 times for 1024-32 test emails. As tuning data we use the labeled emails from an additional inbox different from the test inboxes. The performance measure is the rate by which the 1-AUC risk is reduced over the i.i.d. baseline [Bickel and Scheffer, 2007]; it is computed as $1\text{}-\text{}\frac{1-AUC}{1-AU{C}_{iid}}$. We use linear kernels for all methods. We analyze the rank of the kernel matrix and find that it fulfills the universal kernel requirement of kernel mean matching; this is due to the high dimensionality of the data.

Figure 9.1 (top left) shows the result for various numbers of unlabeled examples. The results for a specific number of unlabeled examples are averaged over 10-640 (p.175) random test samples and averaged over all nine inboxes. Averaged over all users and inbox sizes the absolute AUC of the i.i.d. classifier is 0.994. Error bars indicate standard errors of the 1-AUC risk.

The three discriminative density estimators and kernel mean matching perform similarly well. The differences from the i.i.d. baseline are highly significant. For 1048 examples the 1-AUC risk is even reduced by an average of 30% with the integrated exponential model classifier! The kernel density estimation procedure is not able to beat the i.i.d. baseline.

We now study text classification using computer science papers from the Cora dataset. The task is to discriminate machine learning from networking papers. We select 812 papers written before 1996 from both classes as training examples and 1285 papers written after 1996 as test examples. For parameter tuning we apply an additional time split on the training data; we train on the papers written before 1995 and tune on papers written 1995. Title and abstract are transformed into TFIDF vectors; the number of distinct words is 40,000. We again use linear kernels (rank analysis verifies the universal kernel property) and average the results over 20-640 random test samples for different sizes (1024-32) of test sets. The resulting 1-AUC risk is shown in figure 9.1 (top right). The average absolute AUC of the i.i.d. classifier is 0.998. The methods based on discriminative density estimates significantly outperform all other methods. Kernel mean matching is not displayed because its average performance lies far below the i.i.d. baseline. The integrated models reduce the 1-AUC risk by 15% for 1024 test examples; for a larger number of test examples (128-1024) they perform slightly better than the two-step decomposition.

In a third set of experiments we study the problem of detecting land mines using the data set of Xue et al. [2007]. The collection contains data of 29 minefields in different regions. Binary labels (land mine or safe ground) and nine dimensional feature vectors extracted from radar images are provided. There are about 500 examples for each minefield. Each of the fields has a distinct distribution of input patterns, varying from highly foliated to desert areas.

We enumerate all 29 × 28 pairs of minefields, using one field as training, and the other as test data. For tuning we hold out 4 of the 812 pairs. Results are increases over the i.i.d. baseline, averaged over all 29 × 28 − 4 combinations. We use RBF kernels with variance *σ*^{2} = 0.3 for all methods. The results are displayed in figure 9.1 (bottom left). The average absolute AUC of the i.i.d. baseline is 0.64 with a standard deviation of 0.07; note that the error bars are much smaller than the absolute standard deviation because they indicate the standard error of the *differences* from the i.i.d. baseline.

For this problem, the integrated exponential model classifier and kernel mean matching significantly outperform all other methods on average. Integrated logistic regression and two-stage logistic regression are still significantly better than the i.i.d. baseline except for 32 test examples. We assume that the nonconvex integrated logistic regression is inferior to the convex integrated exponential model method because it runs into unfavorable local optima. (p.176)

# 9.10 Conclusion

We derived a discriminative model for learning under differing training and test distributions. The contribution of each training instance to the optimization problem ideally needs to be weighted with their test-to-training density ratio. We show that this ratio can be expressed – without modeling either training or test density - by a discriminative model that characterizes how much more likely an instance is to occur in the test sample than it is to occur in the training sample.

When Bayesian model averaging is unfeasible and the Bayes decision is unattainable, then one can choose the joint MAP hypothesis of both the parameters of the test-to-training model and the final classifier. Optimizing these dependent parameters sequentially incurs an additional approximation compared to solving the joint optimization problem.

We derived a primal and a kernelized Newton gradient descent procedure for the joint optimization problem. Theorem 9.4 specifies the condition for the convexity of optimization problem 9.3. Checking the condition using popular loss functions as models of the negative log-likelihoods reveals that optimization problem 9.3 is only convex with exponential loss.

(p.177) Empirically, we found that the models with discriminative density estimates outperform the i.i.d. baseline and the kernel density estimated model in almost all cases. For spam filtering the integrated and the two-step models perform similarly well. For land mine detection the convex integrated exponential model classifier and kernel mean matching significantly outperform all other methods.