Adaptive normalization for IPW estimation

Samir Khan; Johan Ugander

doi:10.1515/jci-2022-0019

Open Access Published by De Gruyter February 8, 2023

Adaptive normalization for IPW estimation

Samir Khan and Johan Ugander

From the journal Journal of Causal Inference

https://doi.org/10.1515/jci-2022-0019

Abstract

Inverse probability weighting (IPW) is a general tool in survey sampling and causal inference, used in both Horvitz–Thompson estimators, which normalize by the sample size, and Hájek/self-normalized estimators, which normalize by the sum of the inverse probability weights. In this work, we study a family of IPW estimators, first proposed by Trotter and Tukey in the context of Monte Carlo problems, that are normalized by an affine combination of the sample size and a sum of inverse weights. We show how selecting an estimator from this family in a data-dependent way to minimize asymptotic variance leads to an iterative procedure that converges to an estimator with connections to regression control methods. We refer to such estimators as adaptively normalized estimators. For mean estimation in survey sampling, the adaptively normalized estimator has asymptotic variance that is never worse than the Horvitz–Thompson and Hájek estimators. Going further, we show that adaptive normalization can be used to propose improvements of the augmented IPW (AIPW) estimator, average treatment effect (ATE) estimators, and policy learning objectives. Appealingly, these proposals preserve both the asymptotic efficiency of AIPW and the regret bounds for policy learning with IPW objectives, and deliver consistent finite sample improvements in simulations for all three of mean estimation, ATE estimation, and policy learning.

Keywords: inverse probability weighting; Horvitz–Thompson; Hájek; ATE estimation; survey sampling

MSC 2010: 62F12

1 Introduction

Consider the problem of estimating a mean from samples that are observed with nonuniform probabilities. Formally, we want to estimate the mean μ of a set of responses Y 1 , … , Y n , but only observe Y 1 I 1 , … , Y n I n , where I k is an indicator of whether unit k was observed. This problem is a fundamental primitive of many problems in survey sampling, causal inference, and beyond. We focus on the case where the I k are independent and distributed as I k ∼ Ber ( p k ) , which can correspond to a randomized experiment under a Bernoulli design with known p k or to certain observational contexts.

The standard estimators for μ are the Horvitz–Thompson and Hájek estimators [1,2], both of which are based on the idea of inverse probability weighting (IPW). To introduce these estimators, we first define

S ˆ = ∑ k = 1 n Y k I k p k and n ˆ = ∑ k = 1 n I k p k

as estimates of the population total and sample size. Then, the Horvitz–Thompson and Hájek estimators are

μ ˆ HT = S ˆ / n and μ ˆ Hájek = S ˆ / n ˆ .

These estimators have several desirable properties: μ ˆ HT is unbiased and admissible in the class of all unbiased estimators [3], while μ ˆ Hájek is approximately unbiased and often has lower variance than μ ˆ HT [4].

The idea of IPW also figures prominently in the Monte Carlo literature on importance sampling (IS), where the Horvitz–Thompson and Hájek estimators are known as the IS and self-normalized importance sampling estimators, respectively. In 1954, working in this context, Trotter and Tukey [5] briefly entertained the idea of a family of estimators that generalizes μ ˆ HT and μ ˆ Hájek , namely,

(1) μ ˆ λ = S ˆ ( 1 − λ ) n + λ n ˆ ,

for λ ∈ R . This family contains Horvitz–Thompson estimator as the special case λ = 0 and Hájek as the special case λ = 1 . Curiously, Trotter and Tukey emphasized that λ need not be constrained to [ 0 , 1 ] and that values outside that range might sometimes be useful [5].

1.1 Main contributions

Although this proposal first appeared nearly 70 years ago, it has not received any significant attention in either the Monte Carlo literature or the causal inference literature. In this article, we consider this proposal in detail. In particular:

We propose a method for iteratively selecting data-dependent values of λ in (1) to minimize asymptotic variance. We refer to this as adaptive normalization and show in Section 3 how our proposal leads to an estimator that improves on both the Horvitz–Thompson and Hájek estimators in terms of asymptotic variance. We also study our estimator when propensities are estimated rather than known exactly and demonstrate connections between our estimator and other ideas from the causal inference literature.
As applications, we consider a series of problems where the Horvitz–Thompson and Hájek estimators are traditionally used as a primitive, and evaluate the use of adaptively normalized estimators instead. Specifically, we develop a novel extension of the augmented IPW (AIPW) estimator of ref. [6], new estimators for average treatment effect (ATE) estimation, and a new objective for policy learning. We show that our proposals consistently preserve desirable theoretical guarantees while improving performance in simulations.

1.2 Motivation for adaptive normalization

Why would we want to use values of λ learned from the data in (1) rather than λ = 0 or λ = 1 , especially values of λ < 0 or λ > 1 ? As a starting point, consider λ = 0 , the Horvitz–Thompson estimator, and λ = 1 , the Hájek estimator. As mentioned earlier, the Hájek estimator often has lower variance than the Horvitz–Thompson estimator. One reason for this variance reduction is that the numerator and denominator of the Hájek estimator are positively correlated, while the numerator and denominator of the Horvitz–Thompson estimator are uncorrelated.

To see the role that this correlation plays, consider a toy example with n = 10 units with responses and response probabilities

Y 1 = Y 2 = ⋯ = Y 10 = 1 , p 1 = 1 0 − 5 , p 2 = ⋯ = p 10 = 0.5 .

Then, we claim that the Horvitz–Thompson and Hájek estimators have very different behaviors on the event that Y 1 is observed. For our illustration, assume units 2–5 are observed but 6–10 are not.

For the Horvitz–Thompson estimator, if Y 1 is observed, then the numerator of the estimator increases from 8 to 8 + 1 / 1 0 − 5 ≈ 1 0 5 , while the denominator, n , remains fixed at 10, so our estimate increases from 8/10 to ( 8 + 1 0 5 ) / 10 ≈ 1 0 4 . For the Hájek estimator, if Y 1 is observed, the numerator similarly increases by 1 0 5 , but now the denominator also increases by 1 0 5 , and so our estimate is less affected. The positive correlation between the numerator and denominator of the Hájek estimator provides a kind of shrinkage that serves to reduce variance.

Since the covariance between numerator and denominator in the Hájek estimator is one way that it reduces variance, we would expect that we can further reduce variance by introducing more covariance, which motivates taking λ > 1 in (1). By increasing the weight on the random factor of n ˆ in the denominator of (1), we increase the covariance between the numerator and the denominator, and thus reduce variance. Put crudely, if going from λ = 0 to λ = 1 reduces variance, couldn’t going from λ = 1 to λ = 2 reduce it even more?

Of course, there is a bias-variance trade-off. Increasing λ decreases the variance of μ ˆ λ , but increasing λ to be arbitrarily large also shrinks our estimates toward 0 and increases bias. Thus, the problem becomes one of choosing a value of λ that provides the correct amount of shrinkage for a given problem. This trade-off is visualized in Figure 1, which shows the mean squared error (MSE) of (1) for a simulated problem at a range of values of λ . Ideally, we would choose λ to minimize the curve shown in the left panel of this figure. Unfortunately, we cannot compute the exact MSE of μ ˆ λ , and even if we could, choosing λ to minimize the MSE would lead to an asymptotically biased estimator, complicating issues of inference. (This is analogous to the problem of constructing confidence intervals in nonparametric regression after choosing an MSE-optimal bandwidth [7,8].) In this work, we instead take the approach of showing that μ ˆ λ is asymptotically normal and then choose λ to iteratively minimize estimates of the asymptotic variance.

$Figure 1 The MSE, variance, and bias of μ ˆ λ {\hat{\mu }}_{\lambda } as a function of λ \lambda in an instance of the normal model described in (18) with n = 250 n=250 , μ = 1 \mu =1 , and ρ = 0.2 \rho =0.2 . Note that neither the Horvitz–Thompson estimator ( λ = 0 \lambda =0 ) nor Hájek ( λ = 1 \lambda =1 ) estimator minimizes the MSE, and that the MSE-optimal choice of λ \lambda is greater than 1. The variance shows essentially the same behavior as the MSE because the squared-bias is much smaller then the variance, and so a variance-minimizing choice of λ \lambda is nearly the same as an MSE-minimizing choice of λ \lambda in this example.$

Figure 1

The MSE, variance, and bias of μ ˆ λ as a function of λ in an instance of the normal model described in (18) with n = 250 , μ = 1 , and ρ = 0.2 . Note that neither the Horvitz–Thompson estimator ( λ = 0 ) nor Hájek ( λ = 1 ) estimator minimizes the MSE, and that the MSE-optimal choice of λ is greater than 1. The variance shows essentially the same behavior as the MSE because the squared-bias is much smaller then the variance, and so a variance-minimizing choice of λ is nearly the same as an MSE-minimizing choice of λ in this example.

This approach will also nearly minimize the MSE of μ ˆ λ as long the bias of μ ˆ λ is not too large, in which the case the variance and MSE will be almost equal; this is indeed the case in the setting of Figure 1. In general, the bias of μ ˆ λ is of order 1 / n and should not be a concern in reasonably large sample sizes. However, if the sample size is small, there is a possibility that choosing λ according to our scheme will be unfavorable in an root-mean squared error (RMSE) sense because it does not account for bias. A second-order Taylor expansion of μ ˆ λ shows that the bias is controlled by the parameters T and π defined in (2), and these parameters are large when p k is likely to be close to zero. Thus, in a setting with small sample sizes and many p k close to zero (i.e., with poor overlap), practitioners should be aware that our proposed estimators may offer improvements only in variance, and not necessarily in MSE. We present a further simulation study of the bias of μ ˆ λ in Appendix B.

Finally, we note that other procedures for selecting λ based on different criteria (e.g., minimizing upper bounds of the MSE) would lead to other estimators, suggesting a general class of adaptively normalized estimators that may be worthy of further study.

1.3 Related work

The present work is most closely related to the literature on variance reduction in IS. Since the proposal of IS in ref. [9], there have been a variety of proposals for variance reduction methods, many of which are surveyed in ref. [10]. The one most relevant to our work is the method of control variates, which is also known as difference estimation in the survey sampling community [4]. We show in Section 3 that the estimator we derive is algebraically equivalent to a particular kind of control variate estimator, and although there are multiple known interpretations of control variates [11], this adaptive normalized interpretation appears to be new. A slightly modified form of this estimator has also been briefly discussed by ref. [12] and studied in Monte Carlo simulations by ref. [13]; our work differs from theirs in focusing on the missing data model and in the development of new causal inference applications. In addition, we study this estimator in the case where the propensities p k are estimated rather than known, an issue that is unique to the causal inference context, relative earlier Monte Carlo studies.

Our proposed modifications of the AIPW estimator stand alongside many other proposals: for example, refs [14,15] use an empirical likelihood approach, while [16] uses a weighting approach, among others. We do not, however, know of any attempt or alternative derivation that is equivalent to the estimator we derive in this context. Similarly, our policy learning proposals were build directly based on the work of ref. [17], which is itself closely related to the work on counterfactual risk minimization in the bandit setting [18], and was extended by ref. [19] using ideas from AIPW estimation.

Finally, our work is also related to the literature on limited overlap in two ways. First, one strand of the limited overlap literature aims to reduce variance in the presence of small propensities, often by trimming units with extreme propensities [20,21]. Our proposed methods can offer an alternative form of variance reduction by exploiting correlations between these extreme propensities and the responses, and this approach also has the advantage of not changing the estimand as trimming-based approaches do. Second, another line of work in the limited overlap literature studies IPW estimators in a regime where the propensities are allowed to converge to 0 or 1 asymptotically [22,23]. We do not consider such a regime in this work, but extending our results to this setting is an interesting future direction.

2 Problem formulation and notation

In this section, we formally specify our model and introduce notation. We assume that pairs ( Y 1 , p 1 ) , … , ( Y n , p n ) are drawn i.i.d. from a super-population distribution D on R × [ 0 , 1 ] and that we observe both p 1 , … , p n and Y 1 I 1 , … , Y n I n , where I k are independent and I k ∣ p k ∼ Ber ( p k ) .

This model differs from previous design-based work on asymptotics of the Horvitz–Thompson and Hájek estimators in assuming a super-population model for the Y k rather than modeling Y k as a sequence of fixed finite populations [24,25]. However, we believe that our results are still true in a finite population model, with slightly modified proofs, and we choose to work in a super-population model mainly to avoid making many cumbersome assumptions about the existence of limiting moments of the Y k . In particular, although we are working in a super-population model, we make no parametric assumptions on D , such as assuming a regression model.

Our assumption that the p k are random as well is best understood in the context of a more general model that we consider in Sections 4.1 and 4.3. There, we consider pairs ( Y 1 , X 1 ) , … , ( Y n , X n ) drawn i.i.d. from a super-population distribution D on R × X , where the Y k are responses and the X k are known covariates. Then, we can think of the treatment probability p k as a function p k = p ( X k ) of the observed covariates.

We assume for several of our theoretical results in Section 3 that the p k (or equivalently the map p ( ⋅ ) ) are known exactly or estimated from a well-specified logistic regression model. This is mainly to facilitate asymptotic comparisons, since the asymptotics of IPW estimators with estimated propensities are highly dependent on the method of estimating the propensity. A very specific choice of propensity score estimator has been shown to lead to efficient estimators [26], but we choose to study more general methods of propensity estimation and efficiency under appropriate rate and consistency conditions in the context of AIPW estimation in Section 4.

We now specify our assumptions on D .

Assumption 1

(Boundedness and overlap) There exist constants M , δ > 0 such that ∣ Y k ∣ ≤ M and δ ≤ p k ≤ 1 − δ almost surely.

Under Assumption 1, all of the moments

(2) μ = E [ Y 1 ] , π = E 1 − p 1 p 1 , T = E Y 1 1 − p 1 p 1

of D are all finite.

Throughout the next section, our goal is to estimate μ = E [ Y k ] in the model presented here. In Section 4, we consider more diverse models and goals.

3 Adaptive normalization

In this section, we propose and analyze a novel procedure for selecting a data-dependent value of λ and describe properties of the adaptively normalized IPW estimator that follows. We establish that the resulting estimator has smaller asymptotic variance than either the Horvitz–Thompson estimator or Hájek estimator.

3.1 The optimal choice of λ

Before we can understand how to choose λ from the data, we must first understand the behavior of μ ˆ λ for a fixed λ , as a function of λ and the problem parameters. To this end, we contribute the following central limit theorem, which is a straightforward generalization of known results on the bias and variance of the Hájek estimator. We defer the proof this result, as well as all other results in this section, to Appendix A.

Theorem 1

Suppose Assumption 1 holds. Then, for any fixed λ ∈ R , we have the CLT

(3) n ( μ ˆ λ − μ ) → d N ( 0 , σ 2 ( λ ) ) , σ 2 ( λ ) = var ( Y 1 ) + E 1 − p 1 p 1 ( Y 1 − λ μ ) 2 .

With this result in hand, we now choose λ to minimize σ 2 ( λ ) . Moving forward, in Sections 3.1 and 3.2, we assume that μ ≠ 0 , since if μ = 0 , then σ ( λ ) does not actually depend on λ , and minimizing over λ is no longer meaningful. However, in Section 3.3 onward, when we study the properties of our estimator, we do not require μ ≠ 0 .

Under the assumption that μ ≠ 0 , there is a unique value of λ that minimizes σ ( λ ) 2 in (3), given by

(4) λ ∗ = E 1 − p 1 p 1 Y 1 μ E 1 − p 1 p 1 = T π μ ,

where π and T are the moments defined in (2). Assumption 1 precludes the possibility that π = 0 .

We can interpret (4) to shed light on the role λ plays. If Y k and p k are positively correlated, then Y k and 1 − p k p k are negatively correlated, so T < μ π and λ ∗ < 1 . Similarly, if the Y k and p k are negatively correlated, we obtain λ ∗ > 1 . This interpretation of (4) extends the conventional wisdom that Hájek estimator is preferable to Horvitz–Thompson estimator when Y k and p k are negatively correlated [4].

3.2 Estimating λ ∗ from the data

On the basis of the aforementioned asymptotic analysis, we would seem to prefer μ ˆ λ ∗ over μ ˆ HT and μ ˆ Hájek . However, we cannot use μ ˆ λ ∗ directly as an estimator of μ because the prescribed choice of λ ∗ depends on unknown moments of D , including the very mean μ we are trying to estimate. What happens if we estimate λ ∗ from data?

Our expression of λ ∗ depends on three moments of D , namely, T , π , μ . The first two of these can readily be estimated from the data by the IPW-style estimators

(5) T ˆ = 1 n ∑ k = 1 n 1 − p k p k Y k I k p k , π ˆ = 1 n ∑ k = 1 n 1 − p k p k I k p k ,

and we already have the estimator μ ˆ HT of μ .

Before moving forward, we address two possible questions that might be raised by (5): Can we also use an adaptively normalized estimator when estimating T and π , rather than a standard IPW estimator? And why do we inverse propensity weight our estimate of π , rather than using the sample mean of the ( 1 − p k ) / p k ? For the first question, any consistent estimators of T and π will suffice for our theoretical results, and we focus on T ˆ and π ˆ as defined in (5) for simplicity. In practice, for some estimators of T and π with lower asymptotic variance, the asymptotic variance is not a good approximation of the finite sample behavior unless n is extremely large, thus worsening finite sample performance at moderate sample sizes. Similarly, we could use other estimators of μ that concentrate around μ , and we focus on μ ˆ HT for simplicity. For the second, the reason we estimate π with an IPW estimator is that, to estimate λ ∗ , we will use expressions of the form T ˆ / π ˆ , and IPW estimating π introduces correlation between the numerator and denominator of this expression, thus reducing the variance of our estimate of λ ∗ .

Now, the estimators in (5) lead us to estimate λ ∗ and μ by

(6) λ ˆ ∗ = T ˆ π ˆ μ ˆ HT , μ ˆ λ ˆ ∗ = S ˆ ( 1 − λ ˆ ∗ ) n + λ ˆ ∗ n ˆ .

At this point, however, we can make useful observation: We expect that μ ˆ λ ˆ ∗ is a better estimator of μ than μ ˆ HT , so should we not use μ ˆ λ ˆ ∗ rather than μ ˆ HT when estimating λ ? In fact, every time we obtain a better estimate of μ , we can use this to obtain a better estimate of λ ∗ . On the other hand, a better estimate of λ ∗ will, because λ ∗ is the optimal amount of normalization, lead to a better estimate of μ . Combining these ideas leads to the following alternating scheme.

Formally, we construct a sequence of estimators ( λ ˆ ( t ) , μ ˆ ( t ) ) initialized at ( λ ˆ ( 0 ) , μ ˆ ( 0 ) ) = ( 0 , μ ˆ HT ) and defined for t > 0 by the recursions

(7) λ ˆ ( t ) = T ˆ π ˆ μ ˆ ( t − 1 ) , μ ˆ ( t ) = S ˆ ( 1 − λ ˆ ( t ) ) n + λ ˆ ( t ) n ˆ .

The first equation in (7) corresponds to estimating λ ∗ using μ ˆ ( t − 1 ) as an estimate of μ , while the second corresponds to estimating μ using λ ˆ ( t ) as an estimate of λ ∗ . There are two possible stable limiting behaviors for this sequence: the first is to trivially have μ ˆ ( t ) → 0 and λ ˆ ( t ) → ∞ , while the second is to converge to a fixed point at a pair ( μ ˆ AN , λ ˆ AN ) satisfying

λ ˆ AN = T ˆ π ˆ μ ˆ AN , μ ˆ AN = S ˆ ( 1 − λ ˆ AN ) n + λ ˆ AN n ˆ .

This system of equations has the unique solution

(8) μ ˆ AN = S ˆ n + T ˆ π ˆ 1 − n ˆ n .

The following theorem formally establishes that the iterations (7) converge to the nontrivial solution, the fixed point at (8).

Theorem 2

Suppose Assumption 1 holds, we have μ ≠ 0 , and consider the sequence of estimators ( λ ˆ ( t ) , μ ˆ ( t ) ) initialized at λ ˆ ( 0 ) = 0 , μ ˆ ( 0 ) = μ ˆ HT and defined for t > 0 by the recursion (7). Then

the sequence μ ˆ ( t ) converges as t → ∞ to an estimator μ ˆ lim ;
the estimator μ ˆ lim satisfies
lim n → ∞ P ( μ ˆ lim = μ ˆ AN ) = 1 ,
so that μ ˆ lim − μ ˆ AN converges in probability to 0.

Thus, our attempts to develop an estimator of μ by estimating the optimal normalization parameter λ ∗ culminate in the estimator μ ˆ AN , which we refer to as the adaptively normalized IPW estimator. The role of Theorem 2 and the iterative scheme we have presented is to make explicit the connection between adaptive normalization and the final estimator μ ˆ AN . The fixed-point equations alone do not characterize this connection, since, based on those equations alone, we may be led to take μ ˆ = 0 and think of this as using an infinite value of λ . Thus, this theorem fully develops the proposal of ref. [5] into a usable estimator, providing the first complete analysis of IPW estimators with normalizations other than n or n ˆ .

Before proceeding, we note a very attractive property of μ ˆ AN in (8): its simplicity. It is, for practical purposes, the Horvitz–Thompson estimator with a correction term and can replace the Horvitz–Thompson and Hájek estimators in essentially any application where they are used, with minimal additional computation. Beyond this simplicity, we will see in Section 3.3 that μ ˆ AN typically has lower asymptotic variance than either the Horvitz–Thompson estimator or the Hájek estimator, and in Section 3.4 that it is closely related to several existing ideas in the literature.

3.2.1 An optimization perspective

It may feel unsettled to minimize the asymptotic variance and then commence an iteration scheme. We can also derive μ ˆ AN more directly by framing the problem of selecting λ as a joint minimization, over μ and λ , of the asymptotic variance. Minimizing the asymptotic variance σ λ 2 in (3) is equivalent to minimizing

− 2 λ μ E 1 − p k p k Y k + λ 2 μ 2 E 1 − p k p k = − 2 λ μ T + λ 2 μ 2 π .

As mentioned earlier, we use T ˆ and π ˆ defined in (5) to estimate T and π , but now we estimate μ by μ ˆ λ directly. This leads to the nonconvex optimization problem:

min λ − 2 λ μ ˆ λ T ˆ + λ 2 μ ˆ λ 2 π ˆ .

Rather than directly solving this optimization problem, we consider the equivalent problem

(9) min λ , μ ˆ − 2 λ μ ˆ T ˆ + λ 2 μ ˆ 2 π ˆ , s.t. μ ˆ = S ˆ ( 1 − λ ) n ˆ + λ n .

This problem is also nonconvex, but it is more amenable to analysis. In particular, we claim that, despite its nonconvexity, (9) can be easily solved analytically to find a unique optimum.

To solve (9), note that the objective is a quadratic function of λ μ ˆ , and so checking first-order stationarity conditions shows that the unconstrained minimum is achieved for any pair ( λ , μ ˆ ) satisfying λ μ ˆ = T ˆ / π ˆ . Then, direct algebra yields that there is a unique pair ( λ ˆ opt , μ ˆ opt ) satisfying both λ ˆ opt μ ˆ opt = T ˆ / π ˆ as well as the constraint in (9), and the resulting μ ˆ opt is precisely μ ˆ AN .

Finally, we note that it is possible to relax the constraint in (9) to merely require that μ ˆ is unbiased, rather than that it has a particular functional form. In this case, the resulting problem would be nonconvex, but solving a relaxation or upper bound may lead to other interesting estimators.

3.3 Properties of adaptive normalization

We now study the asymptotic behavior of μ ˆ AN , and show that its asymptotic variance generically improves on the asymptotic variance of the Hájek and Horvitz–Thompson estimators.

3.3.1 Asymptotic variance

To understand the asymptotics of

μ ˆ AN = S ˆ n + T ˆ π ˆ 1 − n ˆ n ,

we use the fact that T ˆ / π ˆ is consistent for T / π and then apply a CLT to the remaining terms. This argument produces the following result.

Theorem 3

Under Assumption 1, and regardless of whether μ = 0 , the estimator μ ˆ AN satisfies the CLT

(10) n ( μ ˆ AN − μ ) → d N ( 0 , σ 2 ( λ ∗ ) ) ,

where σ 2 ( ⋅ ) is the asymptotic variance of μ ˆ λ from (3). Furthermore, the asymptotic variance in (10) is always smaller than the asymptotic variances of μ ˆ HT and μ ˆ Hájek and is strictly smaller except if T = μ π or T = 0 .

We defer the proof of the CLT to Appendix A, but note here that the variance comparison follows from the fact that λ ∗ is the minimizer of σ 2 ( λ ) . In particular, unless λ ∗ = 1 , which corresponds to T = μ π , or λ ∗ = 0 , which corresponds to T = 0 , σ 2 ( λ ∗ ) is smaller than both of σ 2 ( 1 ) and σ 2 ( 0 ) . Thus, μ ˆ AN improves on μ ˆ HT and μ ˆ Hájek unless they were already using the optimal value of λ .

We also note that, since Theorem 3 does not require μ ≠ 0 , there should be no concerns about using the adaptively normalized estimator in problems where the true mean may be zero. This assumption was required for connecting the iterations (7) to μ ˆ AN , but the favorable variance properties of μ ˆ AN do not rely on it. Thus, even in problems where the true mean may be zero, we recommend using μ ˆ AN , since it will be equivalent to other IPW estimators asymptotically if the true mean is zero, and lower variance asymptotically if the true mean is not zero.

3.3.2 Finite sample variance

Examining the form of μ ˆ AN directly, we can see that if it has lower variance than μ ˆ HT in finite samples, it must be because the correction term T ˆ π ˆ 1 − n ˆ n is negatively correlated with the first term, S ˆ / n = μ ˆ HT . However, by using the iterative scheme introduced in (7), we can give an alternative explanation for why the finite sample variance of μ ˆ AN is smaller than that of μ ˆ HT , one that builds on the result of Theorem 2.

To make this argument, we first combine the two update equations in (7) into the single equation

μ ˆ ( t ) = S ˆ 1 − T ˆ π ˆ μ ˆ ( t − 1 ) n + T ˆ π ˆ μ ˆ ( t − 1 ) n ˆ = S ˆ / n 1 − T ˆ π ˆ μ ˆ ( t − 1 ) 1 − n ˆ n .

We can write this more succinctly as μ ˆ ( t ) = f ( μ ˆ ( t − 1 ) ) , where

f ( x ) = a x x − b , a = S ˆ n , b = T ˆ π ˆ 1 − n ˆ n .

In this notation, the fixed point of f is at x = a + b = μ ˆ AN . Rather than working with f directly, we will work with the two-step map g ( x ) = f ( f ( x ) ) = a 2 x x ( a − b ) + b 2 and the sub-sequence x ( 0 ) , x ( 2 ) , … .

We will now argue informally that the iterations of g should be variance reducing. If it were the case that ∣ g ′ ( x ) ∣ ≤ 1 for all x , then applying g would certainly reduce variance, and we would conclude that μ ˆ AN has smaller variance than μ ˆ HT . But this is not the case: g ( x ) approaches ∞ as its denominator approaches 0, and so ∣ g ′ ( x ) ∣ can be arbitrarily large . However, the actual sequence of iterates μ ˆ ( 2 t ) lies, with high probability, in an interval where ∣ g ′ ( x ) ∣ is bounded by 1.

In what follows, we assume again that μ ≠ 0 and that ∣ a ∣ > 2 ∣ b ∣ , which we show holds with high probability by the same arguments as those in the proof of Lemma A3 of Appendix A. We also assume, without the loss of generality, that both a and b are positive, but this is only for clarity. Now, in the proof of Lemma A2, we show that under these conditions, the entire sequence of iterates μ ˆ ( 0 ) , μ ˆ ( 2 ) , … lies in the interval [ a , a + 2 b ] . By standard concentration arguments, a is concentrated around μ and b is concentrated around 0, so the interval between [ a , a + 2 b ] lies with high probability in a small interval ℐ centered at μ . On the other hand, we can check by direct computation that ∣ g ′ ( x ) ∣ ≤ 1 for x outside the interval ( − 3 b , b ) . Again by concentration arguments, for large n , this interval will be concentrated around zero and thus be disjoint from the interval ℐ centered at μ .

To summarize, there exists an interval ℐ centered around μ such that, with high probability, x ( 2 t ) ∈ ℐ for all t > 0 and ∣ g ′ ( x ) ∣ ≤ 1 for all x ∈ ℐ . If the function g were fixed, this would be enough to conclude that each application of g reduces variance, and thus that μ ˆ AN has smaller variance than μ ˆ HT . Unfortunately, because the function g is random, this is not a rigorous argument, only an intuitive one. However, it is clear from simulations, the results of which are shown in Figure 2, that each successive μ ˆ ( 2 t ) does in fact have lower variance.

$Figure 2 The variance of μ ˆ ( t ) {\hat{\mu }}^{\left(t)} as a function of t t . Although a single iteration may increase the variance, we observe that, in this simulation, every two iterations reduce variance. The data here are generated from the normal model (18) in Section 5 with n = 100 n=100 , ρ = 0.1 \rho =0.1 , and μ = − 1 \mu =-1 .$

Figure 2

The variance of μ ˆ ( t ) as a function of t . Although a single iteration may increase the variance, we observe that, in this simulation, every two iterations reduce variance. The data here are generated from the normal model (18) in Section 5 with n = 100 , ρ = 0.1 , and μ = − 1 .

We note that an argument similar to ours appears in Section 5 of ref. [27]. In that context, when studying a generalized method of moments estimator, the authors iterate a random data-dependent estimate of an underlying true function. The underlying true function is a contraction mapping and thus reduces variance when applied to a random variable, so they argue heuristically that the iterations of the data-dependent estimated function should reduce variance as well.

In addition, the following theorem verifies the finite-sample variance reduction properties of μ ˆ AN in the special case where p k is constant.

Theorem 4

Under Assumption 1, if p k = p is constant, then var ( μ ˆ AN ) ≤ var ( μ ˆ HT ) .

We make two important comments on this result. First, in this special case, μ ˆ AN is equivalent to both the sample mean of Y 1 I 1 , … , Y n I n and μ ˆ Hájek . As such, this case is a simple model that does not fully showcase all of the subtleties of the problem, but still presents evidence for the desirable finite-sample variance properties of μ ˆ AN . Second, although μ ˆ AN is equivalent to the observed mean in this case, Theorem 4 differs from classical results on the difference-of-means estimators for completely randomized experiments in assuming a Poisson sampled design, which, to the best of our knowledge, has not been previously considered [28].

3.3.3 Estimated propensities

We have been assuming until now that p k is known, or equivalently that the propensity map p ( X k ) for covariates X k is known. In practice, and especially in the context of observational data, p ( ⋅ ) must often be estimated. In such a setting, the results of ref. [29] imply that the semiparametric efficiency bound for estimating μ is

(11) E var ( Y 1 ∣ X 1 ) p 1 + ( E [ Y 1 ∣ X 1 ] − μ ) 2 = var ( Y 1 ) + E ( Y 1 − E [ Y 1 ∣ X 1 ] ) 2 1 − p 1 p 1 .

Comparing (11) with the result of Theorem 1, we see that the family μ ˆ λ cannot achieve the semiparametric lower bound, because the asymptotic variances of μ ˆ λ have a Y 1 − λ μ term instead of Y − E [ Y 1 ∣ X 1 ] term.

However, [26] has shown that estimation of the propensities actually reduces the asymptotic variance of IPW estimators. In fact, [26] also shows that when using a linear or logistic sieve estimator for the propensity score, the Horvitz–Thompson and Hájek estimators achieve the semiparametric lower bound in (11). In this work, we do not consider such nonparametric propensity estimators, and instead focus on the case of propensities estimated from a well-specified logistic model; we consider the analysis of μ ˆ AN with propensities estimated from a linear or logistic sieve, or other nonparametric estimators (to which the results of ref. [26] may be generalizable), and a study of efficiency properties like those in ref. [26] to be an interesting future direction. For users who are interested in achieving semiparametric efficiency with our methods, we apply our methods to double machine learning methods in Section 4.1 and give a semiparametrically efficient estimator there.

Now, in the case of propensities estimated from a well-specified logistic model, it is natural to ask how this variance reduction from propensity estimation interacts with the variance reduction of μ ˆ AN , a question we answer via the following theorem comparing the asymptotic variances of μ ˆ AN and μ ˆ HT in this case.

Theorem 5

Suppose Assumption 1 holds and that the p k are estimated from a well-specified logistic model, i.e., p k = ( 1 + exp ( X k T θ ∗ ) ) − 1 , and we estimate θ ∗ by the maximum-likelihood estimate θ ˆ based on the data ( X 1 , I 1 ) , … , ( X n , I n ) . Let μ ˆ HT, logistic and μ ˆ AN, logistic be estimates of μ using θ ˆ . Then, these estimators satisfy the CLTs:

n ( μ ˆ HT, logistic − μ ) → d N ( 0 , σ HT, logistic 2 ) , n ( μ ˆ AN, logistic − μ ) → d N ( 0 , σ AN, logistic 2 )

and

(12) σ AN, logistic 2 − σ HT, logistic 2 = − T 2 π + 2 T π u 2 T I ( θ ∗ ) − 1 u 1 − T 2 π 2 u 2 T I ( θ ∗ ) − 1 u 2

where u 1 = E [ Y k ( 1 − p ( X k T θ ∗ ) ) X k ] , u 2 = E [ ( 1 − p ( X k T θ ∗ ) ) X k ] , and I ( θ ∗ ) = E [ ( 1 − p ( X k T θ ∗ ) ) p ( X k T θ ∗ ) X k X k T ] are the Fisher information.

The value of Theorem 5 is that it highlights that some caution is necessary when choosing between μ ˆ AN and μ ˆ HT in a problem with estimated propensities, and suggests an approach to making this choice by using the right-hand side of (12) as a diagnostic quantity. That is, we can select between μ ˆ HT and μ ˆ AN by estimating the right-hand side of (12), and using μ ˆ HT if it is positive, and μ ˆ AN if it is negative. However, practically speaking, when estimating propensities and choosing between μ ˆ HT and μ ˆ AN , we should prefer neither – rather, we should use modern doubly robust methods like the AIPW estimator. We discuss these methods, and the role of adaptive normalization in them, in Section 4.1.

3.4 Connections to regression/control variate methods

In this section, we show how μ ˆ AN can be understood as a control variate/regression control method. We note briefly that arguments similar to the one’s presented here can be used to draw connections between μ ˆ AN and the AIPW estimator of ref. [6], as well as the estimator of ref. [30].

A popular technique for variance reduction in survey sampling is the use of regression controls, and this technique is also known in the Monte Carlo literature, where regression controls are referred to as control variates. We will now show that μ ˆ AN is equivalent to a particular choice of regression control/control variate that is known in the Monte Carlo community, but does not appear to have been widely adopted in the survey sampling and causal inference communities.

We follow the discussion in ref. [10], where the author strongly recommends using the importance weights as control variates whenever they are known, based on the results of ref. [13]. In our setting, the random variables w k = I k / p k play the role of the importance weights, since they have mean 1 and re-weight the observed Y k I k to have mean μ . Then, following ref. [10], we define the family of estimators:

(13) μ ˆ β = 1 n ∑ k = 1 n Y k w k − β 1 n ∑ k = 1 n w k − 1 .

Each estimator in this family is unbiased, so we choose β to minimize variance. The variance of μ ˆ β is

1 n var ( Y k w k ) − 2 β n cov ( Y k w k , w k ) + β 2 n var ( w k ) ,

which is minimized for β ∗ = cov ( Y k w k , w k ) / var ( w k ) .

Direct computation gives

cov ( Y k w k , w k ) = E Y k 1 − p k p k = T and var ( w k ) = E 1 − p k p k = π ,

connecting the IS problem to the notation of our survey sampling problem given in (2). Thus, estimating the numerator and denominator of β ∗ separately by T ˆ and π ˆ , respectively, gives the estimator β ˆ = T ˆ 0 / π ˆ 0 of β ∗ . The resulting estimator of μ becomes

μ ˆ β ˆ = 1 n ∑ k = 1 n Y k w k − T ˆ π ˆ 1 n ∑ k = 1 n w k − 1 = μ ˆ HT + T ˆ π ˆ 1 − n ˆ n ,

which is algebraically equivalent to the estimator μ ˆ AN in (8) derived via adaptive normalization. We note that the version of this estimator in refs [10,13] estimates cov ( Y k , w k ) and var ( w k ) directly rather than first simplifying, a minor difference with our version.

3.4.1 Value of adaptive normalization as a perspective

While the adaptively normalized estimator we have derived is algebraically equivalent to a known estimator in the Monte Carlo literature, we feel that our derivation, based on the simple idea of combining the denominators of μ ˆ HT and μ ˆ Hájek , offers a valuable and unique motivation. Furthermore, our iterative analysis of μ ˆ AN provides instructive intuition for finite-sample variance reduction. Finally, we have focused on choosing λ to minimize asymptotic variance – choosing λ by minimizing a different criteria, such as mean-squared error, or choosing λ using an alternative approach such as Lepski’s method, would lead to other estimators that may be worthy of further study.

Finally, despite these other guises in which the adaptively normalized estimator has appeared, it has not received significant attention in the survey sampling or causal inference community. This inattention is especially surprising when we consider that, in the Monte Carlo setting, the importance weights are often not computable in closed form, and so they cannot be used as control variates. In contrast, in the causal inference setting, the treatment probabilities are often known from the experimental design, and so μ ˆ AN is usually available as an immediate improvement over μ ˆ HT or μ ˆ Hájek . Thus, it seems that μ ˆ AN is more widely known in the community where it is of less practical value; we hope our work will remedy this gap by presenting μ ˆ AN in a causal inference context and addressing issues specific to this context, such as the estimation of propensity scores. To this end, we proceed in the next section to identify a variety of template settings in causal inference where μ ˆ HT and μ ˆ Hájek are used by default, and it is desirable to replace them with μ ˆ AN .

4 Applications

In this section, we consider various causal inference settings in which the Horvitz–Thompson and Hájek estimators are used “by default” and show how adaptively normalized estimators act as a free upgrade.

4.1 AIPW estimation

Our first application of adaptive normalization focuses on AIPW estimation. A basic version of AIPW was first introduced for survey sampling in the 1980s [31,32] and then re-discovered and significantly expanded on for ATE estimation by the causal inference community [6,33], where it is also known as the doubly robust estimator. Our approach follows the one developed in ref. [33], although we concentrate on a single group for now and defer ATE estimation until further on.

We discuss AIPW estimation in the more general model where we have access to covariate information. Specifically, we assume the pairs ( Y 1 , X 1 ) , … , ( Y n , X n ) are i.i.d. from a distribution D on R × X , and we observe all of X 1 , … , X n in addition to Y 1 I 1 , … , Y n I n , where I k are independently Ber ( p k ) and p k = p ( X k ) is a function of the covariates. Then, the AIPW estimate of μ is expressed as follows:

(14) μ ˆ AIPW = 1 n ∑ j = 1 K ∑ i ∈ ℐ j 1 n μ ˆ − j ( X k ) + 1 n ∑ j = 1 K ∑ i ∈ ℐ j ( Y k − μ ˆ ( − j ) ( X k ) ) I k p ˆ ( − j ) ( X k ) ,

where ℐ 1 , … , ℐ K is a partition of the data into K folds, μ ˆ ( − j ) ( X k ) is an estimate of μ ( X k ) = E [ Y k ∣ X k ] based on all the data except ℐ j , and p ˆ ( X k ) is an estimate of p ( X k ) based on all the data except ℐ j . We denote by ℐ − j the data on which μ ˆ ( − j ) and p ˆ ( − j ) are trained. Thus, our estimators are cross-fit as shown in ref. [33]. We will require that μ ˆ ( − j ) ( ⋅ ) and p ˆ ( − j ) ( ⋅ ) satisfy the following standard conditions for each 1 ≤ j ≤ K [33]:

Assumption 2

(Consistency) As n → ∞ , μ ˆ ( − j ) ( ⋅ ) , and p ˆ ( − j ) ( ⋅ ) satisfy

sup x ∈ X ∣ μ ( x ) − μ ˆ ( − j ) ( x ) ∣ , sup x ∈ X ∣ p ˆ ( − j ) ( x ) − p ( x ) ∣ → P 0 .

Assumption 3

(Risk decay) We have that μ ˆ ( − j ) ( ⋅ ) , p ˆ ( − j ) ( ⋅ ) satisfy

E [ ( μ ˆ ( − j ) ( X k ) − μ ( X k ) ) 2 ∣ ℐ − j ] × E [ ( p ˆ ( − j ) ( X k ) − p ( X k ) ) 2 ∣ ℐ − j ] = o P ( n − 1 ) .

Under Assumptions 2 and 3, the estimator μ ˆ AIPW is semi-parametrically efficient [29], making it a natural starting point for potential improvements. Of course, this efficiency result means that any potential usage of adaptive normalization will not reduce the asymptotic variance of μ ˆ AIPW , but we will see that we can still use the ideas of adaptive normalization to develop an estimator that has the same asymptotic efficiency as μ ˆ AIPW , and also has better finite-sample MSE in simulations.

Consider the estimator in (14). The first term is an estimate of μ based on imputing all of the Y k by μ ˆ ( − j ) ( X k ) , while the second term is an inverse probability weighted estimate of the bias of the μ ˆ ( − j ) ( X k ) . In light of our work in Section 3, we naturally propose replacing the second term with a adaptively normalized estimator, yielding the new estimator

(15) μ ˆ AIPW,AN = 1 n ∑ j = 1 K ∑ i ∈ ℐ j μ ˆ ( − j ) ( X k ) + 1 n ∑ j = 1 K ∑ i ∈ ℐ j ( Y k − μ ˆ ( − j ) ( X k ) ) I k p ˆ ( − j ) ( X k ) + 1 π ˆ ∑ j = 1 K ∑ i ∈ ℐ j ( Y k − μ ˆ ( − j ) ( X k ) ) 1 − p ˆ ( − j ) ( X k ) p ˆ ( − j ) ( X k ) I k p ˆ ( − j ) ( X k ) 1 − n ˆ n ,

where now

π ˆ = 1 n ∑ j = 1 K ∑ i ∈ ℐ j 1 − p ˆ ( − j ) ( X k ) p ˆ ( − j ) ( X k ) I k p ˆ ( − j ) ( X k ) , n ˆ = ∑ j = 1 K ∑ i ∈ ℐ j I k p ˆ ( − j ) ( X k )

are functions of the estimated propensities, rather than the true treatment probabilities as shown in Section 3.

The following theorem shows that this correction is asymptotically negligible; the proof is presented in Appendix A.4.

Theorem 6

Suppose Assumptions 1–3 hold. Then n ( μ ˆ AIPW − μ ˆ AIPW,AN ) → P 0 .

Crucially, Theorem 6 implies that μ ˆ AIPW,AN has the same asymptotic variance as μ ˆ AIPW under the given conditions. These are exactly the conditions required for the efficiency of μ ˆ AIPW , so we conclude that μ ˆ AIPW,AN is efficient whenever μ ˆ AIPW is. On the other hand, in finite samples, the additional correction term in μ ˆ AIPW,AN is negatively correlated with the other terms, and reduces variance, a fact we demonstrate empirically in Section 5. Taken together, these observations suggest that μ ˆ AIPW,AN should typically be preferred to μ ˆ AIPW .

Our proposal is by no means the only attempt to improve on μ ˆ AIPW . However, our proposal differs from existing work in two important ways. First, many other proposals are meant to improve on AIPW in the case when μ ˆ is misspecified, which is not our motivation here, although we do explore this setting in simulation and see that μ ˆ AIPW,AN handles misspecification better than μ ˆ AIPW . Second, and more importantly, other proposed estimators are significantly more complex, and sometimes requires solving certain estimating equations numerically. In contrast, μ ˆ AIPW,AN has an explicit, simple, closed form, and computing it requires nothing more than what is required to compute μ ˆ AIPW .

4.2 ATE estimation

Another setting in which IPW estimators play a key role is in the problem of ATE estimation. Our model for ATE estimation is that there are triplets

( Y 1 ( 1 ) , Y 1 ( 0 ) , p 1 ) , … , ( Y n ( 1 ) , Y n ( 0 ) , p n )

drawn i.i.d. from a distribution D on R × R × [ 0 , 1 ] satisfying Assumption 1 for both Y k ( 1 ) and Y k ( 0 ) . We assume that we observe Y 1 ( I 1 ) , … , Y n ( I n ) , where I k ∣ p k ∼ Ber ( p k ) , and want to estimate the ATE τ = E [ Y k ( 1 ) − Y k ( 0 ) ] from these observations. There are a variety of general approaches to estimating τ [34]; the one relevant to our work is the approach of using IPW estimators to separately estimate μ 1 = E [ Y k ( 1 ) ] and μ 0 = E [ Y k ( 0 ) ] , and then subtracting the two estimates.

Since the problems of estimating μ 1 and μ 0 are survey sampling problems, they are often estimated with Horvitz–Thompson and Hájek estimators. Continuing our general theme, why not use μ ˆ AN instead? Thus, we propose

(16) τ ˆ AN = μ ˆ 1 , AN − μ ˆ 0 , AN ,

where μ ˆ 1 , AN is the adaptively normalized estimator based on Y 1 ( 1 ) I 1 , … , Y n ( 1 ) I n and μ ˆ 0 , AN is the same based on Y 1 ( 0 ) ( 1 − I 1 ) , … , Y n ( 0 ) ( 1 − I n ) .

This is not the only possible use of adaptive normalization in this setting. Note that μ ˆ 1 , AN and μ ˆ 0 , AN here are designed to minimize variance within each group, even though our target estimand is the difference between groups. It is natural to ask what happens if we instead attempt to directly minimize the variance of the estimated difference between the two groups; we discuss this approach in detail in Appendix C and compare it to τ ˆ AN from (16). We conclude that estimating the two groups separately and taking the difference is reasonable and generally advised, except in the presence of strong correlation between the responses and the propensities, in which case the second approach may be preferable.

Similarly, rather than using AIPW estimation in each group for ATE estimation, we can also use adaptively normalized AIPW estimation in each group instead. Indeed, we will show in Section 5 that these estimators improve on both of the usual estimators in simulation.

Finally, we note that, since the adaptively normalized estimator is also a regression control estimator, τ ˆ AN corresponds to a particular form of the interaction estimator of ref. [35]. Appealingly, we derive the same estimator without any reference to an ordinary least squares model, side-stepping issues like those discussed in refs [35,36] of whether the model is correct or can only be used “agnostically.”

4.3 Policy evaluation

The final setting in which we propose replacing an IPW estimator with an adaptively normalized estimator is policy evaluation. In this context, we wish to learn a statistical rule for assigning treatments to a population that maximizes the total welfare. We follow the regret minimization framework introduced by ref. [37] and build directly on the work of ref. [17], although much of our notation is drawn from [19].

Formally, suppose that individual k has potential outcomes Y k ( 1 ) and Y k ( 0 ) depending on whether they receive a treatment, and we wish to learn a policy π that maps known covariates X k ∈ X to a treatment assignment in { 0 , 1 } . The value of a policy π (note that we are no longer using π to represent moments of the unknown distribution as in Section 3) is V ( π ) = E [ Y i ( π ( X i ) ) ] , the average outcome of an individual treated using the policy. Assuming we are restricted to a class of policies Π , the best possible policy is π ∗ = arg max π ′ ∈ Π V ( π ′ ) , and we evaluate a policy based on its regret V ( π ∗ ) − V ( π ) .

Ideally we would learn a policy by maximizing V , but we cannot compute V . Instead we estimate V from historical data ( Y 1 ( I 1 ) , X 1 ) , … , ( Y n ( I n ) , X n ) , where I k ∼ Ber ( p ( X k ) ) is an indicator of whether Y k received the treatment, and we assume that Assumption 1 holds. Under the assumption that the propensity map p ( ⋅ ) is known, ref. [17] proposed

V ˆ IPW ( π ) = 1 n ∑ k = 1 n 1 { I k = π ( X k ) } Y k P ( I k = π ( X k ) ∣ X k )

as an unbiased estimate of V ( π ) .

Of course, at this point, we can sense that a better estimate of V would be the adaptively normalized

(17) V ˆ AN = V ˆ IPW ( π ) + ∑ k = 1 n Y k 1 − P ( I k = π ( X k ) ∣ X k ) P ( I k = π ( X k ) ∣ X k ) 1 { I k = π ( X k ) } P ( I k = π ( X k ) ∣ X k ) ∑ k = 1 n 1 − P ( I k = π ( X k ) ∣ X k ) P ( I k = π ( X k ) ∣ X k ) 1 { I k = π ( X k ) } P ( I k = π ( X k ) ∣ X k ) 1 − 1 n ∑ k = 1 n 1 { I k = π ( X k ) } P ( I k = π ( X k ) ∣ X k ) .

The main result of ref. [17] showed that maximizing V ˆ IPW ( π ) yields a policy whose regret decays at a rate of 1 / n . We now give an analogous result for V ˆ AN .

Theorem 7

Let π ˆ AN = argmin π ∈ Π V ˆ AN ( π ) and suppose that Π has finite VC-dimension. Then

E [ V ( π ∗ ) − V ( π ˆ AN ) ] ≤ O M δ VC ( Π ) n .

Thus, V ˆ AN preserves the theoretical guarantees associated with V ˆ IPW , and we verify in the next section that policies learned with V ˆ AN are closer to the optimal policy. One drawback of V ˆ AN , however, is that V ˆ IPW can be interpreted as a weighted classification objective [17,19], facilitating optimization, while V ˆ AN unfortunately does not have such an interpretation. Finally, we note that [19] introduced the idea of policy learning based on AIPW estimation instead of IPW estimation; applying these ideas in our setting is an exciting opportunity for future work.

5 Experiments

In this section, we use a series of experiments to evaluate the empirical performance of adaptive normalization. The goal of our experiments is to validate that applying adaptive normalization for mean estimation as well as in the various settings of Section 4 actually pays the dividends we expect.

Our first set of experiments are focused on survey sampling as discussed in Section 3 and make use of semi-synthetic data to compare μ ˆ AN to μ ˆ HT and μ ˆ Hájek . Our second set of experiments are simulations comparing ATE estimation using all of the aforementioned estimators, and also using μ ˆ AIPW,AN and μ ˆ AIPW . Our third set of experiments studies the coverage of confidence intervals for these estimators constructed using asymptotic normal approximations. Finally, we use a simulation experiment to compare V ˆ AN to V ˆ IPW as a policy learning objective. Additional experiments appear in Appendix B.

5.1 Survey experiments

We work with a dataset of Swiss municipalities, provided by the R package sampling [38] under the GPL-2 license. This data set contains demographic and financial data for 2,896 different municipalities in Switzerland, and we consider two responses: Y 1 , the wooded area of a municipality, and Y 2 , the industrial area of a municipality. We assume the standard sampling scheme for studies of this dataset in which the probability p k of observing Y k is proportional to the total area of municipality k , and set ∑ k p k (the expected number of observations) to be either 50 or 250. We resample the set of observed municipalities according to these probabilities and compare the three estimators with the true mean of the Y k . The RMSE of the Horvitz–Thompson estimator, Hájek estimator, and adaptively normalized estimator for the four specifications are presented in Table 1.

Table 1

RMSE of estimators on Swiss municipality data; Y 1 is wood area and Y 2 is industrial area; probabilities are chosen proportional to total municipality area, which is strongly positively correlated with Y 1 and weakly positively correlated with Y 2 . RMSEs are averaged over 100,000 trials and standard errors are over 10 replications of the 100,000 trials

	Problem specification
	∑ p k = 50 , Y 1	∑ p k = 250 , Y 1	∑ p k = 50 , Y 2	∑ p k = 250 , Y 2
μ ˆ HT	68.4 ± 0.1030	27.8 ± 0.0710	2.51 ± 0.0051	1.07 ± 0.0026
μ ˆ Hájek	95.3 ± 0.3587	39.3 ± 0.1510	2.52 ± 0.0076	1.06 ± 0.00244
μ ˆ AN	61.5 ± 0.1035	23.1 ± 0.0538	2.45 ± 0.0086	1.01 ± 0.0028

To better understand Table 1, we remark that Y 1 is strongly positively correlated with p k and Y 2 is weakly positively correlated with p k . Thus, in the first two columns of Table 1, the Horvitz–Thompson estimator is much better than the Hájek estimator, but not as good as the adaptively normalized estimator, in accordance with our observations about λ ∗ in Section 3. In the latter two columns, the correlation is weaker, so the Horvitz–Thompson and Hájek estimators are more comparable, but the adaptively normalized estimator still improves on both. These results demonstrate that although it may be unclear whether the Horvitz–Thompson estimator or Hájek estimator should be used in a particular problem, the adaptively normalized estimator circumvents this question by always improving on both of the other two estimators. For reference, plots of the MSE in each problem as a function of different values of λ are shown in Figure 3.

$Figure 3 Top: the estimated absolute value of bias, variance, and MSE for estimators applied to data from model (18) with n = 500 n=500 , μ = 1 \mu =1 , and varying θ \theta . Bottom: the same, but with data from model (19) with n = 500 n=500 and varying α \alpha . Results are averaged across 10,000 trials. We omit τ ˆ HT {\hat{\tau }}_{\text{HT}} due to extremely large variance. Across the board, adaptively normalized estimators effectively trade off bias and variance to achieve lower MSE than their traditional counterparts.$

Figure 3

Top: the estimated absolute value of bias, variance, and MSE for estimators applied to data from model (18) with n = 500 , μ = 1 , and varying θ . Bottom: the same, but with data from model (19) with n = 500 and varying α . Results are averaged across 10,000 trials. We omit τ ˆ HT due to extremely large variance. Across the board, adaptively normalized estimators effectively trade off bias and variance to achieve lower MSE than their traditional counterparts.

5.2 ATE experiments

For ATE estimation, we consider two models, one of which represents a benign setting in which Y k and p k have roughly linear relationship, and one of which represents a more difficult setting in which Y k and p k have an extremely strong negative relationship.

The first model, which we refer to as our normal model, is a model in which we generate

(18) ( Y k ( 1 ) , X k ) ∼ N μ 0 , 1 θ θ 1 ,

and Y k ( 0 ) = Y k ( 1 ) − τ , p k = ( 1 + exp ( − 2 X k ) ) − 1 . In this model, θ controls the correlation between Y k and X k , and indirectly the correlation between Y k and p k . To satisfy Assumption 1, we truncate p k and Y k with M = 50 , δ = 0.01 .

The second model we consider is our power law model, where we generate

(19) p k ∼ Uni ( ε , 1 − ε ) , Y k ( 1 ) = p k − α + N ( 0 , σ 2 ) ,

and Y k ( 0 ) = Y k ( 1 ) − τ , X k = log ( ( 1 − p k ) / p k ) . In this model, α controls the strength of the nonlinear negative relationship between Y k and p k , while σ 2 is a noise parameter we set to 1. To satisfy Assumption 1, we take ε = 1 0 − 2 and truncate Y k at M = 1 0 6 .

In both models, we estimate p k from logistic regression with covariate X k and estimate E [ Y k ∣ X k ] with a generalized additive model. For AIPW estimators, we use two-fold cross-fitting. The treatment effect is set to τ = 0.5 . We simulate data from both models, fixing n = 500 and τ = 0.5 while varying θ and α , and compare the biases, variance, and MSEs of various ATE estimators. The results are shown in Figure 4, omitting τ ˆ HT for clarity.

$Figure 4 Coverage of 95% confidence intervals for different estimators of the mean μ \mu of Y k ( 1 ) {Y}_{k}\left(1) in the normal model (18) when μ = 1 , θ = 0.1 \mu =1,\theta =0.1 , and n n varies, along with empirical variances and estimated variances based on asymptotics. We see in general that μ ˆ AN {\hat{\mu }}_{\text{AN}} and μ ˆ AIPW,AN {\hat{\mu }}_{\text{AIPW,AN}} achieve nearly the target coverage, but that μ ˆ AIPW {\hat{\mu }}_{\text{AIPW}} and μ ˆ AN {\hat{\mu }}_{\text{AN}} lag behind. Finite sample coverage issues for μ ˆ AIPW {\hat{\mu }}_{\text{AIPW}} are known in the literature, and the weaker coverage of μ ˆ AN {\hat{\mu }}_{\text{AN}} is driven by its larger bias in finite samples (as shown in Figure 4). Thus, some caution is needed when constructing confidence intervals for μ ˆ AN {\hat{\mu }}_{\text{AN}} .$

Figure 4

Coverage of 95% confidence intervals for different estimators of the mean μ of Y k ( 1 ) in the normal model (18) when μ = 1 , θ = 0.1 , and n varies, along with empirical variances and estimated variances based on asymptotics. We see in general that μ ˆ AN and μ ˆ AIPW,AN achieve nearly the target coverage, but that μ ˆ AIPW and μ ˆ AN lag behind. Finite sample coverage issues for μ ˆ AIPW are known in the literature, and the weaker coverage of μ ˆ AN is driven by its larger bias in finite samples (as shown in Figure 4). Thus, some caution is needed when constructing confidence intervals for μ ˆ AN .

We discuss the classical and semi-parametric estimators separately. Comparing τ ˆ Hájek and τ ˆ AN , we see in the normal model that τ ˆ AN has larger bias but smaller variance than τ ˆ Hájek across all values of θ . Because the MSE is dominated by the variance, τ ˆ AN also has consistently lower MSE than τ ˆ Hájek , and the same behavior is found in the power law model. By comparing τ ˆ AIPW and τ ˆ AIPW,AN , we see in the normal model that they both have nearly no bias, but that τ ˆ AIPW, AN has slightly lower variance and thus lower MSE across all values of θ . We again see the same behavior in the power law model, except that the difference between τ ˆ AIPW and τ ˆ AIPW, AN is more substantial in this model. The reason for this difference is that estimating Y k ∣ X k is more difficult in the power law model, so the more careful estimate of the bias used by τ ˆ AIPW,AN pays larger dividends.

5.3 Coverage experiments

We now turn to the issue of inference, and consider the problem of constructing confidence intervals for our estimators. For μ ˆ HT , μ ˆ Hájek , and μ ˆ AN , this can be done using the asymptotic variances in Theorems 1 and 3 and an asymptotic normal approximation; for μ ˆ AIPW and μ ˆ AIPW,AN , we use the known asymptotic variance of the AIPW estimator (see, e.g., ref. [29]) and an asymptotic normal approximation. We use the same asymptotic variance for both μ ˆ AIPW and μ ˆ AIPW,AN because, as shown in Theorem 6, these two estimators have the same limiting distributions. We generate data from the model (18), and attempt to construct a 95% confidence interval for the mean μ of Y k ( 1 ) when μ = 1 , θ = 0.1 , and n varies. The results are shown in Figure A1.

We see in the top panel of Figure A1 that the coverage of confidence intervals for μ ˆ Hájek and μ ˆ AIPW,AN is fairly close to the target coverage level of 95%. However, the coverage of confidence intervals for μ ˆ AIPW and μ ˆ AN is weaker, especially in smaller samples. Issues with the coverage of confidence intervals for the AIPW estimator are known in the literature [39,40], and we see from the bottom two panels of Figure A1 that these are caused by the empirical variance of the AIPW estimator being larger than our asymptotic approximation of it in the sample sizes considered. Since μ ˆ AIPW,AN has smaller empirical variance than μ ˆ AIPW , it also enjoys better coverage in finite samples. On the other hand, the asymptotic approximation of the variance of μ ˆ AN appears relatively accurate, but the confidence intervals are slightly below the target coverage level because of the larger bias of μ ˆ AN , as shown in Figure 4. For this reason, practitioners should be aware that confidence intervals for μ ˆ AN may slightly undercover unless the sample size is quite large. This issue could be remedied by using some kind of bias correction, and is an important future direction.

5.4 Policy evaluation experiments

Our final simulation compares the two policy learning objectives, V ˆ IPW and V ˆ AN of the policy evaluation application in Section 4. Our data generating model is inspired by the simulations of ref. [19], and sets (letting X k , i be the i th entry of X k )

(20) X k ∼ N ( 0 , I 3 × 3 ) , p ( X k ) = 1 1 + exp ( − X k , 1 ) ,

with potential outcomes Y k ( 1 ) = X k , 1 , Y k ( 0 ) = Y k ( 1 ) − sgn ( X k , 2 + X k , 3 ) . In general, minimizing V ˆ IPW or V ˆ AN is nonconvex. To obtain a tractable problem, we restrict the class Π to be the set of policies of the form π ( X k ) = 1 { X k , 2 > T } for T ∈ [ − 1 , 1 ] . Note that this is a misspecified setting, in the sense that the optimal policy π ( X k ) = 1 { X k , 2 + X k , 3 > 0 } is not in the class we are optimizing over. For each of V ˆ IPW and V ˆ AN , we generate a sample of size n from (20) and learn a cut-off T that minimizes the corresponding objective by grid search on T ∈ [ − 1 , 1 ] .

The average threshold learned, for a range of values of n , is presented in Table 2. The optimal policy is to threshold at T = 0 , and so we take this as a point of comparison. We see that the thresholds learned from minimizing V ˆ AN are consistently closer to the optimal threshold of zero than those learned by minimizing V ˆ IPW , and that the gap between the performance of the two objectives is consistent across the range of values of n we consider.

Table 2

Thresholds learned by V ˆ IPW and V ˆ AN for varying n for data generated according to (20). Each entry is an average over 100,000 replications and standard errors are over 10 replications of the 100,000 trials. The optimal policy is to threshold at 0; minimizing V ˆ AN consistently learns better thresholds

Objective	n = 250	n = 500	n = 750	n = 1,000
V ˆ IPW	− 0.057 ± 0.0013	− 0.035 ± 0.0011	− 0.026 ± 0.0011	− 0.020 ± 0.0014
V ˆ AN	− 0.039 ± 0.0018	− 0.015 ± 0.0014	− 0.010 ± 0.0009	− 0.004 ± 0.0009

6 Discussion

In this article, we study adaptive normalization for IPW estimators: rather than normalizing by the sample size or by the sum of the weights, we propose normalizing by a data-dependent affine combination of the two. For mean estimation in survey sampling, our proposed estimator is algebraically equivalent to a control variate method from the Monte Carlo literature that has not been studied in the causal inference literature before. We further develop the adaptive normalization idea in causal inference settings by using it to improve on the AIPW estimators, to propose new estimators for the ATE in randomized experiments, and new objectives for policy learning.

There are several possible future directions for this work. One is to relax the assumption that the treatment indicators I k are independent, an unrealistic assumption in many observational datasets and also directly violated in certain experimental designs. However, if the correlation structure of the I k is known or possible to estimate, analogues of our limit theorems and estimators could be developed. If the correlation between the I k is sufficiently weak, Theorem 1 may hold without modification, and we expect the remainder of our results would go through as well. However, if the correlation between the I k is strong enough to affect the asymptotic variance in Theorem 1, then the optimal choice of λ would change, and these changes would propagate through the analysis. In such a case, we expect that the general form of μ ˆ AN would remains similar, but the T ˆ / π ˆ term would be replaced by a more complex expression.

In addition, estimands other than the mean, such as quantiles and distributions, are also of interest in causal inference and program evaluation and can be estimated using IPW-style estimators [41,42,43]. Extending our results to these other estimands is another possible line of further work and will likely give very different results from the ones presented here. IPW estimators for the mean have a simple closed form involving the weights, whereas IPW estimators for other estimands are typically defined as weighted M-estimators and do not have such a closed form [42,43]. Thus, the weights enter into the estimator in a different fashion in such cases, and we would expect the analogue of the adaptively normalized estimator to look different as well. In fact, we are not even aware of any work studying Hájek/self-normalized estimators in these other contexts, and the comparison between the Hájek and Horvitz–Thompson estimators may be different as well.

In a different direction, there are many places within and beyond causal inference where inverse probability weighted estimators are used in context-specific ways, such as off-policy evaluation on networks [44], recommender system evaluation [45], in learning to rank problems [46], and inference from bandits [47,48]. Developing and applying similar ideas in these contexts suggests many promising lines of future work.

Acknowledgements

We thank Guillaume Basse, Alex Chin, Dean Eckles, Sharad Goel, Kevin Guo, Ramesh Johari, Brian Karrer, and Fredrik Sävje for helpful discussions and feedback on early versions of this work. This work was partially supported by ARO award 73348-NS-YIP.

Conflict of interest: Authors state no conflict of interest.

Appendix A Technical proofs

A.1 Proofs of Theorems 1, 3, and 5

In this section, we detail the calculations underlying the CLTs in Section 3. The building block of our results is a joint CLT for the vector β ˆ = ( S ˆ / n , n ˆ / n ) , which has mean β = ( μ , 1 ) .

Lemma A1

Under Assumption 1, we have

n ( β ˆ − β ) → d N ( 0 , Σ ) ,

where the entries of Σ are given by

n var ( S ˆ / n ) = var ( Y 1 ) + E Y k 2 1 − p k p k , n var ( n ˆ / n ) = E 1 − p k p k , n cov ( S ˆ / n , n ˆ / n ) = E Y k 1 − p k p k .

Proof

For any vector v ∈ R 4 , the quantity n ( v T β ˆ − v T β ) is a sum of i.i.d. random variables. These random variables have finite variance by Assumption 1, so the classical Lindeberg CLT applies with a limiting distribution that is N ( 0 , v T Σ v ) . The lemma then follows from the Cramer-Wold device.□

The CLT for μ ˆ λ now follows from an application of the delta method.

Proof of Theorem 1

Note that μ ˆ λ = f ( β ˆ ) for f ( x , y ; λ ) = x λ + ( 1 − λ ) y . The function f is differentiable at the point β and has gradient ∇ β f = ( 1 , − ( 1 − λ ) μ ) , and so by the delta method, we conclude that μ ˆ λ satisfies a CLT with the asymptotic variance

∇ β f T Σ ∇ β f = var ( Y 1 ) + E Y k 2 1 − p k p k − 2 ( 1 − λ ) μ E Y k 1 − p k p k + ( 1 − λ ) 2 μ 2 E 1 − p k p k ,

which rearranges to the variance in (3), as desired.□

Proof of Theorem 3

The key step is to replace the factor of T ˆ / π ˆ in μ ˆ AN with T / π , so that we can then apply a standard CLT. Thus, we write

(A1) μ ˆ AN = S ˆ n + T π 1 − n ˆ n + T ˆ π ˆ − T π 1 − n ˆ n ,

(A2) = S ˆ n + T π 1 − n ˆ n + o P ( 1 ) O P ( n − 1 / 2 ) ,

(A3) = S ˆ n + T π 1 − n ˆ n + o P ( n − 1 / 2 ) ,

where the second equality holds because T ˆ π ˆ − T π = o P ( 1 ) by the consistency of T and π , and 1 − n ˆ n = O P ( n − 1 / 2 ) by the CLT for i.i.d. sums.

Thus, the limiting distribution of n ( μ ˆ AN − μ ) is the same as the limiting distribution of

n S ˆ n + T π 1 − n ˆ n − μ = 1 n ∑ k = 1 n Y k I k p k − μ − T π ⋅ 1 n ∑ k = 1 n I k p k − 1 .

This last expression is the difference of two i.i.d. mean zero sums, and so its limiting distribution is asymptotically normal with variance

var ( Y k I k / p k ) − 2 ( T / π ) cov ( Y k I k / p k , I k / p k ) + ( T / π ) 2 var ( I k / p k ) ,

which simplifies to σ 2 ( λ ∗ ) .□

Next we turn to the CLT for the case of estimated propensities. Recall that our set-up here is that p k = ( 1 + exp ( X k T θ ∗ ) ) − 1 , and we estimate θ ∗ by the maximum-likelihood estimate θ ˆ . In what follows, we abuse notation and write p ( X k T θ ) for ( 1 + exp ( X k T θ ) ) − 1 .

Proof of Theorem 5

Consider the family of estimators

μ ˆ β ( θ ) = 1 n ∑ k = 1 n Y k I k p ( X k T θ ) + β 1 − 1 n ∑ k = 1 n I k p ( X k T θ ) .

Then μ ˆ HT corresponds to β = 0 , and it follows from the consistency and asymptotic normality of θ ˆ that we can repeat the arguments of (A3) and show that μ ˆ AN is first-order equivalent to β = T / π . Thus, to establish the theorem, it suffices to characterize the asymptotic variance of μ ˆ β ( θ ˆ ) and compare the cases β = 0 and β = T / π .

To understand the asymptotic of μ ˆ β ( θ ˆ ) , we first Taylor expand around θ ∗ and obtain

(A4) μ ˆ β ( θ ˆ ) = μ ˆ β ( θ ∗ ) + E ∂ μ ˆ β ( θ ) ∂ θ ∣ θ = θ ∗ T ( θ ˆ − θ ∗ ) + o P ( n − 1 / 2 ) .

Next, it follows from standard results on M-estimators that θ ˆ − θ ∗ = I ( θ ∗ ) − 1 S ( θ ∗ ) + o P ( n − 1 / 2 ) , where

(A5) I ( θ ∗ ) − 1 = E [ p ( X k T θ ∗ ) ( 1 − p ( X k T θ ∗ ) ) X k X k T ] and S ( θ ∗ ) = 1 n ∑ k = 1 n X k ( I k − p ( X k T θ ∗ ) )

are the Fisher information and score. Also, by direct computation,

(A6) ∂ μ ˆ β ( θ ) ∂ θ = 1 n ∑ k = 1 n − Y k I k p ( X k T θ ) 2 ⋅ p ( X k T θ ) ( 1 − p ( X k T θ ) ) X k + 1 n ∑ k = 1 n β I k p ( X k T θ ) 2 ⋅ p ( X k T θ ) ( 1 − p ( X k T θ ) ) ,

which has expectation − E [ ( Y k − β ) ( 1 − p ( X k T θ ∗ ) ) X k ] at θ ∗ .

Combining (A4), (A5), and (A6) gives

(A7) n ( μ ˆ β ( θ ˆ ) − μ ) = 1 n ∑ k = 1 n Y k I k p ( X k T θ ∗ ) − μ + 1 n ∑ k = 1 n β − β I k p ( X k T θ ∗ ) − E [ Y k ( 1 − p ( X k T θ ∗ ) ) X k ] I ( θ ∗ ) − 1 × 1 n ∑ k = 1 n X k ( I k − p ( X k T θ ∗ ) ) + β E [ ( 1 − p ( X k T θ ∗ ) ) X k ] I ( θ ∗ ) − 1 ⋅ 1 n ∑ k = 1 n X k ( I k − p ( X k T θ ∗ ) ) + o P ( n − 1 / 2 ) .

Since each term of (A7) is a sum of mean-zero i.i.d. terms, we can compute the asymptotic variance directly. In particular, the terms that depend on β are

(A8) β 2 var ( I k / p ( X k T θ ∗ ) ) − 2 β cov I k p ( X k T θ ∗ ) , Y k I k p ( X k T θ ∗ ) + 2 β cov I k p ( X k T θ ∗ ) , u 1 T I ( θ ∗ ) − 1 X k ( I k − p ( X k T θ ∗ ) ) + β 2 var ( u 2 T I ( θ ∗ ) − 1 X k ( I k − p ( X k T θ ∗ ) ) ) + 2 β cov u 2 T I ( θ ∗ ) − 1 X k ( I k − p ( X k T θ ∗ ) ) , Y k I k p ( X k T θ ∗ ) − 2 β cov ( u 1 T I ( θ ∗ ) − 1 X k ( I k − p ( X k T θ ∗ ) ) , u 2 T I ( θ ∗ ) − 1 X k ( I k − p ( X k T θ ∗ ) ) ) − 2 β 2 cov I k p ( X k T θ ∗ ) , u 2 T I ( θ ∗ ) − 1 X k ( I k − p ( X k T θ ∗ ) ) ,

where u 1 = E [ Y k ( 1 − p ( X k T θ ∗ ) ) X k ] and u 2 = E [ ( 1 − p ( X k T θ ∗ ) ) X k ] . The first two terms of (A8) are β 2 π − 2 T β . To simplify the remaining terms, we compute

(A9) cov I k p ( X k T θ ∗ ) , u 1 T I ( θ ∗ ) − 1 X k ( I k − p ( X k T θ ∗ ) ) = u 1 T I ( θ ∗ ) − 1 u 2 ,

(A10) var ( u 2 T I ( θ ∗ ) − 1 X k ( I k − p ( X k T θ ∗ ) ) ) = u 2 T I ( θ ∗ ) − 1 u 2 ,

(A11) cov u 2 T I ( θ ∗ ) − 1 X k ( I k − p ( X k T θ ∗ ) ) , Y k I k p ( X k T θ ∗ ) = u 2 T I ( θ ∗ ) − 1 u 1 ,

(A12) cov ( u 1 T I ( θ ∗ ) − 1 X k ( I k − p ( X k T θ ∗ ) ) , u 2 T I ( θ ∗ ) − 1 X k ( I k − p ( X k T θ ∗ ) ) ) = u 1 T I ( θ ∗ ) − 1 u 2 ,

(A13) cov I k p ( X k T θ ∗ ) , u 2 T I ( θ ∗ ) − 1 X k ( I k − p ( X k T θ ∗ ) ) = u 2 T I ( θ ∗ ) − 1 u 2 .

Together, these imply that (A8) can be written as follows:

(A14) = β 2 π − 2 β T + 2 β u 1 T I ( θ ∗ ) − 1 u 2 + β 2 u 2 T I ( θ ∗ ) − 1 u 2 + 2 β u 2 T I ( θ ∗ ) − 1 u 1 − 2 β u 1 T I ( θ ∗ ) − 1 u 2 − 2 β 2 u 2 T I ( θ ∗ ) − 1 u 2 ,

(A15) = β 2 π − 2 β T + 2 β ( u 2 T I ( θ ∗ ) − 1 u 1 ) − β 2 u 2 T I ( θ ∗ ) − 1 u 2 .

Finally, recall that μ ˆ HT corresponds to β = 0 and μ ˆ AN corresponds to β = T / π . Since (A15) evaluated at β = 0 is 0, the difference in asymptotic variance between μ ˆ HT and μ ˆ AN will be equal to (A15) evaluated at β = T / π , which is − T 2 π + 2 T π u 2 T A u 1 − T 2 π 2 u 2 T A u 2 , completing the argument.□

A.2 Proof of Theorem 2

This subsection contains the proof of Theorem 2 and details for the heuristic argument for variance reduction given in the main text.

As mentioned earlier, we combine the two equations in (7) to obtain

μ ˆ ( t ) = S ˆ 1 − T ˆ π ˆ μ ˆ ( t − 1 ) n + T ˆ π ˆ μ ˆ ( t − 1 ) n ˆ = S ˆ / n 1 − T ˆ π ˆ μ ˆ ( t − 1 ) 1 − n ˆ n

and write this as μ ˆ ( t ) = f ( μ ˆ ( t − 1 ) ) , where

(A16) f ( x ) = a x x − b , a = S ˆ n , b = T ˆ π ˆ 1 − n ˆ n .

In this notation, the fixed point of f is x = a + b = μ ˆ AN .

The proof now proceeds in two steps: first, we generalize the problem slightly and consider the discrete dynamical system x ( t ) initialized at x ( 0 ) = a and with iterative map x ( t ) = f ( x ( t − 1 ) ) for arbitrary fixed a and b . For this dynamical system, we show that if ∣ a ∣ > ∣ b ∣ , then x ( t ) converges to a + b . Second, we show that, with high probability, the particular a and b defined in (A16) satisfy these conditions.

A.2.1 Analyzing the dynamical system

The function f has two fixed points at x 1 ∗ = 0 and at x 2 ∗ = a + b . To understand the stability of these fixed points, we compute the derivative

∣ f ′ ( x ) ∣ = a b ( x − b ) 2 ⇒ ∣ f ′ ( x 1 ∗ ) ∣ = ∣ a ∣ ∣ b ∣ , ∣ f ′ ( x 2 ∗ ) ∣ = ∣ b ∣ ∣ a ∣ .

A fixed point x ∗ of the map f is stable if and only if ∣ f ′ ( x ∗ ) ∣ < 1 , so we see that if ∣ a ∣ > ∣ b ∣ , then x 1 ∗ is unstable and x 2 ∗ is stable, and if ∣ a ∣ < ∣ b ∣ , then x 1 ∗ is stable and x 2 ∗ is unstable. The case ∣ a ∣ = ∣ b ∣ occurs with probability zero, and so we do not consider it.

In light of these stabilities, we should expect x ( t ) to converge to a + b whenever ∣ a ∣ > ∣ b ∣ . The following lemma confirms this.

Lemma A2

If ∣ a ∣ > ∣ b ∣ , then the dynamical system with initial point x 0 = a and iterative map x t + 1 = f ( x t ) converges to x ∗ = a + b .

Proof

Our analysis relies on the two-step map

f ( f ( x ) ) = a 2 x x ( a − b ) + b 2 .

In particular, we will show that the subsequences x 0 , x 2 , … , and x 1 , x 3 , … , both converge to x ∗ , and this will prove the lemma. In both cases, the key observation is that

(A17) ∣ f ( f ( x ) ) − x ∗ ∣ = a 2 x x ( a − b ) + b 2 − ( a + b ) = b 2 x ( a − b ) + b 2 ⋅ ∣ x − ( a + b ) ∣ .

We first consider the subsequence x 0 , x 2 , … , which for convenience we denote by y t = x 2 t . We claim that for any t ≥ 0 , we have

(A18) ∣ y t + 1 − x ∗ ∣ ≤ ∣ b ∣ ∣ a ∣ ∣ y t − x ∗ ∣ .

The proof of the claim is by induction. For the base case, which is t = 0 and y 0 = a , we have that

b 2 a ( a − b ) + b 2 ≤ b 2 a b = ∣ b ∣ ∣ a ∣ ,

and substituting this into (A17) gives (A18).

For the inductive step, suppose the result holds for y 1 , … , y t − 1 . We will show that y t also satisfies

b 2 y t ( a − b ) + b 2 ≤ ∣ b ∣ ∣ a ∣ ,

and this together with (A17) will prove the claim. The previous display is equivalent to

(A19) ∣ y t ( a − b ) + b 2 ∣ ≥ ∣ a b ∣ .

This inequality can be established by casework on the signs of a and b . We discuss the case a > 0 , b > 0 in detail; the other three cases are analogous.

If a > 0 and b > 0 , then a + b > a , and since by the inductive hypothesis, y t is closer to a + b than a is, we must have a ≤ y t ≤ a + 2 b . Thus,

∣ y t ( a − b ) + b 2 ∣ ≥ min a ≤ t ≤ a + 2 b ∣ t ( a − b ) + b 2 ∣ .

Since the function we are minimizing is piecewise linear, the minimum must be attained at an endpoint ( t = a or t = a + 2 b ) or where t ( a − b ) + b 2 = 0 .

At t = a , the objective is ∣ a 2 − a b + b 2 ∣ ≥ ∣ a b ∣ ; at t = a + 2 b , the objective is ∣ a 2 + a b − b 2 ∣ . Since ∣ a ∣ > ∣ b ∣ and a , b are both positive, this is equal to a 2 + a b − b 2 ≥ a b as well. Finally, t ( a − b ) + b 2 = 0 is not possible because this requires t = b 2 b − a and because a > b , b 2 b − a < 0 < a . Thus, we conclude that (A19) holds, and this completes the induction for the case a > 0 and b > 0 . The other cases are analogous, and combining them establishes (A18).

Now, by using our claim in (A18) repeatedly, we have that for any t > 0 ,

∣ y t − x ∗ ∣ ≤ ∣ b ∣ ∣ a ∣ ∣ y t − 1 − x ∗ ∣ ≤ ⋯ ≤ ∣ b ∣ ∣ a ∣ t ∣ y 0 − x ∗ ∣ .

Since ∣ b ∣ < ∣ a ∣ , we thus have ∣ y t − x ∗ ∣ → 0 as t → ∞ , proving the convergence.

Recalling that y t = x 2 t , we have shown that the subsequence x 0 , x 2 , … , converges to x ∗ . An analogous argument gives that x 1 , x 3 , … converges to x ∗ as well, and these two together imply that x t → x ∗ .□

A.2.2 High-probability guarantees

Our next lemma carries out the proof of the second part of Theorem 2, which is showing that a and b as defined in (A16) satisfy ∣ a ∣ > ∣ b ∣ with high probability.

Lemma A3

Suppose Assumption 1 holds and that μ ≠ 0 . Then,

P S ˆ n > T ˆ π ˆ 1 − n ˆ n ≥ 1 − 4 exp − 2 μ 2 n O ( M / δ ) 2 .

Proof

For any 0 ≤ ε ≤ ∣ μ ∣ , define the events

E 1 = S ˆ n > T ˆ π ˆ 1 − n ˆ n , E 2 = S ˆ n − μ ≤ ε , E 3 = T ˆ π ˆ 1 − n ˆ n ≤ ∣ μ ∣ − ε .

We need to lower bound the probability of E 1 , which we do by upper bounding the probability of E 1 c . By the union bound,

P ( E 1 c ) ≤ P ( E 2 c ) + P ( E 3 c ) , = P 1 n ∑ k = 1 n Y k I k p k − μ > ε + P T ˆ π ˆ 1 − n ˆ n > ∣ μ ∣ − ε , ≤ P 1 n ∑ k = 1 n Y k I k p k − μ > ε + P 1 − n ˆ n > ∣ μ ∣ − ε M , ≤ 2 exp − 2 ε 2 n ( M / δ ) 2 + 2 exp − 2 ( ∣ μ ∣ − ε ) 2 n ( 1 / δ ) 2 .

The second inequality follows from the bound

(A20) ∣ T ˆ ∣ = 1 n ˆ ∑ k = 1 n Y k 1 − p k p k ⋅ I k p k ≤ 1 n ˆ ∑ k = 1 n ∣ Y k ∣ 1 − p k p k ⋅ I k p k ≤ M ∣ π ˆ ∣ ,

which implies that ∣ T ˆ / π ˆ ∣ ≤ M . The third inequality follows from applying Hoeffding’s inequality to each term with the bounds ∣ Y k I k / p k ∣ ≤ M / δ and ∣ I k / p k ∣ ≤ 1 / δ .

Finally, we choose ε to balance these two terms. The optimal choice is ε = M M + 1 ∣ μ ∣ , and with this value of ε , we conclude that

P ( E 1 ) ≥ 1 − 4 exp − 2 μ 2 n ( M + 1 ) 2 / δ 2 ,

finishing the proof.□

A.2.3 Combining the lemmas

With these two lemmas, the proof of Theorem 2 is straightforward.

Proof of Theorem 2

Let a , b be as defined in (A16).

For (i), if ∣ a ∣ > ∣ b ∣ , then Lemma A2 implies the result with μ ˆ lim = μ ˆ AN . If ∣ a ∣ < ∣ b ∣ , then an argument similar to the one in the proof of Lemma A2 establishes that x ( t ) → 0 as t → ∞ , and so the statement holds with μ ˆ lim = 0 .

For (ii), we have

P ( μ ˆ lim ≠ μ ˆ AN ) ≤ P S ˆ n > T ˆ π ˆ 1 − n ˆ n ≤ 4 exp − 2 μ 2 n O ( M / δ ) 2 ,

where the first inequality is from the contrapositive of Lemma A2 and the second is from that of Lemma A3. This bound goes to zero, implying (ii).□

A.3 Proof of Theorem 4

Before presenting the proof, we remark that for this result, we must deal explicitly with the possibility that ∑ I k = 0 , i.e., no units receive treatment. We do not discuss this case elsewhere because our other results are asymptotic, and this event occurs with a probability that is exponentially small in n and so can be ignored, but since Theorem 4 is a finite-sample result, we must account for this possibility.

Proof

Recall that

(A21) var ( μ ˆ HT ) = var ( Y 1 ) + E [ Y 1 2 ] 1 − p p ,

so we must upper bound var ( μ ˆ AN ) by the right-hand side of (A21). Now, under the assumption that p k = p is constant, μ ˆ AN simplifies to

(A22) μ ˆ AN = ∑ Y k I k ∑ I k if ∑ I k ≠ 0 , 0 if ∑ I k = 0 .

For convenience, let N = ∑ I k . Then we decompose

(A23) var ( μ ˆ AN ) = E [ var ( μ ˆ AN ∣ N ) ] + var ( E [ μ ˆ AN ∣ N ] )

and compute

(A24) var ( μ ˆ AN ∣ N ) = var ( Y 1 ) N if N ≠ 0 , 0 if N = 0 , E [ μ ˆ AN ∣ N ] = E [ Y 1 ] if N ≠ 0 , 0 if N = 0 .

Substituting (A24) into (A23) yields

(A25) var ( μ ˆ AN ) = var ( Y 1 ) E [ 1 / N ∣ N ≠ 0 ] + P ( N ≠ 0 ) ( E [ Y 1 ] − E [ Y 1 ] P ( N ≠ 0 ) ) 2 + P ( N = 0 ) ( E [ Y 1 ] P ( N ≠ 0 ) ) 2

(A26) = var ( Y 1 ) E [ 1 / N ∣ N ≠ 0 ] + E [ Y 1 2 ] P ( N = 0 ) P ( N ≠ 0 )

(A27) ≤ var ( Y 1 ) + E [ Y 1 2 ] P ( N = 0 )

(A28) ≤ var ( Y 1 ) + E [ Y 1 2 ] 1 − p p ,

giving the result. Here, the first equality uses P ( N = 0 ) + P ( N ≠ 0 ) = 1 and algebra, the second inequality uses the fact that 1 / N ≤ 1 conditional on N ≠ 0 , and the third inequality uses the fact that P ( N = 0 ) = ( 1 − p ) n together with the elementary inequality ( 1 − p ) n ≤ 1 − p p for p ∈ ( 0 , 1 ) .□

A.4 Proof of Theorem 6

Proof

For the ease of notation, we work within a single fold of the cross-fitted estimator and let T n be an auxiliary data set of size n on which μ ˆ ( ⋅ ) and p ˆ ( ⋅ ) are trained. This is equivalent to twofold cross-fitting; the case of more folds is analogous. Then, it is sufficient to show that the correction term we have introduced is o P ( n − 1 / 2 ) . That error term is

(A29) = 1 π ˆ 1 n ∑ k = 1 n ( Y k − μ ˆ ( X k ) ) 1 − p ˆ ( X k ) p ˆ ( X k ) I k p ˆ ( X k ) 1 − 1 n ∑ k = 1 n I k p ˆ ( X k ) ,

(A30) ≲ 1 n ∑ k = 1 n ( Y k − μ ˆ ( X k ) ) I k 1 − 1 n ∑ k = 1 n I k p ˆ ( X k ) ,

(A31) = 1 n ∑ k = 1 n ( Y k − μ ( X k ) ) I k + 1 n ∑ k = 1 n ( μ ( X k ) − μ ˆ ( X k ) ) I k 1 − 1 n ∑ k = 1 n I k p ( X k ) + 1 n ∑ k = 1 n I k p ( X k ) − I k p ˆ ( X k ) ,

(A32) ≤ O P ( n − 1 / 2 ) + 1 n ∑ k = 1 n ∣ ( μ ( X k ) − μ ˆ ( X k ) ) I k ∣ O P ( n − 1 / 2 ) + 1 n ∑ k = 1 n I k p ( X k ) − I k p ˆ ( X k ) ,

(A33) ≤ O P ( n − 1 ) + O P ( n − 1 / 2 ) n ∑ k = 1 n ∣ μ ( X k ) − μ ˆ ( X k ) ∣ + O P ( n − 1 / 2 ) n ∑ k = 1 n 1 p ( X k ) − 1 p ˆ ( X k ) + 1 n ∑ k = 1 n ∣ μ ( X k ) − μ ˆ ( X k ) ∣ 1 n ∑ k = 1 n 1 p ( X k ) − 1 p ˆ ( X k ) ,

(A34) ≤ O P ( n − 1 ) + O P ( n − 1 / 2 ) sup x ∈ X ∣ μ ( x ) − μ ˆ ( x ) ∣ + sup x ∈ X 1 p ( x ) − 1 p ˆ ( x ) + 1 n ∑ k = 1 n ∣ μ ( X k ) − μ ˆ ( X k ) ∣ 1 n ∑ k = 1 n 1 p ( X k ) − 1 p ˆ ( X k ) ,

(A35) = O P ( n − 1 ) + O P ( n − 1 / 2 ) o P ( 1 ) + 1 n ∑ k = 1 n ∣ μ ( X k ) − μ ˆ ( X k ) ∣ 1 n ∑ k = 1 n 1 p ( X k ) − 1 p ˆ ( X k ) ,

where the second inequality follows from the fact that, by the consistency of p ˆ ( ⋅ ) , we have δ / 2 ≤ p ˆ ( x ) ≤ 1 − δ / 2 for all x ∈ X for sufficiently large n , the fourth inequality applies the triangle inequality and the CLT for i.i.d. sums, and the final equality uses Assumption 2. (Note that since p ( ⋅ ) is bounded, the consistency of p ˆ implies the consistency of 1 / p ˆ as well.)

Examining (A35), the first two terms are o P ( n − 1 / 2 ) as needed, so it remains to analyze the final term. We thus compute

E 1 n ∑ k = 1 n ∣ μ ( X k ) − μ ˆ ( X k ) ∣ 2 1 n ∑ k = 1 n 1 p ( X k ) − 1 p ˆ ( X k ) 2

(A36) ≤ E 1 n ∑ k = 1 n ( μ ( X k ) − μ ˆ ( X k ) ) 2 1 n ∑ k = 1 n 1 p ( X k ) − 1 p ˆ ( X k ) 2 ,

(A37) = E 1 n 2 ∑ k ≠ ℓ ( μ ( X k ) − μ ˆ ( X k ) ) 2 1 p ( X ℓ ) − 1 p ˆ ( X ℓ ) 2 + o ( n − 1 ) ,

(A38) = E 1 n 2 ∑ k ≠ ℓ E ( μ ( X k ) − μ ˆ ( X k ) ) 2 1 p ( X ℓ ) − 1 p ˆ ( X ℓ ) 2 ∣ T n + o ( n − 1 ) ,

(A39) = E n ( n − 1 ) n 2 E ( μ ( X k ) − μ ˆ ( X k ) ) 2 1 p ( X ℓ ) − 1 p ˆ ( X ℓ ) 2 ∣ T n + o ( n − 1 ) ,

(A40) ≤ E E [ ( μ ( X k ) − μ ˆ ( X k ) ) 2 ∣ T n ] E 1 p ( X ℓ ) − 1 p ˆ ( X ℓ ) 2 ∣ T n + o ( n − 1 ) ,

where the first inequality is Cauchy-Schwarz, the second equality expands the product and drops the k = ℓ terms, the fourth equality uses the fact that the terms of the sum are equal after conditioning on T n , and the fifth equality uses the fact that the two errors are independent after conditioning on T n .

By Assumption 3, the product of conditional expectations in (A40) is o P ( n − 1 ) . Since μ ( X k ) and p ( X ℓ ) are bounded by Assumption 1, and μ ˆ ( ⋅ ) and p ˆ ( ⋅ ) are sup-norm consistent by Assumption 2, that product of conditional expectations is eventually dominated by a constant, and so the expectation in (A40) is o ( n − 1 ) as well, completing the proof.□

A.5 Proof of Theorem 7

Our proof closely follows ideas and tools developed in ref. [17].

Proof

Our goal is to control V ( π ∗ ) − V ( π ˆ AN ) . We do this by first writing V ( π ∗ ) − V ( π ˆ AN ) as follows:

(A41) = V ( π ∗ ) − V ˆ AN ( π ˆ AN ) + V ˆ AN ( π ˆ AN ) − V ( π ˆ AN ) ,

(A42) ≤ V ( π ∗ ) − V ˆ AN ( π ∗ ) + sup π ∈ Π ∣ V ˆ AN ( π ) − V ( π ) ∣ ,

(A43) ≤ 2 sup π ∈ Π ∣ V ˆ AN ( π ) − V ( π ) ∣ ,

(A44) ≤ 2 sup π ∈ Π ∣ V ˆ AN ( π ) − V ˆ IPW ( π ) ∣ + 2 sup π ∈ Π ∣ V ˆ IPW ( π ) − V ( π ) ∣ .

The second term of (A44) is bounded in Theorem 2.1 of ref. [17], and so we are interested in bounding the first term. By the definitions of V ˆ IPW and V ˆ AN , that first term of (A44) is

(A45) = sup π ∈ Π ∑ k = 1 n Y k 1 − P ( I k = π ( X k ) ∣ X k ) P ( I k = π ( X k ) ∣ X k ) 1 { I k = π ( X k ) } P ( I k = π ( X k ) ∣ X k ) ∑ k = 1 n 1 − P ( I k = π ( X k ) ∣ X k ) P ( I k = π ( X k ) ∣ X k ) 1 { I k = π ( X k ) } P ( I k = π ( X k ) ∣ X k ) 1 − 1 n ∑ k = 1 n 1 { I k = π ( X k ) } P ( I k = π ( X k ) ∣ X k ) ,

(A46) ≤ M sup π ∈ Π 1 − 1 n ∑ k = 1 n 1 { I k = π ( X k ) } P ( I k = π ( X k ) ∣ X k ) ,

by a calculation analogous to the one in (A20). Combining (A44) and (A46) gives

(A47) E [ V ( π ∗ ) − V ( π ˆ AN ) ] ≤ M E sup π ∈ Π 1 − 1 n ∑ k = 1 n 1 { I k = π ( X k ) } P ( I k = π ( X k ) ∣ X k ) + O M δ VC ( Π ) n .

Finally, the expectation in the first term of (A47) can be bounded by using standard empirical process tools; in particular, Lemmas A.1 and A.4 of ref. [17] imply that it is O 1 δ VC ( Π ) n , finishing the proof.□

$Figure A1 MSE of μ ˆ λ {\hat{\mu }}_{\lambda } as a function of λ \lambda for the four problem specifications of the Swiss municipality data.$

Figure A1

MSE of μ ˆ λ as a function of λ for the four problem specifications of the Swiss municipality data.

B Further experiments

B.1 RMSE plots for Swiss data

The plots in Figure 3 show the RMSE of the various problem specifications of our survey sampling simulations in Section 5 as a function of λ . Interestingly, the optimal choices of λ are all between 0 and 1, suggesting that the Hájek estimator is actually “over-normalizing” in this case.

B.2 Overlap and bias

In this section, we further study the bias of μ ˆ λ and μ ˆ AN in small samples. To do this, we generate data from the normal model (18) with μ = 0.1 and θ = 0.9 , and attempt to estimate the mean μ of Y k ( 1 ) for varying values of n , the sample size, and also for varying values of δ , the cut-off to which the p k are truncated (i.e., when δ is smaller, we allow smaller p k , and thus worse overlap). Estimates for the bias of μ ˆ λ with λ = 0.25 , 0.5 , and 0.75, as well as for μ ˆ AN under these conditions are shown in Figure A2. These experiments confirm the expected behavior of the bias based on theory: the bias of μ ˆ λ decreases at approximately a 1 / n rate, and is larger in the presence of worse overlap. Thus, our proposal to select λ by minimizing the asymptotic variance can also be expected to reduce MSE as long as the data have good overlap. The bias of μ ˆ AN shows similar behavior and also appears to decrease more quickly in the presence of better overlap.

$Figure A2 The absolute value of the biases of μ ˆ λ {\hat{\mu }}_{\lambda } for λ = 0.25 \lambda =0.25 , 0.5, and 0.75 and μ ˆ AN {\hat{\mu }}_{\text{AN}} as n n , the sample size, and δ \delta , the lower bound on the overlap, vary in the normal model (18). We see that bias decreases with n n at roughly a 1 / n 1\hspace{0.1em}\text{/}\hspace{0.1em}n rate, as the theory suggests, and that overlap plays an important role in the bias: worse overlap consistently leads to larger bias.$

Figure A2

The absolute value of the biases of μ ˆ λ for λ = 0.25 , 0.5, and 0.75 and μ ˆ AN as n , the sample size, and δ , the lower bound on the overlap, vary in the normal model (18). We see that bias decreases with n at roughly a 1 / n rate, as the theory suggests, and that overlap plays an important role in the bias: worse overlap consistently leads to larger bias.

C Joint adaptive normalization in ATE estimation

This appendix explores some subtleties of how the adaptive normalizations should be chosen for ATE estimation. In particular, the estimator in (16) is equivalent to selecting λ in (1) to separately minimize the asymptotic variance of the two mean estimates. However, this “plug-in” use of adaptive normalization ignores the fact that the asymptotic variance of an adaptively normalized ATE estimator depends not only on the variances of the two mean estimators but also their covariance.

In what follows, we jointly choose the normalizations of each mean estimator to minimize the asymptotic variance of estimating the combined estimand τ . Specifically, we define

μ ˆ 1 , λ 1 = S ˆ 1 ( 1 − λ 1 ) n + λ 1 n ˆ 1 , μ ˆ 0 , λ 0 = S ˆ 0 ( 1 − λ 0 ) n + λ 0 n ˆ 0 ,

where

S ˆ 1 = ∑ k = 1 n Y k ( 1 ) I k p k , S ˆ 0 = ∑ k = 1 n Y k ( 0 ) ( 1 − I k ) 1 − p k , n ˆ 1 = ∑ k = 1 n I k p k , n ˆ 0 = ∑ k = 1 n 1 − I k 1 − p k .

Combining these estimators leads to the estimator τ ˆ λ 1 , λ 0 = μ ˆ 1 , λ 1 − μ ˆ 0 , λ 0 of τ . To follow the program of Section 3, we seek to choose λ 1 and λ 0 to minimize the following asymptotic variance.

Theorem A1

Assuming that both Y k ( 1 ) and Y k ( 0 ) satisfy Assumption 1, we have

n ( τ ˆ λ 1 , λ 0 − τ ) → d N ( 0 , σ 2 ) ,

where

σ 2 = λ 1 2 μ 1 2 π 1 + λ 0 2 μ 0 2 π 0 − 2 λ 1 μ 1 ( T 1 + μ 0 ) + 2 λ 0 μ 0 ( − μ 1 − T 0 ) + 2 λ 1 λ 0 μ 1 μ 0 + C ,

for

π 1 = E 1 − p k p k , π 0 = E p k 1 − p k , T 1 = E Y k ( 1 ) 1 − p k p k , T 0 = E Y k ( 0 ) p k 1 − p k ,

and C denotes terms that do not depend on either λ 1 or λ 0 .

As with our other CLTs, this is a routine delta method calculation.

Proof

Define the vector

β ˆ = 1 n ∑ k = 1 n Y k ( 1 ) I k p k , 1 n ∑ k = 1 n Y k ( 0 ) ( 1 − I k ) 1 − p k , 1 n ∑ k = 1 n I k p k , 1 n ∑ k = 1 n 1 − I k 1 − p k

with mean β = ( μ 1 , μ 0 , 1 , 1 ) . By the same arguments as in A1, β ˆ satisfies the usual CLT for the mean of i.i.d. random variables, and so we can apply the delta method with the function f ( x , y , z , w ) = x 1 − λ 1 + λ 1 z − y 1 − λ 0 + λ 0 w . The relevant gradient is ( 1 , − 1 , − λ 1 μ 1 , λ 0 μ 0 ) , and so the asymptotic variance of τ ˆ λ 1 , λ 0 is

λ 1 2 μ 1 2 var I k p k + λ 0 2 μ 0 2 var 1 − I k 1 − p k − 2 λ 1 μ 1 cov I k p k , Y k ( 1 ) I k p k − cov I k p k , Y k ( 0 ) ( 1 − I k ) 1 − p k , + 2 λ 0 μ 0 cov 1 − I k 1 − p k , Y k ( 1 ) I k p k − cov 1 − I k 1 − p k , Y k ( 0 ) ( 1 − I k ) 1 − p k − 2 λ 1 λ 0 μ 1 μ 0 cov I k p k , 1 − I k 1 − p k + C ,

where C denotes terms that do not depend on either λ 1 or λ 2 .

Finally, we can compute

var I k p k = π 1 , var 1 − I k 1 − p k = π 0

and

cov I k p k , Y k ( 1 ) I k p k = T 1 , cov I k p k , Y k ( 0 ) ( 1 − I k ) 1 − p k = − μ 0 ,

and

cov 1 − I k 1 − p k , Y k ( 1 ) I k p k = − μ 1 , cov 1 − I k 1 − p k , Y k ( 0 ) ( 1 − I k ) 1 − p k T 0 ,

and finally, cov I k p k , 1 − I k 1 − p k = − 1 . Substituting these in gives the result.□

We now minimize the asymptotic variance of Theorem A1 in λ 1 and λ 0 . The first-order stationary conditions tell us that the optimal pair ( λ 1 ∗ , λ 0 ∗ ) will satisfy

2 λ 1 ∗ μ 1 2 π 1 − 2 μ 1 ( T 1 + μ 0 ) + 2 λ 0 ∗ μ 1 μ 0 = 0 ⇒ λ 1 ∗ = T 1 + μ 0 − λ 0 ∗ μ 0 μ 1 π 1

and

2 λ 0 ∗ μ 0 2 π 0 − 2 μ 0 ( μ 1 + T 0 ) + 2 λ 1 ∗ μ 1 μ 0 = 0 ⇒ λ 0 ∗ = T 0 + μ 1 − λ 1 ∗ μ 1 μ 0 π 0 .

As mentioned earlier, we replace T 0 , T 1 , π 0 , and π 1 with IPW estimates T ˆ 0 , T ˆ 1 , π ˆ 0 , and π ˆ 1 , and then jointly estimate λ 0 , λ 1 , μ 0 , and μ 1 by solving the system of fixed point equations:

(A48) λ ˆ 1 , AN = T ˆ 1 + μ ˆ 0 , AN − λ ˆ 0 , AN μ ˆ 0 , AN μ ˆ 1 , AN π ˆ 1 , λ ˆ 0 , AN = T ˆ 0 + μ ˆ 1 , AN − λ ˆ 1 , AN μ ˆ 1 , AN μ ˆ 0 , AN π ˆ 0

and

(A49) μ ˆ 1 , AN = S ˆ 1 ( 1 − λ ˆ 1 , AN ) n + λ ˆ 1 , AN n ˆ 1 , μ ˆ 0 , AN = S ˆ 0 ( 1 − λ ˆ 0 , AN ) n + λ ˆ 0 , AN n ˆ 0 .

(Note that we have slightly overloaded the notation μ ˆ 1 , AN , which is used differently in Section 4.2. In the entirety of this appendix, the definition mentioned earlier is the one used.)

To actually solve this system, we focus first on (A48), noting that it is linear in λ ˆ 1 , AN μ ˆ 1 , AN and λ ˆ 0 , AN μ ˆ 0 , AN and has solution

(A50) λ ˆ 1 , AN μ ˆ 1 , AN = T ˆ 1 π ˆ 0 + μ ˆ 0 , AN π ˆ 0 − T ˆ 0 − μ ˆ 1 , AN π ˆ 0 π ˆ 1 − 1 , λ ˆ 0 , AN μ ˆ 0 , AN = T ˆ 0 π ˆ 1 + μ ˆ 1 , AN π ˆ 1 − T ˆ 1 − μ ˆ 0 , AN π ˆ 0 π ˆ 1 − 1 .

Then, we rewrite (A49) as follows:

(A51) μ ˆ 1 , AN = μ ˆ 1 , AN λ ˆ 1 , AN 1 − n ˆ 1 n + S ˆ 1 n , μ ˆ 0 , AN = μ ˆ 0 , AN λ ˆ 0 , AN 1 − n ˆ 0 n + S ˆ 0 n .

By using (A50), we can conclude that μ ˆ 1 , AN and μ ˆ 0 , AN must satisfy the system of linear equations:

1 + 1 π ˆ 0 π ˆ 1 − 1 1 − n ˆ 1 n − π ˆ 0 π ˆ 0 π ˆ 1 − 1 1 − n ˆ 1 n − π ˆ 1 π ˆ 0 π ˆ 1 − 1 1 − n ˆ 0 n 1 + 1 π ˆ 0 π ˆ 1 − 1 1 − n ˆ 0 n μ ˆ 1 , AN μ ˆ 0 , AN = S ˆ 1 n + T ˆ 1 π ˆ 0 − T ˆ 0 π ˆ 0 π ˆ 1 − 1 1 − n ˆ 1 n S ˆ 0 n + T ˆ 0 π ˆ 1 − T ˆ 1 π ˆ 0 π ˆ 1 − 1 1 − n ˆ 0 n .

Finally, solving these equations to recover μ ˆ 1 , AN , μ ˆ 0 , AN , and τ ˆ AN 2 ≔ μ ˆ 1 , AN − μ ˆ 0 , AN can be done using any standard matrix inversion technique. We refer to the resulting estimator as τ ˆ AN 2 to distinguish it from τ ˆ AN .

Somewhat surprisingly, τ ˆ AN 2 does not always improve on the MSE of τ ˆ AN . The results of a simulation are shown in Figure A3. Both the normal model and power law model simulations suggest that τ ˆ AN 2 and τ ˆ AN are preferable in different regimes. This suggests that, although τ ˆ AN 2 is estimating the optimal values of λ 1 and λ 0 , the complicated functional form of the data-dependent estimates of the optimal values sometimes inflates the variance. Although τ ˆ AN is estimating a sub-optimal pair of values λ 1 and λ 0 , its estimates of those values are lower variance and thus sometimes lead to a lower variance estimator. Examining the results in Figure A3, it seems that τ ˆ AN 2 is preferable to τ ˆ AN in the presence of strong correlation between Y k and p k , but not otherwise.

$Figure A3 Comparison of τ ˆ AN 2 {\hat{\tau }}_{{\text{AN}}^{2}} and τ ˆ AN {\hat{\tau }}_{\text{AN}} in the normal model with n = 500 , μ = 1 n=500,\mu =1 (left) and power law model with n = 500 n=500 (right). In both models, we see that τ ˆ AN 2 {\hat{\tau }}_{{\text{AN}}^{2}} is preferable to τ ˆ AN {\hat{\tau }}_{\text{AN}} when there is a strong correlation between responses and treatment probabilities.$

Figure A3

Comparison of τ ˆ AN 2 and τ ˆ AN in the normal model with n = 500 , μ = 1 (left) and power law model with n = 500 (right). In both models, we see that τ ˆ AN 2 is preferable to τ ˆ AN when there is a strong correlation between responses and treatment probabilities.

References

[1] Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. J Am Stat Assoc. 1952;47(260):663–85. http://www.jstor.org/stable/2280784. 10.1080/01621459.1952.10483446Search in Google Scholar

[2] Basu D. An essay on the logical foundations of survey sampling, Part I. In: Godambe V, Sprott D, editors. Foundations of statistical inferences. Toronto, Canada: Holt, Rinehart and Winston; 1971. 10.1007/978-1-4419-5825-9_24Search in Google Scholar

[3] Godambe VP, Joshi VM. Admissibility and bayes estimation in sampling finite populations. I. The annals of mathematical statistics. 1965;36(6):1707–22. http://www.jstor.org/stable/2239112. 10.1214/aoms/1177699799Search in Google Scholar

[4] Särndal CE, Swensson B, Wretman J. Model assisted survey sampling. Springer Science and Business Media; 2003. Search in Google Scholar

[5] Trotter HF, Tukey JW. Conditional Monte Carlo for normal samples. In: Symposium on Monte Carlo Methods. New York: Wiley; 1954. Search in Google Scholar

[6] Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J Am Stat Assoc. 1994;89(427):846–66. http://www.jstor.org/stable/2290910. 10.1080/01621459.1994.10476818Search in Google Scholar

[7] Calonico S, Cattaneo MD, Farrell MH. On the effect of bias estimation on coverage accuracy in nonparametric inference. J Am Stat Assoc. 2018;113(522):767–79. 10.1080/01621459.2017.1285776Search in Google Scholar

[8] Hall P. Effect of bias estimation on coverage accuracy of bootstrap confidence intervals for a probability density. Ann Stat. 1992;20:675–94. 10.1214/aos/1176348651Search in Google Scholar

[9] Hammersley JM, Handscomb DC. Monte Carlo methods. New York: Springer; 1964. 10.1007/978-94-009-5819-7Search in Google Scholar

[10] Owen AB. Monte Carlo theory, methods and examples; 2013. https://artowen.su.domains/mc/.Search in Google Scholar

[11] Glynn PW, Szechtman R. Some new perspectives on the method of control variates. In: Monte Carlo and Quasi-Monte Carlo methods 2000. Springer; 2002. p. 27–49. 10.1007/978-3-642-56046-0_3Search in Google Scholar

[12] Firth D. On improved estimation for importance sampling. Brazilian J Probab Stat. 2011;25(3):437–43. 10.1214/11-BJPS155. Search in Google Scholar

[13] Hesterberg T.Weighted average importance sampling and defensive mixture distributions. Technometrics. 1995;37(2):185–94. https://amstat.tandfonline.com/doi/abs/10.1080/00401706.1995.10484303. 10.1080/00401706.1995.10484303Search in Google Scholar

[14] Chen SX, Leung DHY, Qin J. Improving semiparametric estimation by using surrogate data. J R Stat Soc Ser B (Stat Methodol). 2008;70(4):803–23. https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9868.2008.00662.x. 10.1111/j.1467-9868.2008.00662.xSearch in Google Scholar

[15] Qin J, Zhang B, Leung DHY. Empirical likelihood in missing data problems. J Am Stat Assoc. 2009;104(488):1492–503. 10.1198/jasa.2009.tm08163. Search in Google Scholar

[16] Rotnitzky A, Lei Q, Sued M, Robins JM. Improved double-robust estimation in missing data and causal inference models. Biometrika. 2012 April;99(2):439–56. 10.1093/biomet/ass013. Search in Google Scholar PubMed PubMed Central

[17] Kitagawa T, Tetenov A. Who should be treated? empirical welfare maximization methods for treatment choice. Econometrica. 2018;86(2):591–616. 10.1920/wp.cem.2017.2417Search in Google Scholar

[18] Swaminathan A, Joachims T. The self-normalized estimator for counterfactual learning. In: Advances in neural information processing systems. New York: Citeseer; 2015. p. 3231–9. Search in Google Scholar

[19] Athey S, Wager S. Policy learning with observational data. Econometrica. 2021;89(1):133–61. 10.3982/ECTA15732Search in Google Scholar

[20] Crump RK, Hotz VJ, Imbens GW, Mitnik OA. Dealing with limited overlap in estimation of average treatment effects. Biometrika. 2009;96(1):187–99. 10.1093/biomet/asn055Search in Google Scholar

[21] Yang S, Ding P. Asymptotic inference of causal effects with observational studies trimmed by the estimated propensity scores. Biometrika. 2018;105(2):487–93. 10.1093/biomet/asy008Search in Google Scholar

[22] Ma X, Wang J. Robust inference using inverse probability weighting. J Am Stat Assoc. 2020;115(532):1851–60. 10.1080/01621459.2019.1660173Search in Google Scholar

[23] Hong H, Leung MP, Li J. Inference on finite-population treatment effects under limited overlap. Econometr J. 2020;23(1):32–47. 10.1093/ectj/utz017Search in Google Scholar

[24] Delevoye A, Sävje F. Consistency of the Horvitz–Thompson estimator under general sampling and experimental designs. J Stat Planning Inference 2020;207:190–7. 10.1016/j.jspi.2019.12.002Search in Google Scholar

[25] Robinson PM. On the convergence of the Horvitz-Thompson estimator. Australian J Stat. 1982;24(2):234–8. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-842X.1982.tb00829.x. 10.1111/j.1467-842X.1982.tb00829.xSearch in Google Scholar

[26] Hirano K, Imbens GW, Ridder G. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica. 2003;71(4):1161–89. 10.3386/t0251Search in Google Scholar

[27] Hansen BE, Lee S. Inference for iterated GMM under misspecification. Econometrica. 2021;89(3):1419–47. 10.3982/ECTA16274Search in Google Scholar

[28] Imbens GW, Rubin DB. Causal inference in statistics, social, and biomedical sciences. Cambridge, England: Cambridge University Press; 2015. 10.1017/CBO9781139025751Search in Google Scholar

[29] Hahn J. On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica. 1998;66(2):315–31. http://www.jstor.org/stable/2998560. 10.2307/2998560Search in Google Scholar

[30] Tan Z. Bounded, efficient and doubly robust estimation with inverse weighting. Biometrika. 2010;97(3):661–82. 10.1093/biomet/asq035Search in Google Scholar

[31] Cassel CM, Sarndal CE, Wretman JH. Some results on generalized difference estimation and generalized regression estimation for finite populations. Biometrika. 1976;63(3):615–20. http://www.jstor.org/stable/2335742. 10.1093/biomet/63.3.615Search in Google Scholar

[32] Little RJA. Estimating a finite population mean from unequal probability samples. J Am Stat Assoc. 1983;78(383):596–604. http://www.jstor.org/stable/2288125. 10.1080/01621459.1983.10478016Search in Google Scholar

[33] Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, et al. Double/debiased machine learning for treatment and structural parameters. Econom J. 2018 Jan;21(1):C1–C68. 10.1111/ectj.12097. Search in Google Scholar

[34] Imbens GW. Nonparametric estimation of average treatment effects under exogeneity: a review. Review Econ Stat. 2004;86(1):4–29. 10.3386/t0294Search in Google Scholar

[35] Lin W. Agnostic notes on regression adjustments to experimental data: reexamining Freedmanas critique. Annal Appl Stat. 2013;7(1):295–318. 10.1214/12-AOAS583Search in Google Scholar

[36] Freedman DA. On regression adjustments to experimental data. Adv Appl Math. 2008;40(2):180–93. 10.1016/j.aam.2006.12.003Search in Google Scholar

[37] Manski CF. Statistical treatment rules for heterogeneous populations. Econometrica. 2004;72(4):1221–46. 10.1920/wp.cem.2003.0303Search in Google Scholar

[38] Tillé Y, Matei A. Sampling: survey sampling; 2021. R package version 2.9. https://CRAN.R-project.org/package=sampling. Search in Google Scholar

[39] Hankin M, Chan D, Perry M. A comparison of causal inference methods for estimating sales lift. 2020. https://research.google/pubs/pub49507.Search in Google Scholar

[40] Liu L, Mukherjee R, Robins JM. On nearly assumption-free tests of nominal confidence interval coverage for causal parameters estimated by machine learning. Stat Sci. 2020;35(3):518–39. 10.1214/20-STS786Search in Google Scholar

[41] Bitler MP, Gelbach JB, Hoynes HW. What mean impacts miss: distributional effects of welfare reform experiments. Am Econ Rev. 2006;96(4):988–1012. 10.3386/w10121Search in Google Scholar

[42] Firpo S. Efficient semiparametric estimation of quantile treatment effects. Econometrica. 2007;75(1):259–76. 10.1111/j.1468-0262.2007.00738.xSearch in Google Scholar

[43] Cattaneo MD. Efficient semiparametric estimation of multi-valued treatment effects under ignorability. J Econ. 2010;155(2):138–54. 10.1016/j.jeconom.2009.09.023Search in Google Scholar

[44] Chin A, Eckles D, Ugander J. Evaluating stochastic seeding strategies in networks. Manag Sci. 2021;68:1591–2376. 10.1287/mnsc.2021.3963Search in Google Scholar

[45] Schnabel T, Swaminathan A, Singh A, Chandak N, Joachims T. Recommendations as treatments: debiasing learning and evaluation. In: International Conference on Machine Learning. New York: PMLR; 2016. p. 1670–9. Search in Google Scholar

[46] Oosterhuis H, de Rijke M. Unifying online and counterfactual learning to rank: a novel counterfactual estimator that effectively utilizes online interventions. In: Proceedings of the 14th ACM International Conference on Web Search and Data Mining. New York: ACM Digital Library; 2021. p. 463–71. 10.1145/3437963.3441794Search in Google Scholar

[47] Hadad V, Hirshberg DA, Zhan R, Wager S, Athey S. Confidence intervals for policy evaluation in adaptive experiments. 2019. arXiv: http://arXiv.org/abs/arXiv:191102768. Search in Google Scholar

[48] Bibaut A, Chambaz A, Dimakopoulou M, Kallus N, van der Laan M. Post-Contextual-Bandit inference. 2021. arXiv: http://arXiv.org/abs/arXiv:210600418. Search in Google Scholar

Received: 2022-03-17

Revised: 2022-10-24

Accepted: 2022-10-28

Published Online: 2023-02-08

This work is licensed under the Creative Commons Attribution 4.0 International License.

Adaptive normalization for IPW estimation

Abstract

1 Introduction

1.1 Main contributions

1.2 Motivation for adaptive normalization

1.3 Related work

2 Problem formulation and notation

Assumption 1

3 Adaptive normalization

3.1 The optimal choice of λ

Theorem 1

3.2 Estimating λ ∗ from the data

Theorem 2

3.2.1 An optimization perspective

3.3 Properties of adaptive normalization

3.3.1 Asymptotic variance

Theorem 3

3.3.2 Finite sample variance

Theorem 4

3.3.3 Estimated propensities

Theorem 5

3.4 Connections to regression/control variate methods

3.4.1 Value of adaptive normalization as a perspective

4 Applications

4.1 AIPW estimation

Assumption 2

Assumption 3

Theorem 6

4.2 ATE estimation

4.3 Policy evaluation

Theorem 7

5 Experiments

5.1 Survey experiments

5.2 ATE experiments

5.3 Coverage experiments

5.4 Policy evaluation experiments

6 Discussion

Acknowledgements

Appendix A Technical proofs

A.1 Proofs of Theorems 1, 3, and 5

Lemma A1

Proof

Proof of Theorem 1

Proof of Theorem 3

Proof of Theorem 5

A.2 Proof of Theorem 2

A.2.1 Analyzing the dynamical system

Lemma A2

Proof

A.2.2 High-probability guarantees

Lemma A3

Proof

A.2.3 Combining the lemmas

Proof of Theorem 2

A.3 Proof of Theorem 4

Proof

A.4 Proof of Theorem 6

Proof

A.5 Proof of Theorem 7

Proof

B Further experiments

B.1 RMSE plots for Swiss data

B.2 Overlap and bias

C Joint adaptive normalization in ATE estimation

Theorem A1

Proof

References

Journal and Issue

Articles in the same Issue