Skip to content
BY 4.0 license Open Access Published by De Gruyter December 9, 2022

Doubly robust estimators for generalizing treatment effects on survival outcomes from randomized controlled trials to a target population

  • Dasom Lee , Shu Yang EMAIL logo and Xiaofei Wang

Abstract

In the presence of heterogeneity between the randomized controlled trial (RCT) participants and the target population, evaluating the treatment effect solely based on the RCT often leads to biased quantification of the real-world treatment effect. To address the problem of lack of generalizability for the treatment effect estimated by the RCT sample, we leverage observational studies with large samples that are representative of the target population. This article concerns evaluating treatment effects on survival outcomes for a target population and considers a broad class of estimands that are functionals of treatment-specific survival functions, including differences in survival probability and restricted mean survival times. Motivated by two intuitive but distinct approaches, i.e., imputation based on survival outcome regression and weighting based on inverse probability of sampling, censoring, and treatment assignment, we propose a semiparametric estimator through the guidance of the efficient influence function. The proposed estimator is doubly robust in the sense that it is consistent for the target population estimands if either the survival model or the weighting model is correctly specified and is locally efficient when both are correct. In addition, as an alternative to parametric estimation, we employ the nonparametric method of sieves for flexible and robust estimation of the nuisance functions and show that the resulting estimator retains the root- n consistency and efficiency, the so-called rate-double robustness. Simulation studies confirm the theoretical properties of the proposed estimator and show that it outperforms competitors. We apply the proposed method to estimate the effect of adjuvant chemotherapy on survival in patients with early-stage resected non-small cell lung cancer.

MSC 2010: 62D20; 62N02

1 Introduction

In clinical trials or biomedical studies, time-to-event or survival endpoints, such as the time from treatment initiation to death, have been commonly used to evaluate the treatment effect. When estimating the treatment effect, randomized controlled trials (RCTs) are regarded as the gold standard since randomization reduces the effect of confounding variables. However, RCTs often suffer from a lack of generalizability or external validity. Specifically, due to restrictive inclusion and exclusion criteria for enrollment or additional concerns from patients and physicians, RCTs often do not recruit enough participants who represent the real-world patient population, resulting in the covariate distribution of the RCT sample being different from that of the target real-world population. In the presence of such heterogeneity, evaluating the treatment effect based solely on the RCT sample leads to biased quantification of the real-world treatment effect. As a complement of the RCT sample, observational studies have been widely used in comparative effectiveness research, as large samples that are representative of the target population can be studied at a relatively low cost.

Several recent works have proposed integrative methods to generalize findings from the RCT to the target population by leveraging observational studies [15]. Most existing methods focus on directly modeling the probability of participating in the trial, i.e., the sampling score that is analogous to the treatment propensity score. A widely used approach involves inverse probability of sampling weighting [IPSW; 1,6], which can be used to estimate weight-adjusted survival curves [7,8]. However, these IPSW-based estimators are unstable under extreme sampling scores. Alternatively, Lee et al. [5] proposed calibration weighting that enforces covariate balance between the RCT and observational study (OS) without explicitly modeling the sampling score. Recently, Colnet et al. [9] provided a comprehensive review of various novel methods combining complementary features of RCTs and observational studies. However, most of these methods focus on continuous and binary outcomes, and generalization of the findings from RCTs for survival outcomes to the target population is less actively studied.

In this article, we consider a broad class of estimands defined as a functional of treatment-specific survival functions, including differences in survival probability at landmark times and restricted mean survival time (RMST). Various estimators can be constructed to adjust for the unrepresentativeness or selection bias of the RCT sample. One approach relies on fitting conditional survival outcome models and then averaging over the covariate distribution of the observational sample, similar to Chen and Tsiatis [10]. Another common approach is to use weighting [7,11] to adjust for the imbalance between the RCT sample and the observational sample. Instead of direct modeling of sampling score as in IPSW, one can consider the more stable approach that calibrates covariate distributions between the RCT and the observational sample [5]. Motivated by these two intuitive but distinct approaches, we propose improved estimators for survival outcomes under the guidance of the efficient influence function (EIF), which involve survival outcome regression (OR) and weighting based on inverse probability of treatment, censoring, and sampling, simultaneously. The proposed estimator is doubly robust in the sense that it is consistent for the target population estimand if either the survival model or the weighting model is correctly specified, and is locally efficient when both are correct. In addition, to cope with possible misspecification of nuisance functions, we consider the method of sieves [12], which adds great flexibility and robustness to the proposed estimators, meanwhile retaining the root- n consistency.

The remainder of this article is organized as follows. In Section 2, we formalize the basic causal inference framework for survival outcomes. In Section 3, we introduce two direct estimators based on identification formulas, and in Section 4, we propose improved estimators and describe the corresponding asymptotic properties. The finite sample performance of the proposed estimators is assessed via simulation studies in Section 5. Applying the proposed estimators, we analyze the effect of adjuvant chemotherapy on the survival of lung cancer patients with data from an RCT and an OS in Section 6. Section 7 presents the discussion and concluding remarks. All proofs are provided in the Appendix.

2 Estimands, observed data, and assumptions

Suppose we are interested in comparing the effectiveness of two treatments. Let A be the binary treatment assignment, A { 0 , 1 } . Following the potential outcomes framework [13,14], let T a be the potential survival time if a subject received the treatment A = a , and S a ( t ) and λ a ( t ) be the corresponding survival and hazard functions, i.e., S a ( t ) = P ( T a t ) and λ a ( t ) = lim h 0 h 1 P ( t T a t + h ) / P ( T a t ) . Under the proportional hazards assumption, a widely used measure to characterize the treatment effect is hazard ratio (HR), i.e., λ 1 ( t ) / λ 0 ( t ) being a constant. However, the interpretation of HRs is challenging especially when the proportionality assumption is violated [15,16].

Alternatively, we define the average treatment effect (ATE) measure θ τ as a function of treatment-specific survival curves, θ τ = Ψ τ ( S 1 ( t ) , S 0 ( t ) ) , where τ is a prespecified constant. This formulation of the ATE includes a broad class of estimands that are favored in survival analysis [17]. For example, θ τ = S 1 ( τ ) S 0 ( τ ) is a simple survival difference at a fixed time τ , and θ τ = 0 τ { S 1 ( t ) S 0 ( t ) } d t is the RMST difference up to τ . The ratio of restricted mean time loss and the difference of the median survival can also be represented with the appropriate choice of Ψ τ ( ) .

Under the stable unit treatment value assumption, the survival time T is the realization of the potential outcomes, i.e., T = T 1 A + T 0 ( 1 A ) . Let C be the censoring time. In the presence of right censoring, the survival time T is not observed for all subjects; instead, we observe U = T C and Δ = I ( T C ) , where represents the minimum of two values, and I ( ) is an indicator function. Let X be a p -dimensional vector of pre-treatment covariates. Also, let δ denotes the binary indicator of RCT participation, and let δ ˜ denotes the binary indicator of OS participation. We consider a super-population framework assuming that an RCT sample of size n and an observational sample of size m are sampled from the target population. From the RCT sample, we observe { U i , Δ i , A i , X i , δ i = 1 , δ ˜ i = 0 } from i = 1 , , n independent and identically distributed subjects. For the observational sample, it is common that only the covariates information is available, i.e., { X i , δ i = 0 , δ ˜ i = 1 } from i = n + 1 , , n + m independent and identically distributed subjects. The sampling mechanism and data structure are illustrated in Figure 1. We assume independence between the RCT and the observational sample, which holds if two separate studies are conducted by independent researchers, the target patient population is sufficiently large, or the patients are enrolled in two separate time periods.

Figure 1 
               Illustration of the data structure of the RCT sample and the OS sample within the target super-population framework.
Figure 1

Illustration of the data structure of the RCT sample and the OS sample within the target super-population framework.

Let S a ( t , X ) = S ( t X , A = a , δ = 1 ) be the treatment-specific conditional survival curves for a , δ { 0 , 1 } . Also, define the treatment propensity score π A ( X ) = P ( A = 1 X , δ = 1 ) and the sampling score π δ ( X ) = P ( δ = 1 X ) . To identify the ATE from the observed data, we make the following assumptions:

Assumption 1

(Ignorability and positivity of trial treatment assignment).

(i) { T 0 , T 1 } A ( X , δ = 1 ) ; and (ii) 0 < π A ( X ) < 1 with probability 1.

Assumption 2

(Conditional survival exchangeablity and positivity of trial participation).

(i) S a ( t , X ) = P ( T a > t X ) , a { 0 , 1 } ; and (ii) π δ ( X ) > 0 with probability 1.

Assumption 3

(Noninformative censoring conditional on covariates and treatment)

{ T 1 , T 0 } C ( X , A , δ = 1 ) , which also implies T C ( X , A , δ = 1 ) .

Assumptions 13 are not testable in general, and their plausibility should be justified based on subject matter knowledge in practice. Assumption 1 holds in the RCT by default. Assumption 2 (i) is plausible if all information related to the trial participation and the outcome is captured in the data at hand. This assumption is weaker than the ignorablility of the trial participation assumption, i.e., { T 0 , T 1 } δ X . The relationship between Assumption 2 (i) and its stronger version is analogous to that described in the study by Dahabreh et al. [4] in the context of continuous and binary outcomes. Assumption 2 (ii) implies that the absence of patient characteristics prevent from participating in the trial [5]. Assumption 3 is commonly made in survival analysis [10,18,19]. This assumption is weaker than the conditional independence assumption of the censoring and survival time given only on the treatment [20].

The covariate distribution of the RCT sample f ( X δ = 1 ) may not be representative of that of the target population f ( X ) due to restrictive trial enrollment criteria, but the covariate distribution of the observational sample f ( X δ ˜ = 1 ) is often representative of f ( X ) due to the real-world data collection mechanism. In particular, if the observational sample is a simple random sample of the target population, then f ( X δ ˜ = 1 ) = f ( X ) . More generally, the observational sample can be selected under complex sampling designs. To accommodate such scenarios, we can define d as the known design weight for the observational sample.

Assumption 4

(The known design weight for the observational sample). The observational sample design weight d = 1 / P ( δ ˜ = 1 X ) is known.

Assumption 4 is commonly assumed in the survey sampling literature. On the basis of Assumption 4, the design-weighted observational sample is representative of the target population. In an OS with simple random sampling, d = N / m , where N is the target population size.

Under the aforementioned assumptions, the ATE θ τ , or S a ( t ) , a { 0 , 1 } sufficiently, is identified based on the observed data. We consider two identification formulas. Let Y ( t ) = I ( U t ) and define the conditional censoring model S a C ( t , X ) = P ( C > t X , A = a , δ = 1 ) . One identification formula is based on the conditional survival curves, i.e.,

(1) S a ( t ) = E { δ ˜ d S a ( t , X ) } , S a ( t , X ) = E { I ( T t ) X , A = a , δ = 1 , C t } ,

which can be called the OS-design-weighted G-computation formula. Note that if the population covariate distribution is available, E { S a ( t , X ) } is the G-computation formula for S a ( t ) . The other identification formula is based on the inverse probability weighting (IPW) approach for the marginal survival curves, i.e.,

(2) S a ( t ) = E δ π δ ( X ) I ( A = a ) { π A { X } } a { 1 π A ( X ) } 1 a Y ( t ) S a C ( t , X ) ,

for a { 0 , 1 } . The two identification formulas in (1) and (2) motivate the estimators in the following section, depending on different components of the observed data likelihood.

3 Two direct estimators based on identification formulas

3.1 Outcome regression

On the basis of the identification formula (1), the treatment-specific conditional survival curve can be modeled and fitted based on the observed data, e.g., using the widely used Cox regression model for the survival outcome. A treatment-specific conditional hazard function at time t given covariate X i is defined as follows:

(3) λ a i ( t ) λ a ( t X i ) = λ a 0 ( t ) exp ( β a T X i ) ,

where λ a 0 ( t ) is a treatment-specific baseline hazard function, for a { 0 , 1 } , i = 1 , , n . Following standard survival analysis techniques, β a can be estimated as a solution to the partial likelihood score equation, and the baseline cumulative hazard Λ a 0 ( t ) 0 t λ a 0 ( u ) can be estimated by the Breslow [21] estimator:

Λ ˆ a 0 ( t ) = 0 t i = 1 N δ i A a i d N i ( u ) i = 1 N δ i A a i exp ( β ^ a T X i ) Y i ( u ) ,

where A a i = I ( A i = a ) , N i ( u ) = I ( U i u , Δ i = 1 ) , and Y i ( u ) = I ( U i u ) . The survival model in (3) does not imply that the marginal HR of the potential survival outcomes under a = 1 and a = 0 , i.e., λ 10 ( t ) / λ 00 ( t ) , is a constant and thus is not restrictive. Other survival models can be considered, including the additive hazards model [22,23].

Chen and Tsiatis [10] proposed a method that accounts for imbalances between treatment groups by first estimating the conditional treatment effect given X and then averaging the effect over X across both treatment groups. A similar approach can be applied to balance the covariate distribution between the RCT sample and the observational sample by first estimating the treatment-specific survival curve conditional on f ( X δ = 1 ) under model (3), i.e., S ^ a ( t , X i ) = exp { Λ ^ a i ( t ) } = exp { Λ ^ a 0 ( t ) exp ( β ^ a T X i ) } , and then applying the design-weighted averaging over f ( X δ ˜ = 1 ) . The resulting OR estimator of the marginal treatment-effect survival curve is

(4) S ^ a OR ( t ) = i = 1 N δ ˜ i d i 1 i = 1 N δ ˜ i d i e Λ ^ a i ( t ) ,

for a { 0 , 1 } , and the corresponding ATE estimator is defined as θ ^ τ O R = Ψ τ ( S ^ 1 OR ( t ) , S ^ 0 OR ( t ) ) . The OR estimator is consistent when the survival model (3) is correctly specified.

3.2 Inverse probability and calibration weighting

We can construct the estimator of the marginal treatment-specific survival curve based on the identification formula in (2) that involves sampling score, treatment propensity score, and censoring probability. This approach can be viewed as a combination of IPSW, inverse probability of treatment weighting (IPTW), and inverse probability of censoring weighting (IPCW). The weighting estimator requires positing models for the three probabilities and estimating them.

First, we consider the estimation of the sampling score. As the covariate distribution of the RCT sample is different from that of the target population in general, the estimated ATE based only on the RCT sample can be biased. A widely used approach to account for this selection bias is IPSW, i.e., weighting the RCT sample by the inverse probability of trial participation over that of OS participation, to adjust for differences in covariate distribution between the trial sample and the population. Specifically, π δ ( X ) can be modeled as π δ ( X ) = { ω IPSW ( X ) } 1 , where ω IPSW ( X ) = P ( δ ˜ = 1 δ + δ ˜ = 1 , X ) / P ( δ = 1 δ + δ ˜ = 1 , X ) . One can plug in { ω ^ IPSW ( X ) } 1 for π δ ( X ) in (2) using the common logistic regression model. However, the IPSW method requires the sampling score model to be correctly specified; it also could be highly unstable if P ( δ = 1 δ + δ ˜ = 1 , X ) is close to zero for some X .

Instead of direct estimating the sampling scores, Lee et al. [5] proposed the calibration weighting approach to reduce the selection bias in the trial-based estimator, which is analogous to the entropy balancing method by Hainmueller [24] and more stable than the IPSW method. The basic idea is that subjects in the RCT sample are calibrated to the observational sample, so that after calibration, the covariate distribution of the RCT sample empirically matches that of the target population. The calibration weighting approach is based on the idea that for any g ( X ) , the following identity holds,

(5) E δ π δ ( X ) g ( X ) = E { δ ˜ d g ( X ) } = E { g ( X ) } ,

where g ( X ) is a function of X to be calibrated, e.g., the moments or any nonlinear transformations.

The calibration weights ω i are obtained by solving the optimization problem:

(6) min W i = 1 n ω i log ω i , subject to ω i 0 , i , i = 1 n ω i = 1 , and i = 1 N δ i ω i g ( X i ) = g ˜ ,

where W = { w i : δ i = 1 } . The last constraint is the empirical representation of (5), where g ˜ = i = 1 N δ ˜ i d i g ( X i ) / i = 1 N δ ˜ i d i is a consistent estimator of E { g ( X ) } from the observational sample. Minimizing the negative entropy function of the calibration weights in (6) enforces the weights to be as close to one another as possible, which reduces the variability due to heterogeneous weights. By using the Lagrange multiplier λ , the objective function of this convex optimization problem becomes L ( λ , W ) = i = 1 n ω i log ω i λ i = 1 n ω i g ( X i ) g ˜ . The estimated calibration weights are ω ^ i = ω ( X i ; λ ^ ) = exp { λ ^ g ( X i ) } / i = 1 n exp { λ ^ g ( X i ) } , where λ ^ solves

(7) U ( λ ) = i = 1 n exp { λ g ( X i ) } { g ( X i ) g ˜ } = 0 .

Under the loglinear sampling score model, the calibration weights from the objective function (6) have the same functional form as inverse probability of sampling score weights asymptotically, resulting in the direct correspondence between the calibration weight and the sampling score in that ω ^ i { N π ^ δ ( X i ) } 1 p 0 , as n [5]. Following that, we posit the loglinear sampling score model,

(8) π δ ( X ) = exp { η 0 T g ( X ) } , for some η 0 .

Lee et al. [5] showed that λ ^ is equivalent to η ^ , where π ^ δ ( X ) = π δ ( X ; η ^ ) = exp { η ^ T g ( X ) } . Accordingly, in the rest of the article, we represent the calibration weights using η ^ , i.e., ω ^ i = ω ( X i ; η ^ ) = exp { η ^ g ( X i ) } / i = 1 n exp { η ^ g ( X i ) } . If one considers a logistic sampling score model instead, then other objective functions can be used, such as i = 1 n ( ω i 1 ) log ( ω i 1 ) , that corresponds to the weights with the same functional form as the inverse to a logistic probability of sampling [25,26]. However, the loglinear regression model in (8) is close to the logistic regression model when the proportion of the RCT sample in the target population is small.

With respect to treatment assignment, π A ( X ) is generally known for RCTs. However, several authors suggested estimating the treatment propensity score for the RCTs to increase the efficiency and account for the chance of imbalance of prognostic variables [e.g., 27,28]. We choose a logistic regression model for the treatment propensity score,

(9) π A ( X ) = [ 1 + exp { ρ 0 T g ( X ) } ] 1 , for some ρ 0 ,

and define π ^ a i = A i π A ( X i ; ρ ^ ) + ( 1 A i ) { 1 π A ( X i ; ρ ^ ) } . Estimating the propensity scores also broadens the scope of the current article to allow the generalization of the OS. Even though this article focuses on generalizing the trial findings where we only require a two-way balancing between the RCT and OS, a three-way balancing approach among the treated, the controlled, and the observational sample (e.g., [29]) can be useful to generalize the findings from the OS to its larger population with the estimation of π A .

Moreover, to account for right censoring C , we posit Cox proportional hazards model with conditional hazard

(10) λ C ( t X , A = a ) = λ a 0 C ( t ) exp ( γ a T X ) , for a { 0 , 1 } ,

where the standard techniques as for the survival model in (3) can be used to estimate γ a and Λ a 0 C ( t ) 0 t λ a 0 C ( u ) d u .

Combining ω ^ i , π ^ a i , and Λ ^ a i C ( t ) = Λ ^ a 0 C ( t ) exp ( γ ^ a T X i ) estimated under the working models in (8), (9), and (10), respectively, we define the CW estimator of the marginal treatment-specific survival curve as follows:

(11) S ^ a C W ( t ) = i = 1 N δ i ω ^ i A a i π ^ a i e Λ ^ a i C ( t ) Y i ( t ) ,

where A a i = I ( A i = a ) . The corresponding ATE estimator is θ ^ τ C W = Ψ τ ( S ^ 1 C W ( t ) , S ^ 0 C W ( t ) ) .

4 Improved estimators

4.1 Efficient influence function

Let O = ( X , A , U , Δ , δ , δ ˜ ) be one copy of the vector of observed variables. The OR estimator specified in (4) and the CW estimator specified in (11) use different components of the likelihood function f ( O ) . Specifically, the OR estimator is based on modeling S a ( t , X ) for a { 0 , 1 } , and the CW estimator is based on modeling π δ ( X ) , π a ( X ) , and S a C ( t , X ) for a { 0 , 1 } . These estimators are singly robust in that they are consistent only under the correct survival OR model or the correct weighting models. A vast number of estimators can be constructed by combining these two estimators. In general, the question becomes how to obtain the most efficient estimator. Our approach is to derive the EIF [30] of θ τ to construct semiparametrically efficient estimators. Such estimators also gain robustness to model misspecification as we show later.

We consider a class of influence functions of regular asymptotically linear (RAL) estimators of the treatment-specific survival curve S a ( t ) . Define A a = I ( A = a ) and π a ( X ) = a π A ( X ) + ( 1 a ) { 1 π A ( X ) } . Following Tsiatis [30], the class of observed data influence functions includes

(12) φ a ( t ; O ) = δ π δ ( X ) A a π a ( X ) Y ( t ) S a C ( t , X ) S a ( t ) + δ π δ ( X ) { A π A ( X ) } g a 1 ( X ) an arbitrary score of A ( X , δ = 1 ) + δ π δ ( X ) δ ˜ d g a 2 ( X ) an arbitrary score of δ X + δ π δ ( X ) A a π a ( X ) 0 t d M a C ( u ) S a C ( u , X ) g a 3 ( u A = a , X ) an arbitrary score of U , Δ = 0 ( X , A , δ = 1 )

for arbitrary functions g a 1 ( ) , g a 2 ( ) , and g a 3 ( ) , where M a C ( u ) = N a C ( u ) 0 u Λ a C ( s ) d s is a Martingale with N a C = A a I ( U u , Δ = 0 ) , and Λ a C ( s ) is a cumulative hazard for censoring for a { 0 , 1 } . The last three terms in (12) are mean-zero functions. According to the semiparametric theory [30], the EIF is the influence function in (12) with the smallest variance. The EIF can be derived by projecting the first term of (12) onto the orthogonal complement of the tangent space spanned by the scores of the nuisance functions, i.e., the last three terms. The RAL estimator with the EIF is a semiparametrically efficient estimator. The following theorem gives the EIF in the class of influence functions (12), with the proof given in the Appendix B.1.

Theorem 1

Under Assumptions 14, the EIF for the treatment-specific survival curve is

(13) φ a eff ( t ; O ) = δ π δ ( X ) A a π a ( X ) Y ( t ) S a C ( t , X ) S a ( t ) δ π δ ( X ) A π A ( X ) π A ( X ) E { I ( T t ) X , A = a , δ = 1 } δ π δ ( X ) δ ˜ d E { I ( T t ) X , A = a , δ = 1 } + δ π δ ( X ) A a π a ( X ) 0 t d M a C ( u ) S a C ( u , X ) E { I ( T u ) X , A = a , δ = 1 , U u } .

Many common treatment effect estimands are functionals of the treatment-specific survival curves, including the survival difference at a fixed time τ , the difference of RMSTs, the ratio of RMTLs, and the difference of τ th quantile of survivals. Their EIFs can be expressed in the form of a combination of weighted integrals of the EIF for treatment-specific survival curves [17]. To be specific, the EIF for θ τ is

(14) φ θ τ eff ( O ) = 0 τ ϕ 1 ( t ) φ 1 eff ( t ; O ) d t + 0 τ ϕ 0 ( t ) φ 0 eff ( t ; O ) d t

for some functions ϕ a ( ) that E { ϕ a ( ) 2 } < (see Appendix A for details). We limit our interest to the estimands with such form of the EIF, which covers the broad class of estimators that are favored in survival analysis.

4.2 Augmented calibration weighting estimator

Motivated by Theorem 1, under the survival model in (3) and the weighting models specified in (8)–(10), we propose the following augmented CW (ACW) estimator of the treatment-specific survival curve,

(15) S ^ a ACW1 ( t ) = i = 1 N δ i ω ^ i A a i π ^ a i e Λ ^ a i C ( t ) Y i ( t ) i = 1 N δ i ω ^ i A a i π ^ a i π ^ a i e Λ ^ a i ( t ) i = 1 N δ i ω ^ i i = 1 N δ ˜ i d i 1 δ ˜ i d i e Λ ^ a i ( t ) + i = 1 N δ i ω ^ i A a i π ^ a i 0 t d M ^ a i C ( u ) e Λ ^ a i C ( u ) e Λ ^ a i ( t ) e Λ ^ a i ( u ) = i = 1 N δ i ω ^ i A a i π ^ a i e Λ ^ a i C ( t ) Y i ( t ) + i = 1 N e Λ ^ a i ( t ) i = 1 N δ ˜ i d i 1 δ ˜ i d i δ i ω ^ i A a i π ^ a i 1 0 t { e Λ ^ a i C ( u ) + Λ ^ a i ( u ) } d M ^ a i C ( u ) .

In addition, following the technique by Zhang and Schaubel [20] that represents the marginal cumulative hazard function Λ a ( t ) = 0 t { S a ( u ) } 1 d S a ( u ) by estimating the denominator and the numerator separately, we propose another ACW estimator,

(16) S ^ a ACW2 ( t ) = exp 0 t d S ^ a ACW1 ( u ) S ^ a ACW1 ( u ) ,

where

d S ^ a ACW1 ( u ) = i = 1 N δ i ω ^ i A a i π ^ a i e Λ ^ a i C ( u ) d N i ( u ) + i = 1 N e Λ ^ a i ( u ) d Λ ^ a i ( u ) i = 1 N δ ˜ i d i 1 δ ˜ i d i δ i ω ^ i A a i π ^ a i 1 0 u { e Λ ^ a i C ( s ) + Λ ^ a i ( s ) } d M ^ a i C ( s )

estimates d S a ( u ) . Although S ^ a ACW1 ( t ) and S ^ a ACW2 ( t ) are asymptotically equivalent, simulation studies show that S ^ a ACW2 ( t ) has better finite-sample performance. The proposed ACW estimators are similar to the estimators developed by Zhang and Schaubel [18] and Zhang et al. [19]. The difference between the proposed and their method is discussed in Section 7.

Similar to Yang et al. [17], we consider an asymptotic linear characterization of the ATE estimator θ ^ τ ACW = Ψ τ ( S ^ 1 ACW ( t ) , S ^ 0 ACW ( t ) ) for both ACW1 and ACW2 estimators. That is, under mild regularity conditions,

(17) θ ^ τ ACW θ τ = 0 τ ϕ 1 ( t ) { S ^ 1 ACW ( t ) S 1 ( t ) } d t + 0 τ ϕ 0 ( t ) { S ^ 0 ACW ( t ) S 0 ( t ) } d t + o p ( N 1 / 2 ) ,

for bounded variation functions ϕ a ( ) in (14). Under the asymptotic linear characterization, θ ^ τ ACW has the influence function φ θ τ eff ( O ) (see Appendix B.2 for the proof).

Toward this end, the following theorem shows the local efficiency and asymptotic properties of the proposed ACW estimators of the ATE, i.e., θ ^ τ ACW1 = Ψ τ ( S ^ 1 ACW1 ( t ) , S ^ 0 ACW1 ( t ) ) and θ ^ τ ACW2 = Ψ τ ( S ^ 1 ACW2 ( t ) , S ^ 0 ACW2 ( t ) ) .

Theorem 2

Under Assumptions 14, if either the survival model in (3) is correctly specified or the weighting models, i.e., the sampling score model (8), the treatment propensity score model (9), and the censoring model (10), are correctly specified, under regularity conditions, θ ^ τ ACW1 and θ ^ τ ACW2 are consistent for θ τ , and N 1 / 2 ( θ ^ τ ACW1 θ τ ) and N 1 / 2 ( θ ^ τ ACW2 θ τ ) are asymptotically normal with mean zero and variance E ( ς 1 2 ) and E ( ς 2 2 ) . Moreover, if all working models in (3) and (8)–(10) are correctly specified, θ ^ τ ACW1 and θ ^ τ ACW2 are locally efficient, i.e., ς 1 = ς 2 = φ θ τ eff ( O ) as n .

The proof of Theorem 2 and details of ς 1 2 and ς 2 2 are provided in Appendix B.3. For a straightforward procedure of variance estimation, a nonparametric bootstrap method can be used. Specifically, we draw B bootstrap samples from both the RCT and the observational sample, respectively, and then for each resampled pair, we obtain a replicate of the ACW estimator; the sample variance of the B bootstrap replicates is the variance of the ACW estimator.

The proposed ACW estimators depend on the parametric estimation of nuisance functions. Alternatively, we can consider a flexible nonparametric approach without the parametric assumption, which is often unrealistic in complex problems in practice. The asymptotic behavior of the ACW estimators with the nonparametric estimation of nuisance functions can be characterized by the empirical process perspective. Suppose that the posited nuisance models are consistent, i.e., π δ ( X ; η ^ ) π δ ( X ) = o p ( 1 ) , π A ( X ; ρ ^ ) π A ( X ) = o p ( 1 ) , and S a ( t , X ; β ^ a ) S a ( t , X ) = o p ( 1 ) , S a C ( t , X ; γ ^ a ) S a C ( t , X ) = o p ( 1 ) , for a { 0 , 1 } . Also, suppose that the weighting functions π δ ( X ; η ^ ) , π A ( X ; ρ ^ ) , and S a C ( u , X ; γ ^ a ) are bounded away from zero. Then, we have the effect of the estimated nuisance functions in θ ^ τ ACW θ τ bounded above by a = 0 1 0 τ ϕ a ( t ) { π δ ( X ; η ^ ) π δ ( X ) S a ( t , X ; β ^ a ) S a ( t , X ) + π A ( X ; ρ ^ ) π A ( X ) S a ( t , X ; β ^ a ) S a ( t , X ) + P 0 t d M a C ( u , X ; γ ^ a ) S a ( u , X ) 1 S a ( t , X ) S a ( u , X ; β ^ a ) 1 S a ( t , X ; β ^ a ) d t up to a multiplicative constant, where P is a true measure such that P f ( O ) = f ( O ) d P and is L 2 norm. If each term of the bound is of rate o p ( n 1 / 2 ) , then the effect of nuisance function estimations are asymptotically negligible. This statement is formalized in the following theorem.

Theorem 3

Suppose Assumptions 14 hold. Let S a ( t , X ; β ^ a ) be general semiparametric and nonparametric models for S a ( t , X ) , and let π δ ( X ; η ^ ) , π A ( X ; ρ ^ ) , and S a C ( t , X ; γ ^ a ) be general semiparametric models for π δ ( X ) , π A ( X ) , and S a C ( t , X ) , respectively, for a { 0 , 1 } . Suppose the following conditions hold:

  1. S a ( t , X ; β ^ a ) S a ( t , X ) = o p ( 1 ) , π δ ( X ; η ^ ) π δ ( X ) = o p ( 1 ) , π A ( X ; ρ ^ ) π A ( X ) = o p ( 1 ) , S a C ( t , X ; γ ^ a ) S a C ( t , X ) = o p ( 1 ) ;

  2. 0 < c 1 π δ ( X ; η ^ ) , π A ( X ; ρ ^ ) , S a C ( u , X ; γ ^ a ) c 2 1 for some c 1 , c 2 ;

  3. π δ ( X ; η ^ ) π δ ( X ) S a ( t , X ; β ^ a ) S a ( t , X ) = o p ( n 1 / 2 ) , π A ( X ; ρ ^ ) π A ( X ) S a ( t , X ; β ^ a ) S a ( t , X ) = o p ( n 1 / 2 ) , P 0 t d M a C ( u , X ; γ ^ a ) S a ( u , X ) 1 S a ( t , X ) S a ( u , X ; β ^ a ) 1 S a ( t , X ; β ^ a ) = o p ( n 1 / 2 ) .

Then, θ ^ τ ACW1 and θ ^ τ ACW2 are consistent estimators for θ τ and achieve semiparametric efficiency.

The proof of Theorem 3 is provided in Appendix B.4. To ensure the consistency of the nuisance function estimation with the convergence rate of the product as in (C3), we use the method of sieves in Section 4.3.

4.3 Robust estimation using the method of sieves with penalization

For a robust estimation of the ATE under possibly misspecified working models, we adopt the method of sieves [31,32], which enables flexible data-adaptive estimation of the survival curves and probability weights with root-n consistency [12]. We construct the sieves using the linear spans of power series [33], but other basis functions such as Fourier series or splines are applicable. Specifically, for a p -dimensional vector of non-negative integers κ k = ( κ k 1 , , κ k p ) , we consider a K -vector sieve basis functions g ( X ) = { g 1 ( X ) , , g K ( X ) } T = { X κ 1 , , X κ K } T , where X κ k = l = 1 p X l κ k l with κ k = l = 1 p κ k l non-decreasing in k , i.e., κ k κ k + 1 . Under standard regularity conditions, the sieves approximation results in a consistent estimation of the survival curves and weighting probabilities with large K (see Lee et al. [5], supporting information). To facilitate the stable estimation and to control the variability of the estimators with large K , we consider the penalized estimation of the nuisance functions.

For the sampling score model π δ ( X ) , the penalized sieves estimation is based on the dual problem of calibration that solves the estimating equation U ( λ ) in (7). Following the penalized estimating equation approach [3436], we solve

U ε ( λ ) = U ( λ ) q ε ( λ ) sign ( λ )

for λ = ( λ 1 , , λ K ) T , where q ε ( λ ) sign ( λ ) is the element-wise product of q ε ( λ ) = { q ε ( λ 1 ) , , q ε ( λ K ) } T and sign ( λ ) . We define q ε ( x ) = d p ε ( x ) / d x and specify p ε ( x ) to be the popular smoothly clipped absolute deviation (SCAD) penalty function [37], but different penalty functions such as adaptive lasso [38] and the minimax concave penalty [39] are also applicable. For the SCAD penalty, we have q ε ( λ k ) = ε [ I ( λ k < ε ) + { ( b 1 ) ε } 1 { ( b ε λ k ) + } I ( λ k ε ) ] , for k = 1 , , K , with b = 3.7 following the literature and the tuning parameter ε selected by cross validation.

For penalized sieves estimation of the survival outcome model S a ( t , X ) , the standard penalization technique for the Cox PH model was used based on the partial likelihood. Specifically, we estimate β a by solving

arg max β a R K r D δ r A a r β a T g ( X r ) log l R r exp { β a T g ( X l ) } j = 1 K p ε ( β a j ) ,

where D is the set of indices of the events, R r is the set of indices of the patients at risk at time t r , with t 1 < < t d denoting the distinct event time, and p ε ( ) is the SCAD penalty, for a { 0 , 1 } . The penalized sieves estimation of the censoring model S a C ( t , X ) can be adopted by analogy. Finally, for the penalized sieves estimation of the treatment propensity score model π A ( X ) , we use the standard penalization approach for the binary data using the logit link with the SCAD penalty.

Under regularity conditions specified in Fan and Li [37] and Lee et al. [5], the penalized sieve estimators of the nuisance functions possess oracle properties and satisfy two conditions in Theorem 3. Thus, the resulting ACW estimators using the method of sieves achieve the root-n consistency and semiparametric efficiency.

5 Simulation studies

In this section, we conduct simulation studies to evaluate the finite-sample performance of the proposed estimators. We consider the target population of size N = 200,000 and covariates X = ( X 1 , X 2 , X 3 ) T , where each X i , i = 1 , 2 , 3 is generated from N ( 0 , 1 ) and truncated at 4 and 4 to satisfy regularity conditions. An RCT sample of size n 1,300 is selected from a hypothetical RCT eligible population with size N 1 = 50,000 . From the remaining N 2 = 150,000 OS population, we randomly select a sample of size m = 5,000 .

We consider the difference in RMST as the ATE, i.e., θ τ = Ψ τ ( S 1 ( t ) , S 0 ( t ) ) = 0 τ { S 1 ( t ) S 0 ( t ) } d t = 0 τ S 1 ( t ) d t 0 τ S 0 ( t ) d t = μ 1 , τ μ 0 , τ , where we choose τ = 20 . We compare the proposed CW and the ACW estimators with four other methods, the Naive estimator based only on the RCT sample, the ZS estimator, i.e., the estimator proposed by Zhang and Schaubel [18] based only on the RCT sample, the IPSW estimator using the normalized IPSW weight in (11), and the OR estimator specified in (4). For the ACW estimators, in addition to the original covariate vector g 1 ( X ) = ( X 1 , X 2 , X 3 ) T , we consider the penalized sieve estimation with calibration variables g 2 ( X ) = ( X 1 , X 2 , X 3 , X 1 X 2 , X 1 X 3 , X 2 X 3 , X 1 2 , X 2 2 , X 3 2 ) T , i.e., the extension of the basis functions in g 1 ( X ) to its second-order power series including all two-way interactions and quadratic terms. We use (S) to indicate the method of sieves. To assess the performance of these estimators under model misspecification, we consider four scenarios where (i) all four models in (3) and (8)–(10) are correct, (ii) only the survival outcome model (3) is correct, (iii) the outcome model is incorrect, but the other three weighting models (8)–(10) are correct, and (iv) all four models are incorrect. Details of estimators and specification of the four working models when they are correctly/incorrectly specified are listed in Table 1.

Table 1

Simulation settings: model specification and estimators; expit ( x ) = { 1 + exp ( x ) } 1 . O: survival outcome, S: sampling score, A: treatment propensity score, C: censoring

Models Correctly specified Incorrectly specified
O λ 1 ( t X ) t exp ( 3.7 ) exp ( X 1 X 2 1.5 X 3 ) t exp ( 0.8 ) exp ( e X 1 e X 2 1.5 X 3 )
λ 0 ( t X ) t exp ( 3 ) exp ( 1.8 X 1 1.5 X 2 X 3 ) t exp ( 1.5 ) exp ( 1.8 e X 1 1.5 e X 2 X 3 )
S π δ ( X ) expit { 3.9 0.5 X 1 0.5 X 2 0.3 X 3 } expit { 2.5 0.5 e X 1 0.5 e X 2 0.3 X 3 }
A π A ( X ) 0.5 expit { 1 + 0.5 e X 1 + 0.5 e X 2 0.5 e X 3 }
λ 1 C ( t X ) t exp ( 4.5 ) exp ( 0.5 X 1 X 2 X 3 ) t exp ( 2.5 ) exp ( 0.5 e X 1 e X 2 X 3 )
C λ 0 C ( t X ) t exp ( 3.5 ) exp ( 0.5 X 1 X 2 X 3 ) t exp ( 1.5 ) exp ( 0.5 e X 1 e X 2 X 3 )
Estimators Details
Naive θ ^ τ Naive = 0 τ { S ^ 1 Naive ( t ) S ^ 0 Naive ( t ) } d t ,
where S ^ a Naive ( t ) = n 1 i = 1 N δ i A a i { π ^ a i } 1 e Λ ^ a i C ( t ) Y i ( t )
ZS The denominator of the estimator proposed by Zhang and Schaubel [18]
OR θ ^ τ OR = 0 τ { S ^ 1 OR ( t ) S ^ 0 OR ( t ) } d t , where S ^ a OR ( t ) is defined by (4)
IPSW θ ^ τ IPSW = 0 τ { S ^ 1 IPSW ( t ) S ^ 0 IPSW ( t ) } d t ,
where S ^ a IPSW ( t ) = i = 1 N δ i ω ^ i IPSW A a i { π ^ a i } 1 e Λ ^ a i C ( t ) Y i ( t )
CW θ ^ τ CW = 0 τ { S ^ 1 CW ( t ) S ^ 0 CW ( t ) } d t where S ^ a CW ( t ) is defined by (11)
ACW1 θ ^ τ ACW1 = 0 τ { S ^ 1 ACW ( t ) S ^ 0 ACW1 ( t ) } d t where S ^ a ACW1 ( t ) is defined by (15)
ACW2 θ ^ τ ACW2 = 0 τ { S ^ 1 ACW2 ( t ) S ^ 0 ACW2 ( t ) } d t where S ^ a ACW2 ( t ) is defined by (16)
ACW1(S) The penalized ACW1 estimator using the method of sieves with g ( X ) = g 2 ( X )
ACW2(S) The penalized ACW2 estimator using the method of sieves with g ( X ) = g 2 ( X )

Table 2 and Figure 2 summarize the results based on 1,000 Monte Carlo replications. The bootstrap variance estimation was used for all estimators with B = 100 . When all models are correctly specified, the Naive and ZS estimators that are based only on the RCT sample show biased estimations of the ATE due to the selection bias in the RCT sample. The OR, IPSW, CW, and ACW estimators correct that bias by leveraging the observational sample covariates. The ACW estimators were found to be more efficient than other unbiased IPW estimators. Even though the variance of the OR estimator is smaller, it is biased when the outcome model is incorrectly specified. The CW and IPSW estimators are biased when the sampling score model is not correctly specified. The proposed ACW estimators are double robust when either the outcome model or the other three models are correctly specified. In Scenario 4 where both the outcome model and the weighting models are misspecified, the ACW estimator using the penalized sieve estimation, i.e., ACW1(S) and ACW2(S), is still unbiased. In Scenario 1 and 2, the efficiencies of the ACW1(S) and ACW2(S) estimators are comparable to that of the ACW1 and ACW2 estimators. In Scenario 3 and 4 where the outcome model is incorrect, ACW1(S) and ACW2(S) are more efficient than their counterparts, confirming that using the method of sieves can improve efficiency. In addition, the ACW2 and ACW2(S) estimators were found to be more consistent than the ACW1 and ACW1(S) estimators in a finite sample.

Table 2

Simulation results under four scenarios; T: true (correct) model, W: wrong (incorrect) model. Bias is the empirical bias of point estimates; ESE is the empirical standard error of estimates; RSE is the relative bias (%) of bootstrap standard error estimates; CP is the empirical coverage probability of the 95% confidence intervals

BIAS ESE RSE (%) CP (%)
Estimator μ 1 μ 0 θ μ 1 μ 0 θ μ 1 μ 0 θ μ 1 μ 0 θ
Scenario 1: O:T / S:T, A:T, C:T
Naive 4.12 4.83 0.71 0.30 0.28 0.33 7.44 3.91 4.97 0.0 0.0 39.0
ZS 4.14 4.83 0.69 0.29 0.26 0.30 7.12 2.13 4.65 0.0 0.0 34.2
IPSW 0.04 0.04 0.08 0.32 0.38 0.49 2.33 0.14 2.01 93.9 94.7 93.3
CW 0.11 0.06 0.05 0.32 0.37 0.50 2.57 1.57 2.29 92.4 93.9 93.7
OR 0.02 0.01 0.01 0.22 0.23 0.30 1.45 2.59 1.48 94.7 95.3 95.0
ACW1 0.05 0.04 0.00 0.25 0.26 0.35 2.14 1.60 1.54 93.4 93.9 94.4
ACW2 0.02 0.02 0.00 0.25 0.26 0.35 2.27 0.31 2.56 94.0 94.5 94.4
ACW1(S) 0.05 0.04 0.00 0.25 0.26 0.34 1.17 2.46 0.19 93.4 93.7 94.4
ACW2(S) 0.02 0.02 0.00 0.25 0.26 0.35 1.76 0.98 1.42 93.9 94.8 94.1
Scenario 2: O:T / S:W, A:W, C:W
Naive 3.90 4.98 1.08 0.31 0.32 0.39 0.48 2.62 1.88 0.0 0.0 19.5
ZS 3.88 4.73 0.85 0.28 0.29 0.33 0.35 1.41 0.96 0.0 0.0 26.9
IPSW 0.32 1.31 0.98 0.28 0.42 0.49 4.73 0.34 2.40 82.0 13.2 48.6
CW 0.48 0.17 0.65 0.30 0.50 0.57 0.41 0.92 2.28 63.6 92.2 79.7
OR 0.01 0.01 0.01 0.22 0.23 0.28 0.22 3.62 3.81 95.2 95.4 96.6
ACW1 0.03 0.03 0.00 0.24 0.31 0.38 0.30 7.07 5.96 94.3 95.4 95.4
ACW2 0.01 0.03 0.02 0.24 0.32 0.38 0.23 2.23 2.49 94.4 95.0 95.5
ACW1(S) 0.04 0.03 0.01 0.26 0.35 0.42 0.93 7.55 7.20 93.0 94.3 94.6
ACW2(S) 0.01 0.03 0.02 0.26 0.36 0.42 1.27 1.51 2.70 93.1 93.6 93.9
Scenario 3: O:W / S:T, A:T, C:T
Naive 4.02 4.68 0.67 0.30 0.29 0.36 4.14 4.17 4.66 0.0 0.0 49.7
ZS 4.03 4.66 0.63 0.29 0.28 0.34 3.23 3.77 4.88 0.0 0.0 49.6
IPSW 0.01 0.09 0.10 0.33 0.41 0.54 1.42 2.55 3.43 94.4 93.6 93.0
CW 0.08 0.00 0.08 0.33 0.40 0.54 1.50 3.73 3.36 93.7 93.1 93.6
OR 0.40 0.86 0.47 0.25 0.29 0.37 2.11 3.94 4.18 64.7 15.2 74.1
ACW1 0.03 0.00 0.03 0.26 0.29 0.37 0.58 2.68 1.27 94.3 93.8 94.7
ACW2 0.01 0.01 0.02 0.26 0.29 0.37 0.46 1.54 1.67 94.5 93.8 94.5
ACW1(S) 0.04 0.02 0.02 0.24 0.25 0.32 4.10 11.19 8.99 94.7 94.8 95.0
ACW2(S) 0.02 0.01 0.01 0.24 0.25 0.32 2.19 1.54 1.50 95.1 93.9 94.2
Scenario 4: O:W / S:W, A:W, C:W
Naive 4.06 5.53 1.48 0.34 0.33 0.43 2.18 5.94 3.48 0.0 0.0 7.4
ZS 4.12 5.40 1.28 0.31 0.30 0.38 2.67 6.78 4.90 0.0 0.0 8.0
IPSW 0.68 2.27 1.60 0.32 0.48 0.55 2.43 2.19 2.03 44.1 1.2 20.2
CW 0.14 1.18 1.33 0.32 0.56 0.65 0.59 1.51 1.80 92.0 43.0 46.3
OR 0.76 2.15 1.39 0.25 0.31 0.38 2.04 0.54 2.45 11.7 0.0 4.9
ACW1 0.29 0.82 0.53 0.26 0.39 0.45 2.91 1.69 2.36 79.4 43.4 77.1
ACW2 0.32 0.82 0.50 0.26 0.40 0.46 2.98 1.53 0.00 75.9 44.2 77.9
ACW1(S) 0.02 0.02 0.01 0.24 0.32 0.38 0.42 13.16 11.20 95.1 96.1 95.4
ACW2(S) 0.04 0.00 0.04 0.24 0.32 0.38 0.10 3.96 4.13 94.9 96.0 95.3
Figure 2 
               Boxplot of estimators under four model specification scenarios; T: true (correct) model, W: wrong (incorrect) model.
Figure 2

Boxplot of estimators under four model specification scenarios; T: true (correct) model, W: wrong (incorrect) model.

6 Real data application

We apply the proposed method to estimate the effect of adjuvant chemotherapy on survival in patients with early-stage resected non-small cell lung cancer (NSCLC). Cancer and Leukemia Group B (CALGB) 9633 is the only randomized phase III trial designed to evaluate the effectiveness of adjuvant chemotherapy over observation for stage IB NSCLC [40]. An additional observational sample for stage IB NSCLC patients was extracted from National Cancer Database (NCDB), including more than 15,000 patients with the same eligibility criteria as CALGB 9633. More details of the two data sources are given in the study by Lee et al. [5], where they conducted an integrative analysis of the CALGB trial sample and the NCDB sample to improve generalizability for the CALGB 9633 trial-based estimator of average risk of cancer recurrence.

Table 3 summarizes the distribution of the four baseline covariates by the data sources, which have been considered important prognostic factors. The baseline covariates of the patients in the CALGB trial are different from those of the patients in NCDB. Specifically, the CALGB trial patients consist of more males, are younger, and have smaller tumor sizes. Consequently, an important clinical question is whether adjuvant chemotherapy benefits the general stage IB NSCLC patients population, represented by the NCDB, which is a population-based registry capturing approximately 79% of newly diagnosed lung cancers in the United States, and contains more females, are order, and have larger tumor sizes than the CALGB patients. Given that the RCT sample is relatively healthier than the NCDB sample, the estimators based only on CALGB 9633 sample would result in biased estimation of the true effect of adjuvant chemotherapy on the real-world population of early-stage NSCLC patients.

Table 3

Baseline characteristics of the CALGB 9633 trial sample and the NCDB sample; mean (standard deviation) for continuous and number (proportion) for binary covariate

Male ( X 1 ) Age ( X 2 ) Squamous histology ( X 3 ) Tumor size ( X 4 )
RCT: CALGB 9633 ( n = 319 ) 204 (64%) 60.83 (9.62) 128 (40%) 4.6 (2.08)
OS: NCDB ( n = 15,379 ) 8,458 (55%) 67.87 (10.18) 5,998 (39%) 4.94 (3.04)

We estimate a 12-year difference of the restricted mean lifetime between adjuvant chemotherapy and observation (i.e., no chemotherapy). The nonparametric bootstrap method is used to estimate the standard errors. The results are given in Table 4, using the proposed estimators and other existing methods in the simulation studies. The Naive and ZS estimators indicate that in the RCT sample, there is no 12-year RMST difference between the adjuvant chemotherapy and observation, i.e., 0.02 and 0.04 years, respectively. All other estimators that utilize the covariate information of the OS show a much larger difference in the RMST. The IPSW and CW estimators show about 0.4–0.5 year increase in RMST for adjuvant chemotherapy over observation, and the OR estimator shows about 0.84 year increase; however, these estimates are not significant. The proposed ACW estimators give an estimate of about a 1-year RMST increase for patients who received adjuvant chemotherapy, which is significant at 0.05 level.

Table 4

Estimates and 95% confidence intervals of 12-year difference of the restricted mean lifetime between adjuvant chemotherapy and observation

Estimator μ 1 ^ μ 0 ^ θ ^
Naive 3.33 (2.86, 3.94) 3.31 (2.81, 3.92) 0.022 ( 0.815 , 0.895)
ZS 3.34 (2.89, 3.94) 3.30 (2.79, 3.90) 0.044 ( 0.794 , 0.881)
IPSW 3.71 (3.09, 4.55) 3.23 (2.62, 3.83) 0.478 ( 0.339 , 1.550)
CW 3.73 (3.07, 4.69) 3.17 (2.49, 3.87) 0.561 ( 0.363 , 1.830)
OR 3.88 (3.17, 4.69) 3.04 (2.37, 3.61) 0.835 ( 0.160 , 1.890)
ACW1 4.14 (3.42, 5.17) 3.13 (2.45, 3.64) 1.01 (0.048, 2.200)
ACW2 4.07 (3.35, 5.00) 3.10 (2.41, 3.64) 0.967 (0.043, 2.100)
ACW1(S) 4.36 (3.56, 5.17) 3.19 (2.66, 3.69) 1.160 (0.290, 2.087)
ACW2(S) 4.31 (3.50, 5.15) 3.19 (2.66, 3.69) 1.114 (0.239, 2.056)

Figure 3 represents the estimated restricted mean lifetime for adjuvant chemotherapy and observation and their difference as a function of restricted times τ . All selection-bias-adjusted estimators, i.e., the IPSW, CW, OR, ACW1, ACW2, ACW1(S), and ACW2(S) estimators, show a trend of increasing RMST difference over τ for all τ from 1 to 13 years. Especially, all of the estimators show significant non-zero differences when τ is large. Compared to the ACW1 and ACW2 estimators, the ACW1(S) and ACW2(S) estimators gain efficiency by using the method of sieves. On the other hand, the Naive and ZS estimators that are based only on the RCT sample show nearly flat trends near zero difference over τ . All these results indicate that the proposed estimators are able to detect the effect of adjuvant chemotherapy on survival in the real world population than that in the RCT sample with higher efficiency and more protection against model misspecification, by leveraging the information from the OS.

Figure 3 
               Estimated RMST plots of adjuvant chemotherapy and observation and their difference as a function of restricted times.
Figure 3

Estimated RMST plots of adjuvant chemotherapy and observation and their difference as a function of restricted times.

This substantial heterogeneity in the treatment effect is mainly due to age and tumor size. While many baseline covariates are prognostic factors for survival risk, age, and tumor size are the few variables associated with the outcome and significantly interact with the treatments. It is the reason that these two variables are of interest for evaluating treatment effect heterogeneity. Subgroup analysis based only on the RCT data supports the same extent of treatment effect heterogeneity varied by age and tumor size. The patients with older age and larger tumor size have significantly less risk for death after adjuvant chemotherapy than those who were younger and had smaller tumors [40]. The trend is consistent with the treatment effect heterogeneity found in the integrated analysis for the target population. The remaining difference in treatment heterogeneity could be caused by the imbalance of the covariates distribution between the RCT sample and the target population, and again the difference is expected and reasonable, and we believe that it indeed reflects the value of such integrated analysis.

7 Discussion

This article considers a framework to estimate the treatment effect defined as a function of the treatment-specific survival probability function. The proposed ACW estimators are motivated by two identification formulas based on the survival outcome model and the weighting models and achieve local efficiency and double robustness based on semiparametric theory.

The proposed ACW estimators are similar to the estimator proposed by Zhang and Schaubel [18] that combines the treatment-specific Cox model by Chen and Tsiatis [10] and the IPTW Nelson–Aalen method proposed by Wei [41]. However, unlike our estimator that is derived from the EIF, their approach is based on augmenting the IPW estimating equation to achieve double robustness, and the efficiency of their estimator has not been studied yet. Moreover, their estimator is based only on the RCT sample, whereas we leverage the observational sample to account for selection bias, resulting in an additional augmentation in terms of the sampling score model. In addition, we focus on the class of estimands defined as functionals of the survival functions, including the framework proposed by Zhang and Schaubel [18] as a special case. The proposed approach is also similar to Zhang et al. [19], focusing on the broad class of Mann–Whitney-type causal effect based on the semiparametric theory. To derive the EIF, they construct a RAL estimator that excludes censored data. In contrast, by restricting our interest to the functional of the treatment-specific survival curves, we consider a different RAL estimator utilizing both observed and censored data. The proposed class of estimators could be more robust under highly censored data and covers a broad class of estimands that are favored in time-to-event data.

Instead of a penalized nonparametric sieves method, other machine learning methods can be used, such as survival trees [42] or random forests [43] as alternative to the Cox proportional hazard model. A cross-fitting technique can be employed to remove the Donsker’s condition, which is questionable to hold for these methods [44].

The proposed methods assume that the trial participation is ignorable, i.e., all covariates related to the trial participation and survival time are captured. However, some important covariates might not be available in the observational samples as there were not originally collected for the research purpose. Future work could involve a sensitivity analysis assessing the robustness of the proposed framework in the presence of unmeasured covariates in observational studies [e.g., 4548].

In addition to the estimation of the ATE, a key challenge is to identify subgroups of patients for whom the treatment is more effective. Estimating individualized treatment effects is a key toward precision medicine so that doctors can tailor the treatment for individual patients given their characteristics. Interesting future research would be generalizing individualized treatment effects for survival outcomes combining RCT and observational studies, following Yang et al. [49,50] and Wu and Yang [51].

  1. Funding information: Yang is partially supported by the NSF grant DMS 1811245, NCI grant P01 CA142538, NIA grant 1R01AG066883, and NIEHS grant 1R01ES031651. Wang was partially supported by NCI grant P01 CA142538 and NIA grant 1R01AG066883.

  2. Conflict of interest: Authors state no conflict of interest.

  3. Data availability statement: The data that support the findings of the paper were assembled through agreements with third parties for licensed use. Without the permission of these third parties and to avoid unintended leakage of patient privacy, we elect to not share the data.

Appendix A EIFs for survival estimands

All survival estimands mentioned in the main paper have the EIF in the form of a combination of weighted integrals of the EIF for treatment-specific survival curves [17]. Specifically, the EIF for θ τ is

φ θ τ eff ( O ) = 0 τ ϕ 1 ( t ) φ 1 eff ( t ; O ) d t + 0 τ ϕ 0 ( t ) φ 0 eff ( t ; O ) d t ,

where φ a eff ( t ; O ) is the EIF for treatment-specific survival curve S a ( t ) , and ϕ a ( ) is a function that E { ϕ a ( ) 2 } < , for a { 0 , 1 } .

  1. Difference in the survival at a time point τ : θ ^ τ = S ^ 1 ( τ ) S ^ 0 ( τ ) ϕ 1 ( t ) = δ ( t τ ) and ϕ 0 ( t ) = δ ( t + τ ) , where δ ( ) is the Dirac delta function.

  2. Difference in RMSTs up to τ : θ ^ τ = 0 τ S ^ 1 ( τ ) d t 0 τ S ^ 0 ( τ ) d t ϕ 1 ( t ) = 1 and ϕ 0 ( t ) = 1 .

  3. Ratio of RMTLs up to τ : θ ^ τ = τ 0 τ S ^ 1 ( τ ) d t / τ 0 τ S ^ 0 ( τ ) d t . ϕ 1 ( t ) = τ 0 τ S ^ 0 ( τ ) d t 1 and ϕ 0 ( t ) = θ ^ τ τ 0 τ S ^ 0 ( τ ) d t 1 by the Taylor expansion.

  4. Difference in τ th quantile of survivals: θ = q 1 , τ q 0 , τ and θ ^ = q ^ 1 , τ q ^ 0 , τ , where q a , τ = inf q { S a ( q ) τ } and q ^ a , τ = inf q { S ^ a ( q ) τ } .

  5. Following the Bahadur-type representation, under regularity conditions [52], q ^ a , τ can be expressed as follows:

    q ^ a , τ q a , τ = S ^ a ( q a , τ ) S a ( q a , τ ) S ˙ a ( q a , τ ) + o p ( N 1 / 2 ) ,

    where S ˙ a ( q a , τ ) = d S a ( q ) / d q .

  6. ϕ 1 ( t ) = { S ˙ 1 ( q 1 , τ ) } 1 I ( t = q 1 , τ ) and ϕ 0 ( t ) = { S ˙ 0 ( q 0 , τ ) } 1 I ( t = q 0 , τ ) .

B Proofs

B.1 Proof of Theorem 1

We first derive the EIF for treatment-specific survival curves without censoring, i.e., under full data set V = ( X , A , T , δ , δ ˜ ) , and then derive it in the presence of censoring, i.e., under the observed data set O = ( X , A , U , Δ , δ , δ ˜ ) . The proof based on V is similar to the one in given in the study by Lee et al. [5] using the method of parametric submodel [53]. We use ξ in the subscript to denote the submodel. For example, f ξ ( V ) is a one-dimensional parametric submodel with the true f ( V ) at ξ = 0 , i.e., f ξ ( V ) ξ = 0 = f ( V ) .

Since the likelihood of a single V is

f ( V ) = { f ( X ) f ( δ = 1 X ) f ( A X , δ = 1 ) f ( T X , A , δ = 1 ) } δ { f ( X ) } l t a ˜

and δ l t a ˜ = 0 , the score function can be decomposed as follows:

S ˙ ( V ) = δ S ˙ ( X ) + δ S ˙ ( δ X ) + δ S ˙ ( A X , δ = 1 ) + δ S ˙ ( T X , A , δ = 1 ) + δ ˜ S ˙ ( X ) .

We use S ˙ to represent the score function.

The nuisance tangent space from the semiparametric theory is

H = H 1 H 2 H 3 H 4 ,

where

H 1 = { h ( X ) : E { h ( X ) } = 0 } H 2 = { h ( X , δ ) : E { h ( δ , X ) X } = 0 } H 3 = { h ( A , δ , X ) : E { h ( A , δ , X ) δ , X } = 0 } H 4 = { h ( T , A , δ , X ) : E { h ( T , A , δ , X ) A , δ , X } = 0 } .

Here, we only derive the EIF for S 1 ( u ) . The EIF for S 0 ( u ) can be easily derived using the similar technique. We denote the EIF for S 1 ( u ) under V as φ 1 F ( u V ) , which must satisfy φ 1 F ( u V ) H and ξ S 1 , ξ ( u ) ξ = 0 = E { φ 1 F ( u V ) S ˙ ( V ) } , where S 1 , ξ ( u ) = E { δ ˜ d S 1 , ξ ( u , X ) } . Toward this end, we express

(B1) ξ S 1 , ξ ( u ) ξ = 0 = δ ˜ d S 1 , ξ ( u , X ) f ξ ( V ) d V ξ = 0 = δ ˜ d S 1 , ξ ( u , X ) S ˙ ξ ( V ) f ξ ( V ) d V ξ = 0 + δ ˜ d ξ S 1 , ξ ( u , X ) f ξ ( V ) d V ξ = 0 = E { δ ˜ d S 1 ( u , X ) S ˙ ξ ( X ) } + E ξ S 1 , ξ ( u , X ) ξ = 0 .

For the first term in (B1), we have

(B2) E { δ ˜ d S 1 ( u , X ) S ˙ ( X ) } = E [ l t a ˜ d { S 1 ( u , X ) S 1 ( u ) } ] = E [ l t a ˜ d { S 1 ( u , X ) S 1 ( u ) } { S ˙ ( X , A , T , δ ) + l t a ˜ S ˙ ( X ) } ] = E [ l t a ˜ d { S 1 ( u , X ) S 1 ( u ) } S ˙ ( V ) ] ,

and for the second term in (B1), we have

(B3) ξ S 1 , ξ ( u X ) ξ = 0 = I ( T u ) ξ f ξ ( u X , δ = 1 , A = 1 ) d u ξ = 0 = I ( T u ) S ˙ ξ ( T X , δ = 1 , A = 1 ) f ξ ( u X , δ = 1 , A = 1 ) d u ξ = 0 = I ( T u ) S ˙ ( T X , δ , A ) δ π δ ( X ) A π A ( X ) f ( t X ) d u = E δ π δ ( X ) A π A ( X ) I ( T u ) S ˙ ( T X , δ , A ) X .

By combining (B2) and (B3),

E ξ S 1 , ξ ( u , X ) ξ = 0 = E δ π δ ( X ) A π A ( X ) I ( T u ) S ˙ ( T X , δ , A ) = E δ π δ ( X ) A π A ( X ) { I ( T u ) S 1 ( u , X ) } S ˙ ( T X , δ , A ) = E δ π δ ( X ) A π A ( X ) { I ( T u ) S 1 ( u , X ) } S ˙ ( T , X , δ , A ) = E δ π δ ( X ) A π A ( X ) { I ( T u ) S 1 ( u , X ) } S ˙ ( V ) .

Therefore,

ξ S 1 , ξ ( u ) ξ = 0 = E δ π δ ( X ) A π A ( X ) { I ( T u ) S 1 ( u , X ) } + δ ˜ d { S 1 ( u , X ) S 1 ( u ) } S ˙ ( V ) .

Let

φ 1 F ( u V ) = δ π δ ( X ) A π A ( X ) { I ( T u ) S 1 ( u , X ) } + δ ˜ d S 1 ( u , X ) S 1 ( u ) .

Since

δ ˜ d δ π δ ( X ) A π A ( X ) S 1 ( u , X ) H 3

and

δ π δ ( X ) A π A ( X ) I ( T u ) S 1 ( u ) H 4 ,

we have φ 1 F ( u V ) H ; thus, φ 1 F ( u V ) is the EIF for S 1 ( u ) .

Now, in the presence of censoring, we observe data set O = ( X , A , U , Δ , δ , δ ˜ ) instead of V . Following Robins et al. [54] and Tsiatis [30], the EIF for the monotone coarsened data is given as follows:

δ π δ ( X ) A π A ( X ) Δ S 1 C ( U , X ) I ( U u ) S 1 ( u ) δ π δ ( X ) A π A ( X ) { S 1 ( u , X ) δ ˜ d } S 1 ( u , X ) + δ π δ ( X ) A π A ( X ) d M 1 C ( r , X ) S 1 C ( r , X ) g ( u , r , X ) ,

where S 1 C ( r , X ) = p ( C r X , A = 1 , δ = 1 ) and d M 1 C ( r , X ) = d N C ( r ) λ 1 C ( r , X ) Y ( r ) , N C ( r ) = I ( U r , Δ = 1 ) and Y ( r ) = I ( U r ) . According to Theorem 10.4 from Tsiatis [30], the optimal g ( u , r , X ) is

g ( u , r , X ) = E { I ( T u X , A = 1 , δ = 1 , T r ) } .

Thus,

(B4) δ π δ ( X ) A π A ( X ) d M 1 C ( r , X ) S 1 C ( r , X ) g ( u , r , X ) = 0 δ π δ ( X ) A π A ( X ) d M 1 C ( r , X ) S 1 C ( r , X ) I ( u > r ) S 1 ( u , X ) S 1 ( r , X ) + I ( r u ) 1 = 0 u δ π δ ( X ) A π A ( X ) d M 1 C ( r , X ) S 1 C ( r , X ) S 1 ( u , X ) S 1 ( r , X ) + u δ π δ ( X ) A π A ( X ) d M 1 C ( r , X ) S 1 C ( r , X ) .

For the second term in (B4), we can express as follows:

(B5) ( B 4 ) = δ π δ ( X ) A π A ( X ) u d N C ( r ) S 1 C ( r , X ) u λ 1 ( r , X ) Y ( r ) S 1 C ( r , X ) = δ π δ ( X ) A π A ( X ) Y ( u ) 1 Δ S 1 C ( U , X ) u U λ 1 ( r , X ) S 1 C ( r , X ) = δ π δ ( X ) A π A ( X ) Y ( u ) 1 Δ S 1 C ( U , X ) 1 S 1 C ( U , X ) 1 S 1 C ( u , X ) = δ π δ ( X ) A π A ( X ) Y ( u ) 1 S 1 C ( u , X ) Δ S 1 C ( U , X ) .

Plugging (B5) to (B4), the EIF for S 1 ( u ) under O is

δ π δ ( X ) A π A ( X ) Δ S 1 C ( U , X ) I ( U u ) S 1 ( u ) δ π δ ( X ) A π A ( X ) { S 1 ( u , X ) δ ˜ d } S 1 ( u , X ) + 0 u δ π δ ( X ) A π A ( X ) d M 1 C ( r , X ) S 1 C ( r , X ) S 1 ( u , X ) S 1 ( r , X ) δ π δ ( X ) A π A ( X ) Y ( u ) 1 S 1 C ( u , X ) Δ S 1 C ( U , X ) = δ π δ ( X ) A π A ( X ) I ( U u ) S 1 C ( u , X ) S 1 ( u ) δ π δ ( X ) A π A ( X ) { S 1 ( u , X ) δ ˜ d } S 1 ( u , X ) + 0 u δ π δ ( X ) A π A ( X ) d M 1 C ( r , X ) S 1 C ( r , X ) S 1 ( u , X ) S 1 ( r , X ) .

The EIF for S 0 ( t ) can be obtained by analogy.

B.2 The influence function of the ACW estimators

For both ACW1 and ACW2 estimators, by the definition of the EIF,

N { S ^ a ACW ( t ) S a ( t ) } = 1 N i = 1 N φ a eff ( t ; O i ) + o p ( 1 ) , for a { 0 , 1 } .

Thus, under asymptotic linear characterization of the ATE estimator θ ^ τ ACW = Ψ τ ( S ^ 1 ACW ( t ) , S ^ 0 ACW ( t ) ) ,

N ( θ ^ τ ACW θ τ ) = 0 τ ϕ 1 ( t ) N { S ^ 1 ACW ( t ) S 1 ( t ) } d t + 0 τ ϕ 0 ( t ) N { S ^ 0 ACW ( t ) S 0 ( t ) } d t + o p ( 1 ) = 0 τ ϕ 1 ( t ) 1 N i = 1 N φ 1 eff ( t ; O i ) d t + 0 τ ϕ 0 ( t ) 1 N i = 1 N φ 0 eff ( t ; O i ) d t + o p ( 1 ) = 1 N i = 1 N 0 τ ϕ 1 ( t ) φ 1 eff ( t ; O i ) d t + 0 τ ϕ 0 ( t ) φ 0 eff ( t ; O i ) d t + o p ( 1 ) .

Thus, θ ^ τ ACW has the influence function:

φ θ τ eff ( O ) = 0 τ ϕ 1 ( t ) φ 1 eff ( t ; O ) d t + 0 τ ϕ 0 ( t ) φ 0 eff ( t ; O ) d t .

B.3 Proof of Theorem 2

Standard regularity conditions for the consistency of survival OR parameters and treatment propensity score model [e.g., 18,55] and suitable regularity conditions for the M-estimation theory [e.g., 56] are assumed throughout this proof. The former includes almost sure boundedness of X and finite cumulative baseline hazard function for survival and censoring, and the latter includes smoothness and identifiability of working models and applicability of dominated convergence theorem.

Under Assumptions 13, we consider three cases: (i) survival OR model (O) in (3) is correctly specified, (ii) the weighting models, i.e., the sampling score model (S) in (8), the treatment propensity score model (A) in (9), and the censoring model (C) in (10), are correctly specified; (iii) all working models in (3) and (8)–(10) are correctly specified. Assume that ζ ^ = ( η ^ , ρ ^ , β ^ 1 , β ^ 0 , γ ^ 1 , γ ^ 0 ) converges to some ζ = ( η , ρ , β 1 , β 0 , γ 1 , γ 0 ) , not necessary true. Then, π ^ δ ( X ) = π δ ( X ; η ^ ) = exp { η ^ T g ( X ) } converges to π δ ( X ) , π ^ A ( X ) = π A ( X ; ρ ^ ) = [ 1 + exp { ρ ^ T g ( X ) } ] 1 converges to π A ( X ) , Λ ^ a ( t ) = Λ ^ a ( t ; β ^ a ) converges to Λ a ( t ) , and Λ ^ a C ( t ) = Λ ^ a C ( t ; γ ^ a ) converges to Λ a C ( t ) , for a { 0 , 1 } . The following proof is similar to the one given in the study by Zhang and Schaubel [18] and Zhang et al. [19].

B.3.1 Double robustness

We first demonstrate the double robustness property of S ^ 1 ACW1 ( t ) . It is straightforward to show double robustness of S ^ 0 ACW1 ( t ) and θ ^ ACW1 by analogy. For the simplicity, we only consider the simple random sampling in the OS, but it can be easily extended to the general setting with known design weights. Define S 1 ( t ; ζ ) = { π δ ( X ) π A ( X ) e Λ a C ( t ) } 1 δ A Y ( t ) { π δ ( X ) π A ( X ) } 1 δ { A π A ( X ) } e Λ 1 ( t ) { π δ ( X ) 1 δ δ ˜ d } e Λ 1 ( t ) + { π δ ( X ) π A ( X ) } 1 δ A 0 t { e Λ 1 C ( t ) e Λ 1 ( u ) } 1 e Λ 1 ( t ) d M 1 C ( u ) . Under mild regularity conditions, S ^ 1 ACW1 ( t ) p E { S 1 ( t ; ζ ) } , where

(B6) E { S 1 ( t ; ζ ) } = E δ π δ ( X ) A π A ( X ) Y ( t ) e Λ a C ( t ) ,

(B7) E { S 1 ( t ; ζ ) } E δ π δ ( X ) A π A ( X ) π A ( X ) e Λ 1 ( t ) ,

(B8) E { S 1 ( t ; ζ ) } E δ π δ ( X ) δ ˜ d e Λ 1 ( t ) ,

(B9) E { S 1 ( t ; ζ ) } + E δ π δ ( X ) A π A ( X ) 0 t d M 1 C ( u ) e Λ 1 C ( t ) e Λ 1 ( t ) e Λ 1 ( u ) .

First, consider condition (i) that the model for O is correct. Using the notation κ ( t ) from the study by Zhang and Schaubel [18], we can express Y ( t ) = I ( T t ) κ ( t ) , where κ ( t ) = I ( C T or C t ) . Following Tsiatis [30, Chapter 9.3 and Lemma 10.4], we can write

κ ( t ) e Λ 1 C ( t ) = 1 0 t d M 1 C ( u ) e Λ 1 C ( t ) .

Then, we have

(B10) ( B 6 ) = E δ π δ ( X ) A π A ( X ) I ( T t ) κ ( t ) e Λ 1 C ( t ) = E δ π δ ( X ) A π A ( X ) I ( T t ) δ π δ ( X ) A π A ( X ) 0 t d M 1 C ( u ) e Λ 1 C ( t ) I ( T t ) ,

(B11) ( B 7 ) + ( B 7 ) = E δ π δ ( X ) A π A ( X ) e Λ 1 ( t )

(B12) + E { δ ˜ d e Λ 1 ( t ) } .

Combining (B9) with (B10)–(B12), we have

(B13) E { δ ˜ d e Λ 1 ( t ) }

(B14) + E δ π δ ( X ) A π A ( X ) { I ( T t ) e Λ 1 ( t ) }

(B15) + E δ π δ ( X ) A π A ( X ) 0 t d N 1 C ( u ) e Λ 1 C ( t ) I ( T t ) e Λ 1 ( t ) e Λ 1 ( u )

(B16) E δ π δ ( X ) A π A ( X ) 0 t Y ( u ) d Λ 1 C ( u ) e Λ 1 C ( t ) I ( T t ) e Λ 1 ( t ) e Λ 1 ( u ) .

Since e Λ 1 ( t ) = p ( T t X , A = 1 , δ = 1 ) , (B13) equals to S 1 ( t ) and (B14) is 0 by iterated expectation conditioning on ( X , A = 1 , δ = 1 ) . Also (B15) and (B16) are 0 by iterated expectations conditioning on ( X , A = 1 , δ = 1 , C = u , T u ) and ( X , A = 1 , δ = 1 , C u , T u ) , respectively.

Now, consider the condition (ii) that the models for S, A, and C are correct. As π δ ( X ) = p ( δ = 1 X ) , π A ( X ) = p ( A = 1 X , δ = 1 ) , and e Λ 1 C ( t ) = p ( C t X , A = 1 , δ = 1 ) ,

( B 6 ) = E δ π δ ( X ) E A π A ( X ) I ( T t ) I ( C t ) e Λ a C ( t ) X , δ = 1 = E δ π δ ( X ) π A ( X ) π A ( X ) P ( T t X , A = 1 , δ = 1 ) P ( C t X , A = 1 , δ = 1 ) e Λ a C ( t ) = E { δ ˜ d P ( T t X , A = 1 , δ = 1 ) } = S 1 ( t ) ( B 7 ) + ( B 8 ) = E e Λ 1 ( t ) δ ˜ d δ π δ ( X ) π A ( X ) π A ( X ) = 0 ( B 9 ) = E δ π δ ( X ) π A ( X ) π A ( X ) E 0 t d M 1 C ( u ) e Λ 1 C ( t ) e Λ 1 ( t ) e Λ 1 ( u ) X , A = 1 , δ = 1 = 0 .

Thus, S ^ 0 ACW1 ( t ) has the double robustness property. By analogy, it is straightforward to show that S ^ 0 ACW1 ( t ) converges to E { S 0 ( t ; ζ ) } which equals to S 0 ( t ) if either working model for O or weighting models S, A, and C are correct. Consequently, θ ^ τ ACW1 converges to E { ϑ τ ( ζ ) } , where

ϑ τ ( ζ ) = 0 τ ϕ 1 ( t ) S 1 ( t ; ζ ) d t + 0 τ ϕ 0 ( t ) S 0 ( t ; ζ ) d t .

Then, E { ϑ τ ( ζ ) } equals to θ τ with the same double robustness properties.

For the ACW2 estimator, we can show that the numerator d S ^ 1 ACW1 ( u ) converges in probability to d S 1 ( u ) , thus S ^ 1 ACW2 ( t ) p S 1 ( t ) . Similarly, one can easily obtain S ^ 0 ACW2 ( t ) p S 0 ( t ) and θ ^ τ ACW2 p θ τ .

B.3.2 Asymptotic normality

Following empirical process literature, define P as the true measure, P N as the empirical measure, and define G N = N ( P N P ) for the empirical processes. Following the technique used in the study by Zhang et al. [19], we assume that ζ and ζ take values in a suitable Banach space with

(B17) N ( η ^ η ) = G N φ η ( O ) + o p ( 1 ) N ( ρ ^ ρ ) = G N φ ρ ( O ) + o p ( 1 ) N ( β ^ a β a ) = G N φ β a ( O ) + o p ( 1 ) N ( γ ^ a γ a ) = G N φ γ a ( O ) + o p ( 1 ) ,

for a { 0 , 1 } . Assuming that ϑ τ ( ζ ) belongs to Donsker classes [57,58], and under assumed regularity conditions, P N ϑ τ ( ζ ^ ) = θ ^ τ ACW1 and P ϑ τ ( ζ ) = θ τ . Then,

(B18) N { θ ^ τ ACW1 θ τ } = N { P N ϑ τ ( ζ ^ ) P ϑ τ ( ζ ) } = G N ϑ τ ( ζ ^ ) + N P { ϑ τ ( ζ ^ ) ϑ τ ( ζ ) } = G N ϑ τ ( ζ ) + N P { ϑ τ ( ζ ^ ) ϑ τ ( ζ ) } + o p ( 1 ) .

As ϑ is differentiable at ζ , we have the derivative

D ACW = ( D η ACW , D ρ ACW , D β 1 ACW , D β 0 ACW , D γ 1 ACW , D γ 0 ACW ) = 0 τ { ϕ 1 ( t ) D S 1 + ϕ 0 ( t ) D S 0 } d t .

This implies that

N P { ϑ τ ( ζ ^ ) ϑ τ ( ζ ) } = D ACW N ( η ^ η , ρ ^ ρ , β ^ 1 β 1 , β ^ 0 β 0 , γ ^ 1 γ 1 , γ ^ 0 γ 0 ) + o p ( 1 ) = G N ( D η ACW φ η + D ρ ACW φ ρ + D β 1 ACW φ β 1 + D β 0 ACW φ β 0 + D γ 1 ACW φ γ 1 + D γ 0 ACW φ γ 0 ) + o p ( 1 ) .

Therefore,

N { θ ^ τ ACW1 θ τ } = G N { ϑ τ ( ζ ) + D η ACW φ η + D ρ ACW φ ρ + D β 1 ACW φ β 1 + D β 0 ACW φ β 0 + D γ 1 ACW φ γ 1 + D γ 0 ACW φ γ 0 } + o p ( 1 ) .

The similar technique can be used to prove the asymptotic normality of the ACW2 estimator using the Delta method and the M-estimation theory.

B.3.3 Local efficiency

Define the asymptotic variance of the ACW1 estimator as E ( ς 1 2 ) , where

ς 1 = ϑ τ ( ζ ) + D η ACW φ η + D ρ ACW φ ρ + D β 1 ACW φ β 1 + D β 0 ACW φ β 0 + D γ 1 ACW φ γ 1 + D γ 0 ACW φ γ 0 .

When the model for O is correct, then D β 1 ACW = 0 and D β 0 ACW = 0 . When the weighting models S, A, and C are correct, then D η ACW , D ρ ACW , D γ 1 ACW , and D γ 0 ACW are all zero. Thus, when all four working models are correct, then

N { θ ^ τ ACW1 θ τ } = G N ϑ τ ( ζ ) + o p ( 1 ) = G N φ θ τ eff ( O ) + o p ( 1 ) ,

i.e., ς 1 = φ θ τ eff ( O ) , implying that the ACW1 estimator is the locally efficient estimator. The efficiency of the ACW2 estimator can be proved using the similar technique and the small order difference between the ACW1 and the ACW2 estimator.

B.4 Proof of Theorem 3

From (B18) in B.3, we want to show that P { ϑ τ ( ζ ^ ) ϑ τ ( ζ ) } is negligible under conditions in Theorem 3. Toward this end, we write

P { ϑ τ ( ζ ^ ) ϑ τ ( ζ ) } = a = 0 1 0 τ ϕ a ( t ) P { S a ( t ; ζ ^ ) S a ( t ; ζ ) } d t ,

where S a ( t ; ζ ^ ) is defined in B.3. As Y ( t ) = I ( T t ) κ ( t ) with κ ( t ) = I ( C T or C t ) and e Λ a C ( t ) κ ( t ) = 1 0 t e Λ a C ( u ) d M a C ( u ) , by iterated expectation, we have

(B19) P { S a ( t ; ζ ^ ) S a ( t ; ζ ) } = P δ π δ ( X ; η ^ ) A π A ( X ; ρ ^ ) { S a ( t , X ) S a ( t , X ; β a ^ ) } + l t a ˜ d S a ( t , X ; β a ^ ) S a ( t , X ) δ π δ ( X ; η ^ ) A π A ( X ; ρ ^ ) 0 t d M ^ a C ( u ; γ a ^ ) S a C ( t , X ; γ a ^ ) S a ( t , X ) S a ( t , X ; β a ^ ) S a ( u , X ; β a ^ ) , = P δ π δ ( X ; η ^ ) A π A ( X ; ρ ^ ) 1 { S a ( t , X ) S a ( t , X ; β a ^ ) }

(B20) + P [ ( l t a ˜ d 1 ) S a ( t , X ; β a ^ ) ]

(B21) P δ π δ ( X ; η ^ ) A π A ( X ; ρ ^ ) 0 t d M ^ a C ( u ; γ a ^ ) S a C ( t , X ; γ a ^ ) S a ( t , X ) S a ( u , X ) S a ( t , X ; β a ^ ) S a ( u , X ; β a ^ ) ,

for a { 0 , 1 } . By Cauchy-Schwarz inequality and the positivity of π δ ( X ) and π A ( X ) , we have

( B 19 ) = P π δ ( X ) π A ( X ) π δ ( X ; η ^ ) π A ( X ; ρ ^ ) π δ ( X ; η ^ ) π A ( X ; ρ ^ ) { S a ( t , X ) S a ( t , X ; β a ^ ) } π δ ( X ) π A ( X ) π δ ( X ; η ^ ) π A ( X ; ρ ^ ) S a ( t , X ) S a ( t , X ; β a ^ ) { π δ ( X ) π δ ( X ; η ^ ) + π A ( X ) π A ( X ; ρ ^ ) } S a ( t , X ) S a ( t , X ; β a ^ ) ,

where indicates that the inequality holds up to a multiplicative constant. By iterated expectation, ( B 20 ) = 0 . Lastly, by Cauchy-Schwarz inequality and iterated expectation, we have

( B 21 ) P π δ ( X ) π δ ( X ; η ^ ) π A ( X ) π A ( X ; ρ ^ ) 0 t d M ^ a C ( u ; γ a ^ ) S a C ( t , X ; γ a ^ ) S a ( t , X ) S a ( u , X ) S a ( t , X ; β a ^ ) S a ( u , X ; β a ^ ) P 0 t d M ^ a C ( u ; γ a ^ ) S a ( t , X ) S a ( u , X ) S a ( t , X ; β a ^ ) S a ( u , X ; β a ^ ) ,

where the last inequality holds by the condition (C2) in Theorem 3. Therefore, P { S a ( t ; ζ ^ ) S a ( t ; ζ ) } is bounded above by π δ ( X ) π δ ( X ; η ^ ) S a ( t , X ) S a ( t , X ; β a ^ ) + π A ( X ) π A ( X ; ρ ^ ) S a ( t , X ) S a ( t , X ; β a ^ ) + P 0 t d M ^ a C ( u ; γ a ^ ) S a ( u , X ) 1 S a ( t , X ) S a ( u , X ; β a ^ ) 1 S a ( t , X ; β a ^ ) . It is negligible under the conditions (C1) and (C3) in Theorem 3; thus, P { ϑ τ ( ζ ^ ) ϑ τ ( ζ ) } is negligible as well and θ ^ τ ACW1 achieves semiparametric efficiency. The proof for θ ^ τ ACW2 is analogous by the small order difference between θ ^ τ ACW1 and θ ^ τ ACW2 .

References

[1] Cole SR, Stuart EA. Generalizing evidence from randomized clinical trials to target populations: the actg 320 trial. Am J Epidemiol. 2010;172(1):107–15. 10.1093/aje/kwq084Search in Google Scholar PubMed PubMed Central

[2] Tipton E. Improving generalizations from experiments using propensity score subclassification: assumptions, properties, and contexts. J Educ Behav Stat. 2013;38(3):239–66. 10.3102/1076998612441947Search in Google Scholar

[3] Hartman E, Grieve R, Ramsahai R, Sekhon JS. From sample average treatment effect to population average treatment effect on the treated: combining experimental with observational studies to estimate population treatment effects. J R Stat Soc Ser A (Stat Soc). 2015;178(3):757–78. 10.1111/rssa.12094Search in Google Scholar

[4] Dahabreh IJ, Robertson SE, Tchetgen EJ, Stuart EA, Hernán MA. Generalizing causal inferences from individuals in randomized trials to all trial-eligible individuals. Biometrics. 2019;75:685–94. 10.1111/biom.13009Search in Google Scholar PubMed

[5] Lee D, Yang S, Dong L, Wang X, Zeng D, Cai J. Improving trial generalizability using observational studies. Biometrics. 2021. https://doi.org/10.1111/biom.13609.10.1111/biom.13609Search in Google Scholar PubMed PubMed Central

[6] Stuart EA, Cole SR, Bradshaw CP, Leaf PJ. The use of propensity scores to assess the generalizability of results from randomized trials. J R Stat Soc Ser A (Stat Soc). 2011;174(2):369–386. 10.1111/j.1467-985X.2010.00673.xSearch in Google Scholar PubMed PubMed Central

[7] Cole SR, Hernán MA. Adjusted survival curves with inverse probability weights. Comput Meth Program Biomed. 2004;75(1):45–9. 10.1016/j.cmpb.2003.10.004Search in Google Scholar PubMed

[8] Pan Q, Schaubel DE. Proportional hazards models based on biased samples and estimated selection probabilities. Canadian J Stat. 2008;36(1):111–27. 10.1002/cjs.5550360111Search in Google Scholar

[9] Colnet B, Mayer I, Chen G, Dieng A, Li R, Varoquaux G, et al. Causal inference methods for combining randomized trials and observational studies: a review. 2020. arXiv: http://arXiv.org/abs/arXiv:2011.08047. Search in Google Scholar

[10] Chen P-Y, Tsiatis AA. Causal inference on the difference of the restricted mean lifetime between two groups. Biometrics. 2001;57(4):1030–8. 10.1111/j.0006-341X.2001.01030.xSearch in Google Scholar

[11] Wei G, Schaubel DE. Estimating cumulative treatment effects in the presence of nonproportional hazards. Biometrics. 2008;64(3):724–32. 10.1111/j.1541-0420.2007.00947.xSearch in Google Scholar PubMed

[12] Chen X. Large sample sieve estimation of semi-nonparametric models. Handbook of econometrics. Vol. 6; 2007. p. 5549–632. 10.1016/S1573-4412(07)06076-XSearch in Google Scholar

[13] Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol. 1974;66(5):688. 10.1037/h0037350Search in Google Scholar

[14] Rubin DB. Comment: which ifs have causal answers. J Am Stat Assoc. 1986;81(396):961–2. 10.1080/01621459.1986.10478355Search in Google Scholar

[15] Hernán MA. The hazards of hazard ratios. Epidemiology (Cambridge, Mass.). 2010;21(1):13. 10.1097/EDE.0b013e3181c1ea43Search in Google Scholar PubMed PubMed Central

[16] Trinquart L, Jacot J, Conner SC, and Porcher R. Comparison of treatment effects measured by the hazard ratio and by the ratio of restricted mean survival times in oncology randomized controlled trials. J Clin Oncol. 2016;34(15):1813–9. 10.1200/JCO.2015.64.2488Search in Google Scholar PubMed

[17] Yang S, Zhang Y, Liu GF, Guan Q. Smim: a unified framework of survival sensitivity analysis using multiple imputation and martingale. 2020. arXiv: http://arXiv.org/abs/arXiv:2007.02339. 10.1111/biom.13555Search in Google Scholar PubMed PubMed Central

[18] Zhang M, Schaubel DE. Double-robust semiparametric estimator for differences in restricted mean lifetimes in observational studies. Biometrics. 2012;68(4):999–1009. 10.1111/j.1541-0420.2012.01759.xSearch in Google Scholar PubMed PubMed Central

[19] Zhang Z, Liu C, Ma S, Zhang M. Estimating mann-whitney-type causal effects for right-censored survival outcomes. J Causal Inference. 2019;7(1):20180010.10.1515/jci-2018-0010Search in Google Scholar

[20] Zhang M, Schaubel DE. Contrasting treatment-specific survival using double-robust estimators. Stat Med. 2012;31(30):4255–68. 10.1002/sim.5511Search in Google Scholar PubMed PubMed Central

[21] Breslow N. Covariance analysis of censored survival data. Biometrics. 1974;30(1):89–99. 10.2307/2529620Search in Google Scholar

[22] Lin DY, Ying Z. Semiparametric analysis of general additive-multiplicative hazard models for counting processes. Ann Stat. 1995;23(5):1712–34. 10.1214/aos/1176324320Search in Google Scholar

[23] Aalen OO. A linear regression model for the analysis of life times. Stat Med. 1989;8(8):907–92. 10.1002/sim.4780080803Search in Google Scholar PubMed

[24] Hainmueller J. Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political Anal. 2012;20(1):25–46. 10.1093/pan/mpr025Search in Google Scholar

[25] Zhao Q. Covariate balancing propensity score by tailored loss functions. Ann Stat. 2019;47(2):965–93. 10.1214/18-AOS1698Search in Google Scholar

[26] Josey KP, Juarez-Colunga E, Yang F, Ghosh D. A framework for covariate balance using bregman distances. Scand J Stat. 2020;48(3):790–816. 10.1111/sjos.12457Search in Google Scholar

[27] Williamson EJ, Forbes A, White IR. Variance reduction in randomised trials by inverse probability weighting using the propensity score. Stat Med. 2014;33(5):721–37. 10.1002/sim.5991Search in Google Scholar PubMed PubMed Central

[28] Colantuoni E, Rosenblum M. Leveraging prognostic baseline variables to gain precision in randomized trials. Stat Med. 2015;34(18):2602–17. 10.1002/sim.6507Search in Google Scholar PubMed PubMed Central

[29] Chan KCG, PhillipYam SC, Zhang Z. Globally efficient non-parametric inference of average treatment effects by empirical balancing calibration weighting. J R Stat Soc Ser B Stat Methodol. 2016;78(3):673–700. 10.1111/rssb.12129Search in Google Scholar PubMed PubMed Central

[30] Tsiatis AA. Semiparametric theory and missing data. New York City: Springer; 2006. Search in Google Scholar

[31] Grenander U. Abstract inference. Hoboken, New Jersey: Wiley; 1981. Search in Google Scholar

[32] Geman S, Hwang C-R. Nonparametric maximum likelihood estimation by the method of sieves. Ann Stat. 1982;401–14. 10.1214/aos/1176345782Search in Google Scholar

[33] Newey WK. Convergence rates and asymptotic normality for series estimators. J Econom. 1997;79(1):147–68. 10.1016/S0304-4076(97)00011-0Search in Google Scholar

[34] Johnson BA, Lin DY, Zeng D. Penalized estimating functions and variable selection in semiparametric regression models. J Am Stat Assoc. 2008;103:672–80. 10.1198/016214508000000184Search in Google Scholar PubMed PubMed Central

[35] Wang L, Zhou J, Qu A. Penalized generalized estimating equations for high-dimensional longitudinal data analysis. Biometrics. 2012;68(2):353–60. 10.1111/j.1541-0420.2011.01678.xSearch in Google Scholar PubMed

[36] Yang S, Kim JK, Song R. Doubly robust inference when combining probability and non-probability samples with high dimensional data. J R Stat Soc Ser B (Stat Methodol). 2020;82(2):445–65. 10.1111/rssb.12354Search in Google Scholar PubMed PubMed Central

[37] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc. 2001;96(456):1348–60. 10.1198/016214501753382273Search in Google Scholar

[38] Zou H. The adaptive lasso and its oracle properties. J Am Stat Assoc. 2006;101(476):1418–29. 10.1198/016214506000000735Search in Google Scholar

[39] Zhang C-H. Nearly unbiased variable selection under minimax concave penalty. Ann Stat. 2010;38(2):894–942. 10.1214/09-AOS729Search in Google Scholar

[40] Strauss GM, Herndon JE, Maddaus II MA, Johnstone DW, Johnson EA, Harpole DH, et al. Adjuvant paclitaxel plus carboplatin compared with observation in stage IB non-small-cell lung cancer: CALGB 9633 with the cancer and leukemia group B, radiation therapy oncology group, and north central cancer treatment group study groups. J Clin Oncol. 2008;26(31):5043–51. 10.1200/JCO.2008.16.4855Search in Google Scholar PubMed PubMed Central

[41] Wei G. Semiparametric methods for estimating cumulative treatment effects in the presence of non-proportional hazards and dependent censoring. Doctoral dissertation, University of Michigan; 2008. Search in Google Scholar

[42] Bou-Hamad I, Larocque D, Ben-Ameur H. A review of survival trees. Statistics Surveys. 2011;5:44–71. 10.1214/09-SS047Search in Google Scholar

[43] Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival forests. Ann Appl Stat. 2008;2(3):841–60. 10.1002/9781118445112.stat08188Search in Google Scholar

[44] Zhang Z, Li W, Zhang H. Efficient estimation of mann-whitney-type effect measures for right-censored survival outcomes in randomized clinical trials. Stat Biosci. 2020;12(2):246–62. 10.1007/s12561-019-09246-2Search in Google Scholar

[45] VanderWeele TJ, Ding P. Sensitivity analysis in observational research: introducing the e-value. Ann Internal Med. 2017;167(4):268–74. 10.7326/M16-2607Search in Google Scholar PubMed

[46] Yang S, Lok JJ. Sensitivity analysis for unmeasured confounding in coarse structural nested mean models. Statistica Sinica. 2017;28:1703–23. 10.5705/ss.202016.0133Search in Google Scholar PubMed PubMed Central

[47] Nguyen TQ, Ebnesajjad C, Cole SR, Stuart EA. Sensitivity analysis for an unobserved moderator in rct-to-target-population generalization of treatment effects. Ann Appl Stat. 2017;11(1):225–47. 10.1214/16-AOAS1001Search in Google Scholar

[48] Huang M. Sensitivity analysis in the generalization of experimental results. 2022. arXiv: http://arXiv.org/abs/arXiv:2202.03408. Search in Google Scholar

[49] Yang S, Zeng D, Wang X. Elastic integrative analysis of randomized trial and real-world data for treatment heterogeneity estimation. 2020. arXiv: http://arXiv.org/abs/arXiv:2005.10579. Search in Google Scholar

[50] Yang S, Zeng D, Wang X. Improved inference for heterogeneous treatment effects using real-world data subject to hidden confounding. 2020. arXiv: http://arXiv.org/abs/arXiv:2007.12922. Search in Google Scholar

[51] Wu L, Yang S. Transfer learning of individualized treatment rules from experimental to real-world data. 2021. arXiv: http://arXiv.org/abs/arXiv:2108.08415. Search in Google Scholar

[52] Francisco CA, Fuller WA. Quantile estimation with a complex survey design. Ann Stat. 1991;19(1):454–69. 10.1214/aos/1176347993Search in Google Scholar

[53] Bickel PJ, Klaassen CAJ, Bickel PJ, Ritov Y, Klaassen J, Wellner JA, et al. Efficient and adaptive estimation for semiparametric models. vol. 4. Baltimore: Johns Hopkins University Press; 1993. Search in Google Scholar

[54] Robins JM, Rotnitzky A, Zhao LP. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J Am Stat Assoc. 1995;90(429):106–21. 10.1080/01621459.1995.10476493Search in Google Scholar

[55] Lin DY, Wei L-J. The robust inference for the cox proportional hazards model. J Am Stat Assoc. 1989;84(408):1074–8. 10.1080/01621459.1989.10478874Search in Google Scholar

[56] Van der Vaart AW. Asymptotic statistics. Vol. 3. Cambridge, England: Cambridge university press; 2000. Search in Google Scholar

[57] Van Der Vaart AW, Wellner J. Weak convergence and empirical processes: with applications to statistics. Berlin, Germany: Springer Science & Business Media; 1996. 10.1007/978-1-4757-2545-2Search in Google Scholar

[58] Kennedy EH. Semiparametric theory and empirical processes in causal inference. In: Statistical causal inferences and their applications in public health research. New York City: Springer; 2016. p. 141–67. 10.1007/978-3-319-41259-7_8Search in Google Scholar

Received: 2022-01-17
Revised: 2022-09-15
Accepted: 2022-11-11
Published Online: 2022-12-09

© 2022 the author(s), published by De Gruyter

This work is licensed under the Creative Commons Attribution 4.0 International License.

Downloaded on 7.5.2024 from https://www.degruyter.com/document/doi/10.1515/jci-2022-0004/html
Scroll to top button