Abstract

In this paper, we propose the model averaging estimation method for multiplicative error model and construct the corresponding weight choosing criterion based on the Kullback–Leibler divergence with a hyperparameter to avoid the problem of overfitting. The resulting model average estimator is proved to be asymptotically optimal. It is shown that the Kullback–Leibler model averaging (KLMA) estimator asymptotically minimizes the in-sample Kullback–Leibler divergence and improves the forecast accuracy of out-of-sample even under different loss functions. In simulations, we show that the KLMA estimator compares favorably with smooth-AIC estimator (SAIC), smooth-BIC estimator (SBIC), and Mallows model averaging estimator (MMA), especially when some nonlinear noise is added to the data generation process. The empirical applications in the daily range of S&P500 and price duration of IBM show that the out-of-sample forecasting capacity of the KLMA estimator is better than that of other methods.

1. Introduction

The dynamic characteristics of financial markets can be described by some nonnegative time process indicators, such as absolute return, trade volume, realized volatility, high-low range, and so on. Engle [1] reckons the multiplicative error model (MEM) to describe the characteristics of such nonnegative time process, which can be specified as the product of a conditionally factor and an independently and identically distributed innovation term with unit mean. So, the MEM can be seen as a generalization of the generalized autoregressive conditional heteroskedasticity (GARCH; Bollerslev [2]) model and the Autoregressive Conditional Duration (ACD; Engle and Russell [3]) model. Besides, MEM is specified by Engle and Gallo [4] in a multivariate context called vector MEM (vMEM). More recent applications can be found in Taylor and Xu’s work [5], which shows that unexpected duration and volume dominance observed duration and volume in terms of information content, and that volatility and volatility shocks affect duration in different directions. These results are interpreted with reference to extant microstructure theory. Although there are a wide variety of MEMs provided to describe the time process, researchers are constantly thinking about that how to select the most suitable model to fit the actual situation and dataset.

Model selection has a long history in statistics and econometrics; the goal of model selection is to choose the appropriate model, which is regarded as the best among all candidate models. Besides, different methods have been advocated based on distinct estimation criteria including Akaike information criterion (AIC; Akaike [6]), Mallows by Mallows [7]; Bayesian information criterion (BIC; Schwarz [8]); and the focused information criterion by Hjort and Claeskens [9]. However, it is the disadvantage that describe the distribution of data with a single model. The key to avoid selecting a single model is to smoothly interpolate, rather than discontinuously switch between, the different models, using weights related to the relative fits of the models to compute a weighted averaging estimates of the unknown parameters.

Model averaging is an alternative to model selection. Model averaging not just involves a single model estimation result, but a weighted averaging of all the estimation results is performed to obtain the model averaging estimation results. Furthermore, according to Leung and Barron [10], model averaging provides a kind of insurance against selecting a very poor model and thus holds promise for improving the risk in regression estimation. In application, Hoeting et al. [11] apply the exponential AIC and BIC scores, Hansen [12, 13] and Wan et al. [14] apply the model averaging method to the forecast of stationary time series based on the Mallows criterion, and Ullah and Wang [15] outline the steps of averaging general nonparametric models based on the Mallows criterion. Hansen and Racine [16] propose the Jackknife or delete-one cross-validation method and Zhang et al. [17] propose the model averaging method for generalized linear models. A common drawback of the majority of the abovementioned weighting methods is that they are asymptotically optimal only under error terms. Recently, Liu et al. [18] extend the model averaging method by minimizing Kullback–Leibler (KL) divergence to estimate the weight of GARCH model. Zhao et al. [19] combine maximum likelihood estimators of unknown parameters in both the mean and variance functions of the multiplicative heteroscedastic model. In general, the most well-known model averaging estimator is MMA based on the optimal criteria, which consists of two terms to represent the goodness of fit and penalty for the averaging number of variables in candidate models, respectively. Although the MMA is asymptotically optimal for the linear model in the large sample case, according to Liu et al. [18], the model averaging estimator based on KL divergence is more efficient than MMA when the candidate models are the condition volatility models, such as GARCH family models. Furthermore, KL divergence is also widely used in fields such as image classification and recognition; see Fekri-Ershad and Tajeripour [20] and Fekri-Ershad [21].

Hence, the main work of this paper is descried as follows. Firstly, we propose a model averaging estimator based on KL divergence when the candidate models are the MEMs. We construct a feasible weight selection criterion which can be proved to achieve the best out-of-sample forecasting accuracy in the class of model averaging estimators. Besides, we prove the asymptotically optimal, which means the KL divergence with weights determined by KLMA is asymptotically equivalent to the lower bound under some regularity conditions, and the consistency, which means the convergence rate is determined by the sample size and the KL divergence risk under some regularity conditions, respectively. Moreover, we show that the KLMA does not require the condition proposed by Hansen [12] that all candidate models are misspecified. It strengthens theoretical basis for applications of the KLMA. Finally, we show the out-of-sample forecasting accuracy of the KLMA in simulations with different DGPs and apply the KLMA in some empirical examples. The results are supportive in most cases.

The remainder of this paper is organized as follows. In Section 2, we review the MEM briefly and construct the optimal model averaging estimator with the weight selection criterion by minimizing KL divergence. In Section 3, we provide the asymptotic optimality of the KLMA. In Section 4, we perform a Monte Carlo simulation where the KLMA is compared with the other model averaging methods including SAIC, SBIC, and MMA. In Section 5, we use the real dataset to conduct the empirical examples and compare the forecasting results for these methods. In Section 6, some conclusions are made.

2. Model Averaging Framework

2.1. The MEM

Consider a nonnegative time process and the past information set , which implies all of the information about the process from time 1 to . Then, the MEM(p, q) for iswhere denotes the mean process of , depending on an unknown parameter vector , is the conditionally deterministic process

For the necessary and sufficient conditions to ensure the positivity and stationarity of series, refer to Nelson and Cao [22]; Tsai and Chan [23]; that is, are positivity and . is the innovation term of the model, which is conditionally stochastic . Process with density distribution with a mean of 1 and an unknown variance :

Thereby, the conditional mean and conditional variance of is

Sometimes an observed signed variable determines different dynamics in such as the rate of return on the financial market is leveraged in response to changes in volatility. According to Glosten et al. [24] on GJR-like asymmetry effect, the asy-MEM(p, d, q) iswhere is the asymmetric influence and the unknown parameter vector . Such asymmetric MEM considers more information of finance market than MEM(p, q), but there is still a need to limit the parameters for the nonnegative and stationary nature of the . Thereby, there are the other models without limiting the parameters such as log-MEM(p, q) by Bauwens and Giot [25]:

Besides, Bauwens and Rombouts [26] and Taylor and Xu [5] also continue to develop to some extents.

Let be the density function of the innovation term and then we can obtain the density function of asThen, we describe QMLE inference by assuming that follows the exponential distribution as Lu et al. [27], and the log-likelihood function of given the observation is

So, we obtain the maximum likelihood estimator:where denotes the parameters space. In general, the maximum likelihood estimator exists and it is consistent and asymptotically efficient under some regularity conditions. It also verifies the properties are asymptotically normal, and we have , where is the usual Fisher information matrix and is the usual true value. Besides, there are a series estimate methods except QMLE such as GMM estimates by Cipollini et al. [28]; robust moment-based estimators by Lu and Ke [29]; the GLS estimates by Lu and Ke [30], and M-estimates by Lu et al. [27] designed to make the estimation robust.

However, the QMLE is easily affected by outliers. So, we obtain the averaging estimator with different models which avoid the impact of these problems in some cases. In addition, we will prove the consistency when follows the exponential distribution and further prove the asymptotic optimality when follows the Gamma distribution as Engle [1].

2.2. Model Averaging Estimator

To approximate the true time process (1), we consider candidate models with different parameter numbers of MEMs. The unknown parameter vector of the th model is estimated through maximizing the log-likelihood in equation (9). Let the weight vector ; then, the following model averaging estimator of can be written as

Note that if is a unit vector, the estimation result of model averaging is consistent with the single MEM.

In addition, the KL divergence loss associated with iswhere is the unknown true conditional joint density of generating the data and is the estimated density function based on equation (7). However, since is unknown, the minimization of equation (11) cannot be achieved empirically. In order to obtain a computable objective function, the weight criteria of as the KL divergence take the formwhere is the parameter number of each model and is the hyperparameter to avoid the problem of overfitting. So, the optimal model averaging weight is obtained by minimizing the ,

3. Asymptotic Properties of the Estimator

In this section, we investigate the large sample properties of the KLMA. It is well known that KLMA is asymptotically optimal in the sense of achieving the lowest possible KL divergence by Liu et al. [18] for the conditional volatility models. Similarly, we can establish the same asymptotic optimality of KLMA for MEM.

We provide some technical conditions for Theorem 1.

Assumption 1. There exists some such that for all and .

Assumption 2. There exists such that

Assumption 3. Let ; we have .

Assumption 4. According to equation (8), ; then, we denote , which holds .

Assumption 1 requires that all the conditional mean functions are positively lower bounded. Such a restriction is reasonable since otherwise the estimated conditional mean can degenerate to zero. Moreover, such a restriction can be easily satisfied. For example, for the MEM(1, 1), it means and . For Assumption 2, the parameter is the optimum to which converges, so . In this situation, the Assumption 2 will hold when . Assumption 3 poses some restrictions on the increasing speed of the minimum approximation error . It requires that increases slower than both and the maximum deviation between and its conditional mean with respect to . For Assumption 4, it is generally stronger than those in existing research such as Hansen [12]. Compared with other conditions in assumptions, Assumption 4 appears to be the most restrictive.

Besides, we provide a necessary lemma before Theorem 1 holds.

Lemma 1. A random variable series is given, where is defined in the weight spaces . Considering the series and , where , if exists, and holds, then we have

For the proof of Lemma 1, see Liu et al. [18]. Lemma 1 shows the ratio of the infimum of and converges to 1 by probability when the conditions are established. Then, we have the following Theorem 1 based on Lemma 1 and above Assumptions 14.

Theorem 1 (Asymptotic Optimality). If Assumptions 14 are satisfied, then there exist local minimizers of such that

Proof. See Appendix A.

Theorem 1 states that the KL divergence between the model averaging estimator with weights determined by KLMA and is asymptotically equivalent to the lower bound. Therefore, we can optimize the weight of each model by KLMA.

Then, we will propose the consistency that the rate of KLMA based on weights tending to the infeasible optimal weight vector. Denote the optimal weight . Let and be the minimum and maximum singular values of a general real matrix . Denote , , and . Then, we need the following regularity conditions for the following Theorem 2.

Assumption 5. There are two positive constants and such that , in probability tending to 1.

Assumption 6. .

Assumption 7. There exists some positive constant such that and .

Assumption 5 is common and it requires that both the minimum and maximum singular values of are bounded away from zero and infinity. Assumption 2 is similar to Condition (C.1) of Zhang et al. [31] and is a high-level condition that can be proved by original conditions. The first part of Assumption 7 is implied by the conditions for Theorem 2 in Wan et al.’s work [14] for sufficiently small . The second part of Assumption 7 implies that the number of candidate models can increase with but at a rate with restriction. For instance, when the candidate models are nested, if is of order for some , then can be .

Besides, we provide a necessary lemma before Theorem 2 holds.

Lemma 2. If the innovation follows the exponential distribution, then the first-order condition ofis equal to

For the proof of Lemma 2, see Appendix B. Then, we have the following Theorem 2 based on Lemma 2 and above Assumptions 57.

Theorem 2 (Consistency). If is an interior point of and Assumptions 13 are satisfied, then there exist local minimizers of such thatwhere is a positive constant given in Assumption 2.

Proof. See Appendix B.

It is seen from Theorem 2 that the convergence rate is determined by the sample size and the KL divergence risk . It means that given the speed of , the lower speed is and the faster speed is . In addition, if and the convergence of is valid, we have . In addition, we need to assume that to establish the asymptotic optimality like Theorem 1, and it is also a necessary condition used by Hansen [12] and Hansen and Racine [16]. This condition is reasonable if all candidate models are misspecified. In Theorem 2, we do not require this condition because the true model can be one of candidate models. It strengthens theoretical basis for applications of the KLMA.

4. Monte Carlo Simulation

In this section, we conduct simulation study to demonstrate the large-sample performance of the SAIC, SBIC, MMA, and our KLMA estimator for different cases. Here, the SAIC and SBIC proposed by Hoeting et al. [11] assign the weights by and to the th model. MMA is the model averaging criteria proposed by Hansen [12], which takes the form . Besides, according to Zhang et al. [17], we choose the hyperparameters and in our estimator named OPT1 and OPT2.

4.1. Simulation 1: The Case with Specified Candidate Model

This simulation is designed to study the case that the true model can be one of candidate models. The candidate model set includes MEM, asy-MEM, and log-MEM with maximum lags 2. The details can be given below. The data generation process (DGP) in the design iswhere and the innovation term , considering the sample size is 1000, 2000, and 5000. Define and vary so that varies on a grid in . To evaluate each estimator we compute the root mean square error (RMSE) and mean absolute error (MAE) aswhere is the model averaging estimation results and all of them are approximated by the average across 1000 times. The simulation procedure is as follows:(i)Step 1: Generate sample data based on DGP with different sample sizes respectively(ii)Step 2: Estimate the parameters of each candidate model(iii)Step 3: Choose the weight of each model based on SAIC, SBIC, MMA, and KLMA, respectively(iv)Step 4: Compare the in-sample fitting effects with different of each estimator(v)Step 5: Compare the 5-ahead, 10-ahead, and 20-ahead out-of-sample forecasting effects of each estimator(vi)Step 6: Repeat Step 1 to Step 5 1000 times; then calculate the average for MKL, RMSE, and MAE of each method

The in-sample fitting results are summarized in Figure 1. It can be seen that KLMA is superior to SAIC, SBIC, and MMA for RMSE in most case of and . MMA becomes better for MAE. Recall that relative to MMA, KLMA imposes a heavier penalty on the number of variables in the candidate models and when the dimension of the candidate model increases, this penalty becomes large. Therefore, KLMA becomes better on the less robust evaluation function RMSE and MMA becomes better on the more robust evaluation function MAE. It also means that the out-of-sample prediction accuracy of KLMA is better than MMA due to reduce in overfitting. From the out-of-sample forecasting results summarized in Table 1, OPT2 performs the lowest RMSE and MAE in the most case.

Overall, the KLMA performs better than the other methods with an increasing sample size regardless, and this is the most visible in the case in which the error term when in .

4.2. Simulation 2: The Case with Misspecified Candidate Model

This simulation is used in Hansen and Lunde [32]; all candidate models, which are the same as Simulation 1, are misspecified because the data generation process below has a nonlinear term:

The parameter set is the same with simulation 1. The in-sample fitting results are summarized in Figure 2. It can be seen that KLMA is superior to SAIC, SBIC, and MMA for RMSE in most case of and . MMA becomes better for MAE. Recall that relative to MMA, KLMA imposes a heavier penalty on the number of variables in the candidate models and when the dimension of the candidate model increases, this penalty becomes large. Therefore, KLMA becomes better on the less robust evaluation function RMSE and MMA becomes better on the more robust evaluation function MAE. It also means that the out-of-sample prediction accuracy of KLMA is between than MMA from reducing overfitting. From the out-of-sample forecasting results summarized in Table 2, OPT2 performs the lowest RMSE and MAE in the most case. Moreover, we show the robustness of KLMA to prevent the affection of outliers since the prediction accuracy of KLMA has a very close performance with simulation 1 even if there is the nonlinear term in the DGP of simulation 2.

Overall, the KLMA performs better than the other methods with an increasing sample size regardless, and this is the most visible in the case in which the error term when in .

5. Empirical Application

5.1. Example 1: Daily Range

Firstly, we use these model averaging estimation methods to analyse the daily range of the Standard and Poor 500 (S&P500) in Chou [33]. The sample period is from April 26, 1982, to June 20, 1994. The total number of observations is 5,425. We obtain the data from the finance subdirectory of Yahoo.com. The daily range is , where is the highest price and is the lowest price in the same day . Figure 3 shows the trends and Table 3 shows the descriptive statistics of the daily range. From the skewness, kurtosis, and Jarque–Bera tests, we know that the daily range is nonnormal. The daily range is stationary since the statistics of the augmented Dickey–Fuller (ADF) test are −17.9141, which is lower than the critical values at the 1% significance level. The total number of observations is 3074. Observations from April 26, 1982, to December 24, 1992, are used for the estimation and observations from December 28, 1992, to June 20, 1994, are used for the out-of-sample forecasting. The same candidate models as simulation 1 are considered.

It is seen from Table 4 that SAIC performs the best among the five model averaging methods in the in-sample forecasting based on RMSE and MMA performs the best in the in-sample forecasting based on MAE. Although KLMA performs bad in the in-sample forecasting, both OPT1 and OPT2 are better than SAIC, SBIC, and MMA for all of the cases in the out-of-sample forecasting. We also observe that OPT1 and OPT2 have a very close performance. It means the KLMA is less affected by outliers than other estimator. Overall, the KLMA leads to better results compared with other methods.

5.2. Example 2: Price Duration

Secondly, we use these model averaging estimation methods to analyse the price duration of the International Business Machines (IBM) tick-by-tick transaction data in Lu and Ke [29]. The sample period runs from November 1, 1990, to January 31, 1991. All trades before 9:30 AM and after 4:00 PM are discarded and observations of zero duration are removed, leaving 53,307 unique transaction times. Lu and Ke [29] calculated the price duration process by using the middle price defined as and specified a price threshold to calculate the price duration. Hence, they only considered those points at which the changes in prices were equal to or greater than . After the thinning of the original data, the total number of the new marketed point process was 11,418. Figure 4 shows the trends and Table 4 shows the descriptive statistics of the price durations of IBM.

From the skewness, kurtosis, and Jarque–Bera tests, we know that the daily range is nonnormal. The daily range is stationary since the statistics of the augmented Dickey–Fuller (ADF) test are −78.7965, which is lower than the critical values at the 1% significance level. In this paper, observations from November 1, 1990, to January 30, 1991, are used for the estimation and observations from January 30, 1991, to January 31, 1991, are used for the out-of-sample forecasting. The same candidate models as simulation 1 are considered.

It is seen from Table 4 that SAIC performs the best among the five model averaging methods in the in-sample forecasting based on RMSE and MMA performs the best in the in-sample forecasting based on MAE. Although KLMA performs bad in the in-sample forecasting, both OPT1 and OPT2 are better than SAIC, SBIC, and MMA for all of the cases in the out-of-sample forecasting. We also observe that OPT1 and OPT2 have a very close performance. Overall, the KLMA leads to better results compared with other methods.

6. Conclusion

In recent years, the dynamic characteristics of financial markets described by some nonnegative time process indicators have become a hotspot of financial econometrics, and Engle [1] reckoned the MEM to describe the characteristics of such nonnegative time process. Considering the model structure of the different MEM, this paper constructs the optimal model averaging estimator named KLMA with the weight selection criterion by minimizing KL divergence to perform a weighted averaging of all the estimation results to obtain the better results. Besides, the consistency and asymptotic optimality of the KLMA are proved theoretically and the corresponding convergence rate is derived. From the simulation results and empirical examples, we indicate that the KLMA performs better than the other methods with an increasing sample size regardless, and this is the most visible in the case in which the error term follows .

The KLMA method proposed in this paper provides a new perspective for estimating the nonnegative time process based on MEM. The model averaging weight estimated via minimizing the KL divergence are applied to promote the out-of-sample forecasting accuracy. Although many model averaging methods have been developed, little attention has been given to the study of the asymptotic distribution of the resulting model averaging estimator, the asymptotic distributions are nonnormal and have complicated forms, which makes the inference difficult. This is left for the future research.

Appendix

A. Proof of Theorem 1

The proof about Theorem 1 includes following two parts.

At first, the authors prove thatwhere is the true density of the distribution generating the data . From equations (11), (12), and (A.1), we have

Besides, note thatThen under Assumption 3, equation (A.1) holds as long as the authors can prove that

From Assumption 4, since , the authors have . Moreover, equation (A.10) holds based on Assumption 3. Then, the authors prove equations (A.6)–(A.9). From Assumption 1, for all , the authors have and ; then,where . So, according to Assumption 2, we have

According to equations (A.12) and (A.13), equations (A.8) and (A.9) hold.

Besides, note that

So according to Assumption 2, the authors have

According to equations (A.15) and (A.16), equations (A.6) and (A.7) hold. Hence, the authors completed the proof of equation (A.1). Therefore, according to Lemma 1, the authors have

Note that

So, the authors have

Finally, according to equations (A.1) and (A.19), the authors have

Thus, Theorem 1 has been proved.

B. Proof of Lemma 2

Firstly, the first-order condition of weight choose criteria is

Besides, the first-order condition of weight choose criteria is

The solution of equation (B.1) is the same with equation (B.2). Hence, let and , the true time process (1) can be rewritten as , where and . So, the form of can be rewritten as

C. Proof of Theorem 2

Denote . To verify equation (18) of Theorem 2, following Liao and Zou [34], it suffices to show that there exists a constant such that for the vector and ,which means that there exists a minimizer in the bounded closed domain such that . So, the authors notice that

In the following, the authors will show that is the leading term of equation (C.2).

Since in probability tending to 1 under assumption (B1), so the authors have

Noting that , the authors obtain

From Assumption 5, it is clear that . Thus,

Recalling , the authors see that is asymptotically dominated by .

Now the authors are in a position to derive the stochastic orders of the remaining terms of equation (C.2). From Assumption 6, it is seen that

In addition, from Assumption 7, the authors have

So, from equations (C.5)–(C.7) it is seen that , and are all asymptotically dominated by . Hence, equation (C.1) is true, and Theorem 2 is proved.

Data Availability

In this study, we use the simulated data to show the performance of our model, and the simulation processes are already explained in the paper. For the real data analysis section, we use the same data of Chou [33] and Lu and Ke [29]. These data can be freely collected from https://www.Yahoo.com. The data are available from the corresponding author upon request ([email protected]).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Science Foundation of China under Grants 71771187 and 72011530149, the Program for New Century Excellent Talents in University under Grant NCET-13-0961, and the Fundamental Research Funds for the Central Universities in China under Grant JBK190602.