Big Data Optimization in Machine Learning by Xiaocheng Tang A Dissertation Presented to the Graduate Committee of Lehigh University in Candidacy for the Degree of Doctor of Philosophy in Industrial Engineering Randomized Derivative-Free Optimization of Noisy Functions by Ruobing Chen P sented to the Graduate and Research Co mi te of Lehigh niversity in Candidacy for the Degree of Doctor of Philosophy in Industrial and Systems Engineering Lehigh University January 20, 2015 Lehigh University January, 2016 © Copyright by Xiaocheng Tang 2015 All Rights Reserved ii Approval of the Doctoral Committee Big Data Optimization in Machine Learning Xiaocheng Tang Approved and recommended for acceptance as a dissertation in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Date Dissertation Advisor Committee Members: Dr. Katya Scheinberg, Committee Chair Dr. Frank E. Curtis, Dissertation Advisor Dr. Nader Motee, Dissertation Advisor Dr. Jorge Nocedal, External Member iii Dedication To my family. iv Acknowledgments First and foremost, I would like to express my greatest appreciation to my advisor and dissertation committee chair Dr. Katya Scheinberg who has provided me guidance and support ever since the beginning of my Ph.D. study. She is an amazing advisor and has always inspired me to reach for a higher level. I would like to thank my committee members, Dr. Frank E. Curtis, Dr. Nader Motee and Dr. Jorge Nocedal. Their insightful comments and constructive advices have been invaluable. Without their guidance and persistent help, this dissertation would not have been possible. I also wish to thank Peiyan who has been hugely supportive throughout the several years it has taken to finish this work. And finally I would like to thank all my friends, professors and faculties at Lehigh, and everyone else who has helped me complete this dissertation. Without their continued efforts and support, I would have not been able to bring my work to a successful completion. v Contents Dedication iv Acknowledgments v List of Tables viii List of Figures ix Abstract 1 1 Introduction 3 1.1 Hybrid approach and derivative-free optimization in structured learning 4 1.2 Adaptive active-set and communication-efficient coordinate descent . 9 1.3 Complexity of inexact proximal Newton methods and randomization . 12 1.4 A brief outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . 15 2 Machine Learning Applications of Convex Optimization 17 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Sparse Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.1 Optimization Algorithms . . . . . . . . . . . . . . . . . . . . . 20 2.3 HiClass An efficient Hierarchical Classification system . . . . . . . . 25 2.3.1 The Hierarchical Classification Framework . . . . . . . . . . . 27 2.3.2 A Novel Hybrid Approach . . . . . . . . . . . . . . . . . . . . 29 vi 2.3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4 Distributed Learning for Short-term Forecasting . . . . . . . . . . . . 39 2.4.1 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.4.2 Description of the Algorithm . . . . . . . . . . . . . . . . . . . 44 3 Hierarchical Multi-label Learning 56 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.2 Hierarchical sparsity-inducing norms . . . . . . . . . . . . . . . . . . 60 3.3 Hierarchical Multi-label Model . . . . . . . . . . . . . . . . . . . . . . 62 3.4 Learning φ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.4.1 Derivative-Free Optimization . . . . . . . . . . . . . . . . . . 64 3.4.2 Complexity of hierarchical operator . . . . . . . . . . . . . . . 65 3.5 Other hierarchical methods . . . . . . . . . . . . . . . . . . . . . . . . 67 3.6 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4 Complexity of Inexact Proximal Newton Methods 75 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2 Basic algorithmic framework and theoretical analysis . . . . . . . . . 76 4.3 Basic results, assumptions and preliminary analysis . . . . . . . . . . 78 4.3.1 Analysis via proximal gradient approach . . . . . . . . . . . . 81 4.4 Analysis of sub linear convergence . . . . . . . . . . . . . . . . . . . . 88 4.4.1 The exact case . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.4.2 The inexact case . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.4.3 Complexity in terms of subproblem solver iterations . . . . . . 98 4.5 Analysis of the inexact case under random subproblem accuracy . . . 100 4.5.1 Analysis of Subproblem Optimization via Randomized Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 vii 5 Adaptive Active-set in Proximal Quasi-Newton Method for Large Scale Sparse Optimization 106 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.2 The proposed algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.3 Mutable matrix type in low-rank hessian approximation . . . . . . . . 110 5.3.1 Efficient matrix mutating operations . . . . . . . . . . . . . . 111 5.4 Self-adaptive greedy active-set selection . . . . . . . . . . . . . . . . . 115 5.5 Fast coordinate descent on low-rank structure . . . . . . . . . . . . . 118 5.6 Inner loop termination criteria and global convergence . . . . . . . . 119 5.7 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.7.1 Active-set methods . . . . . . . . . . . . . . . . . . . . . . . . 123 5.7.2 Inner loop termination and convergence . . . . . . . . . . . . . 128 5.7.3 Performance comparison with other algorithms . . . . . . . . . 131 6 Conclusions 140 Bibliography 151 Biography 152 viii List of Tables 2.1 Per Iteration Complexity . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1 Datasets statistics used in experiments. Dataset itp and ads have the same exact dimension, but are constructed differently – itp targets users' general interest and includes users' activities from other category-related events besides ads view or ads click. . . . . . . . . . 67 3.2 The comparison of FLAT, HML, Leaf Reduce (LR) and Root Map (RM) on all three datasets, with squared and logistic loss function, and evaluated by precision, recall and f1 score. . . . . . . . . . . . . 67 5.1 Row and column operations on LMatrix type M ∈ Rn×m. Note that while one memcpy or memmove on O(n) data has a complexity of O(n), the constant in front of n can be much smaller than 1 (hence much faster than normal n flops), due to the use of cache optimizations and specialized processor instructions such as SIMD in those built-in library functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.2 Data sets used in the experiments. . . . . . . . . . . . . . . . . . . . 123 5.3 Compare the performance of different active-set strategies. . . . . . . 127 5.4 Data statistics in sparse logistic regression experiments. . . . . . . . 136 ix List of Figures 1.1 Visualization of taxonomy among labels in dataset ads. . . . . . . . 6 2.1 The hierarchy tree of the labels. . . . . . . . . . . . . . . . . . . . . . 29 2.2 HiClass system flow chart . . . . . . . . . . . . . . . . . . . . . . . . 32 2.3 CCAT subtree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.4 Performance comparison at each node between above mentioned algorithms. For HYBRID, we use GLMNET in the upper level and IPM at bottom. The number in the leftmost block of each table denotes the corresponding performance measure on the whole tree. For example, as shown by the rightmost upper table, 70% aggregate accuracy is achieved using the novel HYBRID method. . . . . . . . . . . . . . . 36 2.5 MCAT subtree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.6 Performance comparison at each node between above mentioned algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.7 Illustration of Load Series Periodicity. . . . . . . . . . . . . . . . . . 41 3.1 The tree structure is illustrated with blue rectangles as nodes and straight lines as edges. Groups g are green translucent rectangles covering nodes that are contained in the group. . . . . . . . . . . . . . . 61 x 3.2 Testing and training performance under different metrics as learning objective on rcv dataset. Take (b) for example. HML equipped with jaccard similarity as the learning objective is applied on a training set of 3000 instances. Solutions after each iteration (the x-axis) of HML are recorded and later evaluated under all three metrics on the full rcv dataset (test set). The red curve denotes the objective values evaluated on the test set. In the same subplot along with the red curve are the objective values evaluated at different iterations on the training set. Other two subplots illustrate the same solutions evaluated by other two metrics on the same test set. . . . . . . . . . . . . . . . . . . . . 68 5.1 Illustrate the dynamics of the four index sets (5.4.1) as x→ x∗. Note how |Z1(x)| and |Z2(x)| start from the top and the bottom, respectively, and come together in the end, each containing (roughly) half the size of non-zero elements in x∗. Also note that |Z3(x)| vanishes at optimality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.2 Evolution of free sets in STD and GRDY-AD. Left: dataset mnist (p = 784, N = 1, 000, 000). Right: dataset gisette (p = 5000, N = 6000). The line plots are the size of free set against iterations and the bars denote the total sum of free set sizes over all iterations . . . . . 126 5.3 Study different inner loop termination criteria and its effect on algorithm performance. Time (top) and iterations (bottom) comparison for different l(k): log, linear, constant 5 and constant 200. . . . . . . 128 5.4 Consider linear function l(k) = ak+ b and study the effect of a on running time and iterations for different inner algorithm, i.e., coordinate descent (a) and proximal gradient (b). . . . . . . . . . . . . . . . . . 129 5.5 Convergence plots on SICS (the y-axes on log scale). . . . . . . . . . 135 5.6 RCD step count of LHAC on different SICS data sets. . . . . . . . . 136 5.7 Convergence plots on SLR (the y-axes on log scale). . . . . . . . . . 137 5.8 RCD step count of LHAC on different SLR data sets. . . . . . . . . 138 xi Abstract Modern machine learning practices at the interface of big data, distributed environment and complex learning objectives post great challenges to designing scalable optimization algorithms with theoretical guarantees. This thesis, built on the recent advances in randomized algorithms, concerns development of such methods in practice and the analysis of their theoretical implications in the context of large-scale structured learning problems, such as regularized regression/classification, matrix completion, hierarchical multi-label learning, etc. The first contribution of this work is thus a hybrid hierarchical learning system that achieve efficiency in a data-intensive environment. The intelligent decoding scheme inside the system further enhances the learning capacity by enabling a rich taxonomy representation to be induced in the label space. Important factors affecting the system scalability are studied and further generalized. This leads to the next contribution of the work – a globally convergent inexact proximal quasi-Newton framework and the novel global convergence rate analysis. This work constitutes the first global convergence rate result for an algorithm that uses randomized coordinate descent to inexactly optimize subproblems at each iteration. The analysis quantifies precisely the complexity structure of proximal Newton-type algorithms, which makes it possible to optimize based on that structure to reduce complexity. The final contribution of the work is a practical algorithm which enjoys global convergence guarantee from the framework. The algorithm is memoryand communication-efficient and directly addresses the big data learning cases when both N (samples) and n (features) are large. We demonstrated that this general algorithm is very effective in practice and is competitive with state-of-the-art 1 specialized methods. 2 Chapter 1 Introduction The basic problem of interest in this work is the unconstrained minimization given by: min x∈Rn F (x) ≡ f(x) + g(x) (1.0.1) where f, g : Rn → R are both convex functions such that f is twice differentiable, with a bounded Hessian, and g is Lipschitz continuous, often non-smooth and in many applications has the following separable structure g(x) = J∑ j=1 λjgj(xj) (1.0.2) where each gj is a convex function; λ > 0 is a parameter; and xj ∈ Rnj denotes a partition of the component of x, e.g., the widely-used `1 regularization is obtained by choosing gj = | * |, nj = 1,∀j. Also known as composite optimization, problems of the form (1.0.1) have been the focus of much research lately at the interface of optimization and machine learning. This form encompasses a variety of machine learning models, in which feature selection or structured learning is desired, e.g., • regularized logistic regression, where f computes the logistic loss on the given training set and the regularizer g can be group-`1/`2 depending on the problem. 3 • inverse covariance estimation, where f(X) = − log detX + tr(SX), resulting from maximum conditional log-likelihood estimation like logistic regression, with S as the sample covariance matrix; g(X) = ∑ i,j λij|Xij| is essential here in order to successfully recover the zero entries in the final inverse covariance matrix. • LASSO or group LASSO, where f is a least-square function and g is either || * ||1 or group `1 regularization depending on whether group sparsity is required. • support vector machine, where f(w) = ||w||22 and g(w) is the Vapnik's εinsensitive loss such that in the form (1.0.2) J is the sample size, each partition nj = n includes the full vector w and each gj is the loss max(0, |yi − wxi| − ε) for the i-th example (xi, yi) in the training set. • matrix completion, where in one formulation that estimates a low-rank matrix X = LRT , f computes the least-square loss between the linear mapping A(LRT ) and a given vector b, while g = || * ||2F regularizes the low-rank factor L and R. 1.1 Hybrid approach and derivative-free optimization in structured learning The above examples often present common difficulties to optimization algorithms due to their large scale. During the past decade most optimization effort aimed at these problems focused on development of efficient first-order methods, such as accelerated proximal gradient methods [1–3], block coordinate descent methods [4– 7] and alternating directions methods [8]. These methods enjoy low per-iteration complexity, but typically have slow local convergence rates. Their performance is often hampered by small step sizes. This, of course, has been known about first-oder methods for a long time, however, due to the very large size of these problems, secondorder methods are often not a practical alternative. In particular, constructing and 4 storing a Hessian matrix, let alone inverting it, is prohibitively expensive for values of n larger than 10000, which often makes the use of the Hessian in large-scale problems prohibitive, regardless of the benefits of fast local convergence rate. Hybrid approach and top-down scheme A fairly straightforward way to overcome the shortcomings of the two classes of methods is to adopt a hybrid approach. This is demonstrated in Chapter 2 where we propose a flexible classification system called HiClass for inducing hierarchical structure in the prediction results. It strikes a balance between the efficiency of first order methods and the accuracy enjoyed by second order methods. Hierarchical structure appears in many real-world recommendation and targeting systems. In display-ads targeting, for example, each user is associated with a set of labels such as sports, football, vehicles, luxury cars, etc., and each label contains ads campaign that this user is likely to interact with. It is often the case that the labels are not independent of each other, and that their interdependence can be characterized by a connected (tree) structure (see Figure 1.1), e.g., football is the child of sports. The task then is to assign each user with a set of labels that both appeal to the user and satisfy the underlying taxonomy. Overlooking such structure not only undermines the predictive performance (loss of information), but also makes the results hard to interpret, e.g., predicting that a user is interested in football but not in sports. HiClass exploits the additional knowledge in the structure of the labels, and applies a recursive top-down approach: first discriminate the subsets of labels at the top level of the hierarchy before going down the next level to discern the labels (or sets of labels) in those subsets. This process is repeated until reaching the bottom level. The overall task is decoupled into a series of smaller ones, with each one defined for predicting membership to each class and trained independently with no attention to the hierarchy. HiClass employs a scalable first-order optimization algorithm to train the highlevel classifiers, and a fast second-order algorithm for the low-level classification problems. The main aim of this hybrid approach is to alleviate the shortcomings of the 5 Figure 1.1: Visualization of taxonomy among labels in dataset ads. two methods, i.e. lower accuracy of the first-order method and slower speed of the second-order method, by adjusting the algorithms to the training data sizes that vary from node to node. It exploits a typical scenario in such top-down approach that size of the data becomes smaller at the lower levels of the hierarchy, as the classes become more specific. Hence, using the same optimization method to solve the classification problems through the hierarchy without accounting for different data distribution can be either inefficient or inaccurate – with IPM we need to solve a sequence of large-scale Quadratic Programming (QP) problems that can quickly become intractable simply 6 because of the sheer size of data, and with first-order methods it can be painfully slow to reach the solution at nodes in the lower level where higher accuracy is often desirable. Experiments on the text categorization data set RCV1-v2 show that HiClass achieves better predictive performance (30% more) than first-order method SCD [9] given the same timing budget, and scales to much larger data sets than second-order method IPM [10]. More details about this work are described in [11, 12], and [13] that appears in Learning and Intelligent Optimization Conference. Structured optimization and parameter learning The problem with the topdown scheme in HiClass is that it predicts the part of the label vector that correspond to the upper levels of the taxonomy first, without considering the lower levels. This clearly is a suboptimal approach because the optimization is not performed over the entire set of variables. In this work we will introduce a two-stage hierarchical multi-label learning model that improves upon the top-down scheme to achieve a richer and more powerful hierarchical representation. The proposed model combines base learners of each label with an optimal decoding scheme which induces the hierarchical structure among the outputs of base learners by solving a convex optimization problem based on structured sparse learning. Given a predictor W , computed during training, and a input vector x, the product Wx is typically an unstructured dense vector. The true label vector, however, is not only sparse (with the non zeros in y denoting positive labels) but has to satisfy the hierarchical structure. In prior approaches a unstructured sparse vector has been recovered given Wx, for example, by using a sparsity inducting `1 norm. In this work we make use of a particular variant of group-lasso penalty function [14], called hierarchical norm. This norm is constructed with a recursive group structure based on the given taxonomy. Minimizing objective (1.0.1) equipped with the hierarchical norm, for instance, yields in variable x a particular sparse pattern that respects the underlying taxonomy. 7 Compared to other hierarchical inducing methods, the use of hierarchical norm turns out to be more powerful and enables a richer hierarchical representation on any input vector to be induced in a very efficient way. But to make it work well in practice is not easy and often requires carefully choosing the parameter/weight for each group. For `1-norm this is easily done using a grid search since it has only one parameter determining the sparsity. But for hierarchical norm the number of parameters can be several hundreds, depending on the number of subtrees in a given taxonomy. Hence the two-stage hierarchical model we propose chains the learning and inducing in a single pipeline where each can be efficiently approached. In particular, we base the first-stage training on the multi-label Empirical Risk Minimization (ERM) framework [15] where various convex loss functions can be employed, i.e., least squares, logistic, or hinge loss. The output of the first problem is then used as the input of the second stage. The hierarchical weights are learned in the second stage by directly optimizing popular metrics such as f1, jaccard similarity score or hamming loss, while taking into account of both the classifiers and the encoded hierarchical information present in the training data. While the second problem still is a bi-level optimization problem, its dimension is now significantly reduced to the order of the number of labels that is well under 1000 in this work. In this work we approach the bi-level problem as a black-box optimization problem, which we solve using an efficient derivative-free optimization methods [16]. The details can be found in Section 3.4. Finally, we demonstrate the model on large-scale real-world data sets from document classification, as well as ad targeting domains that involves two million users' daily activities. The results demonstrate the potential of the proposed model. 8 1.2 Adaptive active-set and communication-efficient coordinate descent HiClass improves the efficiency and scalability of hierarchical classification by launching different algorithms to optimize sparse logistic regression for different nodes of the hierarchy, thus overcoming the shortcomings of both first-order and second-order methods. The next question we ask is whether we can borrow from these ideas to improve the existing algorithms for sparse logistic regression, or any sparsity-inducing models with objective F (1.0.1) in general. The objective F in these settings are often constructed from a N by n data matrix that can present common difficulties to optimization algorithms due to the large scale of instances N or features n. Here we propose a fast, memoryand communication-efficient algorithm that employs a two-loop scheme designed to directly address such cases when both N and n are large and that combines firstand second-order information within one single optimization algorithm. This two-loop scheme implements the global-convergent framework that is discussed in Chapter 4 and is designed to separate the optimization procedures into two parts, the outer loop whose complexity depend on both n and N but can be easily distributed, and the inner loop that only relies on n but is sequential in nature. We further propose a self-adaptive active-set selection mechanism, which enhances the inner loop procedures such that the dependence on n can be replaced by dependence on the support/non-zero subspace of the optimal solution x∗. The improvement is significant for (5.1.1) when feature selection is desirable, i.e., the support of x∗ is small although n is large. Finally, we show that establishing convergence rate guarantee of the active-set method is simple by applying the theory presented in Chapter 4 for a particular quadratic function Q constructed according to the active-set selection. The algorithm proposed in this work is backed by convergence analysis presented in Chapter 4 yet outperforms popular methods recently proposed for sparse optimization [4, 17–19]. These methods explore the following special properties of the 9 sparse problems: at optimality many of the elements of x are expected to equal 0, hence methods which explore active set-like approaches can benefit from small sizes of subproblems. Whenever the subproblems are not small, these new methods exploit the idea that the subproblems do not need to be solved accurately. In particular several successful methods employ coordinate descent to approximately solve the subproblems. Other approaches to solve Lasso subproblem were considered in [19], but none generally outperform coordinate descent. [4] proposes a specialized GLMNET [5] implementation for sparse logistic regression, where coordinate descent method is applied to the unconstrained Lasso subproblem constructed using the Hessian of f(x) – the smooth component of the objective F (x). Two main improvements increase efficiency of GLMNET for larger problems – exploitation of the special structure of the Hessian to reduce the complexity of each coordinate step so that it is linear in the number of training instances, and a two-level shrinking scheme proposed to focus the minimization on smaller subproblems. Similar ideas are used in [17] in a specialized algorithm called QUIC for sparse inverse covariance selection. The algorithm we propose is communication-efficient such that, unlike QUIC and GLMNET, the inner coordinate descent does not endure extra communications even when the data is distributed across a network, is general purpose such that its efficiency does not rely on the Hessian structure, and is backed by theoretical analysis and convergence rates [20] yet outperforms the state-of-the-art specialized methods such as QUIC and GLMNET. As these two and other methods, mentioned above, we consider the following general framework: • At each iteration F (x) is approximated, near the current iterate xk, by a convex quadratic function Q(x). • Then an algorithm is applied to optimize (approximately) the function Q(x), to compute a trial point. • The trial point is accepted as the new iterate if it satisfies some sufficient decrease condition. 10 • Otherwise, a different model Q(x) may be computed, or a line search applied to compute a new trial point. We make use of similar ideas as in [4] and [17], but further improve upon them to obtain efficient general schemes, which apply beyond the special cases of sparse logistic regression and covariance selection and for which convergence rates can be established. In particular, the active-set selection mechanism we propose consists of two phases, the greedy phase and the convergent phase, and the transition between phases is self-adaptive. In the greedy phase the mechanism adjusts the size of the active set adaptively, while in the convergent phase it includes all entries with dual constraint violations. Convergence can be established for the convergent phase by applying the theory in [20]. Compared to QUIC and GLMNET, this mechanism generates smaller inner problems through the process, as we demonstrate empirically. We employ a compact representation of LBFGS methods [21] to construct the function Q in our algorithm. While the use of LBFGS for (5.1.1) is not new, e.g., [22], to use it in the two-loop framework along with the active-set selection and coordinate descent is novel and has several implications both in theory and in practical algorithm design: 1) it separates the data from the inner solver such that the inner solver can be programmed in a generic way for a vector, the gradient, and two matrices G and Ĝ as we will see later, with no dependence on the actual form of function L(x); 2) data is solely handled in outer iterations which can also be programmed in a generic way while specifically optimized for how data is stored, e.g., in memory, on disk or across the network; 3) by carefully exploiting the low rank structure present in the compact representation of LBFGS the complexity of each coordinate descent step can be reduced to a constant! 4) the convergence of LBFGS with active-set selection and inexact subproblem minimization has not been analyzed before, but can be established, as we will see, by applying the theory in [20]. One of the key for the algorithm to be fast relies on efficiently computing compact LBFGS representation, which requires constantly performing mutating operations on large matrices. A naive implementation that allocates and frees new chunk of memory 11 every iteration can be several order slower due to increasing memory fragmentation, allocation overhead, and most importantly memory and cache inefficiency which can further slow down the coordinate descent process. In this work we discuss a mutable matrix type that is designed to exploit the specific mutating patterns present in LBFGS such that the mutating operations are done repeatedly on a single chunk of continuous memory, without consuming extra memory and with ultra-low latency. 1.3 Complexity of inexact proximal Newton methods and randomization The algorithm proposed in Chapter 5 belongs to a class of wide range of algorithms called inexact proximal Newton-type methods. The convergence behaviors and overall complexity of inexact proximal Newton-type methods are further analyzed in Chapter 4. The contributions of this work are mainly twofolds: 1. We discuss theoretical properties of the above framework in terms of global convergence rates. In particular, we show that if we replace the line search by a prox-parameter update mechanism, we can derive sublinear global complexity result for the above methods under mild assumptions on Hessian approximation matrices, which can include diagonal, quasi-Newton and limited memory quasiNewton approximations. We also provide the convergence rate for the case of inexact subproblem optimization. It turns out that standard global convergence analysis of proximal gradient methods (see [2, 23]) does not extend in a natural way to proximal quasi-Newton frameworks, hence we use a different technique derived for smooth optimization in [1, 24, 25], in a novel way, to obtain the global complexity result. 2. The heuristic of applying l(k) passes of coordinate descent to the subproblem is very useful in practice, but has not yet been theoretically justified, due to the lack of known complexity estimates. Here we use probabilistic complexity 12 bounds of randomized coordinate descent to show that this heuristic is indeed well justified theoretically. In particular, it guarantees the sufficiently rapid decrease of the error in the subproblems (with some probability) and hence allows for sublinear global convergence rate to hold for the entire algorithm (in expectation). This gives us the first complete global convergence rate result for the algorithmic schemes for practical (inexact) proximal Newton-type methods. Moreover, using the new analysis from [24–26] we are able to provide lower overall complexity bound than the one that follows from [23]. Let us elaborate a bit further on the new approaches and results developed in this chapter and discuss related prior work. In [27] Byrd et al. propose that the methods in the framework described above should be referred to as sequential quadratic approximation (SQA) instead of proximal Newton methods. They reason that there is no proximal operator or proximal term involved in this framework. This is indeed the case, if a line search is used to ensure sufficient decrease. Here we propose to consider a prox term as a part of the quadratic approximation. Instead of a line search procedure, we update the prox term of our quadratic model, which allows us to extend global convergence bounds of proximal gradient methods to the case of proximal (quasi-)Newton methods. The criteria for accepting a new iteration is based on sufficient decrease condition (much like in trust region methods, and unlike that in proximal gradient methods). We show that our mechanism of updating the prox parameter, based on sufficient decrease condition, leads to an improvement in performance and robustness of the algorithm compared to the line search approach as well as enabling us to develop global convergence rates. Convergence results for the proximal Newton method have been shown in [28] and [27] (with the same sufficient decrease condition as ours, but applied within a line search). These papers also demonstrate super linear local convergence rate of the proximal Newton and a proximal quasi-Newton method. Thus it is confirmed in [27, 28] that using second order information is as beneficial for problems of the form (4.1.1) as it is for the smooth optimization problems. These results apply to our 13 framework when exact Hessian (or a quasi-Newton approximation) of f(x) is used to construct q(x) (and if the matrices have bounded eigenvalues). However, theory in [27, 28] does not provide global convergence rates for these methods, and just as in the case on smooth optimization, the super linear local convergence rates, generally, do not apply in the case of LBFGS Hessian approximations. The convergence rate that we show is sub linear, which is generally the best that can be expected from a proximal (quasi-)Newton method with no assumptions on the accuracy of the Hessian approximations. Practical benefits of using LBFGS Hessian approximations is well known for smooth optimization [21] and have been exploited in many large scale applications. In this paper we demonstrate this benefit in the composite optimization setting (4.1.1). Some prior work showing benefit of limited memory quasi-Newton method in proximal setting include [29, 30]. We also emphasize in our theoretical analysis the potential gain over proximal gradient methods in terms of constants occurring in the convergence rate. To prove the sub linear rate we borrow a technique from [24–26]. The technique used in [2] and [23] for the proof of convergence rates of the (inexact) proximal gradient method do not seem to extend to general positive definite Hessian approximation matrix. In a related work [30] the authors analyze global convergence rates of an accelerated proximal quasi-Newton method, as an extension of FISTA method [2]. The convergence rate they obtain match that of accelerated proximal gradient methods, hence it is a faster rate than that of our method presented here. However, they have to impose much stricter conditions on the Hessian approximation matrix, in particular they require that the difference between any two consecutive Hessian approximations (i.e., Hk−Hk+1) is positive semidefinite. This is actually in contradiction to FISTA's requirement that the prox parameter is never increased. Such a condition is very restrictive as is impractical. In this paper we briefly show how results in [2] and [23] can be extended under some (also possibly strong) assumptions of the Hessian approximations, to give a simple and natural convergence rate analysis. We then present an alternative analysis, which only requires the Hessian approximations to 14 have bounded eigenvalues. Moreover, the bounds on the subproblem optimization error in our analysis are looser than those in [23]. Finally, we use the complexity analysis of randomized coordinate descent in [31] to provide a simple and efficient stopping criterion for the subproblems and thus derive the total complexity of proximal (quasi-)Newton methods based on randomized coordinate descent to solve Lasso subproblems. 1.4 A brief outline of the thesis The remainder of this thesis is organized as follows. In Chapter 2 (see [11, 12], and [13]; joint work with Siemens Research) various applications of machine learning models are discussed, particularly for sparse logistic regression. Three specialized algorithm for optimizing large sparse logistic regression problem are analyzed and compared, and a hybrid classification system that incorporates those methods is proposed for inducing hierarchical structure in classification results. In the last section the application of support vector machine and non-negative matrix factorization in time series forecasting is also described. In Chapter 3 (joint work with Yahoo Labs, submitted to AAAI) we discuss a particular application of structured learning in large-scale online-ad targeting. We start by introducing the optimal decoding scheme and the hierarchical operator. The proposed learning algorithm and its complexity are discussed in Section 3.4. Other relevant hierarchical methods and their connections to the hierarchical operator are discussed in Section 3.5 and examples are given at the end of the section to illustrate the difference between those methods. Finally we present numerical experiments. In Chapter 4 (submitted to Mathematical Programming Series A) we discuss the theoretical properties of the above introduced algorithmic framework in terms of global convergence rates. We start with the algorithmic framework. Then, we present some of the assumptions and discussions and convergence rate analysis based on [2] and [23]. We show the new convergence rate analysis using [24–26] for exact and 15 inexact version of our framework. We then extend the analysis for the cases where inexact solution to a subproblem is random and in particular to the randomized coordinate descent. In Chapter 5 (a preliminary version of which appearing in NIPS workshop, submitted to Mathematical Programming Series C) we present a fast, communicationand memory-efficient algorithm that implements the global-convergent framework proposed in Chapter 4. We begin with outlining the algorithm and defining basic terminologies. We then discuss: 1) compact LBFGS and its low-rank structure, paying particular attention to the mutable matrix type; 2) active-set strategies and the convergence guarantee; and 3) different termination conditions for the inner loop and their implications on the global convergence performance. Finally, we present numerical studies of different active-set methods and termination conditions on algorithm running time. 16 Chapter 2 Machine Learning Applications of Convex Optimization 2.1 Introduction In the optimization literature the second order methods, such as the Interior Point Methods (IPM), are well known for their ability to solve QP problems to high accuracy. However, in applications like classifying web content and text documents in a hierarchical structure where large amounts of data exist at higher levels of the hierarchy, the use of IPMs may not be the most appropriate. This is because IPMs require the solution of a system of linear equations (i.e., the Newton system) at every iteration and, as a result, their performance can quickly deteriorate as the size of that system becomes very large. The recent burst of research at the interface of optimization and machine learning has produced a large family of algorithms that require only the first-order (gradient) information and possibly work on a small subset of the training data in each iteration. Hence, the per-iteration work of these algorithms is small, making them suitable for training on large-scale data. In this chapter we demonstrate the benefit of combining the first order method 17 and the second order method in a hybrid approach for large scale hierarchical classification. The sparse logistic regression model is first of all introduced and adapted to the optimization context. We then discuss and compare the complexity difference between two Coordinate Descent approaches [9, 32] and a variant of Interior-Point Method (IPM) [10], all of which are designed specifically for sparse logistic regression. The practical performance of the three algorithms are later compared in a hybrid hierarchical classification problem, which combines them in a top-down approach in order to achieve a good balance between efficiency and accuracy. We finally discuss the use of support vector machine and non-negative matrix factorization in time series forecasting problem. The efficient first order methods for solving the above problems are briefly described in the end. 2.2 Sparse Logistic Regression The objective function of sparse logistic regression is given by min w F (w) = λ||w||1 + 1 N N∑ n=1 log(1 + exp(−yn * wTxn)) (2.2.1) where the first component is a regularization term to select relevant features and enforce sparsity in the solution or classifier w, and the second term 1 N ∑N n=1 log(1 + exp(−yn *wTxn)) is the average logistic loss function. For each training pair {xn, yn}, the logistic loss function is expected to return a positive value (note that the return of the exponent equation inside the log function is always larger than 1), as a measure of whether the predicting label given by wTxn corresponds with the "true" label yn assigned manually before training. Particularly, a different sign between wTxn and yn will force the value given by exp(−yn * wTxn) to quickly go much larger than 1, resulting in a large loss as a whole; on the other hand, a similar sign, indicating a good prediction, will lead to a small loss by driving exp(−yn * wTxn) below 1 all the way close to zero. By summing the logistic loss for each sample across the whole training set and dividing it by the cardinality of that set, we obtain the average 18 logistic loss defined on that training set, and by minimizing the average loss, we are able to obtain the classifier that guarantees to return the smallest error on the training set. A regularization term λ||w||1 is also added to the objective function such that it controls the complexity of the classifier by selecting the most relevant features based on the current training set, which contains important information on how the features relate to each other in predicting the right labels. Gradient and Hessian To compute the gradient and the Hessian of logistic loss, without loss of generality, we reformulate the problem (5.7.1) as the following by squeezing λ and 1/N into one parameter C: min w F (w) = ||w||1 + C N∑ n=1 log(1 + exp(−yn * wTxn)) (2.2.2) Recall our loss function L is given by L(w) = C ∑N k=1 l(w; xk, yk) = C ∑N k=1 log(1+ exp(−yk *wTxk)) where N is the size of our sample points. The partial derivative of L(w) with respect to wj can be thus calculated by ∂L ∂wj = C N∑ k=1 −1 1 + eykwTxk yk(xk)j = CG TXj (2.2.3) where B =  G1 ... GN  ∈ RN , Gk = −yk1 + eykwTxk (2.2.4) and Xj is the jth column of X as in X = [ X1 . . . Xp ] =  xT1 ... xTN  ∈ RN×p (2.2.5) 19 Hence, the gradient of L(w) can be written as ∇TL(w) = CGTX. Similarly, from (2.2.3) we can obtain the second partial derivatives of L(w) ∂L ∂wj∂wi =C N∑ k=1 e−ykw Txk (1 + e−ykwTxk)2 y2k(xk)j(xk)i (2.2.6) =C N∑ k=1 e−ykw Txk (1 + e−ykwTxk)2 (xk)j(xk)i (2.2.7) We can then obtain Hessian by B = CXTDX where D ∈ RN×N is a diagonal matrix with Dkk = e−ykw T xk (1+e−ykw T xk )2 . So to compute the full Hessian matrix B, even if we assume the matrix D is available to use after computing the gradient because the component e−ykw Txk we need for D is also required for the gradient, it still takes O(N2p)+O(Np2) flops to finish the computations. 2.2.1 Optimization Algorithms There are many algorithms tailored for solving sparse logistic regression. Here we are going to discuss three of them, two Coordinate Descent approaches [9, 32] and an Interior-Point Method (IPM) [10]. One of the main difference between those methods lies in the way they handle the second order information. Particularly, the first Coordinate Descent approach (CD1) we are proposing can be considered as a first order method using only gradient information; IPM is the pure second order method solving a Newton system at each step and the second Coordinate Descent approach (CD2) acts more like a quasi-Newton method where the per iteration complexity is cheap because the second order matrix, the actual Hessian in this case rather than the Hessian approximation, is handled efficiently and only matrix-vector multiplications are involved. Next we give an overview of each of the above mentioned algorithms. An Interior-Point Method To apply interior-point method to sparse logistic regression, we begin with transforming the standard unconstrained non-smooth formulation to one with linear smooth 20 constraints and smooth objective, as following min w,u L(w) + λ p∑ i=1 ui s.t. − ui ≤ wi ≤ ui, i = 1, ..., p (2.2.8) where L(w) = 1 N ∑N k=1 l(w;xk, yk) and λ is the average logistic loss function the regularization parameter. Note that at the optimality we must have ui = |wi|, in which case the objective in (2.2.8) becomes the same as standard sparse logistic regression objective (5.1.1). Hence, we say problem (2.2.8) is equivalent to the standard formulation. After adding slack variables si, si, we reformulate problem (2.2.8) into one with only equality constraints. min w,u L(w) + λ p∑ i=1 ui − μ p∑ i=1 log si − μ p∑ i=1 log si s.t. wi − ui + si = 0, − wi − ui + si = 0, i = 1, ..., p (2.2.9) where the logarithmic barrier is added in the objective to force slack variables to stay positive without explicitly adding bound constraints, and μ > 0 is a parameter defining the central path as μ→ 0. To solve problem (2.2.9) by interior-point method, we can iteratively apply Newton method to its optimality conditions for a sequence of barrier parameters such that the limit of the sequence goes to zero. Noticing that in the constraints we have si = ui − wi and si = wi + ui, we substitute these equalities into the objective, thus eliminating both the constraints and slack variables and obtaining the following unconstrained smooth convex problem min w,u φμ(w, u) = L(w) + λ p∑ i=1 ui − μ p∑ i=1 log(u2i − w2i ) whose optimality condition is simply its gradient equal to zero ∇φμ(w∗, u∗) = 0 (2.2.10) 21 In primal interior-point method, Newton method is used to solve (2.2.10) for a search direction and line search, often with backtracking, is used to determine the step length before we decrease μ and solve (2.2.10) again at the new iterate. That sequence of iterates we compute forms the central path, which eventually leads us to the optimality. The main cost of IPM lies in forming and solving the Newton system, which takes O(N2p) flops [10]. Two Coordinate Descent Methods Basic Idea Following the standard approach, we divide the objective in (5.1.1) into two parts min w F (w) = g(w) + L(w), (2.2.11) where g(w) = ‖w‖1 is convex but non-differentiable and the second component L(w) = C ∑N k=1 l(w;xk, yk) is both smooth and convex. With wk as the current iteration, the problem is to calculate the direction dk to progress to our next iteration wk+1 min dk F (wk + dk) = g(wk + dk) + L(wk + dk) (2.2.12) Note that F is non-smooth because g is non-differentiable at zeros, so we cannot directly use gradient information of F to calculate the direction. Instead, we consider the second-order approximation of L by Taylor expansion since L is twice differentiable: min dk q(dk) =L(wk) +∇TL(wk)dk+ (2.2.13) 1 2 dTkHkdk + g(wk + dk) To facilitate the discussion, we will ignore the constant for now and use Lk, d, qk(d), gk(d) instead of L(wk), dk, q(dk), g(wk + dk): min d qk(d) = ∇TLkd+ 1 2 dTHkd+ gk(d) (2.2.14) 22 Let us now consider the coordinate step on (2.2.14). We start by randomly choosing a coordinate, say j, to update. Let d = zej where z is a scalar, and ej is a vector with the same dimension of d. All coordinates of ej are zeros except coordinate j being 1. Substituting d = zej into (2.2.14), we obtain min zej qk(zej) = ∇TLkzej + 1 2 zeTj Hkzej + gk(zej) (2.2.15) since eTj Hkej = (Hk)jj,∇TLkej = (∇TLk)j, and gk(zej) = |(wk)j + z|, (2.2.15) is in fact a one-dimension quadratic problem: min z qk(zej) = (∇TLk)jz + 1 2 (Hk)jjz 2 + |(wk)j + z| (2.2.16) It is well known that (2.2.16) has a simple closed-form solution z =  (∇TLk)j+1 −(Hk)jj if (∇Lk)j + 1 ≤ (Hk)jj(wk)j, (∇TLk)j−1 −(Hk)jj if (∇Lk)j − 1 ≥ (Hk)jj(wk)j, −(wk)j otherwise. (2.2.17) Finally, we calculate step length α based on the Armijo rule [33] in the line search procedure: F (wk + αdk)− F (wk) (2.2.18) ≤σα((∇TLk)jz + |(wk)j + z| − |(wk)j|) The First Order Variant (CD1) To avoid calculating (Hk)jj in (2.2.16), we replace it with an upper bound β on the second derivative of the loss such that β ≥ (Hk)jj, ∀k and j (2.2.19) and (2.2.16) now becomes min z qk(zej) = (∇TLk)jz + 1 2 βz2 + |(wk)j + z|, (2.2.20) 23 which is an upper-bound function of qk(zej) because any z that leads to a decrease of q will inevitably follow by a decrease of q since 0 ≤ qk(0)− qk(zej) ≤ qk(0)− qk(zej). (2.2.21) Replacing the (Hk)jj in (2.2.17) by β, we obtain the closed-form solution of (2.2.20). Consider the complexity of a full coordinate descent. Updating one coordinate of wk requires only constant time, so the main cost is updating ∇TLk, which takes O(N) given that only one coordinate in wk is changed. Hence, the total complexity is O(Np) to update all p coordinates in wk. The Second Order Variant (CD2) Unlike the above methods, which update ∇L and H as soon as one coordinate is updated, CD2 compute the optimal solution of (2.2.14) before updating the objective function. That is, iteratively applying coordinate descent update to (2.2.14) until it satisfies the optimality conditions. Each coordinate update z is computed by solving the following minimization problem min z qk(ds + zej) (2.2.22) =[ 1 2 (Hkds)j + (∇TLk)j]z+ 1 2 (Hk)jjz 2 + |(wk)j + (ds)j + z| A closed-form solution of (2.2.22) can be obtained in a very similar fashion as (2.2.17). Following similar complexity analysis of CD1, we end up with the same O(Np) flops in order to apply a full coordinate descent in CD2. Note that to achieve that complexity, we need to take advantage of some careful arrangement and caching of the computations [32]. The computational complexity per iteration for each of the above three algorithms is summarized in Table 2.1. 24 IPM CD1 CD2 Flops O(pN2) O(pN) O(pN) Table 2.1: Per Iteration Complexity 2.3 HiClass An efficient Hierarchical Classification system Introduction Hierarchical classification is an extension to the multi-class classification by taking into account the hierarchical structure of the labels. This structure is ignored in the standard, or flat multi-classification since each class is treated equally in a flat fashion. However, in many practical applications like classifying web contents [34] and text documents [35, 36], such hierarchical structure of the labels is particularly prevalent and labels can often be more appropriately understood as sets of labels with closely related labels grouped in the same set [37]. Thus labels from the same set are more difficult to discriminate than those from different sets, and it can be beneficial to train a classifier specifically for a set of labels in order to capture the subtle distinction between them. Based on this intuition, hierarchical classification exploits the additional knowledge in the structure of the labels by applying a recursive top-down approach: first discriminate the subsets of labels at the top level of the hierarchy before going down the next level to discern the labels (or sets of labels) in those subsets. This process is repeated until reaching the bottom level. In this paper, we propose a highly flexible and efficient framework for hierarchical classification by combining the strengths of first-order and second-order optimization methods through the top-down approach. Motivation The primary aim of our proposed framework, HiClass, is to provide a flexible solution which is well balanced between efficiency and accuracy for hierarchical classification. We achieve this by adapting the use of second order information to the 25 number of data presented in different levels of the hierarchy. To train classifiers in hierarchical classification, we are faced with solving Quadratic Programming (QP) problems whose sizes depend on the number of training data. Nodes that exist at higher levels of the hierarchical tree usually contain many more data than the nodes at the lower levels of the tree. Using the same optimization method to solve the classification problems at different nodes of the tree would result in computational inefficiencies and inaccuracies. A typical scenario in top-down-based hierarchical classification is that although there is a large amount of data at the high levels of the hierarchy, the size of the training sets are considerably smaller at the low levels, since the classes become much more specific [34–37]. The above observations have motivated us to apply an efficient and scalable first-order optimization algorithm to train the high-level classifiers, and an accurate second-order algorithm for the low-level classification problems. The main aim of this hybrid approach is to alleviate the shortcomings of the two methods, i.e. lower accuracy of the first-order method and slower speed of the second-order method, by adaptively adjusting the algorithms to the training data sizes that are dynamically changing from node to node. We remark that although first-order methods often are not able to solve a problem as accurately as methods using second-order information (e.g. IPM), the large amount of training data available at top-level nodes help compensate the lower accuracy in the solution. Related Methods In many applications, such as text classifications and fault detection, class labels are arranged as nodes in a tree to represent a given hierarchical relation. The traditional way to approach this kind of problem is to simply ignore the hierarchical structure and treat it as a multi-class classification problem. This allows us to take advantage of existing techniques such as multi-class Support Vector Machine (SVM) [38]. Several variants of this formulation have recently been proposed to take into account the extra information from the hierarchy of classes to improve the loss function [35, 39], but two major drawbacks still remain – inconsistencies in 26 child-parent relations are difficult to prevent and inaccuracy in solving the training problem results in an inferior generalization quality. Accordingly, several methods including the top-down procedure [34, 36, 40], the bottom-up procedure [41] and the Bayesian network [42] have been proposed. In this paper, we focus on a novel hybrid approach for training different nodes in the hierarchy. We combine different optimization techniques in the hybrid approach in order to establish a good enough balance between efficiency and accuracy. We also decouple the problem into a series of classification problems defined for each class and train each one independently with no regard to the hierarchy to obtain the classifier that is responsible for predicting membership to that particular class by means of a binary or real-valued output. This provides us with an opportunity to tailor optimization techniques to each problem and solve them in parallel, thus avoiding directly optimizing a single problem that can be much more complex and difficult to even store in memory. From there we discuss the important use of second order information in our hybrid approach and explore ways to build it into our HiClass system. 2.3.1 The Hierarchical Classification Framework Let X ⊂ Rp be the instance domain and let Y = {1, ...,m} be the index set of categories. We are interested in categories arranged in a tree structure. Each category, or node, in that tree structure is denoted by a unique number i ∈ Y (0 for the root of the tree). Ȳ = Y ∪ {0} is the set of all the nodes in the tree. Every node i has a unique parent P(i), a set of children C(i) and siblings S(i). Let S+(i) = S(i) ∪ {i}. Also let L be the set of leaf nodes with no children: L = {i ∈ Y | C(i) = ∅} (2.3.1) A training set is given as {(xk, yk)}Nk=1 ∈ X × L. Note that every instance xk is labeled as yk from leaf node set L instead of Y . One implication for this is illustrated in Figure 2.1, where an instance xk is automatically attached with node 1 and then 27 the root node if it is categorized to node 4. Let a path denotes a set of labels with parent-child relations, and for every instance xk, let Path(k) be the path that includes yk. Then in this case Path(k) will be {4, 1, 0} (remember 0 is the root node). If we further assume that all the instances are categorized until it is found in one and only one leaf node, this leaf node alone is enough to represent the whole path an instance is attached with, which is why we assign every instance only one label yk. So in previous example, yk is 4, the index of leaf node is xk classified to. The challenge here is to learn a function f : X → Y that is efficient to train and easy to test. Assuming for each node i ∈ Y a vector wi ∈ Rp has been given, a top-down procedure is utilized to classify an instance from the root to one leaf node [34, 36, 41, 42]. Algorithm 1: Top-Down Procedure for Classification 1 Inputs: {C(i)}mi=1, {wi}mi=1, xk 2 Outputs: i 3 LET i← 0 4 while C(i) is not empty do 5 LET i = arg max j∈C(i) wTj xk What Algorithm 1 is trying to do is to assign one leaf node to every unknown instance. Given any node i an instance xk belongs to, Algorithm 1 classifies xk among S+(i), updating i and repeating the procedure until i is a leaf node. In this way we decouple the problem into a series of multi-class classification problems at each level. There are different methods to train multi-class classifiers, either one-versus-rest, one-versus-one or solving a single optimization problem [38]. In this paper, we have adopted the one-against-rest approach, where for each node i, a binary classification problem is solved independently to learn a classifier that is responsible for predicting the membership of an instance against S+(i). Let T (i) = {k | (xk, yk) where P(i) ⊂ Path(k)} be the training index set for the classification problem at node i. Further 28 Figure 2.1: The hierarchy tree of the labels. we define for each node i ∈ Ȳ and k ∈ T (i) y (i) k = 1 i ∈ Path(k)−1 otherwise then the training set for node i would be X (i) = {(xk, y(i)k ) | k ∈ T (i)} (2.3.2) The procedures are schematized in Algorithm 2. Algorithm 2: Independent Classifiers Training 1 Inputs: Ȳ , {(xk, yk)}Nk=1 2 Outputs: {wi}mi=1 3 for i = 0→ m do 4 Construct training set X (i) 5 Training 6 Save wi 2.3.2 A Novel Hybrid Approach As motivated in Section 5.1, we want to build a hierarchical classification system that (1) provides scalability and portability, (2) imposes consistency in parent-child relations, (3) reduces inaccuracy in training classifier for each node, (4) adapts tailored 29 models and algorithms to each node based on different conditions and requirements. While the first two points are addressed in Section 2.3 by introducing the top-down classifying procedure and independent training approach for each node, the last problem that is essential to the system's overall speed and accuracy is taken care of by our novel hybrid approach presented in this section. As mentioned in Section 2.3, in most hierarchical classification problems there are a large amount of data at the high levels of the class hierarchy, and the sizes the training sets are considerably smaller at the low levels as the classes become much more specific. The difference in problem sizes at nodes of different levels presents us with different challenges and emphases in solving the underlying optimization problem. In particular, for those nodes with large amount of data, a small generalization error of the model is inherently easy to obtain since more information is presented to the model. But the resulting optimization problem becomes very large and thus brings about computational difficulties. On the other hand, for those nodes with scarcity of data, it is relatively easy to solve the optimization problem but a more accurate solution is required to compensate for the lack of training information. Regarding that, we present our novel hybrid approach, which combines the coordinate descent methods and the IPM algorithm in the hierarchical classification framework. Ideally we can use one algorithm from top to bottom if there exists such an algorithm that outperform others both in speed and accuracy. In fact this is what people have been using in practice most of the time, and Interior Point Methods (IPM) with its fast convergence rate makes such a good candidate, demonstrating satisfying performance as long as the size of the problem is small. However, the need to solve a Newton system markedly limits the scalability of IPM-based algorithm since the time and memory it takes to solve the Newton system at every iteration increases substantially as the problem size becomes large. Hence, we only use an IPM-based method as the problem becomes small enough, i.e. when we reach the bottom level of the hierarchy. For the top level of the hierarchy where the problem size is often much larger than that of bottom-level problems, in order to train the classifiers efficiently, 30 we consider the two variants of coordinate descent methods. Software Architecture The HiClass system is the combination and implementation of what we have discussed in previous sections, including the hierarchical classification framework and our novel hybrid approach with the implementation of IPM by [10], the first order coordinate descent by [9] called SCD and the second order variant by [32] called GLMNET, to solve the hierarchical classification problems accurately and efficiently. The strength of our HiClass system is that it is designed to be highly modular and flexible. The core of the system consists of four basic modules: 1. Hierarchy Tree The tree module implements the framework described in Section 2.3 and forms the foundation of the system. 2. Data Reader The reader module reads the data from text files of different formats (e.g. LIBSVM or UCI) and stores it in matrix and vector data structures in the memory. 3. QP Writer The writer module transforms the raw data into the appropriate matrices and vectors and writes them in the format required by different solvers as the input file. 4. Solver Engine The solver engine provides implementation for the algorithms described above. Figure 2.2 shows the general architecture of the software system. Given the training data sets and the hierarchical structure of the labels, HiClass problem read transfers those data in memory to HiTree data, which then distribute to every node a different training set (as defined in (2.3.2)) that are necessary for the solvers to compute the classifier of that node. HiClass node write interfaces with the data and the solvers that consists of GLMNET used in the top (first) level of the hierarchy along with IPM for the rest. Note that the process is easy to parallelize since every classifier for each node is trained independently, as illustrated in the figure. 31 Figure 2.2: HiClass system flow chart 2.3.3 Experiments In this section, we first discuss the measure of performance we use in every node of hierarchy and we test the performance of HiClass on real-world data set. The Evaluation Metrics To evaluate the performance of the HiClass system, we use the micro-average F1 measure [43]. On a single category the Fβ is given by Fβ = (β2 + 1)A (β2 + 1)A+B + β2C , (2.3.3) where A is the number of documents a system correctly assigns to the category (true positives), B is the number of documents a system incorrectly assigns to the category (false positives), and C is the number of documents that belong to the category but which the system does not assign to the category (false negatives). Let β = 1.0, then 32 we have the harmonic mean of recall and precision: F1 = 2A 2A+B + C = 2RP R + P , (2.3.4) where R is recall, i.e., A/(A+ C), and P is precision, i.e., A/(A+B). In our case, every category i will have its own Ai, Bi and Ci. There are two methods, called macro-average and micro-average to combine those statistics into one measure in order to evaluate the general performance on the hierarchical structure. If we have category from 1 to l, then macro-average F1 is just the unweighted mean of F-measure across all categories macroF1 = ∑l i=1(F1)i l (2.3.5) and micro-average F1 is given by microF1 = 2 ∑l i=1Ai∑l i=1 2Ai +Bi + Ci (2.3.6) In fact, (2.3.6) can be simplified. Since ∑l i=1 Ai + Ci = N where N is the number of all test instances. But we also have ∑l i=1 Ai +Bi = N . Therefore, l∑ i=1 2Ai +Bi + Ci = 2 l∑ i=1 (Ai + Ci) = 2N so (2.3.6) is in fact the ratio of true positives and total number of instances microF1 =∑l i=1 Ai N . The RCV1 Dataset The RCV1-v2/LYRL2004 is a text categorization test collection made available from Reuters Corpus Volume I (RCV1), an archive of over 800,000 manually categorized newswire stories [43]. Each RCV1-v2 document used in our experiment has been tokenized, stopworded, stemmed and represented by 47,236 features in a final form as ltc weighted vectors. Those RCV1 documents are categorized with respect to the 33 Topics, which were organized in four hierarchical groups: CCAT, ECAT, GCAT and MCAT. The RCV1-v2 documents are split chronologically into a training set and test set, with 23,149 training documents and 781,265 test documents. For our experiments, we focus on CCAT and MCAT two subtrees. The ECAT and GCAT subtrees are not considered because the former has a similar tree structure with CCAT and the latter display a structure that is flat rather than organized hierarchically. During preprocessing, we discard those that are not associated with any those leaf nodes as well as those that are categorized to multiple ones, requiring that each document used in the experiments is categorized to exactly one leaf node of either CCAT or MCAT. Next we discuss details. (a) CCAT hierarchical structure. Each node in the tree is represented here as a block in the table. For example, C17 is the child of ROOT and has four children named from C171 to C174. (b) The training data we use for classification at each node. The unbalanced data distribution problem we mentioned before is obvious here, with abundant amount of data for upper nodes yet much less at ones in the bottom. Figure 2.3: CCAT subtree Experiments on the CCAT Subtree We trim the CCAT topic group for our experiments and take out some "dumb" internal nodes, such as, those that have only 34 one child. Reserving the main structure of CCAT, we end up with topics organized in a tree illustrated in Figure 2.3(a). There are altogether 17 nodes including 13 leaf nodes, 3 internal nodes and 1 root. The number of training data for each topic is illustrated in Figure 2.3(b) where the color shifts from green to red with more green denoting larger numbers. The total number is 2062 and most of them amass in the first hierarchy, which causes the previously mentioned unbalanced data distribution problem. In the experiments, we run including our HYBRID method all four algorithms with the same stopping criteria and same regularization parameter. The accuracy (using micro-average F1 measure [43]) and cpu time are recorded for each node respectively and for the whole tree as well. The results are presented in Figure 2.4. The first thing to note from Figure 2.4 is that the more second order information we take into consideration, the more accuracy we eventually achieve as expected, which can be seen by comparing the accuracy obtained by GLMNET, a second-order algorithm, with that of SCD which completely ignores the Hessian information. It can be seen that the first-order method SCD actually takes much longer than GLMNET to solve the problem, e.g., 556s compared with 17s. Having examined the detailed printouts of each algorithm, we find that even though SCD enjoys cheap per iteration cost, it takes SCD a lot more iterations than GLMNET to be able to satisfy the same stopping criteria, mostly due to the small step size SCD is forced to take each iteration. And the step size of SCD is inevitably small mainly because the lack of second order information forces it to build a globally upper bound model locally, which is thus a very loose bound with the minimizer too close to the current iterate. Hence, SCD progresses slowly, eventually resulting in expensive total computational cost even though each step only takes constant time. On the other hand, GLMNET solves the problem very fast, e.g., nearly 35 times faster than SCD. The importance of curvature information can be further appreciated comparing the performance of GLMNET with that of SCD. Particularly, by incorporating second order information into the algorithm, GLMNET is able to achieve 35 Figure 2.4: Performance comparison at each node between above mentioned algorithms. For HYBRID, we use GLMNET in the upper level and IPM at bottom. The number in the leftmost block of each table denotes the corresponding performance measure on the whole tree. For example, as shown by the rightmost upper table, 70% aggregate accuracy is achieved using the novel HYBRID method. the fast convergence rate comparable to that of IPM; plus the cheap iteration cost similar to that of SCD, it is expected that GLMNET performs efficiently. However, GLMNET is not as accurate as IPM and although the inaccuracy can be somehow compensated by the abundance of training at the first hierarchy, the problem becomes more serious as it goes down the tree, e.g., almost 30% loss of accuracy in some cases. Finally, by combining different algorithms we obtain a HYBRID method that has better accuracy than GLMNET. It is expected that the time HYBRID takes is over 90% less than that spent by SCD, due to not only the use of GLMNET in the first hierarchy, but also the fact that IPM at the bottom performs efficiently with 36 limited amount of data as well. In fact, the saving can be much more when the data distribution is more severely unbalanced, in which case there are much more data in the first level and the number decreases dramatically at the bottom. As a result, the time IPM takes to solve the problem increases much faster than that of GLMNET, mainly due to IPM's per iteration complexity growing quadratically with the number of training points. The advantage of using HYBRID will become more obvious in those cases. Another thing to note is the accuracy HYBRID achieves, which is the highest of all. Again, as we discussed, the use of IPM at the bottom improves the accuracy of GLMNET and the use of GLMNET at the upper level returns a slightly better accuracy than IPM because the ability of GLMNET to efficiently process large scale of data. (a) MCAT hierarchical structure. (b) The training data we use for classification at each node. Figure 2.5: MCAT subtree Experiments on the MCAT Subtree The whole experiment setting is similar to that of CCAT, except that this time we have a different hierarchical structure. The MCAT subtree contains altogether 9 nodes including 7 leaf nodes, 2 internal nodes and 1 root node, as illustrated in Figure 2.5(a). The number of training documents labelled under each topics is shown in Figure 2.5(b), from which we can see that over 37 75% of the data are accumulating on M11 and M12. We therefore expect that IPM will have a hard time at the first level where the data matrix has more than 2000 rows and 47326 columns. In fact, it does take IPM too long to solve the problem such that we can't include the results in Figure 2.6. But we are still able to use IPM in the HYBRID method because the problem size at the bottom is at most 300, which is comfortably within the range of IPM. Figure 2.6: Performance comparison at each node between above mentioned algorithms. From Figure 2.6 we find that again, HYBRID outperforms other algorithm in terms of accuracy, while taking up only around one tenth of the time spent by SCD. GLMNET is still the fastest algorithm of the three, but as we discussed, lacks the high accuracy attained by most second order methods and using IPM in nodes M141, M142 and M143 helps improve the classification accuracy. 38 2.4 Distributed Learning for Short-term Forecasting Short-term load forecasting (STLF), normally with a prediction time-horizon from 24 hours up to one-week ahead, has always been an important issue in efficient power system operation and decision making. STLF are not only needed for operation decisions in power system, such as scheduling of unit commitment and energy allocation, but also the key data for the electricity pricing as well. While an underestimation of the energy demand may result in inevitably high operational costs paid to procure energy in the market, an overestimation on the other hand wastes scarce resources. Hence, accurate and reliable load forecasts are vital for the energy transactions in competitive electricity markets. However, with the emergence of deregulation and free competition of modern energy markets [44, 45], the electric load is increasingly becoming difficult to forecast, thus giving rise to the need for more sophisticated forecasting tools with higher accuracy and reliability. There are a wide variety of STLF methods proposed during the last decades, including linear or multiple regression model [46], autoregressive integrated moving average (ARIMA) [47], exponential smoothing method [48], feed-forward neural networks [49], etc. Generally speaking, all of the above mentioned methods can be categorized into two groups, the classical time series and regression methods including linear regression, ARIMA, exponential smoothing, etc., and the artificial intelligence methods containing for example artificial neural networks. While most of the classical methods are often relatively easy to implement and interpret [50], they are solely based on linear analysis, thus lacking the ability to capture the nonlinearity between the load series and the exogenous influencing factors, i.e., temperature, wind speed, etc. As a reality example illustrating the nonlinearity present in load series, we take the dataset of the North American electric utility[51, 52] and visualize them in Figure ??. And it has been shown by many works in literature, e.g.,[45, 52–55], etc., that the forecasting performance greatly relies on how effectively the prediction algorithm 39 captures this nonlinear nature of the time series, which is one of the main reasons why recently the so-called artificial intelligence methods have received more and more attention in solving the problem of loads forecasting. In particular, Support Vector Machine (SVM) or Support Vector Regression (SVR) [56], as a novel and powerful machine learning technique established on a theory of the structure risk minimization principle, has been known for its high generalization ability as well as resistant to the over-fitting problem, and several attempts have been made in recent literature to apply SVM to the field of load forecasting, e.g.,[45, 52, 53], etc. As the first well-known success of such attempts, authors in [53] embed different load-affecting factors into the feature space of SVR and from these features they formulate the predicted max load. They successfully detect in the beginning of their experiments a periodic component within the data set due to both the seasonal variation of consumer electricity demand, e.g., high demand in the winter while low in the summer, and the so-called "holiday" effect, e.g., demand variation between weekend and weekdays. Although they take advantage of those properties subsequently in their modeling, their method is rather preliminary – manually single out winter data, for example, to forecast loads in winter. Still, they report promising experiment results based on this idea of choosing training data by comparing the similarity between the forecasting points, which is further explored by later forecasting works. Notably, Authors in [45] propose a twostage hybrid method combining an unsupervised network called the self-organizing map (SOM)[57] in the first stage, partitioning the non-stationary data into several subsets of stationary time-series data, with SVRs in the second stage, each defined on one of such subsets to take advantage of the similar dynamic properties. And authors in [52] introduce the locally weighted support vector regression (LWSVR), which, rather than taking in the entire data history, focuses the training on only the neighboring points around the current query point and endows a weight factor to each of the load datum. While most of those works are able to report in their experiments improvement in forecasting accuracy, several problems remain to be addressed, among which the biggest concern is the speed of the algorithm. Applying SVM to 40 time series forecast often requires solving a constrained quadratic problem in large scale. While this alone is computationally prohibitive, combining it with the SOM in the hybrid method [45] will only make the computational complexity even worse. Similarly, for LWSVR in [52], the computation of the neighborhood before training SVM requires a traverse through the whole training set. Since that has to be done for each forecasting time point, the complexity of this preprocessing alone will be related to the product between the size of the training set and that of the forecasting set, which is impractical in most reality uses. 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Hour L o a d (a) Yearly View. One year of data is singled out by green. 4200 4400 4600 4800 5000 5200 5400 5600 5800 6000 6200 1000 1500 2000 2500 3000 Hour L o a d (b) Weekly View. Three weeks of data is singled out by green. 4640 4660 4680 4700 4720 4740 4760 1000 1500 2000 2500 3000 Hour L o a d (c) Daily View. 24 hours of data is singled out by green. Figure 2.7: Illustration of Load Series Periodicity. In this section, we develop a time series forecasting algorithm that is specifically optimized for such large-scale forecasting problems, while at the same time incorporating this similarity-based modeling idea to achieve high accuracy in nonlinear and non-stationary time series prediction. In particular, we propose a two-phase forecasting framework that includes the training phase where forecasting models are distributively constructed, and the forecasting phase where the final forecasts are 41 computed jointly from the outputs of all the models combined. Since the construction of forecasting models is the most computationally intensive part of the algorithm, the encapsulation of it into a separate phase makes it possible to reuse the models for different forecasting tasks. Another novelty of our algorithm is that the whole framework can be easily parallelized thanks to its distributed nature – we decouple the whole forecasting problem into several smaller separable ones at beginning of the algorithm and construct a forecasting model for each of those decoupled problems. Since each model is self-contained and trained on problem much smaller in size than the original one, the construction process will be much more efficient and can be easily parallelized, which significantly accelerate the whole training process. Finally, in the forecasting phase, the forecasts from all the models are taken into consideration simultaneously to form the final forecasting output, which we cast into a multi-class classification problem defined for each forecasting point such that the algorithm adaptively and intelligently determines the way to form the final forecast from outputs of each model. 2.4.1 Problem Setting Let us suppose that we have an univariate time series z(t) for t = 1, 2, ..., l, where z(t) is the electric load measurement at time step t. Our goal here is to predict the value of z(t) at the next time step t = l+ 1. Let fi denotes the ith forecasting model for i = 1, ..., k. Let Xi ⊂ Rd be the instance domain and let Yi be the set of labels. Si = Xi×Yi is the training data for the forecasting model fi. That is, for each point (xt, yt) ∈ Si, we have yt = fi(xt) (2.4.1) Here a standard approach to determine (xt, yt) is to follow Takens embedding theorem[58], which states that given any time series z(t), under certain conditions for almost all t and for some finite d,m, there is a smooth map f : Rd → R such that z(t) = f(z(t−m), z(t− 2m), ..., z(t− dm)), ∀t > dm (2.4.2) 42 where the value of d is called the embedding dimension and the smallest value for which (2.4.2) is true is called the minimum embedding dimension. Hence, according to the above theorem, the xt, yt pair is given by yt = z(t) and xt = [z(t −m), z(t − 2m), ..., z(t− dm)]′. Note that d,m are the parameters specified by users and considered as constants in the algorithm. So once the time series z(t), t = 1, ..., l is given and index t is specified, the xt, yt pair is uniquely determined. Hence, for the sake of clarity, we redefine the training set Si of the forecasting model fi as a index set such that Si ⊂ N = {1, ..., l}, and for any t ∈ Si we have yt = fi(xt), where yt = z(t) and xt = [z(t−m), z(t− 2m), ..., z(t− dm)]′. The training phase and forecasting phase framework are described in Algorithm 3 and Algorithm 4, respectively. There are three key steps in the two frameworks, which are Step 2 data distribution and Step 4 model training in Algorithm 3, and Step 3 probability estimation in Algorithm 4. And we will discuss each of them in details next. Algorithm 3: Distributed Training Framework 1 Inputs: time series z(t), t ∈ N and k. 2 Determine the data distribution S1,S2, ...,Sk such that S1 ∪ S2 ∪ ... ∪ Sk = N . 3 for each i ∈ {1, ..., k} do 4 Train forecasting model fi on Si. 5 Return: forecasting models fi for i = 1, ..., k. Algorithm 4: Joint Forecasting Framework 1 Inputs: forecasting models fi for i = 1, ..., k and query point xq 2 for i ∈ {1, ..., k} do 3 Estimate probability pi 4 Compute ŷi = fi(xq). 5 Return: forecasting value yq = ∑k i=1 piŷi. 43 2.4.2 Description of the Algorithm k-Means in F Space The k-means is a clustering method based on minimization of the total distance between all inputs from their respective cluster centers while at the same time maximizing the between-cluster-object dissimilarity. Here we extend the method into the high-dimensional F space so that it can be used to cluster data mapped to different affine segments in F . The difficulty is that in most cases F space is only implicitly defined via kernels, because given any xi in current space it is computationally infeasible to write out explicitly the expression of its counterpart Φ(x) in F , due to both polynomial features of higher order and higher dimensionality. However, as we shall see later, the k-means algorithm can be modified so that it only depends on dot products between xi. Hence, it suffices to know k(xi, xj) := Φ(xi) TΦ(xj) rather than Φ explicitly. Firstly, let us suppose we are given a set of points from original (input) d-dimensional space, {xi}Ni=1 ∈ Rd. The problem is to partition the N points' F -space counterparts {Φ(xi)}Ni=1 ∈ F into k sets (k ≤ N) X = {X1,X2, ...,Xk} such that the within-cluster sum of squares is minimized: μ∗ := arg min μ k∑ i=1 ∑ Φ(x)∈Xi ||Φ(x)− Φ(μi)||2 (2.4.3) where μ∗ ∈ Rd×k with μ∗i being the ith column in matrix μ∗; Φ(μi) ∈ F is the centroid of Xi, and can be obtained by Φ(μi) = ∑ Φ(x)∈Xi Φ(x) |Xi| (2.4.4) Note that Φ(μi) can be considered as the "best" representative of the cluster Xi since Φ(μi) = arg min z ∑ Φ(x)∈Xi ||Φ(x)− z||2 (2.4.5) Although we do not have an explicit expression of the centroid Φ(μi), we can still compute the Euclidean distance between any point Φ(x) and the centroid Φ(μi) through 44 the kernel if Φ(μi) is given by (2.4.4): ||Φ(x)− Φ(μi)||2 = (Φ(x)− Φ(μi))T (Φ(x)− Φ(μi)) = Φ(x)TΦ(x)− 2Φ(x)TΦ(μi) + Φ(μi)TΦ(μi) = k(x, x)− 2 ∑ x∈Xi k(x, x) |Xi| + ∑ x(1)∈Xi ∑ x(2)∈Xi k(x (1), x(2)) |Xi|2 (2.4.6) or if μi is known: ||Φ(x)− Φ(μi)||2 = Φ(x)TΦ(x)− 2Φ(x)TΦ(μi) + Φ(μi)TΦ(μi) = k(x, x)− 2k(x, μi) + k(μi, x) (2.4.7) The procedures of implementing k-Means in F Space are described in Algorithm 5. The structure of the algorithm is very simple, consisting mainly a while-loop to examine the termination conditions and two types of for-loops to compute the clusters X , depending on which formula to use for calculating the distance between the points and the centroids. Let us now analyze the per iteration cost of Algorithm 5. The main computational expense here is the for-loop to update the cluster Xi for each i ∈ {1, ..., k}. That is, for each one of the N training points x, we compute and compare its distance to the k centroids by (2.4.6), choosing the nearest centroids, say the ith one, and putting x in the corresponding cluster Xi. So it is k times the per evaluation cost of (2.4.6) for each training point x. Suppose that we have pre-evaluated and stored the kernel matrix K ∈ RN×N with kij = k(xi, xj). Then computing the first term in (2.4.6) takes O(k) flops and the second term takes O( ∑ i |Xi|) = O(N) flops. For N points together, it takes O(N2) flops. Note that this estimate only takes into account the first two terms in (2.4.6). We consider the last term separately because it is fixed for each cluster Xi, so in each iteration it is computed once, which takes O(kN2) flops for k clusters, and cached. Altogether, the per iteration complexity of Algorithm 5 is O(kN2). 45 Algorithm 5: k-Means in F Space 1 Set t← 0. 2 Initialize the k centroids of clusters: μ (t) 1 , μ (t) 2 , ..., μ (t) k . 3 for i ∈ {1, ..., k} do 4 Compute Xi := {j | j ∈ {1, ..., N} and i is equal to i∗ with i∗ := arg minii∈{1,...,k} ||Φ(μ(t)ii )− Φ(xj)||2} using (2.4.7). 5 while optimality test returns false do 6 for i ∈ {1, ..., k} do 7 Update Xi := {j | j ∈ {1, ..., N} and i is equal to i∗ with i∗ := arg minii∈{1,...,k} ||Φ(μ(t)ii )− Φ(xj)||2} using (2.4.6). 8 Set t← t+ 1. 9 Return X . An Accelerated Kernel k-Means Algorithm Here we propose an algorithm to efficiently solve the kernel kMeans problem (2.4.3) in large scale. While the standard iterative solution to k-Means (Algorithm 5) suffers from poor scalability and is easily trapped in local minima due to the greedy nature of the update, our proposed algorithm, built on the recent advances at the interface of machine learning and optimization research, is not only designed to improve the clustering accuracy, but also specifically tuned to large-scale clustering problem. A key step in achieving such superior performance lies in the equivalent transformation of the optimization of problem (2.4.3) into a Nonnegative Matrix Factorization (NMF) problem, which can be seen, from an optimization point of view, as a nonlinear programming with a nonconvex objective and bound constraints: min G≥0 ||K −GGT ||2 (2.4.8) where the constraints are component-wise and the matrix norm ||G||2 = ∑ i,j G 2 ij is the Frobeneus norm. To see the two problems (2.4.3) and (2.4.8) are actually 46 equivalent, let us first define the matrix G ∈ RN×k such that Gij = 1, if xi ∈ Xj,0, otherwise. (2.4.9) If we consider strict k-Means where every point xi is assigned to exactly one cluster, then G is a cluster membership indicator matrix with each row containing none but one non-zero component, which can be enforced by the following two constraints G ≥ 0 and GTG = I (2.4.10) Using G, we can write down the objective of k-Means (2.4.3) alternatively as min G≥0,GTG=I k∑ i=1 N∑ j=1 Gji||xj − μi||2 = k∑ i=1 N∑ j=1 Gji||xj − ∑N j=1Gjixj∑N j=1 Gji ||2 (2.4.11) which has a compact form matrix representation min G≥0,GTG=I tr(XTX)− tr(GTXTXG) = − tr(GTKG) (2.4.12) where K = XTX is the linear kernel matrix. In nonlinear kernel mapping, kij is simply given by the inner product of Φ(xi) and Φ(xj). Since multiplying by a positive number or adding a constant to the objective does not alter the optimal solution, (2.4.12) can be further transformed into the following problem min G≥0,GTG=I ||K||2 − 2 tr(GTKG) + ||GTG||2 = ||K −GGT ||2 (2.4.13) Relaxing the orthogonality GTG = I turns it into the NMF problem (2.4.8) presented in the beginning, and it has been shown that without the constraint the solution to the problem (2.4.8) still retains G orthogonality approximately[59]. The real challenge, however, lies in how to solve (2.4.8) for the cluster membership indicator matrix G because any such algorithm need to at least address three difficulties: 1.) retain sparsity in G for clustering interpretability. 2.) deal with the non-convexity and non-linearity in the objective to avoid trapped in local minima. 47 3.) design for large-scale clustering problems. There are mainly three novelties in our proposed algorithm to deal with those difficulties respectively. First of all, we transform the nonconvex problem (2.4.8) into a convex one by both adding the equality constraint X = G and modifying the objective function to ||K −XGT ||2, so the problem now becomes bi-affine min G≥0,X=G ||K −XGT ||2 (2.4.14) To deal with the equality constraint, we introduce the augmented Lagrangian function: L(X,G, V, μ) = ||K −XGT ||2 − tr(V T (X −G)) + μ 2 ||X −G||2 (2.4.15) where V ∈ RN×k is the matrix of untrstricted Lagrange multipliers for the equality constraint X = G and the parameter μ ∈ R is positive. Note that the augmented Lagrangian function can be seen as both the extension of the quadratic penalty function by the addition of Lagrangian multipliers to reduce the possibility of ill conditioning, and the extension of the standard Lagrangian by the addition of the penalty term involving μ to measure the infeasibility of the problem (2.4.14). The motivation for using the augmented Lagrangian algorithm is that, for an appropriate fixed choice (V ∗, μ∗), an optimal solution (X∗, G∗) can be obtained by solving the following problem: min G≥0,X L(X,G, V ∗, μ∗) (2.4.16) which is a bi-convex minimization problem, i.e., convex in X for each G and convex in G for each X. This motivates the alternating direction method which splits the nonconvex problem into the following two convex subproblems by alternatingly updating X and G: Xn+1 := arg min X L(X,Gn, V n) (2.4.17) Gn+1 := arg min G≥0 L(Xn+1, G, V n) (2.4.18) 48 and we update the Lagrange multipliers by V n+1 := V n−μ(Xn+1−Gn+1). Minimizing with respect to X (2.4.17) is easy with a closed-form solution obtained by setting the gradient of L with respect to X to zero ∇XL(X,Gn, V n) = −2KGn + 2Xn+1(Gn)TGn − V n + μ(Xn+1 −Gn) = 0 (2.4.19) ⇒ Xn+1 = (2KGn + V n + μGn)(2(Gn)TGn + μI)−1 (2.4.20) The complexity, however, is mainly in obtaining Gn+1 by (2.4.18), a boundconstrained nonlinear problem. A standard approach is using bound-constrained trust region method to compute the direction each iteration by solving a quadratic model of LG. But this would require manipulating the Hessian matrix that has as many as O(N2k2) entries, the storage of which alone is going to be computationally prohibitive in some cases. So here we consider an accelerated gradient projection descent method (Algorithm 6), which takes advantage of a Nesterov-type extrapolation technique (step 6) to accelerate the minimization of subproblem to reach the theoretically best convergence rate of first-order method. Due to both the acceleration and the Hessian-free framework, the Algorithm 6 is designed to be scalable and can be called efficiently as a subroutine. Another important property of the Algorithm 6 is that it induces sparsity in G rather aggressively thanks to the projected gradient step 7, where the projection P is defined as follows Pij(X, a) = xij, if xij > a,a, otherwise. (2.4.21) Let us now analyze the optimality condition of problem (2.4.14). We first define the standard Lagrangian function L(X,G, Veq, Vieq) = ||K −XGT ||2 − tr(V Teq (X −G))− tr(V TieqG) (2.4.22) where Veq ∈ RN×k and Vieq ∈ RN×k are the Lagrange multipliers for equality constraints X = G and inequality constraints G ≥ 0 respectively. Then the first-order 49 Algorithm 6: Accelerated Gradient Projection Descent Method for Subproblem (2.4.18) 1 INPUT: X, V, LG and G 2 INITIALIZATION: Set i← 0, Gi−1 ← G,Gi ← G, ti−1 ← 1 3 while optimality test returns false do 4 ti ← (1 + √ 1 + 4t2i−1)/2 5 ωi ← (ti−1 − 1)/ti 6 Zi ← Gi + ωi(Gi −Gi−1) 7 Gi+1 ← P (Zi −∇GL(X,Zi, V )/LG, 0) 8 i← i+ 1 9 RETURN Gi. optimality condition is given by ∇X∗L(X∗, G∗, V ∗eq, V ∗ieq) = 0 (2.4.23) ∇G∗L(X∗, G∗, V ∗eq, V ∗ieq) = 0 (2.4.24) G∗ ≥ 0 (2.4.25) V ∗ieq ≥ 0 (2.4.26) V ∗ieq *G∗ = 0 (2.4.27) which yields the optimality test ∇XL = −2KG+ 2XGTG− Veq ≤ ε (2.4.28) we use to terminate and return the solution G in Algorithm 7, the final algorithmic frame for solving the kernel kMeans problem (2.4.8) . Customized Primal SVR via Kernels In this section we describe the algorithm to train linear regression in high-dimensional F space for each affine cluster Xi, where F space is defined implicitly through the use of kernels. In particular, we discuss our customizations in standard Support Vector Regression (SVR) in order to learn a forecasting curve that gives a tight 50 Algorithm 7: Accelerated Alternating Projection Method for Kernel kMeans (2.4.8) 1 INPUT: kernel matrix K and number of clusters k 2 INITIALIZATION: Set the iteration counter n← 1, penalty parameter μ← 10 and the variables Gn, Xn, V n ∈ RN×k, Gn ← 0, Xn ← 0, V n ← 1, 3 while optimality test (2.4.28) returns false do 4 Compute Xn+1 = (2KGn + V n + μGn)(2(Gn)TGn + μI)−1. 5 Compute Gn+1 by Algorithm 6. 6 Compute V n+1 = V n − μ(Xn+1 −Gn+1) 7 n← n+ 1 8 RETURN Gn. upper bound prediction of the future load value. To solve the resulted, often largescale, nonlinear optimization problem, we deploy online learning algorithms to ensure efficiency and scalability, which however requires working directly with the primal SVR. While kernelized SVR is often considered in its dual format, we extend the online learning algorithm to kernelized primal SVR. To facilitate the discussion, we will suppose we have only one affine cluster X containing a training set {(xi, yi)}Ni=1 ⊂ X × R. Our goal here is to learn a function f as defined in (2.4.1) by minimizing the empirical loss L(w) on {(xi, yi)}Ni=1 plus an additional regularization term ||w||2 to avoid overly complex hypotheses: min w F (w) = L(w) + λ 2 ||w||2 (2.4.29) There are variety of loss functions L(w) to choose from for regression, e.g., squared loss, Huber's robust loss, Vapnik's ε-insensitive loss, etc. Particularly, Vapnik's εinsensitive loss function is given by L(w) = 1 N ∑N i=1 max(0, |yi − f(xi)| − ε), which establishes a ε tube around the data such that no points but those outside of the tube will suffer a loss. The placement of the tube is iteratively optimized in a way that maximizes the number of points inside of the tube. Normally, this ε tube is always placed in symmetric with respect to the regression function f , which allows training 51 data to deviate from f in all direction. But in certain applications like load forecasting, we would rather like the forecasting function to give a slightly overestimated prediction than a underestimated one in order to prevent a potential power outage. To that end, we modify the ε-insensitive loss function such that L(w) = 1 N N∑ i=1 max(0, |yi − f(xi)| − ε+ sgn(yi − f(xi))η), 0 < η < ε (2.4.30) which shifts the ε tube upward by the amount of η that is the so-called shifting parameter chosen by the users. To make it more clearly the purpose of η, we can formulate the above equation alternatively in the following way L(w) = 1 N N∑ i=1 li(w), with (2.4.31) li(w) = max(0, |yi − f(xi)| − (ε− η)), if yi − f(xi) ≥ 0max(0, |yi − f(xi)| − (ε+ η)), if yi − f(xi) < 0 (2.4.32) Note that when the prediction f(xi) is below the actual value yi we suffer a loss as soon as the deviation grows larger than ε− η, whereas in the case where f(xi) is over yi, we allow the deviation to be as large as ε + η before we begin to penalize. In this way, we basically give the algorithm more space to choose f over the actual load curve, while reducing the freedom of f going under the curve. Now let us consider the algorithm to solve (2.4.29) with the shifted loss (2.4.30). It has always been a challenge to solve such problem efficiently in large scale because of both the nonlinearity and non-smoothness present in the objective function. Here we consider the projected stochastic gradient descent methods, which enjoys inexpensive per iteration cost while still managing a superior rate of convergence among other first-order methods. We extend it, for the purpose of this paper, to SVR in kernel space. The algorithm mainly contains two steps, an update step and a projection step. In the update step, a stochastic gradient descent is performed such that wt+ 1 2 = wt − αt∂FAt(wt) (2.4.33) 52 where αt = 1 λt is the learning rate and −∂FAt(wt) = 1|At| ∑ i∈At Φ(xi) sgn(yi−f(xi))− λwt is the negative gradient of the sub-sampled objective function which gives a descent direction in the objective; At is the set of points for which wt suffers a non-zero loss. In the projection step, wt+ 1 2 is projected onto the set B = {w : ||w|| ≤ 1/ √ λ} such that wt+1 = min(1, 1/ √ λ ||wt+ 1 2 || )wt+ 1 2 (2.4.34) which, as analysis shows, allows the usage of a very aggressive decrease in the learning rate and yields an improved rate of convergence[60]. As simple as the algorithm seems to be, the extension to kernelized SVR is not quite straightforward. The main difficulty here is that w lies in the high dimensional space rather than the input space (recall (2.4.1)), and directly manipulating w in that space is usually intractable. We circumvent that difficulty by expressing w with dual variables a of (2.4.29) as following wt = N∑ i=1 a (t) i Φ(xi) (2.4.35) Substituting it into (2.4.1), we obtain the formula expressing f in terms of kernel k() f(x) = N∑ i=1 a (t) i k(xi, x) + b (2.4.36) Substituting (2.4.35) into (2.4.33), we obtain N∑ i=1 a (t+ 1 2 ) i Φ(xi) = ∑ i∈At [(1− αtλ)a(t)i + αt |At| sgn(yi − f(xi)]Φ(xi) + ∑ i∈Āt (1− αtλ)a(t)i Φ(xi) (2.4.37) which yields the rules to update the dual variables a (t+ 1 2 ) i ← (1− αtλ)a (t) i + αt |At| sgn(yi − f(xi), for i ∈ At (2.4.38) a (t+ 1 2 ) i ← (1− αtλ)a (t) i , for i ∈ Āt (2.4.39) 53 where f(xi) can be computed by (2.4.36) and Āt is the complementary set of At with respect to the index set N = {1, 2, ..., N}. Similarly, we can perform the projection step by updating a instead of w. Using ||wt||2 = || ∑N i=1 a (t) i Φ(xi)||2 =∑ i,j∈N a (t) i a (t) j k(xi, xj), we obtain a (t+1) i ← a (t+ 1 2 ) i min(1, 1/ √ λ√∑ i,j∈N a (t+ 1 2 ) i a (t+ 1 2 ) j k(xi, xj) ), for i ∈ N (2.4.40) Algorithm 8: online learning for kernelized primal SVR 1 INPUT: T,N , λ, s ≤ |N |, {(xi, yi)}i∈N 2 Set t← 0. 3 Initialize the dual variables a(t) to zeros. 4 for t = 1,2,...,T do 5 Compute At such that At := {i : li(w) 6= 0 and |At| = s}. 6 Update the dual variables by (2.4.38) and (2.4.39). 7 Project the dual variables by (2.4.40). 8 Return a(t). Joint Forecasting via Kernel SVM If only one regression function f is trained to model the data distribution, then the forecasted time series value f(x) is uniquely defined given any query point x. But it becomes a little complicated when there are multiple regression functions fXi learned for each cluster Xi, i = 1, ..., k, because now we have multiple forecasted values fX1(x), ..., fXk(x) all at the same point x and we need to know which one (or ones) to use. Mathematically, this problem can be formulated as one finding a unit-norm probability vector p(x) = [p1(x), ..., pk(x)] T with |p(x)| = 1, pi(x) ≥ 0 ∀i = 1, ..., k (2.4.41) such that the linear combination of those forecasted values ∑k i=1 pi(x)fXi(x) predicts better than any one of those value alone does. Here the critical step is to determine the probability vector p(x), where each component pi(x) is the indicator of the 54 prediction accuracy of fXi(x). In the extreme case, for example, pi(x) = 1 implies that fXi(x) predicts 100 percent correct, and pi(x) = 0 indicates that fXi(x) is irrelevant here and better not be included in the forecasting output. To determine the probability vector p(x), let us consider the kernel space where embedded time series data distribute in different clusters Xi, i = 1, ..., k and for each of such cluster Xi an affine regression function fXi is learned to model the data distribution in that particular cluster. Each affine function fXi is expected to forecast accurately given the data in the corresponding cluster i, and the accuracy is expected to decay as the data point moves away from that cluster. This motivates us to measure the closeness of the query point x to each cluster and use them as a surrogate of the probability vector p(x). The problem then becomes how to determine the membership of the query point x among clusters Xi, i = 1, ..., k, which can be considered as a standard multi-class classification problem. 55 Chapter 3 Hierarchical Multi-label Learning 3.1 Introduction Multi-label learning considers a class of supervised learning problems where each sample is associated with multiple labels. Hierarchical Multi-label Learning (HML), in particular, considers scenarios where labels are connected by a hierarchical structure, and consequently this structure has to be imposed on the predicted labels. Modeling and exploiting label structure has been the main focus of recent research on multi-label learning. Most work along this line is focused on label space reduction by using, for example, random projections, principal component analysis and canonical correlation analysis [61–67]. The overall framework connecting these approaches can be described as follows: • Encoding. Given a label vector y ∈ Y ⊂ RL, the codeword z is obtained from a linear projection P ∈ Rm×L: z = Py (3.1.1) An m-dimensional vector z ∈ Z ⊂ Rm is used to encode y, which is of dimension L (m L). The key assumption here is that while the labels are correlated the vector y is often sparse, so that the dimension reduction can be used and later the full label vector can be recovered without loss of information. 56 • Training. Given the encoding matrix P and the training setM = {xi, Pyi}ni=1 where xi ∈ X ⊂ Rd ∀i ∈ {1, ..., n}, a prediction model H : X → Z is trained to predict the codeword z, i.e., a linear predictor H(*) = Wx where W ∈ RL×d. • Prediction. An unknown data point x+ is observed and the prediction model H is used to obtain codeword z+ ← H(x+) • Decoding. The unknown sparse label vector y+ is recovered from the predicted codeword z+ by solving an optimization problem. Hierarchical multi-label structure appears in many real-world recommendation and targeting systems. For example, consider the following problem: in display-ads targeting each user is associated with a set of labels such as sports, football, vehicles, luxury cars, etc., and each label contains ads campaign that this user is likely to interact with. It is often the case that the labels are not independent of each other, and that their interdependence can be characterized by a connected (tree) structure, e.g., football is the child of sports. The task then is to assign each user with a set of labels that both appeal to the user and satisfy the underlying taxonomy. Overlooking such structure not only undermines the predictive performance (loss of information), but also makes the results hard to interpret, e.g., predicting that a user is interested in football but not in sports. There are two key goals of HML: to exploit the hierarchy structure of the labels during training and to preserve the same structure during prediction. The encoding step discussed above is the common way to exploit the structure of labels. However, there is no particular exploitation of the hierarchical structure, just the sparsity of y. Moreover, most methods in this field do not impose the structure when predicting the label vector. Some methods [34–37, 68] produce the structured label vector, by employing a top-down approach; i.e., they predict the part of the label vector that correspond to the upper levels of the taxonomy first, without considering the 57 lower levels. This clearly is a suboptimal approach because the optimization is not performed over the entire vector y. A recent method proposed in [69] considers a different setting where it is mandatory for each prediction to end in a leaf node (no partial path is allowed). The method finds the hierarchical labels by maximizing the posterior probability of a given point x over the whole label space, which is formulated as a combinatorial optimization problem. By exploiting the hierarchy, the total complexity of the method is O(h2L2l ) where h is the height of the tree and Ll is the number of leaf nodes. In this paper we propose a novel two-stage HML method with O(L) complexity in decoding optimal hierarchical labels. In particular, it exploits the hierarchical label structure during the training and also imposed this structure explicitly on the label during the prediction, by solving a convex optimization problem based on structured sparse learning. Most of the existing structure learning methods address the structure of the predictors (such as sparse classifiers), but not the labels. Hence our first contribution in this paper is to propose an optimal decoding scheme that efficiently induces both sparsity and the hierarchical structure on predicted labels. Given a predictor W , computed during training, and a input vector x, the product Wx is typically an unstructured dense vector. The true label vector, however, is not only sparse (with the non zeros in y denoting positive labels) but has to satisfy the hierarchical structure (if a parent label is negative, then so are the labels for its children). In prior approaches a unstructured sparse vector has been recovered given Wx, for example, by using a sparsity inducting `1 norm. Here we propose the use of the so-called hierarchical sparsity-inducing norm, parameterized by weights for each subtree in the hierarchy. Hence our method finds the optimal sparse vector that respects both the taxonomy and the unstructured prediction. The hierarchical norm that we use is a convex function and is a variant of group-lasso penalty [14]. It allows us to employ an efficient convex optimization algorithm that enjoys linear complexity in the label dimension [70]. We note that while the hierarchical norm has been explored before (see, e.g., [70–72]), this is the first time, to the best of our knowledge, that it has 58 been used in the context of inducing hierarchical structure among the output of the HML problems. In prior work on using hierarchical norm it has been frequently observed that the outcome is strongly dependent on the choice of weights corresponding to the subtrees. Several heuristic approaches to choosing such weights have been, thus, proposed. As we will see in Section 3.3, the weight parameters in hierarchical norm are not scaleinvariant to input data, which make it difficult in our case to choose the correct values for the weights independently of the input data and the classifiers. This leads to our second contribution. First, we describe an ideal learning model that jointly trains the multi-label classifier and the hierarchical weights. This model, is a bi-level optimization problem with a non smooth convex lower-level problem, which make it difficult to solve efficiently in large scale setting. We then propose a two-stage training scheme that decouples the joint learning problem into a sequence of two optimization problems, that each can be efficiently solved. In particular, we base the first-level training on the multi-label Empirical Risk Minimization (ERM) framework [15] where various convex loss functions can be employed, i.e., least squares, logistic, or hinge loss. The output of the first problem is then used as the input of the second problem, where the hierarchical weights are learned while taking into account of both the classifiers and the encoded hierarchical information present in the training data. While the second problem still is a bi-level optimization problem, its dimension is now significantly reduced to the order of the number of labels that is well under 1000 in this work. In this paper we approach the bi-level problem as a black-box optimization problem, which we solve using an efficient derivative-free optimization methods [16]. The details can be found in Section 3.4. Finally we demonstrate the performance of HML on both 1) publicly available document classification dataset and 2) a proprietary large scale ad targeting dataset constructed from 2 million users' daily activities in a taxonomy tree with 1500 labels. 59 3.2 Hierarchical sparsity-inducing norms The group-Lasso penalty, which is known to be useful in inducing structured output, is an extension of Lasso that promotes sparsity in the group level as opposed to `1-norm penalty, which promotes sparsity among individual variable. In its simplest form, the group Lasso partitions the variables into non-overlapping groups with all members of the group either being set to zero or not simultaneously. Recent studies [70–72] have used more complicated and structured groups to induce complex patterns such as tree hierarchy The idea behind the so-called hierarchical norm is based on the observation that a variable is zero⇒ its descendants variables are zeros This gives us a natural grouping of variables which aims to impose the given hierarchical structure: any given variable and all its descendants form a group. To formalize this intuition, we introduce notations on general group-Lasso penalty and hierarchical structure. Mathematically speaking, a group-Lasso penalty equipped with the Euclidean norm || * || can be described as follows: ΩG(x) = ∑ g∈G φg||xg|| where g ∈ G forms a subset of the index set of all the coordinates in x and G is a collection of such index sets. If, for example, x is a vector in space Rn, then g is a subset of {1, ..., n} and xg is a vector of dimension |g|. In this work particularly, we are interested in the case when the set G and the groups g ∈ G are associated with a given tree structure, e.g., each group g consists of nodes that form a rooted subtree of the tree. In particular, let T denote a tree that consists of a set of nodes indexed by {1, ..., L}. For each node t ∈ T , let C(t) denote the set of children of t, A(t) the set of ancestors of t (including t), and D(t) the set of descendants of t (including t). Then, we define the set G associated with such hierarchical structure T by GT := {D(t) : ∀ t ∈ T } (3.2.1) 60 0 14 59 6 78 3 2 (a) The tree structure is illustrated with blue rectangles as nodes and straight lines as edges. GT = {gt}9t=1, where gt = D(t) ∀t = 1, ..., 9, e.g., g1 = {1, 2, 3}, g4 = {4, 5, 6, 7, 8, 9}. 0 14 59 6 78 3 2 (b) Each green rectangle and the nodes it includes represents a group and the variables it contains. Setting group g1 and g6 to zeros gives us a set of non-zero nodes that also satisfy the tree definition.. Figure 3.1: The tree structure is illustrated with blue rectangles as nodes and straight lines as edges. Groups g are green translucent rectangles covering nodes that are contained in the group. and the hierarchical norm by ΩGT (x) = ∑ g∈GT φg||xg||, φg > 0 ∀g ∈ GT We use an example in Figure 3.1 to illustrate how hierarchical norm ΩT (*) works. The tree we use has 9 nodes plus one root node numbered 0. The set GT contains 9 groups and each group g includes a rooted subtree or a singleton. In Figure 3.1(b) ΩT (*) results in variables in groups g1 and g6 to equal zero, and the remaining non-zero variables form a tree hierarchy. 61 3.3 Hierarchical Multi-label Model The model we propose for hierarchical multi-label learning problem is built on the empirical risk minimization (ERM) framework, and make use of the hierarchical operator : PT (u, φ) := arg min v 1 2 ||u− v||22 + ∑ g∈GT φg||vg|| (3.3.1) where φ is a vector with each entry φg being the weight of group g in PT (*, *). It then follows from the descriptions of hierarchical norms in Section 3.2 that the non-zero pattern in PT (*, *) always satisfies the underlying hierarchical structure specified by T for any given input pair {u, φ}, although such pattern of PT (*, *) is not necessarily always the same as that of u even if the pattern of u satisfies the same hierarchical structure. In fact, the following lemma follows directly from Theorem 2 presented below: Lemma 1. We have sv ⊆ su where sv = {i : PT (u, φ)(i) 6= 0} and su = {i : u(i) 6= 0}. Remark 1. The zero pattern in PT (u, φ) is created from the pattern of a given vector u by setting some groups g ∈ G to zero, depending on the norm of ug and φg. If φg ≥ ||u||, for example, where g is the group at the root which contains all variables, then all variables in g are set to zero, and PT (u, φ) is zero. Hence choosing correct values for {φg}g∈G, taking the hierarchy and the vector u into account, is critical to recovering the structures in PT (u, φ). We now present the model for the HML problem. Ideally, we want to learn the parameters φ and W simultaneously so that W is chosen according to the penalty ΩGT (among other things) and φ are chosen according to the scale of Wx. Specifically, given PT (u, φ), any loss function `(*, *) and the training set M = {xi, yi}ni=1 where xi ∈ X ⊂ Rd, yi ∈ Y ⊂ RL ∀i ∈ {1, ..., n}, we consider the following `1-regularized problem: min φ,W ∑ (yi,xi)∈M 1 |M| `(yi, PT (Wxi, φ)) + λ||W ||1 (3.3.2) 62 where ||W ||1 := ∑L i=1 ∑d j=1 |Wij| is the element-wise `1 norm of the L × d matrix W . We use feature selection here because in most multi-label applications the data matrix X contains all features, that are relevant for the whole set of labels. it is, hence, useful to select the most relevant features for prediction of particular labels. After obtaining W and φ, given an observed unknown data point x+ we generate the predicted set of labels by the following hierarchical projection step: y+ = PT (Wx+, φ). (3.3.3) This step ensures that the output y+ complies with the underlying taxonomy structure. Note that optimizing jointly over W and φ in Problem (3.3.2) is difficult given the non-smoothness of hierarchical operator PT (*, *) and the often high dimensionality of W . Hence, instead, we propose to use the following decoupling scheme: • Solve for W ∗ without the hierarchical operator: W ∗ ← arg min W ∑ (yi,xi)∈M 1 |M| `(yi,Wxi) + λ||W ||1 (3.3.4) which is the standard ERM problem for each label, i.e., sparse logistic regression or Lasso depending on which loss function is involved. • Solve for φ∗ with W fixed to W ∗: φ∗ ← arg min φ ∑ (yi,xi)∈Mφ 1 |Mφ| `(yi, PT (W ∗xi, φ)) (3.3.5) Problem (3.3.5) minimizes the loss between the true label and the hierarchically projected point PT (*, φ), using the input values of W ∗, on the set Mφ, which can be chosen as a subset of the training set M. Problem (3.3.5) is still non-smooth, but by fixing the matrix W the dimension of the problem is greatly reduced to the number of groups in GT , which is equal to the number of labels L and is several hundreds in the examples used in our experiments. We consider Problem (3.3.5) as a black-box optimization problem, which allows us to 63 search for the optimal solution of Problem (3.3.5) without the use of the derivative information. In the next section we will see that the main effort of each step of this optimization lies in evaluating the hierarchical operator (3.3.1) that can be done efficiently by exploiting the hierarchical group structure. 3.4 Learning φ In this work we make use of the basic model-based trust region derivative-free method (DFO) [73] for learning hierarchical group weights φ. While the DFO method and the underlying convergence theory have been designed for smooth problems, it is known to perform very well in practice on non smooth problems such as Problem (3.3.5). In particular, it tends to significantly outperform to other existing approaches, i.e., direct-search methods which applies to no smooth problems (see, e.g., [74]). In fact, we observe that in our experiments DFO achieves a significant improvement of objective function value and the resulting weights φ along with W obtained by solving Problem (3.3.4) give a superior prediction performance compared with traditional multi-label classifiers. Next we give a brief introduction to the basic ideas behind DFO, as well as the efficient function evaluation of Problem (3.3.5). We refer the interested readers to the book by [16] for more details about various DFO approaches. 3.4.1 Derivative-Free Optimization DFO is based on the trust-region framework, which iteratively minimizes a local model to obtain the next step in a region around the current iterate, where the local model is believed to provide a good approximation to the objective function. The minimizer of the local model, thus, serves as the candidate for the next iterate, and depending on the quality of the minimizer, i.e., how much it decreases the objective function, the size of the region, where the local model is defined, is dynamically expanded or shrunk. The most commonly used local model is the Taylor expansion in the case when derivative is available. In the context of DFO, typically, an interpolation 64 polynomial is constructed from a maintained set of sample points. The model is deemed sufficiently accurate if its accuracy matches the order of that of a Taylor model. The main computational effort of DFO while minimizing Problem (3.3.5) is spent on objective function evaluations. Here, to construct a sufficiently accurate (firstorder) model, it takes O(L) function evaluations with L being the dimension of the objective function (number of labels in our case). After the first O(L) evaluations, DFO maintains the sample set of the O(L) most recently evaluated points and only updates one new point for each iteration while reusing most of the existing points. So after K iterations, the total number of function evaluations done by DFO is O(L+K). While the worst case complexity of DFO is not polynomial, in practice good solutions are often obtained in O(L) function evaluations. Next we discuss the complexity of each function evaluation in Problem (3.3.5). 3.4.2 Complexity of hierarchical operator It is well known that the dual problem of PT (*, φ) with fixed φ is the following: max ζ − 1 2 (||u− ∑ g∈G ζg||2 − ||u||2) s.t.∀g ∈ G ||ζg||∗ ≤ φg, ζ(j)g = 0 ∀j /∈ g (3.4.1) In the simplest case when G constitutes a partition of the index set, it is easy to see that the dual problem (3.4.1) is separable, and thus each dual variable ζg can be obtained independently of the others by projecting ug onto the dual norm ball B∗φg = {x : ||x||∗ ≤ φg}. In general, from strong duality, we have the following optimality conditions: PT (u, φ) = u− ∑ g∈G ζg (3.4.2) ∀g ∈ G, ζg = ProjB∗φg (PT (u, φ) + ζg) (3.4.3) 65 For general overlapping groups one can iteratively apply (3.4.3) to update dual variables {ζg}g∈G until the change in the update is below a given threshold. Then the primal solution from {ζg}g∈G can be recovered using (3.4.2). By exploiting the hierarchical structure in groups, however, a recursive one-pass algorithm to compute PT (u, φ) is available and is given in Theorem 2 which follows from result in [70]. Theorem 2. Let u ∈ RL, ΩT (*) be associated with `2-norm and  denote a total order relation such that gi  gj ⇒ {gi ⊆ gj or gi ∩ gj = ∅}. Then, initializing v ← u and in order  applying update vg ← max(1− φg/||vg||2, 0)vg (3.4.4) once and for all g ∈ GT returns v as PT (u, φ). The whole computation can be done in O(L) operations. Remark 2. The order  requires that updates (3.4.4) start from the bottom of the tree and propagate all the way to the root node. The `2-norm term appearing in the max(*) function can thus be accumulated as the updates are propagated upwards. This can be done recursively so that the sum of squares of the variables is maintained and updated by adding one term for each node, which is the key to the linear complexity (w.r.t. number of variables) of PT (u, φ). Remark 3. From Theorem 2 we conclude that the use of ΩGT (*) indeed imposes the tree structure on v. By definition (3.2.1) of g, the group-thresholding effect on g (vg ← 0) always applied to the whole subtree, including all groups h such that h ⊆ g. During the process (3.4.4) if a group is set to zero, then it will remain zero at the later steps, thanks to the nature of the update. Remark 4. The group-thresholding effect occurs when we have ||vg||2 ≤ φg. Hence a larger φg is more likely to set the variables in group g to zero. 66 Table 3.1: Datasets statistics used in experiments. Dataset itp and ads have the same exact dimension, but are constructed differently – itp targets users' general interest and includes users' activities from other category-related events besides ads view or ads click. datasets labels features training testing itp 380 12,787 1,775,671 761,002 ads 380 12,787 1,775,671 761,002 rcv 103 47,236 17,525 16,243 Table 3.2: The comparison of FLAT, HML, Leaf Reduce (LR) and Root Map (RM) on all three datasets, with squared and logistic loss function, and evaluated by precision, recall and f1 score. squared logistic data eval flat hml lr rm flat hml lr rm itp prec 0.2216 0.2969 0.3152 0.1603 0.2221 0.2958 0.3133 0.1545 rec 0.7343 0.4694 0.4212 0.7249 0.7383 0.4893 0.4492 0.7496 f1 0.2956 0.3331 0.3230 0.2351 0.2960 0.3354 0.3275 0.2356 ads prec 0.2065 0.2737 0.3054 0.1394 0.2079 0.2717 0.3092 0.1361 rec 0.7013 0.4432 0.3588 0.7576 0.7065 0.4843 0.3785 0.7980 f1 0.2732 0.3034 0.2926 0.2135 0.2744 0.3197 0.2942 0.2155 rcv prec 0.8737 0.7784 0.8816 0.8710 0.8664 0.8240 0.8744 0.8603 rec 0.6772 0.7651 0.6718 0.6821 0.7648 0.8193 0.7555 0.7733 f1 0.7382 0.7404 0.7357 0.7396 0.8046 0.8157 0.8015 0.8061 3.5 Other hierarchical methods Let us consider two simple hierarchical-inducing methods. They are denoted by Leaf Reduce (LR) and Root Map (RM), and presented in Algorithm 11 and Algorithm 12, respectively. To describe those two methods, we also define two recursive hierarchical operations ρ -treeMap and η -treeReduce in Algorithm 9 and Algorithm 10. We will see that LR and RM can be expressed by η -treeReduce and ρ -treeMap, respectively. 67 (a) f1 score (b) jaccard similarity score (c) hamming loss Figure 3.2: Testing and training performance under different metrics as learning objective on rcv dataset. Take (b) for example. HML equipped with jaccard similarity as the learning objective is applied on a training set of 3000 instances. Solutions after each iteration (the x-axis) of HML are recorded and later evaluated under all three metrics on the full rcv dataset (test set). The red curve denotes the objective values evaluated on the test set. In the same subplot along with the red curve are the objective values evaluated at different iterations on the training set. Other two subplots illustrate the same solutions evaluated by other two metrics on the same test set. While these are two conceptually simple operations, together they are able to produce sophisticated algorithms, as demonstrated in Algorithm 13. In fact, it can be shown that Algorithm 13 is essentially equivalent to the hierarchical operator (3.3.1). This also shed light on the underlying connections between those methods, that the hierarchical operator (3.3.1) is in fact a intricate combination of LR and RM, where the two operations ρ -treeMap and η -treeReduce are integrated under a single objective of inducing hierarchy. The resulting combination enables a richer representation and can thus be more potent than either of the individual compounds alone. In Algorithm 9 and 10 recall that for a node t its parent node is pt and its set of children is C(t). We use Γ to denote binary operators such as OR,AND,+, ∗, etc., and R to denote a general function. A simple example would be ρ -treeMap(0,+) that computes the depth for each node t, if we let ρp0 = −1 and initialize ρt = 1 for 68 Algorithm 9: ρ -treeMap(t,Γ) 1 ρt ← Γ(ρt, ρpt) 2 for c in C(t) do 3 ρ -treeMap(c,Γ) all t ∈ T . Note also that ρ -treeMap modifies global variables {ρt}t∈T in place and that η -treeReduce modifies both {ρt}t∈T and {ηt}t∈T . Algorithm 10: η -treeReduce(t,Γ1,Γ2, R) 1 for c in C(t) do 2 ηt ← Γ1(η -treeReduce(c,Γ1,Γ2, R), ηt) 3 ρt ← R(ηt, φt, ut) 4 return Γ2(ηt, ρt) LR is described in Algorithm 11. The main part is a call to η -treeReduce which calls itself recursively until a leaf node is reached. Then it returns 1 if ut ≥ φt and 0 otherwise. For all the non-leaf nodes, the return value is non-zero as long as any one of its children returns a non-zero value, or the inequality ut ≥ φt holds. RM is described in Algorithm 12. Contrary to LR, RM starts at the root of the hierarchy, and sets the whole subtree to zero if ρt = 0 (ut < φt), t the root of the subtree, due to the AND operator passed to ρ -treeMap. The procedure for evaluating hierarchical operator (3.3.1) is described in Algorithm 13 using ρ -treeMap and η -treeReduce. Note that the call to ρ -treeMap is exactly the same as that appearing in RM. The difference is in the values of ρ. In RM, they are set for each t individually based on ut and φt. In Algorithm 13, however, that is determined through a call to the η -treeReduce with a specific choice of the binary operators Γ1,Γ2 and function R. In particular, the value of ρt is returned by the function R and thus depends on the value of φt, the input parameter, and the value of ηt. And ηt is in turn determined by a recursive call to the η -treeReduce as shown in Algorithm 10. Note also that the function R returns 0 if φt > √ ηt, where 69 the value of ηt is roughly the squared norm of the subtree vector ugt rooted at node t. Algorithm 11: Leaf Reduce 1 INPUT: u, φ ∈ RL 2 η ← 0 3 R(ηt, φt, ut) := OR(I(ut ≥ φt), ηt) 4 η -treeReduce(0, OR,OR,R) // 0 the root 5 return v ← ρ Algorithm 12: Root Map 1 INPUT: u, φ ∈ RL 2 ρ0 ← 1 3 for t in T do 4 ρt ← I(ut ≥ φt) 5 for t in C(0) do // children of the root 6 ρ -treeMap(t, AND) 7 return v ← ρ Let us illustrate the difference between LR, RM and the hierarchical operator (Algorithm 13) using the following toy example. Consider a taxonomy of three nodes, indexed by 0, the root, and 1 and 2, its two children. Suppose all φ's are 0.5, i.e., φ0 = φ1 = φ2 = 0.5. Let us compare outputs of the three algorithms on this toy taxonomy given different values of input u. To start with, let us consider u = (0.3, 0.9, 0.8). Rounding each entry individually yields v = (0, 1, 1) which clearly does not respect the taxonomy. Applying RM on u we obtain v = (0, 0, 0), while both LR and Algorithm 13 produce v = (1, 1, 1). Both (0, 0, 0) and (1, 1, 1) respect the taxonomy but in this case (1, 1, 1) is probably the better prediction. In fact, the two children nodes being very close to 1, i.e., u1 = 0.9 and u2 = 0.8, is a strong indicator that the classifier at the root probably made a mistake by producing a low score, i.e., 0.3, in u. Moreover, we find that RM is not able to correct that error because it makes the decision to 70 set the whole taxonomy to 0 at the root without even looking at the values of its children. Now let us consider a different set of values for the input u = (0.3, 0.3, 0.8), where we reduce the value for node 1 from 0.9 to 0.3. Again, rounding with no regards to the taxonomy fails to produce taxonomy-compatible solution. LR, RM, and Algorithm 13 yield (1, 1, 1), (0, 0, 0) and (0, 0, 0), respectively. In this case, we think (0, 0, 0) is the better solution because both the root and one of its children, two nodes out of three in total, receive low scores in u. We find that this time RM produces the good prediction, but LR fails to do so, assigning all nodes 1 by only looking at the children, without taking into account the value at the root. Finally, we note that in both cases Algorithm 13 yields the better solution, precisely through the combination of ρ -treeMap and η -treeReduce which enables a global, and better, decision to be made. 3.6 Numerical Experiments In the first part of the experiments we compare HML, the Hierarchical Multi-Label model described in Section 3.3, with FLAT that treat each label as a separate binary classification problem. We also include the other two aforementioned hierarchical methods LR and RM in the comparison. The results are reported in Table 3.2. In the second part we demonstrate the learning effectiveness of HML by comparing the testing and the training performance for each iteration of HML. We also show that HML is able to directly optimize any objective such as f1, jaccard similarity, hamming loss, etc., even though the underlying optimization problem is both non-smooth and non-convex. The results are illustrated in Figure 3.2. We consider the following two applications (related datasets are described in Table 3.1): Online ads targeting: We created features using user activity logs collected at servers of a large internet company. The activities included visited web pages, issued search queries, as well as links and ads clicked during the period of three months. 71 Algorithm 13: Hierarchical Operator (3.3.1) 1 INPUT: u, φ,∈ RL 2 ρ0 ← 1 3 for t in T do 4 ηt ← u2t 5 R(ηt, φt, ut) := (1− φt/ √ ηt)+ 6 η -treeReduce(0,+, ∗, R) 7 for t in C(0) do 8 ρ -treeMap(t, AND) 9 return v ← ρ All activities were classified into a in-house hierarchical taxonomy consisting of 1,500 nodes. In the experiments we used two different data sets: 1) ads dataset, where we predict the probability that a user will click on an ad from a specific category; and 2) itp dataset, where we predict the probability that a user will interact with a specific category (either by clicking on the ad from that category, or clicking on a search result from that category, or any other category-related event). In order to generate labels, we selected 380 categories for which we had more than 1,000 positive examples, defined as a user who either clicked on a categorized ad in the ads dataset, or interacted with category in the itp dataset. We retained only active users (those who had positive labels in at least two paths of the taxonomy), which resulted in a data comprising around 2 million users. 70% of the data set was used to train 380 category-specific models, optimizing for both logistic and squared loss, and the remaining 30% of the data set was used for testing. Documents classification: Here we use the publicly-available text categorization test collection RCV1-v2/LYRL2004 made available from Reuters Corpus Volume I (RCV1) [43]. Each RCV1-v2 document used in our experiment has been tokenized, stopworded, stemmed and represented by 47,236 features in a final form as ltc weighted vectors. Those RCV1 documents are categorized with respect to the Topics, which were organized in four hierarchical groups. We go through the entire 72 dataset and filter out those documents with less than 5 labels. We end up with 33, 768 documents in total for training and testing. Let us look at the results presented in Table 3.2. In particular, HML is applied to maximize F1 score on the training set, hence if we only look at rows denoted by F1 score, then clearly HML outperforms FLAT and other methods in all experiments involving either squared or logistic loss functions. Note that the results are all evaluated on the full datasets. It demonstrates the generalization ability of HML, as illustrated in Figure 3.2. The messages become less clear-cut if we look at the other two rows denoted by precision and recall. In general we notice that HML and LR achieve a higher precision while FLAT and RM has a better recall. For the three hierarchical methods, LR and RM are more extreme in terms of precision and recall, and HML strike a balance between those two, achieving better performance which in this case is defined by a higher F1 score.Finally we note that the superiority of HML is more evident in datasets itp and ads, which possesses more complicated taxonomy structure among labels and also more number of labels than rcv does. Figure 3.2 includes two more metrics other than F1. The message is clear when we look at the plots with red curves. It shows that while the learning algorithm is only trained on the training set, the performance on the unseen testing set tracks closely what is observed on the smaller training set, no matter what objective is used. Optimizing one metric also improves, in general, the performance measured by other metrics. That improvement might not be monotonic as is the case of the optimized metric. While choosing which metric to optimize will depend on the particular application and other practical concerns, HML, as it shows, provides an effective learning algorithm for improving different metrics in hierarchical problems. 73 3.7 Conclusion In this paper we studied multi-label learning problems with hierarchical structure present on labels. We devise a novel hierarchical decoding scheme that efficiently imposes hierarchical structure on predictions, and we propose an ideal bi-level optimization problem for jointly learning the classifiers and the hierarchical weights that we find are highly influential on the predictive performance. We also propose an efficient decoupling scheme for learning the weights and the classifiers in a sequential two-stage process. We demonstrate the superiority of our model on large-scale real-world datasets. 74 Chapter 4 Complexity of Inexact Proximal Newton Methods 4.1 Introduction In this chapter, we are interested in the following convex optimization problem: min x∈Rn F (x) ≡ f(x) + g(x) (4.1.1) where f, g : Rn → R are both convex functions such that f is twice differentiable, with a bounded Hessian, and g(x) is such that the following problem is easy to solve (approximately) for any z ∈ Rn and some class of positive definite matrices H, relative to minimizing F (x): min x∈Rn { g(x) + 1 2 ‖x− z‖2H } . (4.1.2) Here ‖y‖2H denotes y>Hy (on occasion, in this chapter, we abuse the norm notation by using H matrix that is not positive definite). This chapter is organized as follows: in Section 4.2 we describe the algorithmic framework. Then, in Section 4.3 we present some of the assumptions and discussions and, in Section 4.3.1, convergence rate analysis based on [2] and [23]. In Section 4.4 75 we show the new convergence rate analysis using [24–26] for exact and inexact version of our framework. We then extend the analysis for the cases where inexact solution to a subproblem is random in Section 4.5 and in particular to the randomized coordinate descent in Section 4.5.1. 4.2 Basic algorithmic framework and theoretical analysis The following function is used throughout the paper as an approximation of the objective function F (x). Q(H, u, v) := f(v) + 〈∇f(v), u− v〉+ 1 2 〈u− v,H(u− v)〉+ g(u). (4.2.1) For a fixed point x, the function Q(H, x, x) serves as an approximation of F (x) around x. Matrix H controls the quality of this approximation. In particular, if f(x) is smooth and H = 1 μ I, then Q(H, x, x) is a sum of the prox-gradient approximation of f(x) at x and g(x). This particular form of H plays a key role in the design and analysis of proximal gradient methods (e.g., see [2]) and alternating direction augmented Lagrangian methods (e.g, see [8]). If H = ∇2f(x), then Q(H, x, x) is a second order approximation of F (x) [28, 75]. In this paper we assume that H is a positive definite matrix such that σI  H  MI for some positive constants M and σ. Minimizing the function Q(H, u, v) over u reduces to solving problem (4.1.2). We will use the following notation to denote the accurate and approximate solutions of (4.1.2). pH(v) := arg min u Q(H, u, v), (4.2.2) and pH,φ(v) is a vector such that : Q(H, pH,φ(v), v) ≤ Q(H, v, v) = F (v), and Q(H, pH,φ(v), v) ≤ Q(H, pH(v), v) + φ. (4.2.3) 76 The method that we consider in this paper computes iterates by (approximately) optimizing Q(H, u, v) with respect to u using some particular H which is chosen at each iteration. The basic algorithm is described in Algorithms 14 and 15. Algorithm 14: Proximal Quasi-Newton method 1 Choose 0 < ρ ≤ 1 and x0; 2 for k = 0, 1, 2, * * * do 3 Choose 0 < μk, φk > 0, Gk  0; 4 Find Hk = Gk + 1 2μk I and xk+1 := pHk(x k) 5 by applying Prox Parameter Update (μk, Gk, x k, ρ); Algorithm 15: Prox Parameter Update (μ, G, x, ρ) 1 Select 0 < β < 1 and set μ = μ; 2 for i = 1, 2, * * * do 3 Define H = G+ 1 2μ I and compute p(x) := pH(x); 4 If F (p(x))− F (x) ≤ ρ(Q(H, p(x), x)− F (x)), then output H and p(x), Exit ; 5 Else μ = βiμ; Algorithm 15 chooses Hessian approximations of the form Hk = 1 μk I + Gk. However, it is possible to consider any procedure of choosing positive definite Hk which ensures MI  Hk  σI and F (pHk(x)) − F (x) ≤ ρ(Q(Hk, pHk(x), x) − F (x)), for a given 0 < ρ ≤ 1, a step acceptance condition which is a relaxation of conditions used in [2] and [23]. An inexact version of Algorithm 14 is obtained by simply replacing pHk by pHk,φk in both Algorithms 14 and 15 for some sequence of φk values. 77 4.3 Basic results, assumptions and preliminary analysis ISTA [2] is a particular case of Algorithm 14 with Gk = 0, for all k, and ρ = 1. In this case, the value of μk is chosen so that the conditions of Lemma 4 hold with ε = 0. In other words, the reduction achieved in the objective function F (x) is at least the amount of reduction achieved in the model Q(μk, p(x k), xk). It is well known that as long as μk ≤ 1/L(f) (recall that L(f) is the Lipschitz constant of the gradient) then condition (4.3.7) holds with ρ = 1. Relaxing condition (4.3.7) by using ρ < 1 allows us to accept larger values of μk, which in turn implies larger steps taken by the algorithm. This basic idea is the cornerstone of step size selection in most nonlinear optimization algorithms. Instead of insisting on achieving "full" predicted reduction of the objective function (even when possible), a fraction of this reduction is usually sufficient. In our experiments small values of ρ provided much better performance than values close to 1. In the next three sections we present the analysis of convergence rate of Algorithm 14 under different scenarios. Recall that we assume that f(x) is convex and smooth, in other words ‖∇f(x)−∇f(y)‖ ≤ L(f)‖x−y‖ for all x and y in the domain of interest, while g(x) is simply convex. In Section 4.5.1 we assume that g(x) = λ‖x‖1. Note that we do not assume that f(x) is strongly convex or that it is twice continuously differentiable, because we do not rely on any accurate second order information in our framework. We only assume that the Hessian approximations are positive definite and bounded, but their accuracy can be arbitrary, as long as sufficient decrease condition holds. Hence we only achieve sublinear rate of convergence. To achieve higher local rates of convergence stronger assumptions on f(x) and on the Hessian approximations have to be made, see for instance, [27] and [28] for related local convergence analysis. First we present a helpful lemma which is a simple extension of Lemma 2 in [23] to the case of general positive definite Hessian estimate. This lemma establishes some simple properties of an φ-optimal solution to the proximal problem (4.1.2). It uses the 78 concept of the φ-subdifferential of a convex function a at x, ∂φa(x), which is defined as the set of vectors y such that a(x)− yTx ≤ a(t)− yT t+ φ for all t. Lemma 3. Given φ > 0, a p.d. matrix H and v ∈ Rn, let pφ(v) denote the φ-optimal solution to the proximal problem (4.1.2) in the sense that g(pφ(v)) + 1 2 ‖pφ(v)− z‖2H ≤ φ+ min x∈Rn { g(x) + 1 2 ‖x− z‖2H } , (4.3.1) where z = v −H−1∇f(v). Then there exists η such that 1 2 ‖η‖2H−1 ≤ φ and H(z − pφ(v))− η ∈ ∂φg(pφ(v)). (4.3.2) Proof. (4.3.1) indicates that pφ(v) is an φ-minimizer of the convex function a(x) := 1 2 ‖x − z‖2H + g(x). If we let a(x) = a1(x) + a2(x) with a1(x) = 12‖x − z‖ 2 H and a2(x) = g(x), then 0 ∈ ∂φa(pφ(v)) ⊆ ∂φa1(pφ(v)) + ∂φa2(pφ(v)). (4.3.3) For any y∂φa1(pφ(v)) we apply the definition of φ-subdifferential for point t = H −1y+z and we get 1 2 ‖y +H(z − pφ(v))‖2H−1 ≤ φ, from which it follows that y = η −H(z − pφ(v)) | 1 2 ‖η‖2H−1 ≤ φ, which together with (4.3.3) proves (4.3.2). We will find the following bound useful on the norm of η which follows from the above lemma. ‖ηi‖ ≤ √ 2λmax(H)φi, (4.3.4) where λmax is the largest eigenvalue of H or its upper bound. Below are the assumptions made in our analysis. 79 Assumptions 1. (i) The set of optimal solutions of (4.1.1), X∗, is nonempty and x∗ is any element of that set. (ii) The effective domain of F is defined as dom(F ) := {x ∈ Rn : F (x) < ∞}, and the level set of F at point x ∈ dom(F ) is defined by XF (x) := {y ∈ dom(F ) : F (y) ≤ F (x)}. Without loss of generality, we restrict our discussions below to the level set X0 := XF (x0) given by some x0 ∈ dom(F ), e.g., the initial iterate of the Algorithm 14. (iii) g is convex and Lipschitz continuous with constant Lg for all x, y ∈ X0: g(x)− g(y) ≤ Lg‖x− y‖, (iv) There exists positive constants M and σ such that for all k ≥ 0, at the k-th iteration of Algorithm 14: σI  σkI  Hk MkI MI (4.3.5) (v) There exists a positive constant DX0 such that for all iterates {xk} of Algorithm 14: sup x∗∈X∗ ‖xk − x∗‖ ≤ DX0 In the analysis of the proximal gradient methods Assumption 1(v) is removed by directly establishing a uniform bound on ‖xk − x∗‖ (rf. [23]). In the next section we show an outline of a proximal-gradient type analysis where this assumption is imposed for simplicity of the presentation. The proximal gradient approach in the 80 next subsection, however, requires another, stronger, assumption on Hk. In the alternative analysis that follows we impose Assumption 1(v) but relax the assumption on Hk matrices. Note that all iterates {xk} fall into the level set X0, due to the sufficient decrease condition (4.3.7) that demands a monotonic decrease on the objective values. Assumption 1(v) thus follows straightforwardly if the level set is bounded, which often holds in real-world problems or can be easily imposed. 4.3.1 Analysis via proximal gradient approach In this section we extend the analysis in [2] and [23] to our framework under additional assumptions on the Hessian approximations Hk. The following lemma, is a generalization of Lemma 2.3 in [2] and of a similar lemma in [23]. This lemma serves to provide a bound on the change in the objective function F (x). Lemma 4. Given ε, φ and H such that F (pφ(v)) ≤ Q(H, pφ(v), v) + ε (4.3.6) Q(H, pφ(v), v) ≤ min x∈Rn Q(H, x, v) + φ, where pφ(v) is the φ-approximate minimizer of Q(H, x, v), then for any η such that 1 2 ‖η‖2H−1 ≤ φ and for any u ∈ Rn, 2(F (u)− F (pφ(v))) ≥ ‖pφ(v)− u‖2H − ‖v − u‖2H − 2ε− 2φ− 2〈η, u− pφ(v)〉. Proof. The proof is an easy extension of that in [2]. Note that if φ = 0, that is the subproblems are solved accurately, then we have 2(F (u)− F (p(v))) ≥ ‖p(v)− u‖2H − ‖v − u‖2H − 2ε. If the sufficient decrease condition (F (xk+1)− F (xk)) ≤ ρ(Q(Hk, xk+1, xk)− F (xk)) (4.3.7) 81 is satisfied, then F (xk+1) ≤ Q(Hk, xk+1, xk)− (1− ρ) ( Q(Hk, x k+1, xk)− F (xk) ) ≤ Q(Hk, xk+1, xk)− 1− ρ ρ (F (xk+1)− F (xk)) and Lemma 4 holds at each iteration k of Algorithm 14 with εk = −1−ρρ (F (x k+1) − F (xk)). We now establish the sub linear convergence rate of Algorithm 14 under the following additional assumption. Assumption 2. Let {xk} be the sequence of iterates generated by Algorithm 14, then there exists a constant MH such that k−1∑ i=0 (‖xi+1 − x∗‖2Hi+1 Mi+1 − ‖xi+1 − x∗‖2Hi Mi ) ≤MH , ∀k, (4.3.8) where Mi is the upper bound on the largest eigenvalue of Hi as defined in Assumption 1(iv). The assumption above is not verifiable, hence it does not appear to be very useful. However, it is easy to see that a condition Hi+1 Mi+1  Hi Mi , for all i, easily implies (4.3.8) with MH = 0. This condition, in turn, is trivially satisfied if Hi+1 is a multiple of Hi for all i, as it is in the case of ISTA algorithm. It is also clear that Assumption 2 is a lot weaker than the condition that Hi+1 is a multiple of Hi for all i. For example, it also hold if Hi+1 is a multiple of Hi for all i except for a finite number of iterations. It also holds if Hi+1 converges to Hi (or to its multiple) sufficiently rapidly. It is also weaker than the assumption in [30] that Hi+1  Hi. Exploring different conditions on the Hessian approximations that ensure Assumption 2 is a subject of a separate study. Below we show how under this condition sub linear convergence rate is established. Theorem 5. Suppose that Assumptions 1 and 2 hold. Assume that all iterates {xk} 82 of inexact Algorithm 14 are generated with some φk ≥ 0 (cf. (4.2.3)), then F (xk)− F (x∗) ≤ 1 km ( 1 2 ‖x0 − x∗‖2H0/M0 + MH 2 + (1− ρ)(F (x0)− F (x∗)) ρσ + k−1∑ i=0 φi Mi +DX0 k−1∑ i=0 √ 2φi Mi ) , (4.3.9) where km = ∑k−1 i=0 M −1 i . Proof. Let us apply Lemma 4, sequentially, with u = x∗, pφ(v) = x i and subproblem minimization residual φi for i = 0, . . . , k − 1. Adding up resulting inequalities, we obtain k−1∑ i=0 F (xi+1)− F (x∗) Mi ≤ 1 2 k−1∑ i=0 ( ‖xi − x∗‖2Hi/Mi − ‖x i+1 − x∗‖2Hi/Mi ) + k−1∑ i=0 ( φi Mi + εi Mi + 〈ηi, x∗ − xi+1〉 Mi ) = 1 2 ( ‖x0 − x∗‖2H0 M0 + k−1∑ i=0 (‖xi+1 − x∗‖2Hi+1/Mi+1 − ‖x i+1 − x∗‖2Hi/Mi)− ‖x k − x∗‖2Hk/Mk ) (4.3.10) + k−1∑ i=0 εi Mi + k−1∑ i=0 φi Mi + k−1∑ i=0 〈ηi, x∗ − xi+1〉 Mi . From Assumptions 1 and 2 we have k−1∑ i=0 F (xi+1)− F (x∗) Mi ≤ 1 2 ‖x0 − x∗‖2H0/M0 + MH 2 + k−1∑ i=0 εi Mi + k−1∑ i=0 φi Mi +DX0 k−1∑ i=0 ‖ηi‖ Mi . From (4.3.4) and definition of Mi we know that ‖ηi‖ ≤ √ 2Miφi. Using the already established bound ∑k−1 i=0 εi ≤ (1−ρ)(F (x0)−F (x∗)) ρ we obtain k−1∑ i=0 F (xi+1)− F (x∗) Mi ≤ 1 2 ‖x0 − x∗‖2H0/M0 + MH 2 + (1− ρ)(F (x0)− F (x∗)) ρσ + k−1∑ i=0 φi Mi +DX0 k−1∑ i=0 √ 2φi Mi . 83 Hence, F (xk)− F (x∗) ≤ 1 km k−1∑ i=0 F (xi+1)− F (x∗) Mi ≤ 1 km ( 1 2 ‖x0 − x∗‖2H0/M0 + MH 2 + (1− ρ)(F (x0)− F (x∗)) ρσ + k−1∑ i=0 φi Mi +DX0 k−1∑ i=0 √ 2φi Mi ) . Let us consider the term 1 km = 1∑k−1 i=0 M −1 i . From earlier discussions, we can see that 1 km ≤ M k . Moreover, if Hk are diagonal matrices, then M = L(f) and, hence 1 km ≤ L(f) k , which established a bound similar to that of proximal gradient methods. The role of Mi is to show that if most of these values are much smaller than the global Lipschitz constant L(f), then the constant involved in the sub linear rate can be much smaller than that of the proximal gradient methods. This is a well known effect of using partial second order information and it is observed in our computational results. Lemma 6. Let v be the initial point and Q∗ := minu∈Rn Q(H, u, v). If vi is the random point generated by applying i randomized coordinate descent steps to a strongly convex function Q, then for some constant 0 < α < 1 (dependent only on n and σ the bound on the smallest eigenvalue of H) we have E[Q(H, vi, v)−Q∗] ≤ αi(Q(H, v, v)−Q∗). (4.3.11) Next, we establish a bound on the maximal possible reduction of the model function Q(*) objective function value, for any positive definite matrix H 0, and fixed point v ∈ Rn. Lemma 7. Assume that for any v ∈ Rn such that F (v) ≤ F (x0), all of the the subgradients of F at v are bounded in norm by a constant κ, i.e. ‖∇f(v)+γ‖ ≤ κ for all γ ∈ ∂g(v). Then the maximum function reduction for Q(*) is uniformly bounded from above by Q(H, v, v)−Q∗ ≤ R, with R = Mκ2 2σ2 , (4.3.12) 84 where M and σ are respectively the bounds on the largest and smallest eigenvalues of H and Q∗ := minu∈Rn Q(H, u, v). Proof. Let v∗ = arg minu∈Rn Q(H, u, v) and let γg(v ∗) be any subgradient of g(*) at v∗. From the first-order optimality conditions H(v∗ − v) +∇f(v) + γg = 0, (4.3.13) we can obtain an upper bound on ‖v∗ − v‖ ‖v∗ − v‖ = ‖H−1‖ * ‖∇f(v) + γg‖ ≤ κ/σ. (4.3.14) Now, we bound the reduction in the objective function in terms of ||v∗−v||. From the convexity of g, g(v)− g(v∗) ≤ 〈γg, v − v∗〉. (4.3.15) Multiplying (4.3.13) with v∗ − v, we obtain −〈∇f(v), v∗ − v〉 = ‖v∗ − v‖2H + 〈γg, v∗ − v〉. (4.3.16) From (4.3.16), (4.3.13), (4.3.15) and the definition of Q, we have Q(H, v, v)−Q∗ = g(v)− g(v∗)− 〈∇f(v), v∗ − v〉 − 1 2 ‖v∗ − v‖2H ≤ 〈γg, v − v∗〉+ ‖v∗ − v‖2H + 〈γg, v∗ − v〉 − 1 2 ‖v∗ − v‖2H = 1 2 ‖v∗ − v‖2H ≤ Mκ2 2σ2 , which concludes the proof of the lemma. It follows immediately from Lemma 7 that the subproblem optimization error φi is also bounded for all i, say by Φ, and that Φ ≤ R. Using this bound we now present the auxiliary result that derives the bounds on separate terms that appear on the right hand side of (4.5.2) and involve φi. 85 Lemma 8. Assume that φi are nonnegative bounded independent random variables whose value lies in an interval [0,Φ]. Assume also that E[φi] ≤ Rᾱi for all i, and some constants 0 < ᾱ < 1 and R > 0. Then the following inequalities hold E[ k∑ i=1 √ φi] ≤ √ Rᾱ 1− √ ᾱ , E[ k∑ i=1 φi] ≤ R 1− ᾱ (4.3.17) Proof. First we note that E[ √ φiφj] = E[ √ φi]E[ √ φj], (4.3.18) due to independence of φi's, and E[ √ φi] ≤ √ E[φi], (4.3.19) due to Jensen inequality and the fact that the square root function is concave and φi ≥ 0. Then, given (4.3.18) and (4.3.19), we derive (4.3.17) using a bound on E[φi]. E[ k∑ i=1 √ φi] = k∑ i=1 E[ √ φi] ≤ k∑ i=1 √ E[φi] ≤ k∑ i=1 √ Rᾱi/2 ≤ √ Rᾱ 1− √ ᾱ . And the second bound on E[ ∑k i=1 φi] can be established in a similar fashion. Theorem 9. Assume that for each iterate {xk} of Algorithm 17 the following iterate {xk+1} is generated by applying ak + b steps of Algorithm 16, then E[F (xk)]− F (x∗) ≤ ζ k , where x∗ is an optimal solution of (4.1.1) and ζ is a constant. Proof. Taking expectation on both sides of (4.5.2) in Theorem 15, we have E[F (xk)]− F (x∗) ≤ 1 k (σ 2 E[Bk] + σ 2 C + 2E[A2k] + √ CE[Ak] + E[ √ BkAk] ) . (4.3.20) Next, we show how to bound E[Bk], E[Ak], E[A 2 k] and E[ √ BkAk]. We first note a key observation, that, as a result of Lemma 19, Lemma 7 and Algorithm 16, the 86 expectation of subproblem optimization error φi can be upper bounded by a geometric progression such that E[φi] ≤ Rαai+b = Rᾱi, where 0 < α < 1 and R are specified respectively in Lemma 19 and Lemma 7 and ᾱ = αa and R = Rαb. It immediately follows that E[Bk] = 2(Mσ + 1) σ k−1∑ i=0 E[φi] ≤ 2(Mσ + 1) σ k−1∑ i=0 Rᾱ ≤ 2(Mσ + 1)R σ 1 1− ᾱ . Recalling Lemma 8, we then obtain E[Ak] = √ 2ME[ k−1∑ i=0 √ φi] ≤ √ 2MR 1− √ ᾱ E[A2k] = E[( k∑ i=1 √ 2Mφi) 2] = 2ME[( k−1∑ i=0 √ φi) 2] ≤ 4MR (1− √ ᾱ)2 E[ √ BkAk] = E[ √√√√2(Mσ + 1) σ k∑ i=1 φi k∑ i=1 √ 2Mφi] = √ 4M(Mσ + 1) σ E[ √√√√ k∑ i=1 φi k∑ i=1 √ φi] ≤ √ 4M(Mσ + 1) σ R√ 1− ᾱ(1− √ ᾱ) , and ζ is equal to ζ = R(Mσ + 1) 1− ᾱ + σ 2 C + 8MR (1− √ ᾱ)2 + √ C √ 2MR 1− √ ᾱ + √ 4M(Mσ + 1) σ R√ 1− ᾱ(1− √ ᾱ) . Remark 5. Note that the choice of the number of coordinates decent steps in each applications of RCD, in other words, the choice of constants a and b has a direct affect on the constants in the complexity bound. Clearly taking more RCD steps leads to fewer iterations of the main algorithm, as larger values of a and b lead to smaller value of ζ. On the other hand, each iterations becomes more expensive. We believe that our practical approach described in Section 5.7.3 strikes a good balance in this trade-off. 87 We conclude that under Assumption 2, Algorithm 14 converges at the rate of O(1/k) if ∑k−1 i=0 √ 2φi Mi is bounded for all k. This result in similar to those obtained in [23]. It is shown how randomized block coordinate descent and other methods can be utilized to optimize subproblems minuQ(Hi, u, x i) so that √ 2φi Mi decays sufficiently fast to guarantee such a bound (possibly in expectation). In this paper, however, we focus on a different derivation of the sub linear convergence rate, which results in a more complex dependence on the constants, but, on the other hand, does not require Assumption 2 and results in a weaker assumption on φi. 4.4 Analysis of sub linear convergence In our analysis below we will use another known technique for establishing sub linear convergence of gradient descent type methods for smooth convex functions [25]. However, due to the non smooth nature of our function the analysis requires significant extensions, especially in the inexact case. Moreover, it does not apply to the line-search algorithm, we will rely on the fact that a proximal quasi-Newton method is used in that each new iteration xk+1 is an approximate minimizer of the function Q(Hk, u, x k). Our analysis, hence, also applies to proximal gradient methods. First we prove the following simple result. Lemma 10. Consider F (*) defined in (4.1.1). Let Assumptions 1(iii) hold. Then for any three points u, v, w ∈ dom(F ), we have F (u)− F (w) ≤ ‖∇f(u) + γvg,φ‖‖u− w‖+ 2Lg‖u− v‖+ 2φ. (4.4.1) where γvg,φ ∈ ∂φg(v) is any φ-subgradient of g(*) at point v. Proof. From convexity of f and g and the definition of φ-subgradient, it follows 88 that for any points u,w and v, f(u)− f(w) ≤ 〈∇f(u), u− w〉, (4.4.2) g(v)− g(w) ≤ 〈γvg,φ, v − w〉+ φ (4.4.3) g(2v − u)− g(v) ≥ 〈γvg,φ, v − u〉 − φ. (4.4.4) Hence, F (u)− F (w) = f(u)− f(w) + g(u)− g(w) = f(u)− f(w) + g(u)− g(v) + g(v)− g(w) ≤ 〈∇f(u), u− w〉+ 〈γvg,φ, v − w〉+ g(u)− g(v) + φ (4.4.5) = 〈∇f(u), u− w〉+ 〈γvg,φ, u− w〉+ 〈γvg,φ, v − u〉+ g(u)− g(v) + φ ≤ ‖∇f(u) + γvg,φ‖‖u− w‖+ g(2v − u)− g(v) + g(u)− g(v) + 2φ. Here we applied (4.4.2), (4.4.3) and (4.4.4) to get (4.4.5). Using Assumption 1(iii) to bound the term g(2v − u)− g(v) + g(u)− g(v) we can easily derive (4.4.1). 4.4.1 The exact case We now consider the exact version of Algorithm 14, i.e., φk = 0 for all k. We have the following lemma. Lemma 11. Let xk+1 := arg minu∈Rn Q(Hk, u, xk), then Q(Hk, x k, xk)−Q(Hk, xk+1, xk) ≥ σk 2 ‖xk+1 − xk‖2. (4.4.6) Moreover, there exists a vector γk+1g ∈ ∂g(xk+1) such that the following bounds hold: 1 Mk ‖∇f(xk) + γk+1g ‖ ≤ ‖xk+1 − xk‖ ≤ 1 σk ‖∇f(xk) + γk+1g ‖. (4.4.7) Proof. The proof is a special case of Lemma 13, proved below. Theorem 12. Let Assumptions 1 hold for all k. Then the iterates {xk} generated by Algorithm 14 satisfy F (xk)− F ∗ ≤ 2M 2(DX0M + 2Lg) 2 ρσ3 1 k . (4.4.8) 89 Proof. We will denote F (xk)−F ∗ by ∆Fk. Our goal is to bound ∆Fk from above in terms of 1/k which we will achieve by deriving a lower bound on 1 ∆Fk in terms of k. Let us first show that F (xk)− F (xk+1) = ∆Fk −∆Fk+1 ≥ ck∆F 2k , (4.4.9) for some constant ck, which depends on iteration k, but will be lower bounded by a uniform constant. First we will show that ∆Fk ≤ (DX0 + 2Lg σk )‖∇f(xk) + γk+1g ‖. (4.4.10) This follows simply from (4.4.1) with u = xk, w = x∗, v = xk+1 and φ = 0, ∆Fk = F (x k)− F (x∗) ≤ ‖∇f(xk) + γk+1g ‖‖xk − x∗‖+ 2Lg‖xk − xk+1‖. (4.4.11) Substituting the bounds on ‖xk+1 − xk‖ (cf. (4.4.7)) and ‖xk − x∗‖ (cf. Assumptions 1(v)) in (4.4.11) we get the desired bound (4.4.10). Now we will show that F (xk)− F (xk+1) ≥ ρσk 2M2k ‖∇f(xk) + γk+1g ‖2. (4.4.12) Indeed F (xk) − F (xk+1) ≥ ρ(Q(Hk, xk, xk) − Q(Hk, xk+1, xk)). And a bound on the reduction in Q can be established by combining (4.4.6) and (4.4.7), Q(Hk, x k, xk)−Q(Hk, xk+1, xk) ≥ σk 2M2k ‖∇f(xk) + γk+1g ‖2, Hence, the bound (4.4.12) holds. Finally, combining the lower bound on F (xk)− F (xk+1) together with the upper bound on ∆F 2k we can conclude that F (xk)− F (xk+1) = ∆Fk −∆Fk+1 ≥ ρσk 2M2k (DX0 + 2Lg σk )2 ∆F 2k , 90 which establishes (4.4.30) with ck = ρσ3k 2M2k (DX0σk+2Lg) 2 . Dividing both sides of the inequality above by ∆Fk+1∆Fk we have 1 ∆Fk+1 − 1 ∆Fk ≥ ck ∆Fk ∆Fk+1 ≥ ck. Summing the above expression for i = 0, . . . , k − 1 we have 1 ∆Fk ≥ k−1∑ i=0 ci + 1 ∆F0 ≥ k−1∑ i=0 ci, which finally implies ∆Fk = F (x k)− F ∗ ≤ 1∑k=1 i=0 ci ≤ 2M 2(DX0M + 2Lg) 2 ρσ3 1 k . Let us note that if Hk = L(f)I for all k, as in standard proximal gradient methods, where L(f) is the Lipschitz constant of ∇f(x), then the bound becomes F (xk)− F ∗ ≤ 2(DX0L(f) + 2Lg) 2 ρL(f) 1 k ≈ 2D2X0L(f) k , if Lg DX0L(f). This bound is similar to 2‖x0−x∗‖2L(f) k established for proximal gradient methods, assuming that DX0 is comparable to ‖x0 − x∗‖. 4.4.2 The inexact case We now analyze Algorithm 14 in the case when the computation of pH(v) is performed inexactly. In other words, we consider the version of Algorithm 14 (and 15) where we compute xk+1 := pHk,φk(x k) and φk can be positive for any k. The analysis is similar to that of the exact case, with a few additional terms that need to be bounded. We begin by extending Lemma 11. Lemma 13. Let xk+1 ∈ dom(F ) be such that the inequality holds Q(Hk, x k+1, xk)− min y∈Rn Q(Hk, y, x k) ≤ φk 91 for some φk ≥ 0. Then Q(Hk, x k, xk)−Q(Hk, xk+1, xk) ≥ σk 2 ‖xk+1 − xk‖2 − √ 2Mkφk‖xk+1 − xk‖ − φk. (4.4.13) Moreover there exists a vector γk+1g,φ ∈ ∂gφk(xk+1) such that the following bounds hold: 1 Mk ‖∇f(xk) + γk+1g,φ ‖ − √ 2Mkφk Mk ≤ ‖xk+1 − xk‖ ≤ 1 σk ‖∇f(xk) + γk+1g,φ ‖+ √ 2Mkφk σk . (4.4.14) Proof. Recall Lemma 3, from (4.3.2), there exists a vector, which we will refer to as γk+1g,φ , such that γk+1g,φ = Hk(x k − xk+1)−∇f(xk)− ηk ∈ ∂φg(xk+1), (4.4.15) with 1 2 ‖ηk‖2H−1k ≤ φk, which, in turn, implies ‖ηk‖ ≤ √ 2Mkφk. The following inequality follows from the definition of φ-subdifferential, g(xk)− g(xk+1) ≥ 〈γk+1g,φ , x k − xk+1〉 − φk. (4.4.16) From (4.4.15) Hk(x k+1 − xk) +∇f(xk) + γk+1g,φ + ηk = 0, (4.4.17) hence we obtain ‖xk+1 − xk‖ ≤ 1 σk ‖∇f(xk) + γk+1g,φ ‖+ 1 σk ‖ηk‖ ≤ 1 σk ‖∇f(xk) + γk+1g,φ ‖+ √ 2Mkφk σk , ‖xk+1 − xk‖ ≥ 1 Mk ‖∇f(xk) + γk+1g,φ ‖ − 1 Mk ‖ηk‖ ≥ 1 Mk ‖∇f(xk) + γk+1g,φ ‖ − √ 2Mkφk Mk . 92 Using (4.4.16) and (4.4.17), we derive (4.4.13) as follows, Q(Hk, x k, xk)−Q(Hk, xk+1, xk) = g(xk)− g(xk+1)−∇f(xk)T (xk+1 − xk)− 1 2 ‖xk+1 − xk‖2Hk ≥ 〈γk+1g,φ , x k − xk+1〉 − φk + ‖xk+1 − xk‖2HK + 〈γk+1g,φ , x k+1 − xk〉+ 〈ηk, xk+1 − xk〉 − 1 2 ‖xk+1 − xk‖2Hk = 1 2 ‖xk+1 − xk‖2Hk + 〈ηk, x k+1 − xk〉 − φk ≥ σk 2 ‖xk+1 − xk‖2 − ‖ηk‖‖xk+1 − xk‖ − φk ≥ σk 2 ‖xk+1 − xk‖2 − √ 2Mkφk‖xk+1 − xk‖ − φk. Unlike the exact case, the inquality 1 ∆Fk+1 − 1 ∆Fk ≥ ck can no longer be guaranteed to hold on each iteration. The convergence rate is obtained by observing that when this desired inequality fails another inequality always holds, which bounds ∆Fk in terms of φk. Specifically, we have the following theorem. Lemma 14. Consider kth iteration of the inexact Algorithm 14 with 0 ≤ φk ≤ 1. Let ∆Fk := F (x k)− F (x∗). Then there exists large enough positive constant θ > 0, such that one of the following two cases must hold, ∆Fk ≤ bk √ φk, (4.4.18) 1 ∆Fk+1 − 1 ∆Fk ≥ ck. (4.4.19) where bk and ck are given below, bk = θDX0 √ 2Mk + 2(1 + θ)Lg σk √ 2Mk + 2, (4.4.20) ck = ρ(σ3k(θ − 1)2 − 2σkM2k (1 + θ)− σ2kMk) ( √ 2DX0θσkMk + 2 √ 2Lg(1 + θ)Mk + σk √ Mk)2 . (4.4.21) Proof. First, applying (4.4.1) with u = xk, w = x∗ and v = xk+1 we obtain, ∆Fk = F (x k)− F (x∗) ≤ ‖∇f(xk) + γk+1g,φ ‖‖x k − x∗‖+ 2Lg‖xk − xk+1‖+ 2φk. (4.4.22) 93 We will consider two cases that are possible at each iteration k for some fixed constant θ > 1 which we will specify later. • Case 1 ‖∇f(xk) + γk+1g,φ ‖ < θ √ 2Mkφk. (4.4.23) • Case 2 ‖∇f(xk) + γk+1g,φ ‖ ≥ θ √ 2Mkφk. (4.4.24) Let us assume that Case 1 holds, then from (4.4.14) and (4.4.23) it simply follows that ‖xk+1 − xk‖ ≤ (1 + θ) √ 2Mkφk σk . (4.4.25) Using (4.4.23), (4.4.22), the bound on ‖xk+1−xk‖ from (4.4.25) together with the bound on ‖xk − x∗‖ from Assumptions 1(v) we get ∆Fk ≤ DX0‖∇f(xk) + γk+1g,φ ‖+ 2(1 + θ)Lg √ 2Mkφk σk + 2φk (4.4.26) ≤ (θDX0 + 2(1 + θ)Lg σk ) √ 2Mkφk + 2φk ≤ (θDX0 √ 2Mk + 2(1 + θ)Lg σk √ 2Mk + 2) √ φk, (4.4.27) which ensures (4.4.18). We now consider Case 2, where (4.4.24) along with (4.4.14) from Lemma 13 imply θ − 1 θMk ‖∇f(xk) + γk+1g,φ ‖ ≤ ‖x k+1 − xk‖ ≤ 1 + θ θσk ‖∇f(xk) + γk+1g,φ ‖. (4.4.28) Substituting into (4.4.22) the upper bound on ‖xk+1 − xk‖ from (4.4.28) and the bound on ‖xk − x∗‖ from Assumptions 1(v) we now get ∆Fk ≤ (DX0 + 2Lg(1 + θ) θσk )‖∇f(xk) + γk+1g,φ ‖+ 2φk. From φk ≤ 1 it follows that ‖∇f(xk) + γk+1g,φ ‖ ≥ θ √ 2Mkφk ≥ θ √ 2Mkφk. Hence we obtain ∆Fk ≤ (DX0 + 2Lg(1 + θ) θσk + 2 θ √ 2Mk )‖∇f(xk) + γk+1g,φ ‖. (4.4.29) 94 We will show that in this case, as in the exact case, we have ∆Fk −∆Fk+1 = F (xk)− F (xk+1) ≥ ck∆F 2k (4.4.30) for some constant ck (different from that in the exact case). Towards that goal we will establish a lower bound on F (xk) − F (xk+1) in terms of ‖∇f(xk) + γk+1g,φ ‖2 as before. We still have F (xk)−F (xk+1) ≥ ρ(Q(Hk, xk, xk)−Q(Hk, xk+1, xk)). We now use the bounds (4.4.13) from Lemma 13, (4.4.24) and (4.4.28) and obtain Q(Hk, x k, xk)−Q(Hk, xk+1, xk) ≥ σk 2 ( θ − 1 θMk )2‖∇f(xk) + γk+1g,φ ‖ 2 − 1 + θ θ2σk ‖∇f(xk) + γk+1g,φ ‖ 2 − 1 2θ2Mk ‖∇f(xk) + γk+1g,φ ‖ 2 By selecting a sufficiently large θ we can ensure that F (xk)− F (xk+1) ≥ ρtk‖∇f(xk) + γk+1g,φ ‖ 2, (4.4.31) for tk = σk 2 ( θ−1 θMk )2 − 1+θ θ2σk − 1 2θ2Mk > 0. Finally, combining the lower bound (4.4.31) on F (xk)−F (xk+1) together with the upper bound (4.4.29) on ∆F 2k we can conclude that (4.4.30) holds with ck = ρtk (DX0 + 2Lg(1+θ) θσk + 2 θ √ 2Mk )2 = ρ(σ3k(θ − 1)2 − 2σkM2k (1 + θ)− σ2kMk) ( √ 2DX0θσkMk + 2 √ 2Lg(1 + θ)Mk + 2σk √ Mk)2 . Finally, (4.4.19) follows from (4.4.30) divided by ∆Fk∆Fk+1 and using the fact that ∆Fk ∆Fk+1 ≥ 1. Remark 6. Let us discuss the result of the above lemma. The lemma applies for any value of θ for which tk, and hence, ck is positive for all k. It is easy to see that large values of θ imply potentially larger values of ck. On the other hand, large θ is likely to cause Case 1 to hold (i.e., (4.4.18)) instead of Case 2 (i.e., (4.4.19)) on any given iteration, with bk also growing with the size of θ. As we will show below the overall rate of convergence of the algorithm is derived using the two bounds - (4.4.18), where the rate is controlled by the rate of φk → 0 and (4.4.19), which is 95 similar to the bound in the exact case. The overall bound, thus, will depend on the upper bound on bk's and the inverse of the lower bound on ck's. If, again, we assume that σk = Mk = L(f) for all k, then θ ≥ 5 is sufficient to ensure that ck > 0 and this results in bk ≤ O(DX0L(f)) and 1/ck ≥ O(D2X0L(f)), thus again, we obtain a bound which is comparable to that of proximal gradient methods, although with more complex constants. We now derive the overall convergence rate, under the assumption that φk decays sufficiently fast. Theorem 15. Suppose that Assumption 1 holds. Assume that all iterates {xk} of inexact Algorithm 14 are generated with some φk ≥ 0 that satisfy φk ≤ a2 k2 , with 0 < a ≤ 1. (4.4.32) Let θ be chosen as specified in Lemma 14. Then for any k F (xk)− F (x∗) ≤ max{ba, 1 c } k − 1 (4.4.33) with b, c given as follows, b = θDX0 √ 2M + 2(1 + θ)Lg σ √ 2M + 2, c = ρ ( σ2 M2 (1− θ−1)2 − (2θ−1 + 3θ−2) ) 2(DX0 + 2Lg(1+θ−1)√ σ + 2√ 2 θ−1)2 . (4.4.34) Proof. Consider all iterations before a particular iteration k. From Lemma 14, it follows that either (4.4.18) or (4.4.19) must hold for each prior iteration. Let k1 < k denote the index of the last iteration before k, for which (4.4.18) holds. If no such k1 exists, then (4.4.19) holds for all k and without loss of generality we can consider k1 = 0. From (4.4.18) and from the fact that the function value never increases ∆Fk1+1 = F (x k1+1)− F (x∗) ≤ F (xk1)− F (x∗) = ∆Fk1 ≤ bk1 √ φk1 ≤ abk1 k1 (4.4.35) 96 which is the same as 1 ∆Fk1+1 ≥ k1 abk1 . (4.4.36) For each iteration from k1 + 1 to k − 1, (4.4.19) gives 1 ∆Fk1+i+1 − 1 ∆Fk1+i ≥ ci, ∀i = 1, ..., k − k1 − 1 Summing up the above inequalities and using (4.4.36) we obtain the following bound on ∆Fk, 1 ∆Fk ≥ k−k1−1∑ i=1 ci + 1 ∆Fk1+1 ≥ k−k1−1∑ i=1 ci + k1 abk1 . (4.4.37) To derive a simple uniform bound on ∆Fk we will use b and c uniform upper and lower bounds, respectively, for bk and ck given in (4.4.20), i.e., bk ≤ b, ck ≥ c,∀k ≥ k0. From Assumptions 1(iv) we can derive the expressions for c as follows, ck = ρ ( σk 2 ( θ−1 θMk )2 − 1+θ θ2σk − 1 2θ2Mk ) (DX0 + 2Lg(1+θ) θσk + 2 θ √ 2Mk )2 ≥ ρ ( σ 2 ( θ−1 θM )2 − 1+θ θ2σ − 1 2θ2σ ) (DX0 + 2Lg(1+θ) θσ + 2 θ √ 2σ )2 ≥ ρ ( σ2 2M2 ( θ−1 θ )2 − 1+θ θ2 − 1 2θ2 ) (DX0 + 2Lg(1+θ) θ √ σ + 2 θ √ 2 )2 Bound b can be obtained in a similar fashion. Substituting bounds b and c in (4.5.5) we get 1 ∆Fk ≥ (k − k1 − 1)c+ k1 ab ≥ min{c, 1 ab }(k − 1), which is the same as (4.5.2). Remark 7. It follows that the inexact version of Algorithm 14 has sublinear convergence rate if φi ≤ a2/i2 for some a < 1 and all iterations i = 0, . . . , k. In contrast, the bounds in [23] and in Section 4.3.1 require that ∑∞ i=0 √ φi is bounded. This bound 97 on the overall sequence is clearly tighter than φi ≤ a2/i2, since ∑∞ i=0 a i =∞. On the other hand, it does not impose any particular requirement on any given iteration, except that each φi is finite, which our bound on φi is assumed to hold at each iteration, so far. In Section 4.5 we will relax this assumption. 4.4.3 Complexity in terms of subproblem solver iterations Let us discuss conditions on the sequence {φi} established above and how they can be ensured. Firstly, let us note that condition a < 1 in Theorem 15 can easily be removed. We introduced it for the sake of brevity, to ensure that φi ≤ 1 on each iteration. Clearly, an arbitrarily large a can be used and in that case φi ≤ 1 for all i ≥ a. Moreover, the condition φi ≤ 1 is only needed to replace φi with √ φi in Lemma 14 in inequality (4.4.26). Instead we can use bound (4.4.24) and upper bounds on ∇f(xi) and γi+1g,φ to replace φk with a constant multiple of √ φi. In conclusion, it is sufficient to solve the subproblem on the i-th iteration to accuracy O(1/i2). The question now is: what method and what stopping criterion should be used for subproblem optimization, so that sufficient accuracy is achieved and no excessive computations are performed, in other word, how can we guarantee the bound on φi, while maintaining efficiency of the subproblem optimization? It is possible to consider terminating the optimization of the i-th subproblem once the duality gap is smaller than the required bound on φi. However, checking duality gap can be computationally very expensive. Alternatively one can use an algorithm with a known convergence rate. This way it can be determined apriori how many iterations of such an algorithm should be applied to the i-th subproblem to achieve the desired accuracy. In particular, we note that the objective functions in our subproblems are all σ-strongly convex, so a simple proximal gradient method, or some of its accelerated versions, will enjoy linear convergence rates when applied to these subproblems. Hence, after i iterations of optimizing Qi, such a method will achieve accuracy φi that decays geometrically, i.e., φi = Cδ i, for some constants C > 0 and 0 < δ < 1, hence ∑∞ i=0 √ φi is bounded. Note that the same property holds for any linearly convergent method, such as the 98 proximal gradient or a semi-smooth Newton method. Also, it is easy to see that φi ≤ a2/i2 holds for some a > 0 for all i. One can also use FISTA [2] to optimize Qi which will ensure φi ≤ a2/i2 for some a > 0 but will not guarantee ∑∞ i=0 √ φi. The advantage of using FISTA and its resulting rate is that it does not depend on the strong convexity constant, hence the subproblem complexity does not depend on σ the lower bound on the smallest eigenvalues of the Hessian approximations. In conclusion, we have the following overall bounds. Theorem 16. Suppose that Assumptions 1 hold and that for all k, at the k-th iteration of inexact Algorithm 14 function Q(Hk, u, x k) is approximately minimized, to obtain xk+1 by applying l(k) = αk + β steps of any algorithm which guarantees that Q(Hk, x k+1, xk) ≤ Q(Hk, xk, xk) and whose convergence rate ensures the error bound φk ≤ a2/(αk + β)2 for some a > 0. Then accuracy F (xt) − F (x∗) ≤ ε is achieved after at most K = βt+ α 2 (t− 1)t (4.4.38) inner iterations (of the chosen algorithm), with t = dmax{ba, 1 c } ε + 1e and b, c given in Theorem 15. Proof. A proof follows trivially from Theorem 15. Theorem 17. Suppose that Assumptions 1 hold and that or all k at the k-th iteration of inexact Algorithm 14 function Q(Hk, u, x k) is approximately minimized, to obtain xk+1 by applying l(k) steps of an algorithm, which guarantees that Q(Hk, x k+1, xk) ≤ Q(Hk, x k, xk) and whose convergence rate ensures the error bound φk ≤ δl(k)MQ, for some constants 0 < δ < 1 and MQ > 0. Then, by setting l(k) = 2log 1 δ (k), accuracy F (xt)− F (x∗) ≤ ε is achieved after at most K = t∑ k=0 2 log 1 δ (k) ≤ 2t log 1 δ (t) (4.4.39) inner iterations (of the chosen algorithm), with t = dmax{ba, 1 c } ε + 1e and b, c given in Theorem 15. 99 Proof. A proof follows trivially from Theorem 15. Remark 8. The total complexity in terms of the inner iterations should not be viewed as a summary of the whole complexity of Algorithm 14. A key step of the algorithm is the computation of F (xk) and ∇f(xk) at each iteration. In big data applications this is often the most extensive step, hence the main complexity is defined by the number of function and gradient computations. Due to backtracking via proximal parameter update to satisfy sufficient decrease condition, the number of function and gradient computation may be larger than the number of iterations of Algorithm 14, however, it does not exceed this number by more than a logarithmic factor. In practice, only several initial iterations contain backtracking steps, hence Theorem 15 provides the bound on the complexity in terms of function and gradient computations. In the next section we extend our convergence rate results to the case of solving subproblems via randomized coordinate descent, where φ is random and hence does not satisfy required bounds on each iteration. 4.5 Analysis of the inexact case under random subproblem accuracy As we pointed out in the introduction, the most efficient practical approach to subproblem optimization, in the case when g(x) = λ‖x‖1, seems to be the coordinate descent method. One iteration of a coordinate descent step can be a lot less expensive than that of a proximal gradient or a Newton method. In particular, if matrix H is constructed via the LBFGS approach, then one step of a coordinate decent method takes a constant number of operations, m (the memory size of LBFGS, which is typically 10-20). On the other hand, one step of proximal gradient takes O(mn) operations and Newton method takes O(nm2). Unfortunately, cyclic (Gauss-Seidel) coordinate descent, does not have deterministic complexity bounds, hence it is not possible to know when the work on a particular 100 subproblem can be terminated to guarantee the desired level of accuracy. However, a randomized coordinate descent has probabilistic complexity bounds, which can be used to demonstrate the linear rate of convergence in expectation. We have the following probabilistic extension of Theorem 15. Theorem 18. Suppose that Assumption 1 holds. Assume that for all k iterates {xk} of inexact Algorithm 14 are generated with some φk ≥ 0 that satisfy P{φk ≤ a2 k2 } ≥ 1− p, for some 0 < a ≤ 1 and 0 ≤ p < 1, (4.5.1) conditioned on the past. Let θ, b and c be as specified in Theorem 15. Then for any k E(F (xk)− F (x∗)) ≤ max{ba, 1 c }(2− p) (1− p)(k − 1) . (4.5.2) Proof. As in the proof of Theorem 15 consider all iterations before a particular iteration k. From Lemma 14, it follows that either (4.4.18) or (4.4.19) must hold for each prior iteration. Let k1 < k denote the index of the last iteration before k, for which (4.4.18) holds. Let k2 denote the index of the second to last such iteration and so on, hence ki is the index of the last iteration such that there are exactly i iterations between ki and k − 1 for which (4.4.18) holds. Without loss of generality we can assume that ki exists for each i, because if it does not we can set ki = 0 and obtain a better bound. Let us now assume for a given i that φki ≤ a 2 ki 2 holds, but that φkj > a2 kj 2 for all j = 1, . . . , i− 1 (if i = 1 we have the case analyzed in the proof of Theorem 15). The analysis in the proof of Theorem 15 extends easily to the case i > 1 by observing that 1 ∆Fl+1 − 1 ∆Fl ≥ cl holds for any ki + 1 ≤ l ≤ k − 1, l 6= kj, j = 1 . . . , i− 1, that is all the iterations for which or (4.4.18) does not hold. We also have 1 ∆Fl+1 − 1 ∆Fl ≥ 0, 101 for ki+1 ≤ l ≤ k−1, l = kj, j = 1 . . . , i−1, simply from the fact that function values never increase. Summing up the above inequalities and using the fact that φki ≤ a 2 ki 2 we obtain ∆Fk, 1 ∆Fk ≥ k−ki−i∑ l=1 cl + kl abkl ≥ (k − ki − i)c+ ki ab , (4.5.3) and finally, we have ∆Fk ≤ max{ba, 1 c } k − i . (4.5.4) Now, recall that P{φki ≤ a 2 ki 2} ≥ 1 − p, for any iteration i, independently of the other iteration. This means that the probability that {φki ≤ a 2 ki 2} and {φkj > a 2 kj 2} for all j = 1 . . . , i− 1 is (1− p)pi. This implies E(∆Fk) ≤ k−1∑ i=1 max{ba, 1 c } k − i (1− p)pi−1 ≤ max{ba, 1 c }(1− p) k − 1 k−1∑ i=1 (pi−1 + i− 1 k − i pi−1). (4.5.5) To bound the term ∑k−1 i=1 (p i−1 + i−1 k−ip i−1) observe that i−1 k−ip i−1 ≤ (i−1)pi−1 and hence k−1∑ i=1 (pi−1 + i− 1 k − i pi−1) ≤ 1 (1− p) + 1 (1− p)2 = 2− p (1− p)2 , which gives us the final bound on the expected error. We note that the (2-p) factor is the result of an overestimate of the weighted geometric series and a tighter bound should be possible to obtain. Below we show that randomized coordinate descent can guarantee sufficient accuracy for subproblem solutions and hence maintain the sub linear convergence rate (in expectation) of Algorithm 14. Moreover, we show in Section 5.7.3, that the randomized coordinate descent is as efficient in practice as the cyclic one. 102 4.5.1 Analysis of Subproblem Optimization via Randomized Coordinate Descent In randomized coordinate descent the model function Q(*) is iteratively minimized over one randomly chosen coordinate, while the others remain fixed. The method is presented in Algorithm 16 and is applied for l steps, with l being an input parameter. Algorithm 16: Randomized Coordinate Descent for optimizing Model Function Q(H, v, x) over v: RCD (Q(H, v, x), x, l) 1 Set p(x)← x; 2 for i = 1, 2, * * * , l do 3 Choose j ∈ {1, 2, ..., n} with probability 1 n ; 4 z∗ = arg min z Q(H, p(x) + zej, x); 5 p(x)← p(x) + z∗ej; 6 Return p(x). Here we will show how properties of coordinate descent can be used together with the analysis in Section 4.4. Combination of coordinate descent with the weaker analysis presented can be found in Section 4.3.1. Our analysis is based on Richtarik and Takac's results on iteration complexity of randomized coordinate descent [31]. In particular, we make use of Theorem 7 in [31], which we restate below without proof, while adapting it to our context. Lemma 19. Let v be the initial point and Q∗ := minu∈Rn Q(H, u, v). If vl is the random point generated by applying l randomized coordinate descent steps to a strongly convex function Q, then for some constant we have P{Q(H, vl, v)−Q∗ ≥ φ} ≤ p, as long as i ≥ n(1 + μ(H)) log(Q(H, v, v)−Q ∗ φp ), 103 where μ(H) is a constant that measures conditioning of H along the coordinate directions and in the worst case is at most M/σ the condition number of H. Let us now state the version of Algorithms 14 and 15 which is close to what is implemented in practice and for which the complexity bound can be applied. Algorithm 17: Proximal Quasi-Newton method using randomized coordinate descent 1 Choose 0 < ρ ≤ 1, a, b > 0 and x0; 2 for k = 0, 1, 2, * * * do 3 Choose 0 < μk, θk > 0, Gk  0.; 4 Find Hk = Gk + 1 2μk I and xk+1 := pHk,φk(x k) 5 by applying Prox Parameter Update with RCD (μk, Gk, x k, ρ, a, b).; Algorithm 18: Prox Parameter Update with RCD (μ, G, x, ρ, a, b) 1 Select 0 < β < 1 and set μ = μ; 2 for i = 1, 2, * * * do 3 Define H = G+ 1 2μ I, and compute p(x) := pH,φ(x) 4 by applying RCD (Q(H, v, x), x, l(k)); 5 If F (p(x))− F (x) ≤ ρ(Q(H, p(x), x)− F (x)), then output H and p(x), Exit ; 6 Else μ = βiμ; The conclusion of Lemma 19 in application to Algorithms 16-18 is that when applying l(k) steps of randomized coordinate descent approximately to optimize Q(Hk, u, x k), φk ≤ MQδl(k), with probability p, where MQ is an upper bound on Q(Hk,x k,xk)−Q(Hk,pH(xk),xk) p and δ = e − 1 n(1+μ(Hk)) . This, together with Theorem 18 implies that to solve subproblem on iteration k it is sufficient to set l(k) = O(n(1 + μ(Hk)) log(kp/MQ)) and the resulting convergence rate will then obey Theorems 15 and 17. However, it is necessary to know constants MQ and μ(H) to be able to construct efficient expression l(k). In practice, a successful strategy is to select a slow growing linear function of k, l(k) = ak + b. This certainly guarantees convergence 104 rate of the outer iteration as in Theorem 18. In terms of overall rate this gives inferior complexity, however, we believe that the real difference in terms of the workload appears only in the limit, while in most cases the algorithm successfully terminated before our practical formula l(k) = ak+ b, described in the next section, significantly exceeds, the theoretical bound O(n(1 + μ(H))log(kp/MQ)) with appropriate constants. Moreover, as noted earlier, the number of function and gradient evaluations may be the dominant complexity in the big data cases, hence it may be worthwhile to increase workload of coordinate descent in order to reduce the constants in the bound in Theorem 18. Finally, we note that when using ISTA method for the subproblem, instead of randomized coordinate descent, the number of inner iteration does not need to depend on the dimension n. However, it depends on σk/Mk and the cost per iteration is roughly n times bigger than that of coordinate descent (with LBFGS matrices). Hence the overall complexity of using coordinate descent is better than that of ISTA if μ(Hk) σk/Mk. This indeed happens often in practical problems, as is discussed in [31] and other works on coordinate descent. 105 Chapter 5 Adaptive Active-set in Proximal Quasi-Newton Method for Large Scale Sparse Optimization 5.1 Introduction We consider convex sparse unconstrained minimization problem of the following general form min x F (x) = λ‖x‖1 + L(x) (5.1.1) where L : Rp → R is convex and twice differentiable and λ > 0 is the regularization parameter that controls the sparsity of x. More generally, the regularization term λ‖x‖1 can be extended to ‖λ ◦ x‖1 = ∑p i=1 λi|xi| to allow for different regularization weights on different entries, e.g., when there is a certain sparsity pattern desired in the solution x. We will focus on the simpler form as in (5.1.1) in this work for the sake of simplicity of presentation, as the extension to the general form is straightforward. To keep the current article short and focused, we leave the theoretical analysis of a general inexact quasi-Newton proximal scheme out of this chapter. This chapter 106 is organized as follows. In Section 5.2 we overview the algorithm and define basic terminologies. Section 5.3 we introduce compact LBFGS and its low-rank structure, paying particular attention to the mutable matrix type. Section 5.4 discusses active-set strategies and the convergence guarantee. Section 5.6 discusses different termination conditions for the inner loop and their implications on the global convergence performance. In Section 5.7 we present numerical comparisons of different active-set methods and termination conditions on algorithm running time. 5.2 The proposed algorithm Two-loop scheme. At k-th iteration of the algorithm, the descent step is computed from minimizing a quadratic function QHk,xk(*) that serves as an approximation to (5.1.1) around the current iterate xk, QHk,xk(d) := L(x k + d) + 〈∇L(xk), d〉+ 1 2 〈d,Hkd〉+ λ‖xk + d‖1. Solving the subproblem, or the minimization of Q, with a general matrix H can sometimes be as hard as solving the original problem (5.1.1) [76], and often requires taking iterative steps, which we denote as inner iterations. Low-rank structure. The difficulty of the subproblem can vary a lot with different choice of H. Sometimes it is as easy as computing a closed-form solution, e.g., with H to be a multiple of identity matrix. But such Q does not necessarily serve as a good approximation to (5.1.1), hence may not result in a good step for reducing (5.1.1). In this work, we propose to use a compact representation of L-BFGS as the matrix H, which, as we will see, strike a balance between the easiness of minimizing Q and the goodness of the approximation it provides. Moreover, such H possesses a nice low-rank structure, which can be exploited to facilitates the minimization of Q. Self-adaptive active-set selection. At every outer iteration, we divide the variables into two sets, the free set I and the active set A. We only update those 107 belonging in the free set during inner iterations, i.e., Q is minimized in a subspace formed by those variables in the free set, dk := arg min d {QHk,xk(d), s.t. di = 0 ∀i ∈ Ak}. (5.2.1) This allows us to construct a smaller H matrix, solve a smaller subproblem and in some cases optimize the memory usage to be more cache-friendly since everything now is small enough to fit within a cache line. We use active-set method to denote the strategy of constructing the two sets. For problem (5.1.1), the active set is often selected such that it approximates the set of zero entries in the optimal solution x∗. For example, if this set is known in advance, then the optimal solution of (5.1.1) can be obtained by only minimizing over the variables not in the free set while fixing those in the active set to zero. Inexactness. The definition in (5.2.1) implies that dk is the exact minimizer of the subproblem. In practice, however, the subproblems are almost never optimized exactly (and rightfully so), unless in cases when H is a multiple of diagonal so that closed-form solutions exist. Hence we also define the inexact minimizer, dkφ, where the inexactness is captured by a nonnegative scalar φk ≥ 0, QHk,xk(d k φ) ≤ QHk,xk(d k) + φk and QHk,xk(d k φ) ≤ QHk,xk(0). (5.2.2) We use inexact conditions to denote the precision the subproblems need to be minimized to. As in other classic optimization algorithms, e.g., inexact Newton methods, inexact conditions plays an important role both in convergence rate analysis and in practical performance, e.g., running time, number of function and gradient evaluations. A common intuition demands that subproblems be solved more accurately as the objective approaches optimality. In this work we will describe a particular inexact conditions we propose on φ, the simple stopping criteria that follows and its effect on the algorithm performance. 108 Globalization. Finally, to drive global convergence, the following sufficient decrease condition, a relaxation of what is used in [2] and [23], is validated on every (inexact) minimizer of the subproblem F (xk + dk)− F (xk) ≤ ρ(QHk,xk(d k)− F (xk)) (5.2.3) where ρ is a constant between 0 and 1. We use dk to denote the dkφ that satisfies the condition, and we update the iterate by xk ← xk + dk. In particular, we let H be parameterized by μ, Hk ← μ 2 +Bk Let μ iterate through sequence {μ0, μ0β, μ0β2, ...} with β as a constant larger than 1. As μ becomes larger, we have a slightly different subproblem with a larger strongly convex parameter, and the step d computed from minimizing the subproblem will eventually be a descent direction for the objective. Once we find the smallest μ that produces such dkφ satisfying the sufficient decrease condition, we return from the procedure with dk ← dkφ. The general framework described above is summarized in Algorithm 19. Next we discuss how to construct matrix Bk, the active set Ak, the algorithm for minimizing Q and the criteria for terminating the inner iteration. Algorithm 19: LHAC for sparse optimization 1 Input: x0, μ0, β 2 for k = 1, 2, ... do // outer loop 3 compute ∇L(xk), Bk,Ak 4 μ← μ0/β,Hk ← Bk, dk ← 0 5 while condition (5.2.3) is not satisfied do 6 μ← μβ,Hk ← μ2 I +Bk 7 for i = 1, 2, ... do // inner loop 8 update dk towards the minimizer of QHk,xk(*) 9 xk+1 ← xk + dk 109 5.3 Mutable matrix type in low-rank hessian approximation We make use of a specific form of the low-rank Hessian estimates, i.e., Bk = B0 − GRGT . The matrices B0, G,R are defined in Theorem 20, which, adopted from [21], shows that Bk is a compact representation of L-BFGS method. In Section 5.5 we will see how to exploit this structure in coordinate descent to facilitate the subproblem minimization. In this section we will focus on discussing a mutable matrix type we implement which is specifically designed for efficiently and repeatedly mutating matrices without consuming extra memory. This new data type will be particularly useful for implementing coordinate descent on the low-rank Hessian approximation, as we will see in the second part of this section. In the first part we briefly describe Theorem 20 which defines S ∈ Rn×m, T ∈ Rn×m, L ∈ Rm×m, D ∈ Rm×m which are the main matrices we use this data type to instantiate. Theorem 20. Let B0 be symmetric and positive definite, and assume that the k vector pairs {si, ti}k−1i=0 satisfy sTi ti > 0, si = xi+1 − xi and ti = ∇Li+1 − ∇Li. Let Bk be obtained by applying k BFGS updates with these vector pairs to B0, using the formula: Bk = Bk−1 − Bk−1sk−1s T k−1Bk−1 sTk−1Bk−1sk−1 + yk−1y T k−1 yTk−1sk−1 . (5.3.1) From the correction pairs {si, ti} define matrices Sk, Tk ∈ Rp×k, Lk, Dk ∈ Rk×k: Sk = [s0, ..., sk−1], (Lk)i,j = sTi−1tj−1 if i > j0 otherwise, (5.3.2) Tk = [t0, ..., tk−1], Dk = diag[s T 0 t0, ..., s T k−1tk−1]. Construct G ∈ Rp×2k and R ∈ R2k×2k from B0, Sk, Tk, Lk, Dk in the follow manner: G = [ B0Sk Tk ] , R = [ STk B0Sk Lk LTk −Dk ]−1 . (5.3.3) 110 We then have that Bk = B0 −GRGT . (5.3.4) For large-scale problems the matrices Sk and Tk are augmented by one column every iteration according to (5.3.2). The update (5.3.4) become inefficient as k grows, if all the previous pairs {si, ti} are used. Hence BFGS method is often used in the limited-memory setting, known as the L-BFGS method [77]. In the limited-memory setting, we maintain the set {si, ti} with the m most recent correction pairs by removing the oldest pair and adding the latest one. When the number of iteration k exceeds m, the representation of the matrices Sk, Tk, Lk, Dk needs to be slightly modified to reflect the changing nature of {si, ti}: Sk and Tk are maintained as the p×m matrices with the i-th column of Sk and Tk being sk−1−m+i and tk−1−m+i, respectively (i = 1, 2, ...,m). Also, Lk and Dk are now both the m×m matrices: Dk is a diagonal matrix with the i-th diagonal entry to be s T k−1−m+itk−1−m+i and Lk is a strictly lower triangular matrix with the i-th row given by the i−1 leftmost entries of sTk−1−m+iTk. 5.3.1 Efficient matrix mutating operations As shown in Theorem 20, the compact representation requires efficiently mutating matrices such as S ∈ Rn×m, T ∈ Rn×m, L ∈ Rm×m, D ∈ Rm×m. The mutations, however, follow certain patterns that we shall exploit, i.e., (5.3.2), to define the data type LMatrix. For example, operations that are performed most often are the deletion of the first column and the addition of the new one to the end; although row mutations are also required, they are only needed for L, which is a much smaller matrix than S, i.e., m by m where m is a small chosen constant. We summarized in Table 5.1 the complexity of four LMatrix mutating operations required in obtaining the compact representation. It can be seen from the table that column operations can be implemented using just one built-in library function call. For row operations sometimes we need m separate function calls but again m is very small so the cost is negligible. 111 We also discuss implementing matrix-vector product on LMatrix, which will be used for maintaining STS. We will see that for a n by m matrix a naive implementation requires m separate lv1 BLAS function calls. But after careful arrangement through a maintained permutation matrix we are able to improve that to a single function call to lv2 BLAS function, which enjoys better multi-core usage and is more cache-friendly. Maintaining the permutation matrix can also be done efficiently, as efficient as one pass through a linear array of m entries which fits within a single cache line of most personal computers. Finally, we will see that, besides the matrixvector product, the same mutating operations in Table 5.1, are needed again on the maintained matrix STS. Hence we also use LMatrix to instantiate matrix STS. Row Append Row Pop Column Append Column Pop Complexity O(m) O(mn) O(n) O(m) Implementations m copy m memmove 1 memcpy 1 memmove Table 5.1: Row and column operations on LMatrix type M ∈ Rn×m. Note that while one memcpy or memmove on O(n) data has a complexity of O(n), the constant in front of n can be much smaller than 1 (hence much faster than normal n flops), due to the use of cache optimizations and specialized processor instructions such as SIMD in those built-in library functions. Mutable matrix type LMatrix Column operations LMatrix is implemented such that two mutating operations on the matrix, column append (CA) and column pop (CP), are optimized for maximum possible efficiency. Here CA means inserting a column vector to the right, i.e., the inserted vector becomes the last column of the matrix; and CP means delete the first column on the left. By "maximum efficiency" we mean the complexity of both operations is linear to one dimension of the matrix. For example, assume that we initiate an instance of type LMatrix with a n by m matrix. Then CA can be done by one call to memcpy which copies O(n) elements, and CP is done by one call to memmove 112 which copies O(m) elements while taking into account overlapping. We consider these complexities to be "optimal" because we need at least O(n) to insert (and copy) a n-dimensional vector. The O(m) complexity for CP is also efficient, especially in our case where we have n m. Matrix-vector product To achieve the optimal complexity for CP and CA, LMatrix uses two arrays internally, ref and body, where body stores the nm entries of the matrix in column major and ref stores pointers to each column. In particular, ref keeps the pointers sorted to reflect the column order of the matrix so that we do not need to move around nm entries in body every time the matrix is mutated. Hence for any LMatrix instance, a naive implementation of matrix-vector product would be a loop iterating through the ref array and performing a lv1 BLAS operation to compute vector-vector product for each column. To achieve better cache uses and parallel performance, here we discuss how to utilize lv2 BLAS library and combine it with permutation to implement the matrix-vector product in LMatrix. Let us use M ∈ Rn×m to denote the matrix given by accessing body in column major, and M ∈ Rn×m to denote the one given by ref. M and M will have exactly the same mn entries but different column order. For example, the i-th entry ref (i) points to M(:, i), the i-th column in M , which might be in the j-th column of M . This connection between M and M can be captured by a permutation matrix P ∈ Rm×m, i.e., P (j, :) = eTi if M(:, i) = M(:, j). Computing the product x = M T ∗ b can thus be done by two calls to the BLAS function dgemv, the first call to compute x = MT ∗ b and the second to obtain the final result by permuting x, i.e., x = P ∗ x. P is computed on the fly, from a linear array p ∈ Rm where p(i) = j if M(:, i) = M(:, j), because maintaining p can be done more efficiently. For that we will exploit the fact that CA and CP are always performed together in Algorithm 20. To be more specific, at the beginning, we initialize p such that p(i) = i, i.e., P is an identity matrix and M = M . Later whenever CA and CP are performed, we update p, i.e., for each entry of p either decreasing it by 1, or replacing it by m if that entry is 1 (assuming that the first column of M is M(:, 1) instead of M(:, 0)). 113 The extra complexity of such arrangement consists of one O(m) for updating p and another O(m) for the matrix-vector multiplication P ∗ x, counting only the nonzero operations. Notice that once p is computed, it is valid to use until when LMatrix is mutated again. The complexity of matrix-vector product with LMatrixM ∈ Rn×m is thus O(m(n + 1)) ≈ O(mn) when m is small, which is roughly the same as the naive implementations but is much more cache-friendly. Maintain STk Sk We maintain the small matrix STk Sk ∈ Rm×m and update it every iteration by a matrix-vector product which costs O(nm) operations, rather than the required O(m2n) if we compute the matrix from scratch. Consider Sk−1 and Sk, Sk−1 = [ sk−m−1 C ] , Sk = [ C sk−1 ] , where they share the same sub-matrix denoted by C. Hence, STk−1Sk−1 = [ sTk−m−1sk−m−1 s T k−m−1C CT sk−m−1 C TC ] , STk Sk = [ CTC CT sk−1 sTk−1C s T k−1sk−1 ] . Let us contain STk−1Sk−1 in a LMatrix type structure Mk−1 and consider the complexity of updating Mk−1 to Mk such that Mk contains S T k Sk. Besides CP and CA, we introduce two mutating operations, row pop (RP) and row append (RA), to denote the corresponding row-wise operations. The complexity of the four mutations is summarized in Table 5.1. Updating Mk−1 to Mk can then be expressed by the following operation, Mk ← CA(RA(CP (RP (Mk−1)), sTk−1C), STk sk−1) (5.3.5) where RA(M, v) appends v to the last row of M and CA(M, v) appends v to the last column of M . Note that sTk−1C is simply the first m − 1 sub-vector of STk sk−1. Hence the update (5.3.5) involves a LMatrix-vector product STk sk−1 followed by the four operations applied in sequence. The complexity of maintaining STk Sk is then O(m2 +m(m− 1) +m− 1 +m− 1 +m) = O(m2). 114 Reduced Q ∈ Rn×2m and Q ∈ R2m×n A reduced matrix QI ∈ R|I|×2m indexed by set I contains only a subset of rows in the full matrix Q. We have QI = Q when the free set I contains all the coordinates. In sparse optimization, I is constructed such that only a small subset of coordinates are included to be optimized. It is thus inefficient to construct the full matrix Q while only a small number of rows in Q are actually needed in the computation. Given QI , the matrix Q can also be reduced to QI = RQ T I with R ∈ R2m×2m which is now a simpler linear system to solve with a RHS of a much smaller size. 5.4 Self-adaptive greedy active-set selection In this section we describe the strategy to construct the active set A. In the discussion that follows we will consider any point x ∈ dom(F ), thus removing the superscript in xk. We first define four index sets, which form a partition of the set P = {1, ..., n}, Z1(x) := {i ∈ P | xi 6= 0 and |∇iL(x)| ≥ λ}, Z2(x) := {i ∈ P | xi 6= 0 and |∇iL(x)| < λ}, Z3(x) := {i ∈ P | xi = 0 and |∇iL(x)| > λ}, Z4(x) := {i ∈ P | xi = 0 and |∇iL(x)| ≤ λ}. (5.4.1) We denote as Pi(x) the i-th coordinate of the least norm subgradient of F at x: Pi(x) =  ∇iL(x) + λ xi > 0, ∇iL(x)− λ xi < 0, max{|∇iL(x)| − λ, 0} xi = 0 (5.4.2) The four sets can thus be expressed using P (x) as follows: Z1(x) ∪ Z2(x) ∪ Z3(x) = {i | Pi(x) 6= 0} (5.4.3) Z4(x) = {i | Pi(x) = 0} 115 Also, from the optimality conditions of (5.1.1) it follows that, ||P (x)|| → 0 as w → x∗. (A1) We denote the active set and free set at optimality as A∗ and I∗, respectively. We define A∗ as the index set of zero entries in x∗, and I∗ as that of nonzero entries, i.e., A∗ := {i ∈ P|xi = 0}, I∗ := {i ∈ P|xi 6= 0}. From optimality condition we have, A∗ = Z4(x∗), I∗ = Z1(x∗) ∪ Z2(x∗), Z3(x∗) = ∅. Hence the goal of any active-set method discussed here is to construct A such that A → A∗ as x → x∗ (assuming I is the complement of A). Let us start with a "standard" active-set method (STD) which simply set A to Z4(x). A nice property can then be established about STD, which shows that optimizing over the subspace formed by STD is equivalent to optimizing over the full space with a block-diagonal H matrix. Lemma 21. Let x, g ∈ Rn, I := {i | xi 6= 0 or |gi| > λ} and its complement A := {i | xi = 0 and |gi| ≤ λ}. Let HI be any p.d. matrix of dimension |I| × |I|. The vector d∗ ∈ Rn defined below, d∗A := 0, d ∗ I := arg min d∈R|I| {〈gI , d〉+ 1 2 〈d,HId〉+ λ‖xI + d‖1}. (5.4.4) is also the solution for the following problem, min d∈Rn {〈g, d〉+ 1 2 〈d,Hd〉+ λ‖x+ d‖1}, (5.4.5) where H = [ HI 0 0 tI ] is a n by n matrix with t > 0 as a fixed constant. Proof. Note that (5.4.5) is block-separable with one block corresponding to set A and the other to set I. The solution of I block is exactly d∗I . And applying the 116 definition of A and the optimality conditions it is easy to see that the solution of A block is 0. In STD the free set I is constructed as Z1(x) ∪ Z2(x) ∪ Z3(x) which eventually converges to I∗ = Z1(x∗) ∪ Z2(x∗) as Z3(x) is diminishing at optimality. The idea is that for I to converge to Z1(x∗) ∪ Z2(x∗) we have to at lease include these two sets Z1(x) and Z2(x). But only including them would not converge because for an active-set strategy to work it has to allow both directions – nonzero to turn to zero and the other way, and obviously optimizing over Z1(x)∪Z2(x) only allows variables to change in one direction – from non-zero to zero. So STD also includes set Z3(x). In this work we propose another active-set method, which we denote as GRDY. The method computes two sets Z2(x) and Z1(x)∪Z3(x), and construct I differently based on the difference of the two sets. In particular, at the beginning it sorts Z1(x)∪Z3(x), pick the first |Z2(x)| and join it with Z2(x); later when |Z1(x)∪Z3(x)| is close to |Z2(x)|, it includes all entries Z1(x) ∪ Z2(x) ∪ Z3(x) in I. To explain the idea behind GRDY, let us consider how each of those sets is changing as x approaches optimality, i.e., w → x∗. A key observation is that when x is far from x∗, Z3(x) and Z4(x) contain few entries as there are only a few zero elements in x. The set Z2(x) is also small because most gradient elements are larger than λ which is often chosen to be a small number (less than one for example). In fact, Z1(x) is often the largest set at this point. Note that the size of I is only 2|Z2(x)| by GRDY. As a comparison, the I by STD contains all entries from Z1(x)∪Z2(x)∪Z3(x), which includes almost all the variables. Later as the algorithm advances, the size of Z1(x) is decreasing while the other sets are expanding. Eventually condition (A1) demands that Z4(x) contain all of the zero components in x and Z3(x) contains none. One notable observation here is that at the neighborhood of x∗, i.e., w ∈ Nσ(x∗), we have the sizes of two sets Z1(x) and Z2(x) "converging" to each other: |Z1(x)| ≈ |Z2(x)| ≈ |I∗|/2 which implies that |Z1(x) ∪ Z3(x)| is close to |Z2(x)|. At this point GRDY will automatically include in I all entries from Z1(x) ∪ Z2(x) ∪ Z3(x). 117 5.5 Fast coordinate descent on low-rank structure At k-th iteration given the current iterate xk, we apply coordinate descent method to the piecewise quadratic subproblem to obtain the direction dk. Suppose jth coordinate in dk is to be updated, i.e., dk ← dk + z∗ej (ej is the j-th vector of the identity), where z∗ is determined by solving the following one-dimensional problem, z∗ = arg min z 1 2 (Hk)jjz 2 + ((∇Lk)j + (Hkd)j)z + λ|xkj + dj + z|. (5.5.1) Closed-form solutions exist and can be obtained by one soft-thresholding step [17, 78]: z∗ = −c+ S(c− b/a, λ/a) (5.5.2) with a, b, c chosen to be a = (Hk)jj, b = (∇Lk)j + (Hkdk)j, c = xkj + dkj and the shrinkage function as S(u, r) := sgn(u) max(|u| − r, 0). As mentioned above, the special low-rank update of Hk provides us an opportunity to accelerate the coordinate descent process. To see how, let us recall that Hk = μ 2 I +Bk = μ 2 I +B0k −GRGT . The basic matrix B0 in (5.3.4) may vary from iteration to iteration, so here we use B0k to denote the basic matrix in k-th iteration. A popular choice of B0k in practice is to set it to tTk−1tk−1 tTk−1sk−1 I, which is proved effective in ensuring that the search direction is well-scaled so that less time is spent on line search. Letting γk = μ 2 + tTk−1tk−1 tTk−1sk−1 and Ĝ = RGT , we have, Hk = γkI −GĜ. where G ∈ Rp×2m, Ĝ ∈ R2m×p and m chosen as a small constant. Clearly we do not need to explicitly store or compute Hk. Instead, since Hk is only used through (Hk)ii and (Hkd)i when applying soft-thresholding steps to updating each coordinate i, we can only store the diagonal elements of Hk and compute (Hkd)i on the fly whenever it is needed. Specifically, (Hk)ii = γk − gTi ĝi (5.5.3) 118 where gi is the ith row of the matrix G and ĝi is the ith column vector of the matrix Ĝ. To compute (Hkd)i, we maintain a 2m dimensional vector dG := Ĝd, then (Hkd k)i = γkd k i − gTi dG (5.5.4) which takes O(2m) flops, instead of O(p) if we multiply d by the ith row of Hk. Notice though, that after taking the soft-thresholding step dG has to be updated each time by dG ← dG + ziĝi. This update requires little effort given that ĝi is a vector with 2m dimensions where m is often chosen to be between 3 and 20. However, we need to use extra memory for caching Ĝ and dG which takes O(2mp + 2m) space. With the other O(2p + 2mp) space for storing the diagonal of Hk, G and d, altogether we need O(4mp+ 2p+ 2m) space, which can be written as O(4mp) when p m. So far we have discussed most missing details from Algorithm 19, and the results are summarized in Algorithm 20. Particularly, in Step 8 of Algorithm 20 we use a set function Γ(S, s) which returns a new set composed of s elements from S. It returns S itself if s ≥ |S|. To determine which s elements should be chosen, greedy methods as described above or randomization can be used. In Step 9 we compute and store the diagonal entries in Hessian approximation Bk with indices contained in Ik. In Step 11 we divided μ0 by β so that we have μ = μ0 for the first while iteration. In Step 14 we apply coordinate descent, as discussed in Section 5.5, to the subproblem constructed from Ik. Finally, we introduce a function l(k) in Step 14, which denote the number of inner iterations we apply at the k-th outer iteration. We will discuss l(k) in more details next. 5.6 Inner loop termination criteria and global convergence The inner loop of Algorithm 20 is terminated after l(k) iterations, hence the termination criteria we use is simply the number computed by the function l with respect to the outer iteration counter k. Alternatively, one could consider terminating the 119 Algorithm 20: Low rank Hessian Approximation in Active-set Coordinate descent (LHAC) 1 Input: x0 initial iterate, m > 0 the LBFGS parameter, μ0 > 0, η > 1, β > 1, 1 ≥ ρ ≥ 0 2 Set x1 ← x0, θ1 ← 1, G← 0, Ĝ← 0 3 for k = 1, 2, ... do // outer loop 4 Compute Z2(xk) and Z1(xk) ∪ Z3(xk) by (5.4.1) 5 if |Z1(xk) ∪ Z3(xk)| ≤ η|Z2(xk)| then 6 Ik = Z2(xk) ∪ Z1(xk) ∪ Z3(xk) 7 else 8 Ik = Γ(Z1(xk) ∪ Z3(xk), |Z2(xk)|) ∪ Z2(xk) 9 for coordinate j ∈ Ik do 10 Compute (Bk)jj = θk − gTj ĝj 11 Set dG ← 0, dk ← 0, μ← μ0/β 12 while condition (5.2.3) is not satisfied do 13 μ← μβ, γk ← μ2 + θk 14 for i = 1, 2, ..., l(k) do // inner loop 15 for coordinate j ∈ Ik do 16 Compute a = μ 2 + (Bk)jj, b = ∇jL(xk) + γkdkj − gTj dG and c = xkj + d k j 17 Compute z = −c+ sgn(c− b/a) max(|c− b/a| − λ/a, 0) 18 Set dk ← dk + zej, dG ← dG + zĝj 19 Set xk+1 ← xk + dk, tk ← ∇L(xk+1)−∇L(xk), sk ← xk+1 − xk, θk+1 ← tTk tk tTk sk 20 Set Sk+1, Tk+1, Dk+1, Lk+1 according to Theorem 20 21 Set G← [ θk+1Sk+1 Tk+1 ] , Ĝ← [ θk+1S T k+1Sk Lk+1 LTk+1 −Dk+1 ]−1 GT . inner loop once the duality gap is smaller than a required bound on φk. Checking duality gap, however, can be computationally more expensive than simply counting the number of steps. It is also not clear how often the check needs to be performed. In Algorithm 20, for instance, each coordinate descent step is so cheap that it is inefficient to perform the expensive check after each such cheap step. 120 Let us now discuss the functional forms l(k) can take. l can be as simple as a constant, for instance. That is, the subproblems are always minimized to a fixed accuracy proportional to the value of the constant, e.g., a large constant is equivalent to exact minimization of the subproblems. In general, however, we would like l to be a monotonically increasing function so that the subproblem is minimized more accurately as outer iteration moves closer to the optimality, i.e., φk is a monotonically decreasing function of k which approaches 0 as k → ∞. This function k → φk is determined by the outer convergence rate. For example, if the outer iteration convergences with k less than 10, then it means that φk has to also go to zero in 10 iterations. Given these two functions, k → l(k) and k → φk, there is a third one. Since φk and l(k) are both monotonic function of k, φk can be alternatively expressed using l. And this function, l→ φk, is predefined by the convergence rate of the inner algorithm. Knowing any two of the three functions, k → φk, k → l(k) and l→ φk, we are able to derive the third one. In other words, how fast l should be increasing will depend on the relation between l(k) and φk, and how fast φk should be approaching zero will depend on the rate of the outer iteration. Hence the actual form l takes will be the result of taking both into account. Theorem 22, which we adopt from [20], expresses it formally and precisely. Theorem 22. Let l(k) denote the number of inner iterations taken to approximately minimize QHk,xk in Algorithm 19, φk denote the minimization error as defined in (5.2.2), and K denote the number of inner iterations (of the chosen algorithm) to achieve ε-accuracy on F after t outer iteration, i.e., F (xt)−F (x∗) ≤ ε. The relations between l(k), φk and K are stated in the following two cases, with t = d max{ba, 1 c } ε + 1e and b, c > 0 as fixed constants1, 1. if φk ≤ a2/l(k)2 for some a > 0, then, l(k) = αk + β and K = βt+ α 2 (t− 1)t; 1b and c are defined in [20] and depends on the bounds of eigenvalues of Hk. 121 2. if φk ≤ δl(k)MQ, for some constants 0 < δ < 1 and MQ > 0, then, l(k) = 2 log 1 δ (k) and K = t∑ k=0 2 log 1 δ (k) ≤ 2t log 1 δ (t). Theorem 22 shows that l has to increase at least as fast as a linear function of k, if the inner algorithm converges as fast as FISTA [2], for example, and that if the rate of the inner algorithm is linear, then l can slow down to a logarithmically-growing function. In particular, we note that the objective functions in our subproblems are all σ-strongly convex, so a simple proximal gradient method, or some of its accelerated versions, will enjoy linear convergence rates when applied to these subproblems. In other words, if we replace the coordinate descent in Algorithm 20 with proximal gradient methods, it is sufficient for l(x) to increase logarithmically because of the linear rate of proximal gradient on the strongly convex subproblems. The next question is what would l(x) be if we keep Randomized Coordinate Descent (RCD) as it is in Algorithm 20? The major distinction is that now the relation between φk and l(k) does not always hold as required by Theorem 22, due to the non-deterministic nature of RCD. We could repeatedly apply RCD until the relation is verified, but the verification tends to be computationally cumbersome and thus defeat the purpose of using l(k) in the first place. The next Theorem 23 [20] we present extends Theorem 22 to the above probabilistic case. It allows the relation between φk and l(k) to hold with certain probability. And the price we pay is a constant factor depending on the magnitude of that probability. Finally, we conclude that at k-th iteration O(log(k)) passes of coordinates are desired in the inner loop of Algorithm 20. Theorem 23. Suppose that the coordinate in Step 15 of Algorithm 20 is picked randomly with probability 1|Ik| . Let p denote the probability that φk ≤ MQe − l(k)|Ik|(1+μ(Hk)) holds at any iteration k, μ(H) denote a constant that measures conditioning of H along the coordinate directions and MQ denote an upper bound on Q Hk,x k (0)−Q∗ Hk,x k p . Then at any iteration k the number of coordinate descent steps we need is given by, l(k) = O(|Ik|(1 + μ(Hk)) log(kp/MQ)). 122 And ε-accuracy is achieved in expectation, i.e., E[F (xk)−F (x∗)] ≤ ε, after O(1 ε log(1 ε )) coordinate descent steps. 5.7 Numerical experiments We report results on training sparse logistic regression (5.7.1) with various datasets listed in Table 5.2. min w∈Rp F (w) = λ‖w‖1 + 1 N N∑ n=1 log(1 + exp(−yn * wTxn)), (5.7.1) Data p N #non-zeros Description a9a 123 32, 561 451, 592 'census Income' dataset. epsilon 2000 100, 000 200, 000, 000 PASCAL challenge 2008. gisette 5000 6, 000 29, 729, 997 handwritten digit recognition. mnist8m 784 1, 000, 000 199, 401, 957 handwritten digit recognition. slices 385 53, 500 20, 597, 500 CT slices location prediction. Table 5.2: Data sets used in the experiments. 5.7.1 Active-set methods In this experiment we compare the strategy used in Algorithm 20, denoted as GRDYAD, with six other active-set strategies. The standard approach STD described in Section 5.4 is also included in the comparison. The rest are variants derived from GRDY-AD and STD, all trying to maintain a smaller free set than STD so that the 123 subproblems can be minimized more efficiently. A smaller free set leads to cheap iteration, but also often suffers from slow convergence. An extreme case would be coordinate descent where the free set is a singleton, which takes lots of iteration/coordinate updates to converge. Hence, many of the strategies here are greedy-based, updating only those variables with greatest optimality violations, in the hope that even though other variables are fixed there are still enough "useful" information in the subproblem for determining a good descent direction. The next thing to consider is how many variables to be included in the free set. It is required that the free set converge to the non-zero subspace in the optimal solution, hence a strategy fixing the size of the free set would not work, such as coordinate descent where that size is fixed to 1. Strategies included in the experiments all come with different adaptive selection criteria. Finally, we also introduce "inexactness" into the strategies, i.e., an "inexact" version of Z1(x),Z2(x) and Z3(x). Together they make up a subset of (5.4.3) containing those indices i such that |Pi(x)| is above a small threshold ε > 0, Zε1(x) := {i | xi 6= 0 and |∇iL(x)| − λ > ε} Zε2(x) := {i | xi 6= 0 and λ− |∇iL(x)| > ε} Zε3(x) := {i | xi = 0 and |∇iL(x)| − λ > ε} Similar to (5.4.3), it can be shown by definition that the following relation holds Zε1(x) ∪ Zε2(x) ∪ Zε3(x) = {i | |Pi(x)| ≥ ε}. We describe each strategy below. Note that a set of indices/variables that are selected means that they are to be updated in the inner loop, i.e., those included in the free set I, and others that are not selected are fixed through the inner loop. • GRDY: a greedy strategy that selects Z1(x)∪Z2(x)∪Z3(x) and sort the indices based on their corresponding value |Pi(x)|, i.e., ||∇iL(x)| − λ|. • STD: selects Z1(x) ∪ Z2(x) ∪ Z3(x) • GRDY-CTZ: a greedy strategy that selects Z1(x) ∪ Z2(x) ∪ Z3(x), sorts the indices and pick the first Z1(x) ∪ Z2(x) that have the largest value in P . 124 • GRDY-EP : a greedy strategy that selects and sorts Z1(x) ∪ Z2(x) ∪ Zε3(x) for some small ε > 0 (we choose ε = 0.01 in this experiment). • GRDY-AD: the strategy used in Algorithm 20 – it selects indices from both Z2(x) and a fraction of Z1(x) ∪ Z3(x). I.e.,, if |Z1(x) ∪ Z3(x)| is close to that of |Z2(x)|, then it selects all indices from Z1(x) ∪ Z2(x) ∪ Z3(x); otherwise, it sorts Z1(x) ∪ Z3(x) and pick the first |Z2(x)| and join it with Z2(x). • STD-EPA: selects Z1(x)∪Z2(x)∪Zε3(x) with aggressively chosen ε (0.5 is used here). • STD-EP: selects Z1(x) ∪ Z2(x) ∪ Zε3(x) with ε equal to 0.01. All the strategies can be implemented very efficiently. Consider GRDY-AD for example. We maintain a vector of size n and initialize two counters α and β with 0. The free set is then computed as follows. For each coordinate i, if condition |∇iL(x)| > λ is true, then we increase the counter α by one and insert i into the end of vector and make sure that the last α entries remain sorted; if the condition is false, then we check xi – if it is non-zero, then we put i at the front of vector and increase the counter β by one. After going through all the indices, the last α entries of the vector are those indices with |∇iL(x)| > λ, and the first β are those with both |∇iL(x)| ≤ λ and xi 6= 0. That is, α = |Z1(x) ∪ Z3(x)| and β = |Z2(x)|, according to definitions in (5.4.1). We compare the number α and β. If α β, then we copy the last β entries of array, put them behind the first β elements and set the size of I to 2β; if α ≈ β, then we copy all α entries to the front and set size of active set to α+β. Note that since we maintain the last α entries to be sorted in descending order, this is a greedy method that always includes those indices from Z1(x)∪Z3(x) that have the largest dual constraints violations. The main results are presented in Table 5.3. We first point out the two strategies that do not work well – STD-EPA and STD-EP. The former does not finish within 125 Figure 5.1: Illustrate the dynamics of the four index sets (5.4.1) as x → x∗. Note how |Z1(x)| and |Z2(x)| start from the top and the bottom, respectively, and come together in the end, each containing (roughly) half the size of non-zero elements in x∗. Also note that |Z3(x)| vanishes at optimality. 0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700 800 Iters N um be r o f C oo rd in at es mnist_workset STD GRDY_AD TOTAL SIZE 800000 822500 845000 867500 890000 GRDY_AD STD 886684 805835 Column, bar, and pie charts compare values in a single category, such as the number of products sold by each salesperson. Pie charts show each category's value as a percentage of the whole. 0 50 100 150 200 250 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Iters N um be r o f C oo rd in at es gisette_workset STD GRDY_AD TOTAL SIZE 0 110,000 220,000 330,000 440,000 STD GRDY_AD 305,733 425,779 Pie Chart STD GRDY_AD Column, bar, and pie charts compare values in a single category, such as the number of products sold by each salesperson. Pie charts show each category's value as a percentage of the whole. Figure 5.2: Evolution of free sets in STD and GRDY-AD. Left: dataset mnist (p = 784, N = 1, 000, 000). Right: dataset gisette (p = 5000, N = 6000). The line plots are the size of free set against iterations and the bars denote the total sum of free set sizes over all iterations the given computing budget (1500 iterations) in four out of five experiments, and the latter fails once and performs poorly in the other four experiments. They both selects Z1(x) ∪ Z2(x) ∪ Zε3(x) but differ in the actual value of ε (0.01 v.s. 0.5). Here Zε3(x) leaves out those indices with small Pi(x) so that the corresponding xi remains 126 Table 5.3: Compare the performance of different active-set strategies. A9A SLICE gisette EPSILON MNIST time (s)iterstime (s)iterstime (s)iterstime (s) iterstime (s)iters GRDY 2.3663 360 9.9324 450 13.2527 420 29.1595 330 64.0209 810 STD 2.8944 420 10.5150 480 14.8445 480 28.8553 330 64.8016 840 GRDY-CTZ2.6271 390 10.0439 450 12.3715 420 29.4759 330 59.9645 780 GRDY-EP 2.3768 360 9.9739 450 13.2448 420 29.2295 330 63.2239 810 GRDY-AD 2.0126 330 8.4519 390 12.1911 390 29.4915 330 59.0973 780 STD-EPA 59.9110 – 32.5708 – 51.4633 – 156.2294– 62.1497 810 STD-EP 2.7650 420 10.9344 480 12.9109 420 792.5668– 63.0899 810 zero and untouched although they violates the optimality conditions. It is, however, worth pointing out that GRDY-EP using the similar strategy but in a greedy manner performs quite well and even beats STD in most of the experiments. In fact, all four greedy-based strategies in general outperform STD. The clear winner of all is GRDY-AD, which outperforms others in four out of five experiments and by measure of both time and number of iterations. Compared to other methods, GRDY-AD not only generates smaller free set but also requires fewer number of iterations to reach the same accuracy. In other words, GRDY-AD iterates fast because of smaller subproblems to minimize, without suffering from slow convergence! Figure 5.2 plots the size of the free set at each iteration and it compares GRDY-AD against STD. For STD, the size of I starts high at the top in the beginning, and slide gradually as the algorithm progresses. As a comparison, that of GRDY-AD takes another path – starts from the middle, and most of the time runs below the blue line while oscillating up and down. In total GRDY-AD requires less number of coordinate updates than STD does, as shown by the bar plots in Figure 5.2. The observations that motivate GRDY-AD is illustrated in Figure 5.1. The plot shows the evolution patterns of the four index sets introduced in Section 5.4. It can be seen from the plot that |Z1(x)| and |Z2(x)| are converging to each other, each containing roughly half the entries from I∗ in the end. But at the beginning 127 Figure 5.3: Study different inner loop termination criteria and its effect on algorithm performance. Time (top) and iterations (bottom) comparison for different l(k): log, linear, constant 5 and constant 200. they are quite different such that the initial |Z1(x)| is almost the size of the problem while |Z2(x)| is close to zero. Also included in the plots are the number of zero entries |Z3(x)∪Z4(x)| and |Z3|, the number of zero entries that violate the optimality conditions. It can be seen that |Z3(x) ∪ Z4(x)| is moving upwards in general while |Z3| oscillates, quite a bit in the beginning, but dies down in the end. 5.7.2 Inner loop termination and convergence In the first experiment (see Figure 5.3) we study different termination functions l(k), e.g., log, linear, constant etc., and its impact on overall running time and number of outer iterations required to converge. We will see that a faster l(k) decreases the total number of outer iterations, but only to a certain point. After that the number of outer iterations will stay roughly the same. In other words, solving the subproblems more accurately, up to a point, does not further improve the convergence of the outer iteration. This corresponds well to the convergence theory proposed in [20]. But 128 (a) Coordinate descent (b) Proximal gradient Figure 5.4: Consider linear function l(k) = ak + b and study the effect of a on running time and iterations for different inner algorithm, i.e., coordinate descent (a) and proximal gradient (b). solving the subproblems more accurately does affect the overall running time. In fact, choosing a proper l(k) is critical to the performance of the algorithm, since a too slow or too fast l(k) either takes too many iterations, hence more function and gradient evaluations, or spends too much time on solving the subproblems without making progress on the actual objective. The experiments show that a linearly-increasing l(k) performs the best in the context of Algorithm 20. In the second experiment (see Figure 5.4(a) and Figure 5.4(b)) we study l(k) and its relations with the inner algorithm in the context of linear functions l(k) = ak+ b. We apply both coordinate descent and proximal gradient to the subproblems and 129 we find that the best a for coordinate descent is around 0.2 and that for proximal gradient is around 1.2. In other words, at any given iteration k we need over six times more proximal gradient iterations on the subproblem than the number of passes we need if applying coordinate descent, to achieve either the best overall running time or the optimal number of outer iterations. Results are reported in Figure 5.4. Figure 5.3 plots Algorithm 20 with four different l(k) on various datasets in Table 5.2. The value of log(F (x)−F (x ∗) F (x∗) ) against both iteration and CPU time is reported, where F (x∗) is estimated by pre-solving the problem to an accuracy of 1e−10. The four functions l(k) are log(k) + 1, 0.2k + 1, 5 and 200, where subproblems are exactly minimized with l(k) = 200 and are inexactly minimized, to a fixed accuracy, with l(k) = 5. The iteration plots show that l(k) = 200 achieves the optimal outer iteration convergence. But note that the performances of l(k) = 200 and l(k) = 0.2k + 1 are very close and are almost identical in three out of four cases. Let us compare in terms of the total number of inner iterations. Summing l(k) = 0.2k + 1 for k from 1 to 400, for instance, gives 16,041 inner iterations, while 400 times l(k) = 200 yields 80, 000. In other words, with the same outer convergence rate, l(k) = 0.2k+1 requires the number of inner iterations 5 times less than l(k) = 200 does! This also explains the difference between these two in the time plots. Let us now compare l(k) = log(k) + 1 with l(k) = 0.2k + 1. The total inner iteration in the log case is 1769 with k from 1 to 400. That is almost 10 times less than the 16,041 required by the linear l(k). But the linear function is still much more efficient, as shown in the time plots. The reason can be read from the iteration plots. It shows that the objective value converges much faster on the outer iteration in the linear case. In fact, a superlinear convergence can be observed for l(k) = 0.2k + 1, while the log case exhibits a merely linear trend, hence much more time spent on function and gradient evaluations. Finally, let us look at the results in Figure 5.4(a) and Figure 5.4(b). Here we plot the final iteration number and running time against different choices of a in 130 l(k) = ak + b. As can be observed from both plots, iteration number decreases as a increases, and running time increases after decreasing to a point first. The best value for a is where running time reaches its minimum value. For coordinate descent that is a = 0.2 and for proximal gradient it is a = 1.2. While both achieves linear rate on the strongly convex subproblems, it is known in theory that proximal gradient can be a constant factor slower than coordinate descent if the condition number of the subproblems along the coordinate direction is better than the overall condition number. Hence the experiment results show that l(k) needs to increase fast if a slow-rate algorithm is used in the inner loop. 5.7.3 Performance comparison with other algorithms The aim of this section is to provide validation for our general purpose algorithm, but not to conduct extensive comparison of various inexact proximal Newton approaches. In particular, we aim to demonstrate a) that using the exact Hessian is not necessary in these methods, b) that backtracking using prox parameter, based on sufficient decrease condition, which our theory uses, does in fact work well in practice and c) that randomized coordinate descent is at least as effective as the cyclic one, which is standardly used by other methods. LHAC, for Low rank Hessian Approximation in Active-set Coordinate descent, is a C/C++ package that implements Algorithms 1617 for solving general `1 regularization problems. We conduct experiments on two of the most well-known `1 regularized models – Sparse Inverse Covariance Selection (SICS) and Sparse Logistic Regression (SLR). The following two specialized C/C++ solvers are included in our comparisons: • QUIC: the quadratic inverse covariance algorithm for solving SICS described in [17]. • LIBLINEAR: an improved version of GLMNET for solving SLR described in [4, 5]. 131 Note that both of these packages have been shown to be the state-of-the-art solvers in their respective categories (see e.g. [4, 17, 18, 79]). Both QUIC and LIBLINEAR adopt line search to ensure function reduction. We have implemented line search in LHAC as well to see how it compares to the updating of prox parameter proposed in Algorithm 15. In all the experiments presented below use the following notation. • LHAC: Algorithm 17 with backtracking on prox parameter. • LHAC-L: Algorithm 17 with Armijo line search procedure described below in (5.7.5). Experimental Settings For all of the experiments we choose the initial point x0 = 0, and we report running time results in seconds, plotted against log-scale relative objective function decrease given by log( F (x)− F ∗ F ∗ ) (5.7.2) where F ∗ is the optimal function value. Since F ∗ is not available, we compute an approximation by setting a small optimality tolerance, specifically 10−7, in QUIC and LIBLINEAR. All the experiments are executed through the MATLAB mex interface. We also modify the source code of LIBLINEAR in both its optimization routine and mex gateway function to obtain the records of function values and the running time. We note that we simply store, in a double array, and pass the function values which the algorithm already computes, so this adds little to nothing to LIBLINEAR's computational costs. We also adds a function call of clock() at every iteration to all the tested algorithms, except QUIC, which includes a "trace" mode that returns automatically the track of function values and running time, by calling clock() iteratively. For both QUIC and LIBLINEAR we downloaded the latest versions of the publicly available source code from their official websites, compiled and built the software on the machine on which all experiments were executed, and which uses 2.4GHz quad-core Intel Core i7 processor, 16G RAM and Mac OS. 132 The optimal objective values F ∗ obtained approximately by QUIC and LIBLINEAR are later plugged in LHAC and LHAC-L to terminate the algorithm when the following condition is satisfied F (x)− F ∗ F ∗ ≤ 10−8 (5.7.3) In LHAC we chose μ = 1, β = 1/2 and ρ = 0.01 for sufficient decrease (see Algorithm 18), and for LBFGS we use m = 10. When solving the subproblems, we terminate the RCD procedure whenever the number of coordinate steps exceeds (1 + b k m c)|Ik| (5.7.4) where |Ik| denotes the number of coordinates in the current working set. Condition (5.7.4) indicates that we expect to update each coordinate in Ik only once when k < m, and that when k > m we increase the number of expected passes through Il by 1 every m iterations, i.e., after LBFGS receives a full update. The idea is not only to avoid spending too much time on the subproblem especially at the beginning of the algorithm when the Hessian approximations computed by LBFGS are often fairly coarse, but also to solve the subproblem more accurately as the iterate moves closer to the optimality. Note that in practice when |Ik| is large, the value of (5.7.4) almost always dominates k, hence it can be lower bounded by l(k) = ak + b with some reasonably large values of a and b, which, as we analyzed in Section 4.5.1, guarantees the sub linear convergence rate. We also find that (5.7.4) works quite well in practice in preventing from "over-solving" the subproblems, particularly for LBFGS type algorithms. In 5.6 and 5.8 we plot the data with respect to the number of RCD iterations. In particular Figures 5.6(a) and 5.8(a) show the number of RCD steps taken at the k-th iteration, as a function of k. Figures 5.6(b) and 5.8(b) show convergence of the objective function to its optimal value as a function of the total number of RCD steps taken so far (both values are plotted in logarithmic scale). Note that RCD steps are not the only component of the CPU time of the algorithms, since gradient computation has to be performed at least once per iteration. 133 In LHAC-L, a line search procedure is employed, as is done in QUIC and LIBLINEAR, for the convergence to follow from the framework by [80]. In particular, the Armijo rule chooses the step size αk to be the largest element from {β0, β1, β2, ...} satisfying F (xk + αkdk) ≤ F (xk) + αkσ∆k (5.7.5) where 0 < β < 1, 0 < σ < 1, and ∆k := ∇fTk dk + λ‖xk + dk‖1 − λ‖xk‖1. In all the experiments we chose β = 0.5, σ = 0.001 for LHAC-L. Sparse Inverse Covariance Selection The sparse inverse covariance selection problem is defined by min X0 F (X) = − log detX + tr(SX) + λ||X||1 (5.7.6) where the input S ∈ Rp×p is the sample covariance matrix and the optimization is over a symmetric matrix X ∈ Rp×p that is required to be positive definite. For SICS we report results on four real world data sets, denoted as ER 692, Arabidopsis, Leukemia and hereditarybc, which are preprocessed from breast cancer data and gene expression networks. We refer to [81] for detailed information about those data sets. We set the regularization parameter λ = 0.5 for all experiments as suggested in [81]. The plots presented in Figure 5.5 show that LHAC and LHAC-L is almost twice as fast as QUIC, in the two largest data sets Leukemia and hereditarybc (see Figure 5.5(c) and 5.5(d)). In the other two smaller data sets the results are less clear-cut, but all of the methods solve the problems very fast and the performance of LHAC is comparable to that of QUIC. The performances of LHAC and LHAC-L are fairly similar in all experiments. Again we should note that with the sufficient decrease condition proposed in Algorithm 15 we are able to establish the global convergence rate, which has not been shown in the case of Armijo line search. 134 0 5 10 15 20 10 −8 10 −6 10 −4 10 −2 10 0 Time (sec) O b j (r e la ti v e ) Leukemia (1255) LHAC QUIC LHAC−L (a) S : 1255× 1255 0 50 100 150 200 10 −8 10 −6 10 −4 10 −2 10 0 Time (sec) O b j (r e la ti v e ) hereditarybc (1869) LHAC QUIC LHAC−L (b) S : 1869× 1869 0 0.5 1 1.5 2 10 −8 10 −6 10 −4 10 −2 10 0 Time (sec) O b j (r e la ti v e ) ER_692 (692) LHAC QUIC LHAC−L (c) S : 692× 692 0 2 4 6 10 −8 10 −6 10 −4 10 −2 10 0 Time (sec) O b j (r e la ti v e ) Arabidopsis (834) LHAC QUIC LHAC−L (d) S : 834× 834 Figure 5.5: Convergence plots on SICS (the y-axes on log scale). 135 0 50 100 0 2 4 6 8 10 12 x 10 4 Iteration C D s te p s SICS ER_692 Arabidopsis Leukemia (a) The number of coordinate descent steps on subproblems increases as iterates move towards optimality 10 4 10 5 10 6 10 −15 10 −10 10 −5 10 0 CD Steps O b j (r e la ti v e ) SICS ER_692 Arabidopsis Leukemia (b) Both axes are in log scale. Change of objective w.r.t. the number of coordinate descent steps. Figure 5.6: RCD step count of LHAC on different SICS data sets. Data set #features p #instances N #non-zeros Description a9a 123 32561 451592 'census Income' dataset. epsilon 2000 100000 200000000 PASCAL challenge 2008. gisette 5000 6000 29729997 handwritten digit recognition. slices 385 53500 20597500 CT slices location prediction. Table 5.4: Data statistics in sparse logistic regression experiments. Sparse Logistic Regression The objective function of sparse logistic regression is given by F (w) = λ‖w‖1 + 1 N N∑ n=1 log(1 + exp(−yn * wTxn)) 136 0 2 4 6 10 −8 10 −6 10 −4 10 −2 10 0 10 2 Time (sec) O b j (r e la ti v e ) a9a LHAC LIBLINEAR LHAC−L (a) a9a (p = 123, N = 32561) 0 50 100 150 10 −8 10 −6 10 −4 10 −2 10 0 10 2 Time (sec) O b j (r e la ti v e ) slice LHAC LIBLINEAR LHAC−L (b) slices (p = 385, N = 53500) 0 5 10 15 10 −8 10 −6 10 −4 10 −2 10 0 10 2 Time (sec) O b j (r e la ti v e ) gisette_scale LHAC LIBLINEAR LHAC−L (c) gisette (p = 5000, N = 6000) 0 20 40 60 80 10 −8 10 −6 10 −4 10 −2 10 0 10 2 Time (sec) O b j (r e la ti v e ) epsilon_normalized LHAC LIBLINEAR LHAC−L (d) epsilon (p = 2000, N = 100000) Figure 5.7: Convergence plots on SLR (the y-axes on log scale). where L(w) = 1 N ∑N n=1 log(1 + exp(−yn * wTxn)) is the average logistic loss function and {(xn, yn)}Nn=1 ∈ (Rp×{−1, 1}) is the training set. The number of instances in the training set and the number of features are denoted by N and p respectively. Note that the evaluation of F requires O(pN) flops and to compute the Hessian requires 137 0 200 400 600 0 2000 4000 6000 8000 10000 12000 14000 16000 Iteration C D s te p s SLR a9a slice gisette (a) The number of coordinate descent steps on subproblems increases as iterates move towards optimality 10 2 10 4 10 6 10 −10 10 −5 10 0 10 5 CD Steps O b j (r e la ti v e ) SLR a9a slice gisette (b) Both axes are in log scale. Change of objective w.r.t. the number of coordinate descent steps. Figure 5.8: RCD step count of LHAC on different SLR data sets. O(Np2) flops. Hence, we chose such training sets for our experiment with N and p large enough to test the scalability of the algorithms and yet small enough to be completed on a workstation. We report results of SLR on four data sets downloaded from UCI Machine Learning repository [82], whose statistics are summarized in Table 5.4. In particular, the first data set is the well-known UCI Adult benchmark set a9a used for income classification, determining whether a person makes over $50K/yr or not, based on census data; the second one we use in the experiments is called epsilon, an artificial data set for PASCAL large scale learning challenge in 2008; the third one, slices, contains features extracted from CT images and is often used for predicting the relative location of CT slices on the human body; and finally we consider gisette, a handwritten digit recognition problem from NIPS 2003 feature selection challenge, with the feature set of size 5000 constructed in order to discriminate between two confusable handwritten digits: the four and the nine. 138 The results are shown in Figure 5.7. In most cases LHAC and LHAC-L outperform LIBLINEAR. On data set slice, LIBLINEAR experiences difficulty in convergence which results in LHAC being faster by an order of magnitude. On the largest data set epsilon, LHAC and LHAC-L is faster than LIBLINEAR by about one third and reaches the same precision. Finally we note that the memory usage of LIBLINEAR is more than doubled compared to that of LHAC and LHAC-L, as we observed in all the experiments and is particularly notable on the largest data set epsilon. 139 Chapter 6 Conclusions Machine learning has attracted lots of interests in recent years due to both advances in the active research community and wide applications through the industry. While much progress has been made, designing scalable and efficient algorithms for complex and large-scale learning, with theoretically sound guarantee, is still a challenging problem. Optimization thus serves a critical role to the success of machine learning in the real-world application, as the general trend moving towards big data applications, e.g., in distributed system, and complex learning objectives, e.g., non-smoothness in structured learning, deep neural network, etc. This work contributes to the latest research on the analysis of global convergence rate of inexact proximal quasi-Newton framework that is applicable to a broad class of machine learning and signal processing problems. The work shows that randomized coordinate descent, among other subproblem methods, can be used effectively to find inexact quasi-Newton directions, which guarantee sub linear convergence rate of the algorithm, in expectation. This is the first global convergence rate result for an algorithm that uses coordinate descent to inexactly optimize subproblems at each iteration. This globally convergent framework enables, as proposed in the work, a fast, memoryand communication-efficient algorithm that directly addresses the big data cases when both N (samples) and n (features) are large and that combines 140 firstand second-order information within one single optimization algorithm for fast convergence in practice. The framework does not rely on or exploit the accuracy of second order information, and hence we do not obtain fast local convergence rates. We also do not assume strong convexity of our objective function, hence a sublinear conference rate is the best global rate we can hope to obtain. In [30] an accelerated scheme related to our framework is studied and an optimal sublinear convergence rate is shown, but the assumptions on the Hessian approximations are a lot stronger in [30] than in this work and appear to be impractical, hence the accelerated method is not as widely applicable. The framework studied in this work covers several existing efficient algorithms for large scale sparse optimization. However, to provide convergence rates we had to depart from some standard techniques, such as line-search, replacing it instead by a prox-parameter updating mechanism with a trust-region-like sufficient decrease condition for acceptance of iterates. We also use randomized coordinate descent instead of a cyclic one. We demonstrated that this modified framework is, nevertheless, very effective in practice and is competitive with state-of-the-art specialized methods. 141 Bibliography [1] Nesterov, Y. Gradient methods for minimizing composite objective function. CORE report (2007). [2] Beck, A. & Teboulle, M. A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM Journal on Imaging Sciences 2, 183–202 (2009). [3] Wright, S. J., Nowak, R. D. & Figueiredo, M. A. T. Sparse reconstruction by separable approximation. Trans. Sig. Proc. 57, 2479–2493 (2009). [4] Yuan, G.-X., Ho, C.-H. & Lin, C.-J. An improved GLMNET for l1-regularized logistic regression and support vector machines. National Taiwan University (2011). [5] Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33, 1–22 (2010). [6] Friedman, J., Hastie, T. & Tibshirani, R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics Oxford England 9, 432–41 (2008). [7] Scheinberg, K. & Rish, I. SINCO a greedy coordinate ascent method for sparse inverse covariance selection problem. tech. rep. (2009). [8] Scheinberg, K., Ma, S. & Goldfarb, D. Sparse inverse covariance selection via alternating linearization methods. NIPS (2010). 142 [9] Shalev-Shwartz, S. & Tewari, A. Stochastic methods for l1 regularized loss minimization. ICML 929–936 (2009). [10] Koh, K., Kim, S. & Boyd, S. An interior-point method for large-scale l1regularized logistic regression. Journal of Machine learning research 8, 1519– 1555 (2007). [11] Tang, X., Qin, Z., Akrotirianakis, I. & Chakraborty, A. Hiclass an efficient hierarchical classification system. Tech. Rep. (2012). [12] Tang, X., Akrotirianakis, I. & Chakraborty, A. Distributed learning algorithm for short-term load forecasting. Tech. Rep. (2012). [13] Qin, Z., Tang, X., Akrotirianakis, I. & Chakraborty, A. Hipad a hybrid interior-point alternating direction algorithm for knowledge-based svm and feature selection. In Pardalos, P. M., Resende, M. G., Vogiatzis, C. & Walteros, J. L. (eds.) Learning and Intelligent Optimization, Lecture Notes in Computer Science, 324–340 (Springer International Publishing, 2014). URL http://dx.doi.org/10.1007/978-3-319-09584-4_28. [14] Yuan, M. & Lin, Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68, 49–67 (2006). [15] Vapnik, V. N. & Vapnik, V. Statistical learning theory, vol. 1 (Wiley New York, 1998). [16] Conn, A. R., Scheinberg, K. & Vicente, L. N. Introduction to derivative-free optimization, vol. 8 (Siam, 2009). [17] Hsieh, C.-J., Sustik, M., Dhilon, I. & Ravikumar, P. Sparse inverse covariance matrix estimation using quadratic approximation. NIPS (2011). 143 [18] Olsen, P. A., Oztoprak, F., Nocedal, J. & Rennie, S. J. Newton-Like Methods for Sparse Inverse Covariance Estimation (2012). [19] Byrd, R., Chin, G., Nocedal, J. & Oztoprak, F. A family of second-order methods for convex l1-regularized optimization. tech. rep. (2012). [20] Scheinberg, K. & Tang, X. Practical inexact proximal quasi-newton method with global complexity analysis. arXiv preprint arXiv:1311.6547 (2013). URL http://arxiv.org/abs/1311.6547. arXiv:1108.0775v2. [21] Nocedal, J. & Wright, S. J. Numerical Optimization. Springer Series in Operations Research (Springer, New York, NY, USA, 2006), 2nd edn. [22] Schmidt, M. W., Berg, E., Friedlander, M. P. & Murphy, K. P. Optimizing costly functions with simple constraints: A limited-memory projected quasi-newton algorithm. In International Conference on Artificial Intelligence and Statistics, None (2009). [23] Schmidt, M., Roux, N. L. & Bach, F. Supplementary material for the paper convergence rates of inexact proximal-gradient methods for convex optimization. In NIPS (2011). [24] Nesterov, Y. E. & Polyak, B. T. Cubic regularization of Newton method and its global performance. mathprog 108, 177205 (2006). [25] Cartis, C., M., G. N. I. & L., T. P. Evaluation complexity of adaptive cubic regularization methods for convex unconstrained optimization. Optimization Methods and Software 27, 197–219 (2012). [26] Nesterov, Y. E. Introductory lectures on convex optimization 87, xviii+236 (2004). A basic course. [27] Byrd, R., Nocedal, J. & Oztoprak, F. An inexact successive quadratic approximation method for convex l-1 regularized optimization. Tech. Rep. (2013). 144 [28] Lee, J. D., Sun, Y. & Saunders, M. A. Proximal newton-type methods for minimizing convex objective functions in composite form. In SIAM Journal on Optimization (2014). [29] Becker, S. & Fadili, J. A quasi-newton proximal splitting method. In Pereira, F., Burges, C., Bottou, L. & Weinberger, K. (eds.) Advances in Neural Information Processing Systems 25, 2618–2626 (Curran Associates, Inc., 2012). URL http://papers.nips.cc/paper/ 4523-a-quasi-newton-proximal-splitting-method.pdf. [30] Jiang, K., Sun, D. & Toh, K.-C. An inexact accelerated proximal gradient method for large scale linearly constrained convex SDP. SIAM Journal on Optimization 3, 10421064 (2012). [31] Richtárik, P. & Takáč, M. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming (2012). URL http://link.springer.com/10.1007/s10107-012-0614-z. [32] Yuan, G. & Ho, C. An improved GLMNET for l1-regularized logistic regression and support vector machines. National Taiwan University (2011). [33] Tseng, P. & Yun, S. A coordinate gradient descent method for nonsmooth separable minimization,. Mathematical Programming 117, 387–423 (2009). [34] Dumais, S. & Chen, H. Hierarchical classification of web content. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, 256–263 (ACM, 2000). [35] Cai, L. & Hofmann, T. Hierarchical document categorization with support vector machines. In Proceedings of the thirteenth ACM international conference on Information and knowledge management, 78–87 (ACM, 2004). 145 [36] McCallum, A., Rosenfeld, R., Mitchell, T. & Ng, A. Improving text classification by shrinkage in a hierarchy of classes. In Proceedings of the Fifteenth International Conference on Machine Learning, 359–367 (Citeseer, 1998). [37] Zimek, A., Buchwald, F., Frank, E. & Kramer, S. A study of hierarchical and flat classification of proteins. IEEE/ACM Transactions on Computational Biology and Bioinformatics 7, 563–571 (2010). [38] Crammer, K. & Singer, Y. On the algorithmic implementation of multiclass kernel-based vector machines. The Journal of Machine Learning Research 2, 265–292 (2002). [39] Dekel, O., Keshet, J. & Singer, Y. Large margin hierarchical classification. In Proceedings of the twenty-first international conference on Machine learning, 27 (ACM, 2004). [40] Zhou, D., Xiao, L. & Wu, M. Hierarchical classification via orthogonal transfer (2010). [41] Cesa-Bianchi, N., Gentile, C. & Zaniboni, L. Hierarchical classification: combining bayes with svm. In Proceedings of the 23rd international conference on Machine learning, 177–184 (ACM, 2006). [42] Barutcuoglu, Z., Schapire, R. & Troyanskaya, O. Hierarchical multi-label prediction of gene function. Bioinformatics 22, 830 (2006). [43] Lewis, D. D., Yang, Y., Rose, T. & Li, F. Rcv1: A new benchmark collection for text categorization research. JMLR 5, 361–397 (2004). [44] Hahn, H., Meyer-Nieberg, S. & Pickl, S. Electric load forecasting methods: Tools for decision making. European Journal Of Operational Research 199, 902–907 (2009). URL http://linkinghub.elsevier.com/retrieve/pii/ S0377221709002094. 146 [45] Fan, S. & Chen, L. Short-Term Load Forecasting Based on an Adaptive Hybrid Method. IEEE Transactions on Power Systems 21, 392–401 (2006). URL http: //ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1583738. [46] Hor, C. L., Watson, S. J. & Majithia, S. Analyzing the Impact of Weather Variables on Monthly Electricity Demand. IEEE Transactions on Power Systems 20, 2078–2085 (2005). URL http://ieeexplore.ieee.org/lpdocs/epic03/ wrapper.htm?arnumber=1525139. [47] Espinoza, M., Joye, C., Belmans, R. & DeMoor, B. Short-Term Load Forecasting, Profile Identification, and Customer Segmentation: A Methodology Based on Periodic Time Series. IEEE Transactions on Power Systems 20, 1622– 1630 (2005). URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper. htm?arnumber=1490617. [48] Taylor, J. W., De Menezes, L. M. & McSharry, P. E. A comparison of univariate methods for forecasting electricity demand up to a day ahead. International Journal of Forecasting 22, 1–16 (2006). URL http://eprints.maths.ox.ac. uk/281/. [49] González-Romera, E., Jaramillo-Morán, M. A. & Carmona-Fernández, D. Monthly Electric Energy Demand Forecasting Based on Trend Extraction. IEEE Transactions on Power Systems 21, 1946–1953 (2006). [50] Bruhns, A., Deurveilher, G. & Roy, J.-s. A Non-Linear Regression Model for Mid-Term Load Forecasting and Improvements in Seasonality. In 15th Power Systems Computation Conference, August, 22–26 (Citeseer, 2005). URL http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.124. 891&amp;rep=rep1&amp;type=pdf. [51] Ferreira, V. & Alves da Silva, A. Toward estimating autonomous neural networkbased electric load forecasters. Power Systems, IEEE Transactions on 22, 1554 –1562 (2007). 147 [52] Elattar, E. E., Goulermas, J. & Wu, Q. H. Electric Load Forecasting Based on Locally Weighted Support Vector Regression (2010). URL http://ieeexplore. ieee.org/xpl/freeabs_all.jsp?arnumber=5406166. [53] Chen, B.-j., Chang, M.-w. & Lin, C.-j. Load Forecasting Using Support Vector Machines : A Study on EUNITE Competition 2001. Power 19, 1821–1830 (2004). URL http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber= 1350819. [54] Bodyanskiy, Y., Popov, S. & Rybalchenko, T. Multilayer Neuro-fuzzy Network for Short Term Electric Load Forecasting. In Hirsch, E. A., Razborov, A. A., Semenov, A. & Slissenko, A. (eds.) Third International Computer Science Symposium CSR2008, 339–348 (Springer, 2008). URL http://www.springerlink. com/content/y46125v1651tt212/. [55] Lin, C.-J. L. C.-J., Chen, C.-H. C. C.-H. & Lin, C.-T. L. C.-T. A Hybrid of Cooperative Particle Swarm Optimization and Cultural Algorithm for Neural Fuzzy Networks and Its Prediction Applications (2009). URL http: //ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4717248. [56] Vapnik, V. N. The Nature of Statistical Learning Theory, vol. 8 of Statistics for Engineering and Information Science (Springer, 1995). URL http://portal. acm.org/citation.cfm?id=211359. [57] Kohonen, T. The self-organizing map. Proceedings of the IEEE 78, 1464–1480 (1990). URL http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber= 58325. [58] Takens, F. Detecting strange attractors in turbulence. Dynamical systems and turbulence Warwick 1980 898, 366–381 (1981). URL http://www. springerlink.com/index/B254X77553874745.pdf. 148 [59] Ding, C., He, X. & Simon, H. D. On the equivalence of nonnegative matrix factorization and spectral clustering. Proc SIAM Data Mining Conf 44, 606610 (2005). URL http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1. 1.76.9476&rep=rep1&type=pdf. [60] Shalev-shwartz, S., Singer, Y., Srebro, N. & Cotter, A. Pegasos: Primal estimated sub-gradient solver for svm. Corpus 127, 807–814 (2007). URL http://en.scientificcommons.org/56847611. [61] Sun, L., Ji, S. & Ye, J. Canonical correlation analysis for multilabel classification: A least-squares formulation, extensions, and analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 194–200 (2011). [62] Chen, Y. & Lin, H. Feature-aware Label Space Dimension Reduction for Multi-label Classification. Advances in Neural Information Processing Systems 1538–1546 (2012). URL http://machinelearning.wustl.edu/mlpapers/ paper_files/NIPS2012_0728.pdf. [63] Hsu, D., Kakade, S., Langford, J. & Zhang, T. Multi-Label Prediction via Compressed Sensing. NIPS 1–14 (2009). URL http://arxiv.org/abs/0902.1284http://papers.nips.cc/paper/ 3824-multi-label-prediction-via-compressed-sensing-supplemental. zip. 0902.1284. [64] Zhang, Y. & Schneider, J. Maximum margin output coding. Proceedings of the 29th International Conference on Machine Learning 1575–1582 (2012). URL http://arxiv.org/abs/1206.6478. [65] Bi, W. & Kwok, J. Efficient Multi-label Classification with Many Labels. . . . of the 30th International Conference on . . . 28 (2013). URL http://jmlr.org/proceedings/papers/v28/bi13.pdfhttp: //machinelearning.wustl.edu/mlpapers/papers/icml2013_bi13. 149 [66] Gu, H., Gartrell, M., Zhang, L., Lv, Q. & Grunwald, D. Anchormf: Towards effective event context identification. In Proceedings of the 22Nd ACM International Conference on Conference on Information &#38; Knowledge Management, CIKM '13, 629–638 (ACM, New York, NY, USA, 2013). URL http://doi.acm.org/10.1145/2505515.2505548. [67] Gu, H., Hang, H., Lv, Q. & Grunwald, D. Fusing text and frienships for location inference in online social networks. In Web Intelligence and Intelligent Agent Technology (WI-IAT), 2012 IEEE/WIC/ACM International Conferences on, vol. 1, 158–165 (IEEE, 2012). [68] Cesa-Bianchi, N., Gentile, C. & Zaniboni, L. Incremental algorithms for hierarchical classification. The Journal of Machine Learning Research 7, 31–54 (2006). [69] Bi, W. & Kwok, J. Mandatory leaf node prediction in hierarchical multilabel classification. Advances in Neural Information Processing Systems 10–12 (2012). [70] Jenatton, R., Mairal, J., Obozinski, G., Bach, F. & Fr, I. Proximal Methods for Sparse Hierarchical Dictionary Learning. Proceedings of the 27th International Conference on Machine Learning (ICML-10) 487–494 (2010). URL http://www. icml2010.org/papers/416.pdf. arXiv:0909.0844. [71] Zhao, P., Rocha, G. & Yu, B. The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics 3468–3497 (2009). [72] Kim, S. & Xing, E. Tree-guided group lasso for multi-task regression with structured sparsity. Proceedings of the 27th International . . . (2010). URL http://arxiv.org/abs/0909.1373http://machinelearning. wustl.edu/mlpapers/paper_files/icml2010_KimX10.pdf. 0909.1373. 150 [73] Conn, A. R., Scheinberg, K. & Toint, P. L. On the convergence of derivative-free methods for unconstrained optimization. Approximation theory and optimization: tributes to MJD Powell 83–108 (1997). [74] Moré, J. J. & Wild, S. M. Benchmarking derivative-free optimization algorithms. SIAM Journal on Optimization 20, 172–191 (2009). [75] Schmidt, M., Kim, D. & Sra, S. Projected newton-type methods in machine learning. Optimization for Machine Learning 305 (2012). [76] Tibshirani, R. Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B. 58, 267–288 (1996). [77] Liu, D. C. & Nocedal, J. On the limited memory BFGS method for large scale optimization. Mathematical Programming, Series B 45, 503–528 (1989). [78] Donoho, D. De-noising by soft-thresholding. Information Theory, IEEE Transactions on 41, 613 –627 (1995). [79] Yuan, G.-X., Chang, K.-W., Hsieh, C.-J. & Lin, C.-J. A comparison of optimization methods and software for large-scale l1-regularized linear classification. JMLR 11, 3183–3234 (2010). [80] Tseng, P. & Yun, S. A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming 117, 387–423 (2009). [81] Li, L. & Toh, K.-C. An inexact interior point method for L1-regularized sparse covariance selection. Mathematical Programming 2, 291–315 (2010). [82] Bache, K. & Lichman, M. UCI machine learning repository (2013). URL http: //archive.ics.uci.edu/ml. 151 Biography Xiaocheng Tang obtained his PhD in the area of optimization from Department of Industrial and Systems Engineering (ISE) at Lehigh University. During his study at Lehigh, he has been actively engaged in analyzing and designing novel optimization solutions for large-scale problems arising from various applications including Machine Learning and Data Mining. His work in proximal quasi-newton type methods was presented at the NIPS Optimization workshop and was submitted to Mathematical Programming A. Apart from conducting research at school, he was also engaged in solving real-world problems with collaborations from various industrial research institutes including Siemens Research, IBM Research and Yahoo Labs, and his work there has turned into several papers and patents. Prior to study at Lehigh, he spent 4 years in a beautiful coastal city Hangzhou in China, where he earned his B.S. in Control Science and Engineering from Chu KoChen Honors College in Zhejiang University. Xiaocheng was born and raised in Guiyang, which is a lovely southern city in China blessed by its pleasant weather, unique landscape called Karst topography, and the most famous/best wine in the country.