Abstract

Adaptive dynamic programming (ADP), which belongs to the field of computational intelligence, is a powerful tool to address optimal control problems. To overcome the bottleneck of solving Hamilton–Jacobi–Bellman equations, several state-of-the-art ADP approaches are reviewed in this paper. First, two model-based offline iterative ADP methods including policy iteration (PI) and value iteration (VI) are given, and their respective advantages and shortcomings are discussed in detail. Second, the multistep heuristic dynamic programming (HDP) method is introduced, which avoids the requirement of initial admissible control and achieves fast convergence. This method successfully utilizes the advantages of PI and VI and overcomes their drawbacks at the same time. Finally, the discrete-time optimal control strategy is tested on a power system.

1. Introduction

Adaptive dynamic programming (ADP) [14], which integrates the advantages of reinforcement learning (RL) [58] and adaptive control, has become a powerful tool in solving optimal control problems. With decades of development, ADP has also provided many approaches to solve other control problems, such as robust control [9, 10], optimal control with input constraints [11, 12], optimal tracking control [13, 14], zero-sum games [15], and non-zero-sum games [16]. Furthermore, ADP methods have been widely applied to the real-world systems, such as water-gas shift reaction [17], battery management [18], microgrid systems [19, 20], and Quanser helicopter [21]. These aforementioned papers were all inspired and developed by the basic works of ADP-based optimal control; i.e., optimal control is the core research topic of ADP.

The bottleneck of solving the nonlinear optimal control problems is to obtain the solutions of Hamilton–Jacobi–Bellman (HJB) equations. However, these equations are generally difficult or even impossible to be solved analytically. To overcome this difficulty, ADP has given several important iterative learning frameworks, such as policy iteration (PI) [2, 22, 23] and value iteration (VI) [2426]. PI algorithm starts from an initial admissible control policy and then proceeds the policy evaluation step and the policy improvement step successively till convergence. The main advantage of PI is that it ensures all the iterative control policies are admissible and achieves fast convergence. The drawback of PI is also obvious. The requirement of initial admissible control is a strict condition in practice, which seriously limits its applications. Different from PI, VI can start from an arbitrary-positive semidefinite value function, which is an easy-to-realize initial condition. Although the easier initial condition makes VI more practical, it also leads to a longer iteration learning process; that is, VI achieves convergence much slower than PI. Thus, it is desired to develop a new method, which avoids the requirement of initial admissible control and gets convergence faster than the VI algorithm. To realize these purposes, the multistep heuristic dynamic programming (HDP) approach [27] is presented to integrate the merits of PI and VI algorithms and overcome their drawbacks.

This paper reviews the state-of-the-art ADP algorithms for the optimal control of discrete-time (DT) systems. The rest of this paper is arranged as follows. In Section 2, the problem formulation is derived. Three iterative model-based offline learning algorithms along with comprehensive comparisons are presented in Sections 3 and 4. The proposed DT optimal control strategy is tested on a power system in Section 5. Finally, a brief conclusion is drawn in Section 6.

2. Problem Formulation

In this paper, we consider the general nonlinear DT system:where represents the system state, denotes the control input, and and are the system functions.

The purpose of the optimal control issue is to find out a state feedback control policy , which can not only stabilize system (1) but also minimize the following performance index function:where . The matrices and determine the performance of system states and control inputs, respectively. Given the admissible control policy , the value function can be described by

According to the definition of optimal control, the optimal value function can be defined by

By using the stationarity condition [28], the optimal control policy can be derived aswhere .

The key to obtaining the optimal control policy is to solve the following DT HJB equation [27]:

Remark 1. Figure 1 provides the relationship and difference between discrete-time and continuous-time optimal control. The real-world systems generally exist in the continuous-time forms. After mathematical modeling, they are formulated by the continuous-time system models. Through sampling and discretization, the continuous-time system models are converted into the discrete-time ones. Therefore, the associated performance indexes and HJB equations of discrete-time systems are in the discretization forms compared with the continuous-time systems. The key to solving the discrete-time optimal control issue is the discrete-time HJB equation, which is a nonlinear partial difference equation. The existing works regarding continuous-time systems are much more than the ones regarding discrete-time systems. In order to overcome this bottleneck, several ADP learning algorithms along with their neural network (NN) implementations will be introduced.

3. Model-Based PI Algorithm for the Optimal Control Problem of DT Systems

In this section, the model-based PI algorithm along with its NN implementation will be introduced in detail. The model-based PI algorithm [2, 23] is shown in Algorithm 1.

Step 1: (Initialization)
Let the iteration index .
Select an initial admissible control policy .
Choose a small enough computation precision .
Step 2: (Policy Evaluation)
With , compute the iterative value function by
Step 3: (Policy Improvement)
With , update the iterative control policy by
Step 4: if , stop and the optimal control policy is acquired;
Else, let and go back to Step 2.

The actor-critic dual-network structure with the gradient-descent updating law is employed to implement Algorithm 1. First, construct the critic NN to approximate the iterative value function:where and denote the NN weights and NN activation functions of the critic network and is the iteration index for the following gradient-descent method.

Define the error function for the critic NN:where . If we select a large enough integer , then, with the admissible control , one has [2]; that is, can be expressed as .

In order to minimize the error performance , the gradient-descent-based updating law for the critic NN is given bywhere is the learning rate of the critic NN.

Similar to the design of critic NN, the actor network, which is used to approximate the iterative control policy, is expressed as

The error function for the actor NN is defined aswhere can be attained according to Algorithm 1.

To minimize the error performance , using the chain derivation rule, the updating law for the actor NN is designed bywhere is the learning rate of the actor NN.

Remark 2. Figure 2 displays the NN implementation diagram of PI algorithm. First, NN weights of the actor network should be chosen to generate admissible control. Second, critic and actor networks are updated via the gradient-descent-based learning law to realize policy evaluation and improvement steps, respectively. After iteration, critic and actor networks achieve convergence, where the NN-based approximate optimal control can be obtained. Many stability proofs of the NN implementation procedure have been given in the existing works. Here, we introduce the following rigorous proof to demonstrate the optimality and convergence.

Theorem 1. Let the target iterative value function and control policy be described by and , respectively. Let the critic and actor NNs be updated via (9) and (12), respectively. If the learning rates and are selected to be appropriately small, then the NN weights and will asymptotically converge to the ideal values and , respectively.

Proof. Let and . According to (9) and (12), it can be acquired thatwhere and .
Construct the following Lyapunov function candidate:The difference of the Lyapunov function (14) can be derived asIf the learning rates are selected to satisfy and , then one has , which implies the NN weights and will asymptotically converge to the ideal values.
This completes the proof.

4. Model-Based VI Algorithm and Multistep HDP Algorithm

With the help of the initial admissible control, the PI algorithm achieves fast convergence. However, the weakness of the PI algorithm is obvious. The PI algorithm requires the initial control policy to be admissible, which is a strict condition. How to find out an initial admissible control policy is still an open problem, which limits the real-world applications of the PI algorithm. To relax the strict condition, the model-based VI algorithm [2426] is shown in Algorithm 2, where the initial condition becomes much easier.

Step 1: (Initialization)
Let the iteration index .
Select an initial value function .
Choose a small enough computation precision .
Step 2: (Policy Improvement)
With , compute the iterative control policy by
Step 3: (Policy Evaluation)
With , calculate the iterative value function by
Step 4: if , stop and the optimal control policy is acquired;
Else, let and go back to Step 2.

Remark 3. Different from the PI algorithm, the VI algorithm does not require the initial admissible control, and one only needs to provide a specific initial value function, which makes the VI algorithm more practical in the real-world applications. However, without the help of the initial admissible control, the VI algorithm generally suffers from the low convergence speed. From the aforementioned content, it can be observed that the PI and VI algorithms have their own advantages and disadvantages. The PI algorithm can achieve fast convergence, while it requires an initial admissible control policy. The VI algorithm can start from an easy-to-realize initial condition, while it generally suffers from the low convergence speed. Thus, it is expected to design a new approach, which can make the trade-off between the PI algorithm and the VI algorithm.
That is, it is desired to develop an algorithm, which achieves convergence faster than the VI algorithm and does not require an initial admissible control policy. To realize this goal, the multistep HDP method [27] will be introduced in Algorithm 3.
Construct the critic and actor NNs to approximate the iterative value function and control policy as follows:where and are the NN weights and and are the associated NN activation functions.
According to Algorithm 3, using the NNs to estimate the solutions will yield the following error:Let and . Equation (17) becomesTo minimize , we employ the least-square method to update . Collect different data sets for training, where is a large enough number. Then, one has and . The least-square-based updating law for is given byTo minimize , the gradient-descent-based updating law for the actor NN is given by

Let the iteration index .
Select an initial value function .
Choose a small enough computation precision .
Step 2: (Policy Improvement)
With , compute the iterative control policy by
Step 3: (Multistep Policy Evaluation)
With , calculate the iterative value function by
Step 4: if , stop and the optimal control policy is acquired;
Else, let and go back to Step 2.

Remark 4. From Table 1 and Figure 3, we can see the performance comparison and relationship among PI algorithm, VI algorithm, and multistep HDP. Due to the existence of initial admissible control, the PI algorithm gets fast convergence. However, the condition of initial admissible control is difficult to realize. Different from the PI algorithm, the initial condition of VI algorithm is easy-to-realize. However, the initial condition may not be admissible, which may lead to the low stability. Multistep HDP follows the initial condition of VI algorithm and develops the multistep policy evaluation step to obtain more history data. Therefore, multistep HDP is easy-to-realize and achieves fast convergence at the same time; that is, multistep HDP successfully combines the advantages of PI and VI algorithms.

5. Application to a Benchmark Power System

The benchmark power system investigated in this paper is illustrated in Figure 4. This power system can be regarded as a microgrid, which is composed of nonpolluting energy (subsystems I and II), load demand sides (subsystem III), and regular generations (subsystem IV). The core control unit is the management center, which maintains the frequency stability against load variations.

5.1. System Model and Application

In Figure 5, first, the real-world power system can be formulated by a state space function via mathematical modeling. After sampling and discretization, the system model can be controlled by computers. Through iterative ADP learning, the approximate optimal control can be obtained. Substituting the approximate optimal control into the system model will yield simulation results. To test the effectiveness of the proposed DT optimal control strategy, let us consider the following power system [19, 20]:where is the frequency deviation; denotes the turbine power; represents the governor position value; , , and denote the time constants of turbine, governor, and power system, respectively; represents the gain of power system; is the speed regulation coefficient; denotes the control input; and is the state variable. Let , where , and . Then, the system (21) can be discretized as the form of (1). Set the matrices in the performance index function: and .

5.2. Simulation Results

Simulation results are shown in Figure 6. Figure 6(a) implies the system states cannot be stabilized without control. Then, we apply the optimal control strategy into the system. Figure 6(b) indicates the system states can be stabilized after 8 time steps under optimal control. Comparing the trajectories of the system states, the superior control performance of optimal control strategy can be observed. Figure 6(c) shows the 2D plot of convergence trajectory in detail. Figure 6(d) provides the evolution of the control input. The aforementioned simulation results demonstrate the high stability, fast convergence, and low control cost of the DT optimal control strategy.

6. Conclusions

In this paper, several state-of-the-art ADP-based methods have been reviewed to address the optimal control problem of DT systems. A comprehensive comparison has been made between PI and VI. A novel multistep HDP method has been introduced to integrate the advantages of PI and VI algorithms with either strict requirement of initial admissible control or longer interaction learning process. The simulation results have demonstrated the effectiveness of our proposed schemes.

Data Availability

Data are available upon request to the corresponding author.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the Science and Technology Foundation of SGCC (Grant no. SGLNDK00DWJS1900036).