Abstract

Rapid and precise air operation mission planning is a key technology in unmanned aerial vehicles (UAVs) autonomous combat in battles. In this paper, an end-to-end UAV intelligent mission planning method based on deep reinforcement learning (DRL) is proposed to solve the shortcomings of the traditional intelligent optimization algorithm, such as relying on simple, static, low-dimensional scenarios, and poor scalability. Specifically, the suppression of enemy air defense (SEAD) mission planning is described as a sequential decision-making problem and formalized as a Markov decision process (MDP). Then, the SEAD intelligent planning model based on the proximal policy optimization (PPO) algorithm is established and a general intelligent planning architecture is proposed. Furthermore, three policy training tricks, i.e., domain randomization, maximizing policy entropy, and underlying network parameter sharing, are introduced to improve the learning performance and generalizability of PPO. Experiments results show that the model in this work is efficient and stable, and can be adapted to the unknown continuous high-dimensional environment. It can be concluded that the UAV intelligent mission planning model based on DRL has powerful intelligent planning performance, and provides a new idea for researching UAV autonomy.

1. Introduction

Mission planning is the process of making an operational plan, including a route plan, weapon plan, and avionics plan [1, 2]. Intelligent planning capability is an important symbol of unmanned aerial vehicle (UAV) autonomy. With the development of UAV technology, UAVs can fly independently and complete simple missions, such as reconnaissance and strike, which greatly improves efficiency and reduces the labor cost. However, for complex cooperative missions, UAV intelligent planning is still a key research issue.

Mission planning is a decision optimization problem that solves the optimal solution of mission objective function under certain constraints, such as the shortest route [3], minimum threat [4], and maximum efficiency [5, 6]. Because the mission planning problem considers many mutually coupled factors, large decision space, and nonlinear constraints, traditional mission planning is mostly solved by intelligent optimization algorithms. Xin et al. [3] modeled the route planning problem by solving the shortest-path problem from the starting point to the target point, established an optimization model based on an ant colony optimization (ACO) algorithm, and searched for the shortest route. Zhang et al. [4] studied the tactical maneuver planning problem. Taking the minimum threat of a fighter and the maximum damage effectiveness of ground targets as the optimization objectives, and considering the capability constraints of weapons and equipment, the optimal flight route and weapon delivery time were solved using the multiobjective evolutionary algorithm based on decomposition (MOEA/D) [5]. Aiming at the task allocation problem of UAV identification, attack, and evaluation targets, a genetic algorithm (GA) was used to solve the optimal allocation result [6]. Zhang et al. [7] studied electronic warfare mission planning. By taking the route safety width and electronic jamming effect of a jammer as the objective function, the multiobjective particle swarm optimization (PSO) algorithm was used to solve it, and the optimal jamming array model was obtained. A search-based intelligent optimization algorithm can find the global optimal solution or suboptimal solution of the complex objective function through the parallel optimization mechanism, but its essence is still random search. Each solution can only be researched in the static and known environment (explicit objective function and constraints) and cannot be generalized to the dynamic and unknown environment. The computational complexity increases exponentially with the growth in problem scale. Therefore, its application has certain limitations for rapid planning and dynamic scenarios of future large-scale operations.

In recent years, due to the expansion of computing power, the emergence of big data [8], and the development of artificial intelligence (AI) algorithms, learning-based methods such as neural networks [9, 10] and reinforcement learning (RL) [11] have promoted the second wave of AI. From AlphaGo [12] to AlphaGo Zero [13], AlphaZero [14], and AlphaStar [15], DRL has achieved a series of breakthroughs in a range of challenging domains. Among such methods, deep learning (DL) has been used to solve high-dimensional mapping problems, RL has been used to solve sequential decision-making problems, and DRL has been successfully applied to a series of robotics [16, 17], autonomous driving [18], real-time strategy (RTS) games [19], and optimization and scheduling [20] problems. A learning-based method, also known as the data-driven method, refers to feeding data to improve the prediction or decision-making performance of a model. Such method uses a neural network to learn or fit the complex and high-dimensional nonlinear relationship between input and output, so as to achieve the minimal mean square error or optimal prediction and decision results, and saves the mapping network parameters to realize offline training and online inferencing. It also has certain robustness and generalization for new input data, which is suitable for fast and dynamic mission planning. The comparison of the two methods is shown in Table 1.

Therefore, by taking a high-risk typical suppression of enemy air defense (SEAD) mission planning [21, 22] as an example, we propose an end-to-end UAV intelligent mission planning method based on DRL. First, the mission planning problem of SEAD operation is formalized, then the basic principles of a DRL algorithm are introduced, and a DRL-based intelligent planning model is established. Finally, the superiority and potential value of this method are analyzed and verified by simulation experiments.

2. SEAD Mission Planning Problem Formulation

A SEAD mission is an operational style of offensive counter-air (OCA) combat carried out by an air force. Its mission goal is to break through the enemy’s surface-to-air missile (SAM) threat and strike the enemy’s radar or target through a cooperative combat between attacking UAV, named fighter, and jamming UAV, named jammer. A schematic of a SEAD mission is presented in Figure 1.

In Figure 1, the mission is to use a fighter to safely destroy an enemy SAM. However, since the detection range of the enemy SAM is longer than the attack range of the fighter, the fighter faces the threat of being detected and attacked by the SAM. Therefore, a jammer is required to jam the SAM to reduce its detection range. Only then the fighter can take the opportunity to attack and destroy the SAM. Therefore, the jammer must jam at the right position and time, and the fighter must attack at the right position and time simultaneously. The two cooperate to complete the mission.

To summarize, mission planning is essentially a sequential decision-making problem. Under different space-time sequence states, a combat unit adopts the optimal decision-making sequence to transfer from the initial situation to the termination situation to achieve the mission objectives. Therefore, SEAD mission planning can be modeled as an end-to-end sequence optimization problem from the state (position and situation) to the decision (maneuver, attack, and jamming). The optimization goal is to solve an optimal state decision sequence to meet the needs of fighters and jammer to destroy an enemy SAM and ensure their own safety through tactical cooperation, as shown in Figure 2.

3. Deep Reinforcement Learning

3.1. Principles of Reinforcement Learning

RL is a machine learning approach for teaching agents how to solve tasks by trial and error. The main characters of RL are the agent [23] and the environment. The agent perceives an initial state of the environment and then decides on an action to take. The environment changes when the agent acts on it and gives the agent an instant reward signal, while the agent transfers to the next state and continues to choose new actions until reaching a termination state. The goal of the agent is to maximize the sum of rewards in the entire decision-making process, i.e., to find the optimal policy. Therefore, RL is a decision optimization method. The reinforcement learning framework is shown in Figure 3.

RL is usually modeled as a finite Markov decision process, which is represented by five tuples , in which S is a finite state set, A is a finite action set, R is a reward function, P is a state transition function, and γ is a discount factor, which is used to calculate the long-term discount rewards. Assuming that the state of the agent at time t is to take action according to policy , the environment will feed back an instant reward to the agent, and the agent will transfer to a new state . Tabular reinforcement learning evaluates the state-action value through a discrete Q table, but for continuous and high-dimensional problems, it encounters the “curse of dimensionality” problem, which spurred the development of DRL [2326].

3.2. Deep Reinforcement Learning

DRL refers to the combination of RL with DL. DRL uses a neural network to approximate policy and value functions to solve the high-dimensional mapping problem. The goal of the agent is to find the parameterized policy with the maximum expected return , which refers to the discounted cumulative rewards on the trajectory . θ is a policy parameter. The optimal policy is as follows:

A DRL algorithm is divided into three learning paradigms: value-based, policy gradient, and actor-critic. The actor-critic reinforcement learning algorithm integrates the value function and policy gradient, and uses the value function error to guide the policy gradient update to accelerate the learning speed. The policy is updated through the gradient of expected return, which can be written as follows:where is an actor and a critic, and can also take other forms, such as a state-action value function , advantage function , or temporal difference (TD) residual . When the critic takes the TD residual and the value function is approximated by the neural network with parameter ω, the derivative of equation (2) is obtained, the critic is updated according to equations (3) and (4), and the actor is updated according to equation (5).

3.3. Proximal Policy Optimization Algorithm

Because proximal policy optimization (PPO) [27] algorithm is a simple, stable, and easy-to-implement actor-critic algorithm. Both Dota2 AI (OpenAI Five) [28] and Honor of Kings AI Juewu (Tencent) [29] are implemented by PPO. The PPO algorithm addresses the computationally expensive problem in the trust region policy optimization (TRPO) [30] algorithm, which needs enormous calculations to ensure monotonous policy performance improvement. Through the first-order approximation, the surrogate loss function is optimized. The new policy is calculated in each iteration, and it is close to the old policy. The policy is optimized in the direction of minimizing the loss function (maximizing the expected return). The PPO algorithm achieves a balance between sampling efficiency, algorithm performance, and engineering implementation complexity.

In a PPO algorithm, a critic uses the advantage function to measure the quality of the action, and (2) becomes the following:

Because PPO is an on-policy method, importance sampling is introduced to improve sample utilization, and the old is used the sample to obtain the following:

Because , Eq. (7) becomes the following:

The optimization objective function corresponding to the gradient is as follows:

In practical application, based on the sampling estimation expectation, the optimization objective of PPO, i.e., the surrogate loss function, is simplified as shown in equation (10). The policy update amplitude is limited by the truncation operation, i.e., CLIP, to ensure the training stability.where is the ratio of new and old policies, and is a hyperparameter. The generalized advantage estimation [31] is used to calculate the advantage and keep the variance and deviation estimated by the value function small, as shown in the following:.

The PPO algorithm is executed as Table 2:

4. Modeling of SEAD Intelligent Planning Based on the PPO Algorithm

4.1. UAV Kinematics Equation

In this paper, we aim to study the feasibility and future value of an end-to-end DRL method in intelligent mission planning. Therefore, we construct a simple two-dimensional (2-D) environment, where the fighter and jammer adopt a three-degree-of-freedom (3-DOF) model, and its kinematic equation is as follows:where , , , and represent the differentiation of coordinate X and Y, speed, and heading of the fighter, respectively, and , , and , and represent the differentiation of coordinate X and Y, speed, and heading of the jammer, respectively. The heading is any continuous value in and the speed is a continuous value in .

4.2. MDP Modeling
4.2.1. State Space

The state of the fighter, jammer, and SAM is defined as the coordinate in 2-D space, which are continuous values and represented by , , and , as shown in Table 3.

4.2.2. Action Space

The actions of the fighter and jammer are their respective heading and speed, as shown in Table 4. The UAV can change the position in 2-D space by controlling the heading, and reaching the attack position by controlling the speed. When the fighter fires the missile and the jammer turns on the jamming, it is automatically completed by default by setting the distance conditions. In an actual mission, it is necessary to calculate the fire time, position, and parameters in detail.

4.2.3. Reward Function

The design of the reward function follows the principle of Occam’s razor, i.e., it should be simple and effective. If the jammer suppresses the SAM without entering its missile range, and the fighter destroys the enemy’s SAM radar without entering its missile range, it will get a reward +1. If the jammer or fighter enters the range of SAM or flies out of the environmental boundary, it will get a reward −1. For other circumstances, reward shaping is adopted according to experience knowledge, and a continuous reward of the relative distance is given to guide the agent to learn. The reward function formula is expressed as follows:where and represent the distance between fighter and SAM and between jammer and SAM, respectively, represents the attack range of the SAM, represents the attack range of the fighter, and represents the jamming range of the jammer.

4.2.4. Environment Class Development

The environment class mainly includes two parts: the step() and reset() functions. The step() function realizes the deterministic state transition according to the UAV kinematics equation, and returns the judgment results of the new state, reward, and termination. The reset() function is designed to reset the fighter and jammer to the initial position.

4.3. Policy Training Tricks
4.3.1. Domain Randomization

To improve the robustness of the agent policy and adapt to diversified inputs, add disturbances to the inputs in the training stage [32], as shown in Eq. (14); that is, train the agent in different environments with parameter disturbances on each random seed, so that the agent can abstract higher-level policy features, avoid overfitting the one environment and policy, and that the final learned policy is the more robust and better generalization to new environments.

4.3.2. Maximization Policy Entropy

Entropy is used to measure the randomness of random variables. The greater the entropy, the more random it is. Therefore, while maximizing the cumulative rewards, the entropy of the policy is maximized and the policy is made as random as possible. The agent can fully explore the state space to avoid the policy falling into local minima, and can explore multiple feasible schemes to complete the mission, which improves the exploration performance, robustness, and generalization performance of the policy. The calculation formula is shown as follows:

4.3.3. Parameter Sharing

Actor and critic networks share network parameters. The loss function is the sum of policy loss, value function loss, and policy entropy, and the loss function gradient is backpropagated. The training difficulty is reduced by sharing the underlying network characteristics through parameter sharing. The loss function is expressed as follows:

4.3.4. Construction of Intelligent Planning Model

To summarize, an end-to-end intelligent mission planning model for SEAD mission is established. The input is the position of fighter, jammer, and SAM, which is normalized by zero mean (Z-score) according to equation (17), and then put into the two-layer fully connected neural network. Finally, the heading, speed of the fighter, and jammer are output. The optimization goal is to maximize the cumulative rewards. The solution is the optimal or suboptimal policy, and the policy network architecture of the intelligent planning model is shown in Figure 4.

Therefore, this paper adopts the idea of offline training and online inferencing. First, the agent is trained in the environment. After the training is completed, the inference module is carried out to test the planning performance of the agent. The research framework is shown in Figure 5.

A general intelligent planning architecture is proposed, including environment, planner (agent), and controller. First, the planner inputs the initial situation and interacts with the environment, that is, offline training. The trained planner can be directly applied and input the new initial situation for online inferencing. The inference decision sequence, i.e., planning results, is put into the controller for execution. The architecture realizes the entire process of intelligent planning, as shown in Figure 6.

5. Evaluation in Simulation Experiments

5.1. Experimental Setup and Training Tricks

In this paper, the SEAD environment is a square area of 100 × 100 km2. The fighter position is (40, 40), the attack range is 30 km, the jammer position is (30, 30), the attack range is 50 km, the SAM position is (80, 80), and the attack range is 40 km. After jamming, the SAM attack range is reduced to 25 km, as shown in Figure 7. In the experiment, the above range is reduced 100 times for normalization, which is easy for neural network training and can prevent gradient vanishing. The simulation environment uses Python 3.6 and PyCharm. The intelligent planning experiments of three different scenarios are completed separately. Compared with other classical DRL algorithms, robustness and ablation studies are carried out to verify the algorithm performance.

The hyperparameter settings in this work are shown in Table 5. The neural network adopts orthogonal initialization, the optimizer is Adam [33], and other training parameters are described in Section 5.1.1.

5.1.1. State Space

The status of the fighter, jammer, and SAM is defined as the coordinate position in 2-D space, which are continuous values and represented by , , and .

(1) Advantage function normalization. The advantage function value is normalized to improve the training stability and policy learning skills. The formula is as follows:

(2) Value function normalization. Similarly, the value function loss is also normalized. The policy training skills are studied and compared in the subsequent simulation. The formula is as follows:

5.1.2. Adaptive Adjustment Parameters

(1) Adaptive learning rate. In the early training stage, a larger learning rate is adopted to accelerate convergence, and in the later training stage, a smaller learning rate is adopted to find the optimal value. The formula is as follows:

(2) Adaptive clip value. The clip adaptive change is consistent with the change of learning rate. A larger clip value is allowed to accelerate the policy update in the early training stage, and a smaller clip value is used in the later stage to ensure the policy update is stable. The formula is as follows:

5.2. Intelligent Planning for Fighter-Jammer Scenario

In Experiment 1, we set up a fighter-jammer scenario, as depicted in Section 5.1. Therefore, the fighter has to be able to safely destroy the SAM under jamming to complete the mission.

Therefore, in this work, the intelligent planning model is used to simulate the 300 episodes to train the model, and use the trained model for online inferencing to test the intelligent planning performance and cooperative attack performance.

5.2.1. Offline Training

Three different random seeds are used to train the PPO policy and value function network. We record the average timesteps reward in the training process, and compare it with the classical advantage actor-critic (A2C) and TRPO algorithms to obtain the cumulative rewards learning curve, as shown in Figure 8.

As it can be seen from the above figure, the A2C model has a large variance and the TRPO model has poor convergence performance. However, the PPO model used in this work has higher episode rewards, more stable training, smaller episode rewards variance, and good robustness, so its performance is better.

5.2.2. Online Planning

The trained model is used to input the initial situation information of the environment, test the online inferencing performance of the model, and obtain the cooperative attack process, as shown in Figure 9.

It can be seen from Figure 9(a) that the fighter and jammer cleverly complete a cooperative attack. Before the jammer successfully jams, the fighter does not enter the detection range of the SAM or start an attack for its own safety. As can be seen from Figure 9(b), when the jammer jams the SAM and degrades the detection range of the SAM, the fighter attack quickly and successfully destroys the SAM, which reflects strong intelligent cooperative planning performance.

5.3. Intelligent Planning for Fighter-Decoy Scenario

To further test the intelligent planning performance of the model proposed in this paper, a cooperative attack for fighter-decoy scenario is designed in which the SAM attack range is 20 km and that of the fighter is also 20 km, which could not directly and safely destroy the SAM and target. The decoy is a low-cost expendable UAV, which could be sacrificed. Therefore, the fighter has to attack with the decoy cooperatively, sacrificing the decoy first and utilizing the interval between the SAM attacks on the decoy to destroy the SAM and target to complete the mission. Therefore, the reward function is modified as follows:where and represent the distance between fighter and SAM and between decoy and SAM, respectively, represents the attack range of the SAM, represents the attack range of the fighter, and represent the coordinate X and Y of the decoy.

5.3.1. Offline Training

Similarly, the policy and value function network of PPO are trained on three different random seeds, and the cumulative rewards of each episode in the training process are recorded and compared with the intelligent planning model based on A2C and TRPO to obtain the episode rewards learning curve, as shown in Figure 10.

It can be seen from the figure that the A2C and TRPO models have large variance, low training stability, and poor convergence, while the PPO model proposed in this paper obtains higher cumulative episode rewards, a smooth and stable learning curve, and excellent convergence performance.

5.3.2. Online Planning

The trained PPO model is tested in the environment to verify the planning performance of the model. The results are as follows:

In Figure 11(a), the decoy flies directly to the SAM to attract SAM radar tracking and attacking, while the fighter waits for an opportunity to fly around. In Figure 11(b), the fighter utilizes the time interval between the SAM tracking, locking on to the decoy, and completing the attack quickly, successfully destroying the SAM and the targets. The fighter sacrifices the decoy, but completes the mission, which shows that the PPO-based intelligent planning model has certain tactical cooperative planning performance.

5.4. Robustness Studies

The robustness of the intelligent planning model is tested next, and certain randomness is added to the training environment. As shown in Figure 12, the starting positions of fighter, decoy, and SAM can change randomly in a certain surrounding area to test the generalization performance and robustness of the trained model in the unknown environment.

The test results are shown in Figure 13. In Figure 13, the initial position of the fighter is (20, 20), that of the decoy (85, 95), and that of the SAM (55, 60). Upon being input into the trained model, it is found that the fighter will still intelligently wait for the decoy to enter the SAM range first, and then take the opportunity to attack quickly and successfully destroy the SAM and target, as shown in Figure 13. Therefore, this shows that the model has certain robustness and generalization performance to unknown situations, can adapt to an uncertain environment, and has strong practical application value.

5.5. Ablation Studies

Finally, the effectiveness of different training tricks are compared; that is, the performance of models using all tricks, and the advantage function normalization, layer orthogonal initialization, value function normalization, adaptive learning rate, and adaptive clip. The ablation studies are completed. The results are shown in Figure 14.

As it can be seen from Figure 14, the episode rewards using all training tricks are higher. The advantage function normalization will greatly improve the early convergence speed of the model, layer orthogonal initialization can improve the final performance of the model, and other tricks have relatively little impact on model performance.

6. Conclusions

In this paper, we propose an end-to-end DRL-based UAV intelligent mission planning method. First, a SEAD mission is selected as the research object, and mission planning is described as a sequential decision-making problem. Then, the SEAD mission intelligent planning model based on the PPO algorithm is established, including the design of UAV state space, action space, and reward function. Three policy training tricks, namely domain randomization, maximizing policy entropy, and parameter sharing, are introduced, and the intelligent planning general architecture is constructed. Finally, two experiments analyses, robustness studies, and an ablation experiment are completed. We conclude that the intelligent planning model based on deep reinforcement learning, which adopts an end-to-end architecture and offline training and online inferencing, can adapt to the dynamic situation, and it is advanced and valuable. In future work, this end-to-end method will be extended to large-scale complex combat scenarios, and the problem of multi-agent cooperative planning will be studied in depth.

Data Availability

The data used to support the findings of this paper are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of the paper.

Authors’ Contributions

Longfei Yue planned the work, completed the simulation experiment, and drafted the main part of the paper. Rennong Yang, Ying Zhang, and Lixin Yu contributed error analysis. Zhuangzhuang Wang contributed to setup type.

Acknowledgments

The work described in this paper is partially supported by the Nature Science Foundation of Shaanxi Provincem of China under Grant no. 2021JQ-370 and the National Natural Science Foundation of China under Grant no. 62106284.