Deep reinforcement learning for real-world quadrupedal locomotion: a comprehensive review

Building controllers for legged robots with agility and intelligence has been one of the typical challenges in the pursuit of artificial intelligence (AI). As an important part of the AI field, deep reinforcement learning (DRL) can realize sequential decision making without physical modeling through end-to-end learning and has achieved a series of major breakthroughs in quadrupedal locomotion research. In this review article, we systematically organize and summarize relevant important literature, covering DRL algorithms from problem setting to advanced learning methods. These algorithms alleviate the specific problems encountered in the practical application of robots to a certain extent. We first elaborate on the general development trend in this field from several aspects, such as the DRL algorithms, simulation environments, and hardware platforms. Moreover, core components in the algorithm design, such as state and action spaces, reward functions, and solutions to reality gap problems, are highlighted and summarized. We further discuss open problems and propose promising future research directions to discover new areas of research.


INTRODUCTION
Wheeled and tracked robots are still unable to navigate the most challenging terrain in the natural environment, and their stability may be severely compromised. Quadrupedal locomotion, on the other hand, can greatly expand the agility of robot behavior, as legged robots can choose safe and stable footholds within their kinematic (B) Quadrupedal locomotion in challenging natural environments. (F) Exteroceptive and proprioceptive perception for quadrupedal locomotion in a variety of challenging natural and urban environments.
(D) Adaptive behaviors in response to changing situations.
(A) Recovering from the fall.
(E) Coupling vision and proprioception for navigation tasks.
(H) Leveraging both proprioceptive states and visual observations for locomotion control.
(G) Utilizing prior knowledge to learn reusable locomotion and dribbling skills.  [6] ; (B) a radically robust controller for quadrupedal locomotion in challenging natural environments [7] ; (C) learning agile locomotion skills by imitating real-world animals [10] ; (D) producing adaptive behaviors in response to changing situations [9] ; (E) coupling vision and proprioception for navigation tasks [11] ; (F) integrating exteroceptive and proprioceptive perception for quadrupedal locomotion in a variety of challenging natural and urban environments over multiple seasons [8] ; (G) utilizing prior knowledge of human and animal movement to learn reusable locomotion and dribbling skills [12] ; and (H) leveraging both proprioceptive states and visual observations for locomotion control [13] .
reach and rapidly change the kinematic state according to the environment. To further study quadrupedal locomotion on uneven terrain, the complexity of traditional control methods is gradually increased as more scenarios are considered [1][2][3][4] . As a result, the associated development and maintenance becomes rather timeconsuming and labor-intensive, and it remains vulnerable to extreme situations.
With the rapid development of the artificial intelligence field, deep reinforcement learning (DRL) has recently emerged as an alternative method for developing legged motor skills. The core idea of DRL is that the control policy learns to make decisions to obtain the maximum benefit based on the reward received from the environment [5] . DRL has been used to simplify the design of locomotion controllers, automate parts of the design process, and learn behaviors that previous control methods could not achieve [6][7][8][9] . Research on DRL algorithms for legged robots has gained wide attention in recent years. Meanwhile, several well-known research institutions and companies have publicly revealed their implementations of DRL-based legged robots, as shown in Figure 1.
Currently, there are several reviews on applying DRL algorithms to robots. Some works summarize the types of DRL algorithms and deployment on several robots such as robotic arms, bipeds, and quadrupeds [14] . They discuss in detail the theoretical background and advanced learning algorithms of DRL, as well as present key current challenges in this field and ideas for future research directions to stimulate new research interests. There is also a work summarizing some case studies involving robotic DRL and some open problems [15] . Based on these case studies, they discuss common challenges in DRL and how the work addresses them. They also provide an overview of other prominent challenges, many of which are unique to real-world robotics settings. Furthermore, a common paradigm for DRL algorithms applied to robotics is to train policies in simulations and then deploy them on real machines. This can lead to the reality gap [16] (also known as sim-to-real gap) problem, which is summarized for the robotic arm in [17] . These reviews introduce the basic background behind sim-to-real transfer in DRL and outline the main methods currently used: domain randomization, domain adaptation, imitation learning, meta-learning, and knowledge distillation. They categorize some of the most relevant recent works and outline the main application scenarios while also discussing the main opportunities and challenges of different approaches and pointing out the most promising directions. The closest work to our review simply surveys current research on motor skills learning via DRL algorithms [18] , without systematically combing through the relevant literature and without an in-depth analysis of the existing open problems and future research directions.
In this survey, we focus on quadrupedal locomotion research from the perspective of algorithm design, key challenges, and future research directions. The remainder of this review is organized as follows. Section 2 formulates the basic settings in DRL and lists several important issues that should be alleviated. The classification and core components of the current algorithm design (e.g., the DRL algorithm, simulation environment, hardware platform, observation, action, and reward function) are introduced in Section 3. Finally, we summarize and offer perspectives on potential future research directions in this field.

BASIC SETTINGS AND LEARNING PARADIGM
In this section, we first formulate the basic settings of standard reinforcement learning problems and then introduce the common learning paradigm.
Quadrupedal locomotion is commonly formulated as a reinforcement learning (RL) problem, which in the framework of Markov decision processes (MDPs) is specified by the tuple := (S, A, , , 0 , ), where S and A denote the state and action spaces, respectively; : S × A → R is the reward function; (s ′ |s, a) is the stochastic transition dynamics; 0 (s) is the initial state distribution; and ∈ [0, 1] is the discount factor. The objective is to learn a control policy that enables a legged robot to maximize its expected return for a given task [19] . A state is observed by the robot from the environment at each time step , and an action a ∼ (a | s ) is derived from robot' s policy . The robot next applies this action, which results in a novel state +1 and a scalar reward = (s , a ). As a result, a trajectory := (s 0 , a 0 , 0 , s 1 , a 1 , 1 , ...) is obtained by repeating applications of this interaction process. Formally, the RL problem requires the robot to learn a decision making policy (a|s) that maximizes the expected discounted return: where denotes the time horizon of each episode and ( ) = (s 0 ) (s +1 | s , a ) (a | s ) represents the likelihood of a trajectory under a given policy , with (s 0 ) being the initial state distribution.
For quadrupedal locomotion tasks, most of the current research is based on a similar learning paradigm, as shown in Figure 2. First, we need to build a simulation environment (e.g., ground, steps, and stairs), and then design the state and action space, reward function and other essential elements. DRL-based algorithms are further designed and used to train policies in the simulation. The trained policy is finally deployed on the real robot to complete the assigned task.

DRL-BASED CONTROL POLICY DESIGN FOR QUADRUPEDAL LOCOMOTION
In this section, we detail the key components of a DRL-based controller. The classification results are presented in Tables 1 and 2 in the Appendix. After the most relevant publications in this field are summarized, their key parts are further condensed. As shown in Figure 3, we firstly review and analyze the general state and development trend of current research (e.g., DRL algorithms, simulators, and hardware platforms). Then, important components of DRL algorithm (state and action design, reward function design, solution to reality gap, etc.) are presented, as shown in Figure 4. These specific designs would help to alleviate open questions, which are further discussed in Section 4. Please refer to the Appendix for more details.

DRL algorithm
Although many novel algorithms have been developed in the DRL community, most current quadrupedal locomotion controller designs still use model-free DRL algorithms, especially PPO and TRPO [20,21] . For a complex high-dimensional nonlinear system such as robots, stable control is the fundamental purpose. Most researchers choose the PPO (TRPO) algorithm for utilization in their research due to its simplicity, stability, theoretical justification, and empirical performance [20][21][22] .
Similar to on-policy algorithms, PPO (TRPO) has been criticized for its sample inefficiency; thus, more efficient model-free algorithms (ARS [23] , SAC [24] , V-MPO [25] , etc.) are sometimes considered. Some researchers have also recently used advanced algorithms for more challenging tasks. For example, the multi-objective variant of the VMPO algorithm (MO-VMPO) [26] has been utilized to train a policy to track the planned trajectories [27] . Some researchers have introduced guided constrained policy optimization (GCPO) method for tracking base velocity commands while following defined constraints [28] . Moreover, for more efficient real-world fine-tuning and to avoid overestimation problems, REDQ, an off-policy algorithm [29] , is used for real robots [30] .

Simulator
The robot simulator should be able to simulate the dynamic physical laws of the robot itself more realistically and efficiently solve the collisions generated when the robot interacts with the environment. Over the past few years, the Pybullet [31] and RaiSim [32] simulation platforms have been the choice of most researchers. However, the current robotic simulators in academia are still relatively simple, and the precision is far less than that of simulators in games. For robots, directly realizing end-to-end decision making from perception to control is difficult without an accurate and realistic simulator. Common robotic simulators, such as Pybullet and RaiSim, can only solve control-level simulations, but they are stretched for real-world simulations. They have been developed to run on CPUs with reduced parallelism. On the other hand, while mujoco [33] is a popular DRL algorithm verification simulator, it is rarely used as a deployment and testing platform for real-world quadrupedal locomotion algorithms. A possible explanation is that the highly encapsulated mujoco simulator makes it difficult for researchers to develop it further.
Recently, NVIDIA released a new simulator, Isaac Gym [34] , which simulates the environment with much higher accuracy than the aforementioned simulators, and can simulate and train directly on GPUs. This simulator is scalable and can simulate a large number of scenarios in parallel, so researchers can use DRL algorithms for large-scale training. It can also build large-scale realistic complex scenes, and its underlying PhysX engine can accurately and realistically model and simulate the motion of objects. Therefore, more researchers have begun to use Isaac Gym as the implementation and verification platform of DRL algorithm [35][36][37][38] .

Hardware platform
In the early research stage, Minitaur [39] with only eight degrees of freedom was used to verify the feasibility of the DRL algorithm in simple experimental scenarios. To accomplish more complex tasks, robots (Unitree Laikago, Unitree A1, ANYmal [40] , etc.) with more than 12 degrees of freedom are used by researchers. While the ANYmal series robots are well known for their high hardware costs, low-cost robots such as Unitree A1 are a more prevailing choice among researchers. Lower-cost hardware platforms allow DRL algorithms to be more widely used. More recently, a wheel-legged quadruped robot [38] demonstrated skills learned from existing DRL controllers and trajectory optimization, such as ducking and walking, and new skills, such as switching between quadrupedal and humanoid configurations.

Publisher
Currently, DRL-based quadrupedal locomotion research is an emerging and promising field, and many papers have not been officially published. The published papers are mainly in journals or conferences related to the field of robotics, and there are four outstanding works [6][7][8][9] published on Science Robotics. It is worth noting that the field is actually an intersection of several fields, and some excellent studies have been published at conferences in the machine learning field.

State, action, reward, and others
State, action, and reward are integral and important components for training controllers. The design of these components will directly affect the performance of the controller. However, there is no fully unified standard and method for the specific design.
For the design of state space, on the one hand, considering too few observations can lead to a partially observable controller. On the other hand, providing all available readings results in a brittle controller that is overfitted to the simulation environment. Both affect the performance of the controller in the real machine, so researchers can only make trade-offs based on practical problems. In current research works, for simple tasks (walking, turning on flat ground, etc.), proprioception alone (base orientation, angular velocity, joint position and velocity, etc.) is sufficient to solve the problem [10,39,41] . For more complex tasks (walking on uneven ground, climbing stairs or hillsides, avoiding obstacles, etc.), exteroception, such as visual information, needs to be introduced [8,13,42] . Adding additional sensors alleviates the partial observation issues to some extent.
Most researchers use the desired joint positions (residuals) as the action space and then calculate the torque through a PD controller to control the robot locomotion. Early studies [43] experimentally demonstrated that controllers with such action space can achieve better performance. However, recent studies also attempt to use lower-level control commands to obtain highly dynamic motion behavior to avoid the use of PD controllers and control torque directly [44] . Although the current DRL-based controllers have achieved outstanding performance [6][7][8] , their stability is still not as good as the common control methods, such as MPC controllers [45] . The force-position hybrid control method adopted by MPC is worthy of reference and further research. Furthermore, in some studies based on hierarchical DRL, the latent commands serve as the action space of the high-level policy to guide the behavior of low-level policies [46,47] .
In general, the design of the reward function is fairly laborious, especially for complex systems such as robots.
Small changes in the reward function hyperparameters have the potential to have a large impact on the final performance of the controller. In order for the robot to complete more complex tasks, the reward function must be designed with sufficient detail [6][7][8]48] . Some specific factors include the desired direction, base orientation, angular velocity, base linear velocity, joint position and velocity, foot contact states, policy output, and motor  Based on the current research states of quadrupedal locomotion, we expound the future research prospects from multiple perspectives. In particular, world models, skill data, and pre-trained models require significant attention, as these directions will play an integral role in realizing legged robot intelligence.

torque.
Many studies have also considered additional information, such as trajectory generators [46,[49][50][51] , control methods [52][53][54] , motion data [10,12,55,56] , etc. Trajectory generators and control methods mainly introduce prior knowledge in the action space, narrowing the search range of DRL control policies, which greatly improves the sample efficiency under a simple reward function. Motion data are often generated by other suboptimal controllers or assessed via public datasets. Through imitation learning based on the motion data, the robot can master behaviors and skills such as walking and turning. In both simulations and real-world deployment, the robot eventually manages to generate natural and agile movement patterns and completes the assigned tasks according to the external reward function.

Solution to reality gap
Under the current mainstream learning paradigm, the reality gap is an unavoidable problem that must be addressed. The domain randomization method is used by most researchers due to its simplicity and effectiveness. The difference between simulation and real environment is mainly reflected in physical parameters and sensors. Therefore, researchers mainly randomize physical parameters (mass, inertia, motor strength, latency, ground friction, etc.), add Gaussian noise to observations, and apply disturbing force, etc. [35,48,50,57,58] . However, domain randomization methods trade optimality for robustness, which can lead to conservative controllers [59] . Some studies have also used domain adaptation methods, that is, use real data to identify the environment [60,61] or obtain accurate physical parameters [62] . Furthermore, these methods can improve the generalization (adaptation) performance of robots in challenging environments. For more solutions to the reality gap, please refer to the relevant review paper [63] .

OPEN PROBLEMS AND FUTURE PROSPECTS
In this section, we discuss the long-standing open questions and promising future research directions in the DRL-based quadrupedal locomotion field around these issues, as shown in Figure 5. Solutions to these open problems are described in Section 3.

Sample efficiency
In many popular DRL algorithms, millions or billions of gradient descent steps are required to train policies that can accomplish the assigned task [64][65][66] . For real robotics tasks, therefore, such a learning process requires a significant number of interactions, which is infeasible in practical applications. In the face of increasingly complex robotic tasks, without improvement in the sample efficiency of algorithms, the number of training samples needed will only increase with model size and complexity. Furthermore, a sample-efficient DRL algo-rithm can deal with sparse-reward tasks, which greatly reduces the difficulty of designing reward functions. It also alleviates the serious time burden for the researchers to tune the parameters of reward function.

Generalization and adaptation
Generalization is another fundamental problem of the DRL algorithm. Current algorithms perform well in single-task and static environments, but they struggle with multi-task and dynamically unstructured environments. That is, it is difficult for robots to acquire novel skills and quickly adapt to unseen environments or tasks. Generalization or adaptation to new scenarios remains a long-standing unsolved problem in the DRL community. In general, there are two broad categories of problems in robotics tasks: the observational generalization (adaptation) problem and the dynamic generalization (adaptation) problem. The former is a learning problem for robots considering high-dimensional state spaces, such as raw visual sensor observations. High-dimensional observations may incorporate redundant, task-irrelevant information that may impair the generalization ability of robot learning. Currently, there are many related studies published on physical manipulation [67][68][69][70][71] but only a few cutting-edge works on quadrupedal locomotion tasks [8,11,13] . The latter mainly takes into account the dynamic changes of the environment (e.g., robot body mass and ground friction coefficient) [72][73][74] . This causes the transition probability of the environment to change, i.e., the robot takes the same action in the same state, but it transitions to a different next state.

Partial observation
Simulators can significantly reduce the training difficulty of the DRL algorithms because we have access to the ground-truth state of the robots. However, due to the limitations of the onboard sensors of real robots, the policies are limited to partial observations that are often noisy and delayed. For example, it is difficult to accurately measure the root translation and body height of a legged robot. This problem is more pronounced when faced with locomotion or navigation tasks in complex and unstructured environments. Several approaches have been proposed to alleviate this problem, such as applying system identification [75] , removing inaccessible states during training [39] , adding more sensors [8,11,13] , and learning to infer privileged information [7,76] .

Reality gap
This problem is caused by differences between the simulation and real-world physics [16] . There are many sources of this discrepancy, including incorrect physical parameters, unmodeled dynamics, and random realworld environments. Furthermore, there is no general consensus on which of these sources plays the most important role. A straightforward approach is domain randomization, a class of methods that uses a wide range of environmental parameters and sensor noises to learn robust robot behaviors [39,77,78] . Since this method is simple and effective, most studies on quadrupedal locomotion have used it to alleviate the reality gap problem.

Accelerate learning via model-based planning
For sequential decision making problems, model-based planning is a powerful approach to improve sample efficiency and has achieved great success in applied domains such as game playing [79][80][81] and continuous control [82,83] . These methods, however, are both costly to plan over long horizons and struggle to obtain accurate world models. More recently, the strengths of model-free and model-based methods are combined to achieve superior sample efficiency and asymptotic performance on continuous control tasks [84] , especially on fairly challenging, high-dimensional humanoid and dog tasks [85] . How to use model-based planning in DRL-based quadrupedal locomotion research is an issue worthy of further exploration.

Reuse of motion priors data
Current vanilla DRL algorithms have difficulty producing life-like natural behaviors for legged robots. Furthermore, reward functions capable of accomplishing complex tasks often require a tedious and labor-intensive tuning process. Robots also struggle to generalize or adapt to other environments or tasks. To alleviate this problem to a certain extent, there have been recent DRL studies based on motion priors [86][87][88][89][90] , which have been successfully applied to quadrupedal locomotion tasks [12,56,91] . However, the variety of motion priors in these studies is insufficient, and the robot' s behavior is not agile and natural. This makes it difficult for robots to cope with complex and unstructured natural environments. Improving the diversity of motion priors is also an interesting direction in quadrupedal locomotion research. On the other hand, there is currently a lack of general real-world legged motion skills datasets and benchmarks, which would have significant value for DRL-based quadrupedal locomotion research. If many real-world data were available, we could study and verify offline RL [92] algorithms for quadrupedal locomotion. The main feature of offline RL algorithms is that the robot does not need to interact with the environment during the training phase, so we can bypass the notorious reality gap problem.

Large-scale pre-training of DRL models
The pre-training and fine-tuning paradigms for new tasks have emerged as simple yet effective solutions in supervised and self-supervised learning. Pre-trained DRL-based models enable robots to rapidly and efficiently acquire new skills and respond to non-stationary complex environments. Meta-learning methods seem to be a popular solution for improving the generalization (adaptation) performance of robots to new environments. However, current meta-reinforcement learning algorithms are limited to simple environments with narrow task distributions [93][94][95][96] . A recent study showed that multi-task pre-training with fine-tuning on new tasks performs as well as or better than meta-pre-training with meta test-time adaptation [97] . Research considering large-scale pre-trained models in quadrupedal locomotion research is still in its infancy and needs further exploration. Furthermore, this direction is inseparable from the motor skills dataset mentioned above, but it focuses more on large-scale pre-training of DRL-based models and online fine-tuning for downstream tasks.

CONCLUSIONS
In the past few years, there have been some breakthroughs in quadrupedal locomotion research. However, due to the limitations of algorithms and hardware, the behavior of robots is still not agile and intelligent. This review provides a comprehensive survey of several DRL algorithms in this field. We first introduce basic concepts and formulations, and then condense open problems in the literature. Subsequently, we sort out previous works and summarize the algorithm design and core components in detail, which includes DRL algorithms, simulators, hardware platforms, observation and action space design, reward function design, prior knowledge, solution of reality gap problems, etc. While this review considers as many factors as possible in systematically collating the relevant literature, there are still many imperceptible factors that may affect the performance of DRL-based control policies in real-world robotics tasks. Finally, we point out future research directions around open questions to drive important research forward.

Authors' contributions
Made substantial contributions to conception and design of the study and performed data analysis and interpretation: Zhang H, Wang D Performed data acquisition, as well as provided administrative, technical, and material support: He L

Availability of data and materials
Please refer to Table 1 and Table 2 in the appendix.  [50] ArXiv 2020 A quadrupedal Sim2Real framework utilizing offline RL with dynamics and domain randomized to allow traversing uneven terrain.
Foot Position Residuals.

ANYmal
Learning a Contact-Adaptive Controller for Robust, Efficient Legged Locomotion [101] CoRL 2020 A hierarchical framework combining Model-Based Control and RL to synthesize robust quadrupedal controllers.
IsaacGym [102] Unitree Laikago Zero-Shot Terrain Generalization for Visual Locomotion Policies [103] ArXiv 2020 A learning approach for terrain locomotion using exteroceptive inputs without ground-truth height maps. Learning Agile Robotic Locomotion Skills by Imitating Animals [10] RSS 2020 A system enabling legged robot to learn agile locomotion skills by imitating real-world animals. Learning Quadrupedal Locomotion over Challenging Terrain [7] Science Robotics 2020 A novel Sim2Real solution incorporating proprioception showing remarkable zero-shot generalization. Base Target Positions (with Previous), Orientation, and Survival. Webots [106] Yobogo

TRPO
Terrain-Aware Risk-Assessment-Network-Aided (RAN) DRL for Quadrupedal Locomotion in Tough Terrain [51] IROS 2021 A terrain-aware DRL-based controller integrating a RAN to guarantee the action stability. Learning robust perceptive locomotion for quadrupedal robots in the wild [8] Science

Robotics 2022
A quadrupedal locomotion solution integrating exteroceptive and proprioceptive perception.

RaiSim
ANYmal-C Imitate and Repurpose: Learning Reusable Robot Movement Skills From Human and Animal Behaviors [12] ArXiv 2022 Learning reusable locomotion skills for real legged robots using prior knowledge of human and animal movement.

MuJoCo ANYmal
Learning vision-guided quadrupedal locomotion end-to-end with crossmodal transformers [13] ICLR 2022 An end-to-end RL method leveraging both proprioceptive states and visual observations for locomotion control.