Научная статья на тему 'Reinforcement Learning in the Task of Spherical Robot Motion Control'

Reinforcement Learning in the Task of Spherical Robot Motion Control Текст научной статьи по специальности «Физика»

CC BY
17
4
i Надоели баннеры? Вы всегда можете отключить рекламу.
Журнал
Russian Journal of Nonlinear Dynamics
Scopus
ВАК
RSCI
MathSciNet
zbMATH
Область наук
Ключевые слова
control / control of a mechanical system / spherical robot / mechanics / artificial intelligence / reinforcement learning / Q-learning / DDPG / actor-critic / multilayer neural network / MuJoCo / PyTorch

Аннотация научной статьи по физике, автор научной работы — Nikita V. Nor

This article discusses one of the DDPG (Deep Deterministic Policy Gradient) reinforcement learning algorithms applied to the problem of motion control of a spherical robot. Inside the spherical robot shell there is a platform with a wheel, and the robot is simulated in the MuJoCo physical simulation environment. The goal is to teach the robot to move along an arbitrary closed curve with minimal error. The output control algorithm is a pair of trained neural networks — actor and critic, where the actor-network is used to obtain the control torques applied to the robot wheel and the criticnetwork is only involved in the learning process. The results of the training are shown below, namely how the robot performs the motion along ten arbitrary trajectories, where the main quality functional is the average error magnitude over the trajectory length scale. The algorithm is implemented using the PyTorch machine learning library.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «Reinforcement Learning in the Task of Spherical Robot Motion Control»

Russian Journal of Nonlinear Dynamics, 2024, vol. 20, no. 2, pp. 295-310. Full-texts are available at http://nd.ics.org.ru DOI: 10.20537/nd240501

NONLINEAR ENGINEERING AND ROBOTICS

MSC 2010: 70Q05, 68T40, 70E55, 68T05, 93E35

Reinforcement Learning in the Task of Spherical Robot

Motion Control

N. V. Nor

This article discusses one of the DDPG (Deep Deterministic Policy Gradient) reinforcement learning algorithms applied to the problem of motion control of a spherical robot. Inside the spherical robot shell there is a platform with a wheel, and the robot is simulated in the MuJoCo physical simulation environment.

The goal is to teach the robot to move along an arbitrary closed curve with minimal error. The output control algorithm is a pair of trained neural networks — actor and critic, where the actor-network is used to obtain the control torques applied to the robot wheel and the critic-network is only involved in the learning process. The results of the training are shown below, namely how the robot performs the motion along ten arbitrary trajectories, where the main quality functional is the average error magnitude over the trajectory length scale. The algorithm is implemented using the PyTorch machine learning library.

Keywords: control, control of a mechanical system, spherical robot, mechanics, artificial intelligence, reinforcement learning, Q-learning, DDPG, actor-critic, multilayer neural network, MuJoCo, PyTorch

1. Introduction

Currently, there are many spherical robots with different mechanical devices inside. For example, rotors [1] (the so-called Chaplygin ball), omniwheel platforms [2], pendulums, etc. can be located inside a sphere shell. In the review articles [3-5], one can learn more details about the internal devices of various spherical robots. One of the tasks of such robots is to move along a predetermined trajectory.

Received November 28, 2022 Accepted March 19, 2024

This work was supported by the Noncommercial Foundation for the Advancement of Science and Education INTELLECT.

Nikita V. Nor [email protected]

Lomonosov Moscow State University Leninsikie gory 1, Moscow, 119991 Russia

The current literature describes controls that are built using classical methods, without the use of machine learning, with significant limitations in the derivation of dynamic and kinematic equations, such as the exclusion of friction, slip, and center of mass displacement. The equations are quite large and complex, which is not an exception for the problem of controlling the motion of a spherical robot, which is considered in this article.

(a) (b)

Fig. 1. Photo of the robot (a) and 3-D model (b) built in the MuJoCo engine

In the scientific literature, there are works [6-9] that consider the construction of control for trajectory tracking of four-wheeled or three-wheeled rovers and similar vehicles using the DDPG algorithm. The main similarity of these papers (with the exception of the algorithm used) is that several points closest to the trajectory are used as the state vector of the environment with various parameters added, e.g., a heading angle, a deviation from the trajectory, but at the same time there is a difference how the reward function is constructed and the study of the environment. Below we consider the reward functions and the results of applying the algorithm to the problems solved in [6] and [8].

In [6], the control algorithm is designed so that a four-wheeled cart with two leading front wheels moves along an arbitrary trajectory. In this task, the reward function (1.1) is used,

\Vt-Vref\

where Pcte = 0.8cte

P,

o.iA

R, = 0.8-

ref

Pacc = 0.2\areq\ are penalties that are

imposed on the agent if the deviation of the robot from the trajectory (cte) does not exceed 0.2 m, where 5t is the steering wheel angle, ¿max is the maximum possible wheel steering angle, vref, areq are the required velocity and acceleration, limited modulo some values (the results are shown in Fig. 2):

- 1, if cte> 0.2,

r =

1.5 - Pcte - P& - Pv - Pacc, otherwise.

(1.1)

In [8], a more complex reward function is considered: for a two-wheeled cart robot, the agent is trained so that the object can move along a given trajectory, taking into account the obstacles placed on it (Fig. 3a, schematic representation). The reward function is calculated using formula (1.2) when the robot has no obstacle in its path. By formula (1.3) for the case

0 50 100 Timesteps

40 60 X[m]

(a)

Track Avg cte Max cte Avg Av Max Av % of path

1 0.103 0.264 0.473 1.466 100.0

2 0.141 0.32 0.465 1.813 100.0

3 0.101 0.299 0.608 2.523 100.0

4 0.119 0.335 0.658 2.519 100.0

5 0.113 0.641 0.62 2.34 100.0

6 0.114 0.464 0.752 2.623 100.0

7 0.141 0.54 0.65 2.28 100.0

8 0.107 0.31 0.609 1.937 100.0

9 0.111 0.385 0.496 1.84 100.0

10 0.105 0.336 0.626 2.076 100.0

Avg 0.115 0.39 0.596 2.142 100.0

(b)

Fig. 2. (a) test track No.6 with the cross-track error [m] and velocity [m/s] deviation shown over time, (b) performance over a set of 10 test-tracks

when the robot needs to avoid an obstacle and continue its movement along a given trajectory,

reb = -(\xe \ + \ye\ + \0e |),

(1.2)

where \xe\ and \ ye\ denote the deviations from a given trajectory in the x and y coordinates, respectively, \de\ is the robot's deviation from the course,

reb + max ^ reb — max | $pz

1

rz> (d —r

tanh —£

V re-rc

tanh 1 d°bs r°bs

reb — ^ obstacle.

rewards zone,

penalty zone,

(1.3)

/

ree = 5

1

c obs

The constants 0rz, 0pz, ^ are introduced to limit the reward and penalty, where dobs is the

distance from the center of the obstacle to the center of the robot. The rest of the labeling can be seen in Fig. 3b. One of the training results can be observed in Fig. 4.

(a)

(b)

Fig. 3. (a) schematic illustration of a two-wheeled cart following a trajectory with an obstacle, (b) explanatory illustration of the reward function (1.3)

-10 0 10 20 30 40 50 60 70 80 X [m]

Fig. 4. The simulation result of the robot's motion for a sinusoidal trajectory with two obstacles on it, obtained in [5]

The originality of this work lies in the fact that a neural network approach to control design was applied to this type of the spherical robot. The basis of the DDPG algorithm is the same as in all other works [6-9], but there are some differences. For each task, it was necessary to adjust the training hyperparameters, select reward functions, and neural network architectures, as well as to create an environment for training.

The goal was to find out what the result would be if we adapted the basic algorithm to build a control so that the robot could move with minimal error along a random, closed trajectory in the plane. The real robot is shown in Fig. 1a, and its model created in the MuJoCo physics simulation engine [10] is shown in Fig. 1b.

The real robot model is designed as follows: inside of the spherical shell there is a platform on which sensors are fixed, microcontrollers are led by a single-board computer. The wheel is located under the platform and is mounted on the rotors of two motors, which are attached to the fork and can rotate the wheel about an axis perpendicular to the plane of the wheel. As for the fork, it is connected directly to the rotor of the large motor and can rotate with the wheel and the two motors about a vertical axis. Consequently, the robot has two control torques. The power supply is provided by three batteries, which are located under the platform. A simplified model of the robot is modeled in the physical simulator. We assume that the platform, shell, wheel and the fork where the wheel is attached are dynamically symmetric solids with no displaced center of mass. The wheel moves on the inside of the spherical shell without slipping, as the shell on the plane surface.

2. Reinforcement learning

Reinforcement learning is successfully applied to problem-solving in various domains, such as robotics, recommendation systems, gaming industry, and autonomous vehicles. The main idea of reinforcement learning is that there is an agent that interacts with the environment E by performing actions at at each time step t, and in return receives the environment state st and some reward rt in response to its actions. The agent must maximize the cumulative reward: Rt =

T

max

= E Yi-tr(si, ai).

i=t

All reinforcement learning algorithms are based on the concept of a Markov decision process and the dynamic Bellman equation. The definitions of the Q-function (2.1), the Bellman equation (2.2) are expressed in terms of the Q-function, and the definition of the quality functional to be maximized is given below:

Qn(st, at) = E . „ ^[Rt\st, at], (2.1)

~ I

Qn(st, at) = Ert, st+1~E r(st, at) + Y • E^n Q(s+, am)] , (2.2)

J = Es,r.~E, [R1] ^ max. (2.3)

The policy n = n(a|s) represents the distribution of actions1 under the current state, 7 is the discount factor, which is always taken in the range (0, 1], and the closer it is to zero, the greedier the agent will be. The maximization of the functional (2.3) is the essence of reinforcement learning. According to the Bellman optimality criterion, the optimal policy can be computed via the Q-function: n(a) = argmaxQ(s, a).

a

In the "tabular" case, where all possible states of the environment are known in advance and the probability p(st+1|st, at), the problem is solved iteratively by dynamic programming, improving the policy with each iteration and bringing the Q-function closer to the optimal value (these two alternating steps are called policy evaluation and policy improvement).

In the problem under consideration, the state space of the environment has infinite dimensionality, so in such cases the Q-function is approximated by a neural network. Let us denote the parameters of this network by d, then we denote the parameterized Q-function by Qg. In this approach, learning takes place through minimizing the loss function on a random batch B from some buffer of bounded size (replay buffer), which stores the agent's experience, i.e., the

1 In the context of this paper, actions refer to the two control moments applied to the wheel.

set of transitions (st, at, st+1, rt). In the formulas below, T means one of such transitions that has entered the batch:

y(T) = rt + YmaxQe,(st+i, at+i), (2.4)

at+i

Loss(0) = ¿7 E (Qo(*t, at) - y(T))2. (2.5)

1 1 t eB

This is the well-known Deep Q Networks (DQN) algorithm, and more details can be found in [11]. DQN operates under the assumption that the action space is discrete. In this and all other algorithms, an additional target network is introduced, which by structure completely repeats the original one. Let us denote its parameters by 9'. This network is needed to deal with the overestimation of the Q-function value, as well as with the noisiness of its values. The weights of this network are updated once every few episodes and the parameters are copied from the original network. The update frequency is set experimentally.

The control torques applied to the robot wheel and platform are bounded and vary continuously over a certain range, so the DQN algorithm is not suitable. For the case of continuous control, there is a Deep Deterministic Policy Gradient algorithm, in which the policy n and the Q-function are approximated by a neural network with parameters w and 9. Similarly, target networks with parameters w' and 9' are introduced. The policy attempts to predict the action based on the current state, and it is worth noting that, in this algorithm, the policy is deterministic, not stochastic. DDPG is a model of an actor-critic, where the actor is represented by a neural network approximating the policy, and the critic is represented by a neural network approximating the Q-function.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

The DDPG training is similar to the DQN training. Gradient ascent is performed over the parameters w (2.6), where J is (2.3), using the gradient formula for the policy and gradient descent over 9 to minimize the functional (2.7), which is similar to (2.5):

V-J * M E Qo(s, a)\s=sv a=ir0J(si) 7Tw(s)|Si , (2.6)

1 1 t eB

Loss {9, = (Qe(st, at) - y{T)f min, (2.7)

1 1 t eB

y(T) = rt + y • Qe'(st+i, (st+i)). (2.8)

In DDPG, unlike the DQN algorithm, the weights of the target networks are updated "gradually" (2.9), (2.10), where ¡3, t e (0, 1] are chosen and fixed before training starts. A detailed description of the algorithm is given in [12]:

9' ^ t9 + (1 - t)9', (2.9)

w' ^ 3w + (1 - 3)w'. (2.10)

3. Exploration/exploitation tradeoff

The most common problem in reinforcement learning is that of finding an optimal balance between exploration of the environment and utilization of experience already accumulated. In the classical DDPG algorithm, throughout training, Ornstein-Uhlenbeck noise is added to the action obtained by the neural network responsible for the policy [12]. It should be noted that, in

the original description of the algorithm, the authors allow adding their own noise and creating their own policy for exploring the environment.

In the context of this problem, the following approach to exploring the environment is used, which differs a little bit from the approaches described in [6, 9]. At the beginning of each training episode, three normally distributed random variables are randomly selected and fixed: A, u, a2 ~ ~ N(0, 1) with zero mean and standard deviation 1, and two numbers , $p ~ Uniform(—n, n), uniformly distributed on the interval (—n, n), which represent phase shifts:

— A ■ sin(u ■ t + ), (3.1)

¡iw — A ■ sin(u ■ t + ). (3.2)

The above values of ¡ip and ¡iw are used as means for the normal distributions N (^p, a2) and N w, a2), from which two random variables are obtained, which are added as noise to the actions obtained from the actor neural network:

ap — ap ++ ep, where ep ~ N p, a2), (3.3)

aw — aw + ew, where ew ~ N w, a2). (3.4)

Random variables are added with some probability p(n), depending on the episode number in the training n, where p decreases linearly to 0.25, and then remains constant throughout the training (3.5). This approach to exploring the environment was chosen empirically and showed a significant increase in the average cumulative reward, while reducing p to 0 resulted in the average reward becoming approximately half as much:

f — 18.75 ■ 10"5 ■ n + 1, n < 1500, p(n) — { (3.5)

( 0.25, n > 1500.

4. Environment state vector. Reward function

The quality of the resulting model, i.e., the quality of how the robot will follow a given trajectory, depends on how well the reward function is chosen. The following vector of values was chosen as the state vector that describes the environment:

vwbs vfs vfs Mw r1 ■■■ rn Mp u^b3 uf3 , (4.1)

dim s — 36 + 2n, (4.2)

where n is the number of closest points from the robot to the trajectory, and when n — 7, the best passage of the trajectory is achieved. An experimental justification of this result can be seen in Appendix.

ri — ('rxi , i — 1, ..., n are the coordinates of the ith radius vector connecting the center of the robot platform to the nearest point on the trajectory relative to the frame attached to the platform, schematically depicted in Fig. 5. v^ — [vw\ vw2 vw^j, vpabs, vabs are the absolute velocities of the wheel center, the platform, and the spherical shell referred to the inertial frame, Uws — (uwi uw2 , up,bs, uabs are the absolute angular velocities of the wheel, the platform,

and the spherical shell referred to the inertial frame, Mw — m'w mw3 m^i mww2 ■ ■ ■ ,

n — 2

Fig. 5. Top view. Nearest points from the robot to the curve, he is the distance from the robot to the curve, ri are the coordinates of the curve point in the frame attached to the robot O'x'y', vpabs is the absolute velocity of the platform center referred to the Oxy frame, vd is the vector defining the direction of the robot's movement, always taken with the origin at the 0th and the end at the 1st point

Mp are the components of the orientation matrices of the wheel and the platform written as a row vector.

The reward function (4.3) will take into account how much the robot deviated from the trajectory — he (4.4) and how much it has a correct course (4.5). Finding the relationship between the reward and the state-action pair will be handled by neural networks. The episode ends when the robot's deviation from the trajectory exceeds 0.1 meter, i.e., he >= 0.1 [m]:

10

if 0 < he < 0.01, r [he, vfS, vd) = <j r(he) + r (vfS, vd) if 0.01 < he < 0.1,

15

if 0.1 < he or vfs = 0,

r(he) =

4 he

15,

r I v%bs, vd

10 • dl/3

-20

if d ^ 0,

1 \ where d

+ 1 if d < 0,

abs

d1

abs

(4.3)

(4.4)

(4.5)

Let us say what trajectories will be generated for the robot. We assume that the trajectories are closed and have no discontinuities of the second kind and no self-intersections. Since everything happens in discrete time, we will take 100 points to generate a random trajectory:

= {(x(ti), y(ti))} = {{rz • sinfo), r* • cos(t*))}, ti e [0, 2^),

(4.6)

where r represents a random vector, dim r = 100, which is obtained as follows: first, a vector r0 is generated, dim r0 = 100, where each coordinate r0 is a normally distributed random variable N(0, 1). Each element of the vector r0 is componentwise multiplied by the coordinates of another proportional vector vlog (4.7). Similarly, vectors 0O and u0 are introduced, as well as a proportional vector rl, in which all coordinates are equal to 1. In the end, we obtain the formula (4.8) for r:

v

log

= {10xi, -2.1 < x* < 1.61},

(4.7)

r = rl + r0 • sin(00 • t + u0), where r0, 00, w0 ~ N(0, 1); r\ = 1, with i = 1, ..., 100. (4.8)

1

v

d

P

V

d

K

Algorithm 1: Algorithm for training the robot to move along trajectories 1: Inputs:

EpisodeCount — the number of episodes for training, ReplayBuffSize — size of the replay buffer, batchSize — batch size formed randomly for training from the replay buffer, freqLearn — frequency at which the network weights will be updated, tmax — the maximum time for the robot's motion along the trajectory 2: Initialize:

Initialize the critic network Qg, actor network and corresponding target networks Qs> , also the replay buffer B0 of size Replay BuffSize 3: for episode = 1 to EpisodeC ount do 4: Generate a random trajectory k — {(xi, yi)]'9;L0 (4.6) 5: Place the robot at a random point on trajectory k 6: do

7: Form action at ^ (st) with added noise according to (3.3)-(3.5)

8: Apply control inputs at to the robot and obtain new robot state st+1 and reward rt (4.3)

for the action taken 9: Store transition (st_1, at, st, rt) in buffer B

10: if |Bo| ^ batchSize and |Bo| mod freqLearn — 0 then

11: Randomly sample a batch B C B0 of size batchSize

12: Update critic and actor network weights according to (2.6)-(2.8)

13: Update target critic and actor network weights (2.9), (2.10)

14: end if

15: while he < 0.1 [m] and trajectory traversal time t < tmax 16: end for

5. Results of learning and simulation

The optimizer used for both the actor and critic networks is the stochastic gradient descent ADAM [13]. When the replay buffer is empty, the agent acts randomly, accumulating experience in the buffer until it gathers the required number of transitions (st-1, at, st, rt), equal to the batch size specified at the beginning of the algorithm. This process can be summarized in Algorithm 1.

The DDPG algorithm includes two neural networks: the actor and the critic, along with their target network copies. Schematically, the actor and the critic are depicted in Fig. 6. Both neural networks are fully connected, and a rectified linear unit function max(0, x) is applied to each neuron's output in the hidden layers.

The fundamental difference in the structure of the critic versus the actor is that the actor neural network "branches out" into two subnetworks after the third layer. This is done because the robot has two different control torques: the first one is fed to the wheel fork, which allows the robot to rotate about the vertical axis, and the second one is fed to the wheel, which allows the robot to move forward.

The subnetwork which is responsible for controlling the wheel steering output applies the hyperbolic tangent function tanh(x), denoted as apiatform, while the other output uses the sigmoid function a(x) for awheel. The moment awheel is made nonnegative in order to simplify the problem in some sense, namely, to reduce the training time of the model, since the robot is symmetric, it has no front/back part. By restricting this moment to nonnegative values only, we achieve that the robot will move only "forwards".

These two outputs are combined into a vector a — (aplatjojrm,awheel), which in turn is combined into one vector together with the state vector s and fed to the input of the critic.

Input

(a)

Input

(b)

Fig. 6. Actor Qe (a) and critic (b) neural networks

After each hidden layer of the actor to which the activation function is applied, a layer with normalization [14] is added. As for the critic, it is an ordinary fully connected neural network with one output and 5 hidden layers. All dimensions can be seen in Fig. 6. The algorithm is implemented using the PyTorch machine learning library [15]:

Error K, Ka = --:-, (5.1)

V J m • ld

number of passed points

% path =-------:--100%. 5.2

total number on the required trajectory

To evaluate the quality of learning we use the dimensionless metric (5.1). The meaning of this metric is the average error magnitude in the scale of the trajectory length, where Kd is the required trajectory along which the robot should move and k is the motion trajectory obtained. The length of the trajectory traveled by the robot l, the length of the desired trajectory ld and the percentage of the path traveled are calculated. The percentage of the path traveled (5.2) is an important metric because the model can be trained in such a way that the robot deviates at some point from the trajectory beyond an acceptable distance (he > 0.1 [m]) before reaching the vicinity of the point from which it started its motion, see Fig. 7. The results for the 10 random trajectories are shown in Table 1.

Figure 8 shows the results of how the robot performed the motion along trajectories 2, 4, 5, 7, 8, 9. The metrics are summarized in Table 1, and Fig. 9 shows the control that was constructed by the algorithm when the robot was moving along the path shown in Fig. 8b.

The results of learning the robot to move along random trajectories are given above. It is interesting to see how the algorithm coped with "classical" trajectories, such as the "rectangle", "circle" and Bernoulli's lemniscate. The results of the simulations for these trajectories can be seen in Fig. 10 and the main related metrics in Table 2. It can be seen from the movement along the rectangle that the robot moved along the straight lines quite accurately, excluding those sections where it had to make a turn. It is relatively easy for the real model to move in a straight

Fig. 7. The figure shows a schematic representation of the path traversed by the robot. The required trajectory is marked with numbered support points. A point on the trajectory is considered passed if a point H matches with or is "ahead" of a reference point on the trajectory, and he < 0.1 [m]. The percentage of the path traveled is the ratio of the number of points traveled to the total number of reference points of the required trajectory (5.2)

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

Table 1. Metrics for the robot's motion along 10 random trajectories

No Error (ft, ftd) / %path

1 0.004675 13.007 12.944 100.0

2 0.005463 14.148 13.97 100.0

3 0.005484 14.246 14.083 100.0

4 0.004565 14.263 14.198 100.0

5 0.005247 13.891 13.791 100.0

6 0.005243 13.576 13.498 100.0

7 0.005465 12.92 12.889 100.0

8 0.004697 13.111 13.063 100.0

9 0.005507 12.998 12.953 100.0

10 0.004898 13.496 13.423 100.0

Table 2. Metrics for robot motion along "classic" trajectories

No Error (ft, Kd) ld I %path

1. Rectangle 0.004893 7.0 6.688 100.0

2. Circle 0.003729 12.564 12.622 100.0

3. Bernoulli lemniscate 0.004866 13.627 13.643 100.0

line (see video [16]) if a constant torque is applied to the wheel. However, some deviations in the motion of the real model are noticeable. This is due to the fact that in a real physical system there is always a displacement of the center of mass of the platform and also due to the internal connection of the two hemispheres that form the spherical shell of the robot. Below, Fig. 11 demonstrates how the agent was trained every 100 episodes using Algorithm 1. After each 100th episode of training, the robot with the current actor-network weights attempted to move around a circle.

Fig. 8. The plots depict the movement along random curves on the plane, the solid lines indicate the desired trajectory, and the dashed line represents the robot trajectory. (a-f) are trajectories 2, 4, 5, 7, 8, 9 from Table 1

Fig. 9. Time-dependent control built by the algorithm when the robot moved along the trajectory from Fig. 8b, the upper graph is the dimensionless moment aplatform applied to the wheel fork, and the lower one is the dimensionless moment awheel applied to the wheel axis

0.0 0.5 1.0 1.5 2.0 2.5 x [m]

(a) (b) (c)

Fig. 10. Motion along a square (a), a circle (b), and Bernoulli's lemniscate (c)

-2 0 2 x [m]

(m)

Fig. 11. The evolution of the motion in a circle

6. Conclusions

The DDPG algorithm was used to control the robot's motion along an arbitrary closed curve on a plane with minimal error. To generate control torques, the agent used information about seven closest points on the trajectory, relative to a reference frame attached to the robot's platform. It also used information about the orientation of the platform, the shell, and angular and linear velocities. The training has shown very promising results. The average error value (5.1) over 10 random trajectories is approximately 0.005 (Table 1). The robot learned to move along not only random curves, but also "classical" trajectories such as a circle, a rectangle, and Bernoulli's lemniscate. The novelty lies in the fact that reinforcement learning was applied to build control to a spherical robot of this type. It should be noted that the algorithm has not been tested on a real robot yet.

Appendix. Choosing the number of points to the curve

Let us elaborate on the issue of selecting the number of closest points from the robot to the curve, which are included in the state vector of the environment (4.1), as the distance from the center of the platform to these points in the coordinate system attached to the platform. For this purpose, 13 models were trained using Algorithm 1, each model considering a different number of closest points, i.e., n = 1, ..., 13. To find out which model was better, a test set of 100 random trajectories was prepared. The criteria for selecting the desired model and setting the desired n are: trajectory traversability (5.2) — the maximum number of trajectories fully traversed by the robot among the test set, and the minimum average error (5.1) across the entire test set.

Table 3. Minimum and maximum error values, average error value and standard deviation. For each model, the calculations consider errors from fully traversed trajectories

number of nearest points min max Error std number of 100 % paths followed

3 0.000085 0.031778 0.003632 0.006708 88

4 0.000210 0.773589 0.043003 0.163012 75

5 0.000129 0.784409 0.021916 0.114405 87

7 0.000146 0.031125 0.002794 0.005102 100

8 0.000237 0.964662 0.022405 0.121418 83

9 0.000147 0.057071 0.004528 0.008816 89

10 0.000226 0.773679 0.012520 0.080740 100

11 0.000106 0.877702 0.014017 0.092854 100

12 0.000079 0.303388 0.006883 0.032468 85

The results of the trained models on the test set are shown in Table 3. The results of learning models with 1, 2, 6, and 12 closest points are not shown in Table 3 because they do not have trajectories that are completely traversed. For the other models, the table presents the average error, standard deviation, minimum, and maximum error. From the data in Table 3, it can be seen that the models that accounted for the 7, 10 and 11 closest points on the trajectory were the best in terms of traversability. We use the one-tailed Student's i-test to select the model with

the minimum average error. Let us formulate two hypotheses (A.1) and their alternatives (A.2):

H01 H02 H11 H12

E[Error7] ^ E[Error10], E[Error7] ^ E[Error11], E[Error7] > E[Error10], E[Error7] > E[Error11],

(A.1) (A.2)

where H01 is the null hypothesis that the average error of the 7-point model is less than the mean error of the 10-point model, and H02 is similarly defined. H11 and H12 are the opposite hypotheses for H01 and H02, respectively.

We choose the classical significance level a = 0.05. The results of the Student's criterion: pvalue1 œ 0.85, pvalue2 œ 0.83, so we do not reject the null hypotheses, which is consistent with the data from Table 3, hence n = 7.

Conflict of interest

The author declares that he has no conflict of interest.

References

[1] Borisov, A. V., Kilin, A. A., and Mamaev, I. S., How to Control Chaplygin's Sphere Using Rotors, Regul. Chaotic Dyn, 2012, vol. 17, nos. 3-4, pp. 258-272.

[2] Karavaev, Yu. L. and Kilin, A. A., The Dynamics and Control of a Spherical Robot with an Internal Omniwheel Platform, Regul. Chaotic Dyn., 2015, vol. 20, no. 2, pp. 134-152.

[3] Karavaev, Yu.L., Spherical Robots: An Up-to-Date Overview of Designs and Features, Russian J. Nonlinear Dyn., 2022, vol. 18, no. 4, pp. 699-740.

[4] Li, M., Sun, H., Ma, L., Gao, P., Huo, D., Wang, Zh., and Sun, P., Special Spherical Mobile Robot for Planetary Surface Exploration: A Review, Int. J. Adv. Robot. Syst., 2023, vol. 20, no. 2, 20 pp.

[5] Diouf, A., Belzile, B., Saad, M., and St.-Onge, D., Spherical Rolling Robots — Design, Modeling, and Control: A Systematic Literature Review, Rob. Auton. Syst., 2024, vol. 175, 104657, 15 pp.

[6] Hess, G. and Ljungbergh, W., Deep Deterministic Path Following, arXiv:2104.06014 (2021).

[7] Kamran, D., Zhu, J., and Lauer, M., Learning Path Tracking for Real Car-Like Mobile Robots from Simulation, in European Conf. on Mobile Robots (ECMR, Prague, Czech Republic, 2019), 6 pp.

[8] Cheng, X., Zhang, S., Cheng, S., Xia, Q., and Zhang, J., Path-Following and Obstacle Avoidance Control of Nonholonomic Wheeled Mobile Robot Based on Deep Reinforcement Learning, Appl. Sci, 2022, vol. 12, Art. 6874, 14 pp.

[9] Gou, W. and Liu, Y., Trajectory Tracking Control of Wheeled Mobile Robot Based on Improved LSTM-DDPG Algorithm, J. Phys. Conf. Ser, 2022, vol. 2303, Art. 012069, 7 pp.

[10] Todorov, E., Erez, T., and Tassa, Y., MuJoCo: A Physics Engine for Model-Based Control, in 2012 IEEE/RSJ Internat. Conf. on Intelligent Robots and Systems (Vilamoura-Algarve, Portugal, Oct 2012), pp. 5026-5033.

[11] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M., Playing Atari with Deep Reinforcement Learning, arXiv:1312.560 (2013).

[12] Lillicrap, T.P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D., Continuous Control with Deep Reinforcement Learning, arXiv:1509.02971 (2015).

[13] Kingma, D. P. and Ba, J.L., Adam: A Method for Stochastic Optimization, arXiv:1412.6980 (2014).

[14] Ioffe, S. and Szegedy, Ch., Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, arXiv:1502.03167 (2015).

[15] PyTorch, https://pytorch.org/

[16] https://disk.yandex.ru/i/f-ik3ki-pQMxuA (2022).

i Надоели баннеры? Вы всегда можете отключить рекламу.