NANOSYSTEMS: Lobanov I.S. Nanosystems:
PHYSICS, CHEMISTRY, MATHEMATICS Phys. Chem. Math., 2023,14 (6), 613-625.
http://nanojournal.ifmo.ru
Original article DOI 10.17586/2220-8054-2023-14-6-613-625
Toward nanomagnetic implementation of energy-based machine learning
IgorS. Lobanov1'"'6
1Facultyof Physics, ITMO University, Lomonosova Str. 9, Saint Petersburg, 191002 Russia [email protected], [email protected]
PACS 75.78.-n, 05.65.+b
Abstract Some approaches to machine learning (ML) such as Boltzmann machines (BM) can be reformulated as energy based models, which are famous for being trained by minimization of free energy. In the standard contrastive divergence (CD) learning the model parameters optimization is driven by competition of relaxation forces appearing in the target system and the model one. It is tempting to implement a physical device having natural relaxation dynamics matching minimization of the loss function of the ML model. In the article, we propose a general approach for the design of such devices. We systematically reduce the BM, the restricted BM and BM for classification problems to energy based models. For each model we describe a device capable of learning model parameters by relaxation. We compare simulated dynamics of the models using CD, Monte-Carlo method and Langevin dynamics. Benchmarks of the proposed devices on generation and classification of hand-written digits from MNIST dataset are provided. Keywords Machine learning, Boltzmann machine, energy based model, dissipative training. Acknowledgements The work is supported by Russian Science Foundation grant 22-22-00565: https://rscf.ru/en/project/22-22-00565/
For citation Lobanov I.S. Toward nanomagnetic implementation of energy-based machine learning. Nanosystems: Phys. Chem. Math., 2023,14 (6), 613-625.
1. Introduction
The recent machine learning (ML) development revolutionized the field of artificial intelligence and unleashed rapid progress in natural language processing, robot motion planning, art generation and so on. Part of the rapid development can be attributed to the appearance of new ML models such as transformers, but the appearance of human-like behavior is connected with the increase of capacity of the models [1]. At the moment, training of large models is an expensive task requiring 105 hours of A100 GPU and costs millions of dollars. To obtain even more complex models further increase of performance and memory size of hardware simulating the models is required. The state of the art hardware originates from computer graphics accelerators and is not optimal for computation of essentially analogue artificial neural networks (ANN) and similar models. This leads in particular to the huge energy consumption of electric computers based ML compared to biological neural networks.
Various new approaches for implementation of ML using new physical principles are proposed, e.g. photoelectronic circuits speed up image processing thousands times faster than Tesla A100 [2]. One of the approaches lies in designing ML models on top of magnetic devices [3,4]. The approach extends the boundaries of spintronics, which is a likely replacement of modern electronics. Some ML models are especially appealing for implementation as magnetic devices since they originated from physical models. One approach uses topological solitons to transmit impulses between artificial neurons [5]. Probably before the emergence of such neuromorphic devices, we will see the appearance of magnetic racetrack memory working on the same principle [6,7]. The ML model, which is most close to a physical system is the Boltzmann machine (BM) [8], that can be considered the stochastic Ising machine, and therefore can be physically implemented as interconnected spin islands [9]. BM has a variety of uses including associative memory, solving combinatorial optimization problems, generation of images, features extraction and so on.
BM consists of spins, which interact with each other and the environment, resulting in Boltzmann distribution of the spins energy. The BM is trained by variation of the interaction constants to match the probability distribution (PD) to a target one. In the simulation, sequence of states of BM are computed by Monte-Carlo methods and its weights are optimized using contrastive divergence (CD) method. The CD method implements a variant of Hebbian rule, which makes training of BM similar to the learning of biological organisms. Earlier attempts to train BM showed that it is not practical for any hard problems, but BM with restricted connections (RBM) can be efficiently simulated and are in use now especially in the form of stacked BMs.
Besides numerical simulation of BM as a variant of ANN, BM has been implemented as a number of physical devices, including the one based on tunnel magnetic junctions [10]. The Ising model also forms a basis for D-Wave quantum computer, which however uses quantum annealing instead of stochastic dynamics for computations [11]. The vast majority of BM implementations do not support training in hardware. Instead they rely on traditional computers to
© Lobanov I.S., 2023
solve the optimization problem. We will pursue the goal to develop a self-training device capable of adapting to continuous feed of training data. An especially simple device can be obtained, if training can be reformulated as a naturally occurring physical process.
BM belongs to the class of so-called energy-based models (EBM) [12]. A useful feature of the models is that minimization of the loss function is equivalent to energy minimization [13,14]. The observation makes it possible to apply a physically inspired method to ML, e.g. simulation of Langevin dynamics achieving better results than other likelihood models [15].
Our initiative is to implement BM loss minimization as a relaxation in a magnetic device. In the article [16], we proposed an extended BM machine whose weights are included to degrees of freedom (DoF). We showed that the energy minimization of the system leads to memorization of the trained samples. This kind of device can be used as an associative memory, but it is not suitable for answering questions with a non-deterministic answer. In the present article we generalize the approach to a stochastic model capable of generation of arbitrary probability distributions. Similar idea was carried out in [17] that made it possible to implement plasticity of the energy landscape of a few atom BM demonstrating self-learning to some extent.
In the article, we systematically derive EBMs for BM with only visible neurons (Section 2), BM containing hidden neurons (Section 3) and BM with loss function specialized for classification problems (Section 4). For all the variants of BM, we propose an approach for physical implementation of a device, whose relaxation matches minimization of the corresponding loss functions. We also provide examples of application of the approach, studying a simple model of tuning variation of normal distribution in Section 2.1 and benchmarking hand-written digits generation (Section 3.2) and classification (Section 4.1).
2. Dissipative training
Consider an arbitrary physical system with energy functional E[x] depending on the system state x. Suppose the system is in thermal equilibrium with a thermal reservoir, hence the state of the system is random and is described by Boltzmann probability distribution:
p(x) = Z-1Z (x), Z (x) = e-^x], where fi = 1/kBT is the inverse temperature, and Z is the partition function:
Z = £ Z (x) ^ £ p(x) = 1.
x x
We use notation for summation over a discrete state space for simplicity of presentation, however, the theory is valid for arbitrary measurable state space and the Lebesgue-Stieltjes integral can be substituted for sums, and an appropriate probability measure for Z (x). Helmholtz free energy is defined by
F = -fi log Z ^ p(x) = e-^(E[x]-F(1)
The Shannon entropy is defined by
H (p) = Ex_p[- log p(x)] = p(x)log p(x).
x
It coincides with the Gibbs entropy up to the Boltzmann constant factor: S = kBH. Due to (1), the entropy is related to the free energy and mean value of energy with respect to the distribution p:
H(p) = fiEp[E - F] = fi(Ep[E] - F). The mean energy can be computed in terms of the partition function:
d(fiF) = d log Z = i dZ = i^
= d~ = dfi = Z
= -Z= Z-1Y) Z (x)E [x] = Ep [E]. (2)
In machine learning, the system can be used as a generator of samples with PD p. In practice, we want the distribution to match a given distribution p. To compare the model distribution p with a target one p, the Kullback-Leibler (KL) divergence can be used:
-p(x)1 = £ p(x) log pM = H (p,p) - H (p),
Dkl(p || p) = Ex
log
p(x)
where H(p) is the entropy for the distribution p and the cross-entropy is defined by:
H(p,p) = Ex^p[- logp(x)] = - £p(x) logp(x).
x
The KL divergence is not-negative and is equal to zero if and only if the distributions p and p coincide. Although DKL is not a metric, since it is not symmetric, the two mentioned properties make the KL divergence a good choice for loss function for system parameters optimization for the system to learn the target distribution p. The learning is possible, if
the system depends on parameters 6 then we can minimize loss function with respect to the parameters. The parameters will be discussed below, for the moment, we can work in very general settings.
Since the model PD p is generated by the considered system, the cross-entropy can be expressed in terms of the mean and free energy using (1):
H(p,p) = /3Ep[E - F] = J(Ep[E] - F).
The expression for the cross-entropy differs from the entropy of p only by the distribution used for averaging the energy.
Suppose that the distribution p is generated by an analogous system, but with a different expression for the energy. We will put tilde above all values related to the system with the PD p over the states, in particular,
p(x) = exp(-J(E(x) - F)) .
The KL divergence simplifies to:
H(p,p)
H(p)
Dkl(p || p) = J(E p[E] - F) - J(Ep[E ] - F) = JEP[E - E]+ J(F - F).
Therefore, the learning of the distribution p is equivalent to equalization of the free energies for p and p. Let both energies E and E be two implementations of the same model for different parameters:
E(x) = E(x; 6), E(x) = E(x; 6).
Then the learning is done by optimizing the loss with respect to 6. The simplest optimization procedure is the gradient descend, when the parameters are updated according to the rule:
6 ^ 6 - v
dD
KL
d6
where v defines learning rate. Since the values with tilde do not depend on 6, the KL divergence simplifies:
dD
KL
d6
d
= pqq (Ep[E] - F).
In virtue of independence of p on 6, the differentiation and averaging can be swapped. The free energy derivative can be computed as follows:
dF = 1 dZ = 1 dZ(x)
JZ d6
d6
ZE Z (x) d-Exl = E p(x)dE(x)
d6
E„
dE
~ae
Finally, the parameter update rule is driven by the velocity f:
6 ^ 6 + (vJ)f, f = Ep
dE ~dB
- E„
dE ~30
(3)
(4)
It is worth noting that the parameter 6 does not present in the final expression. Moreover, the origin of p does not matter below, so we will not restrict p in any way below.
The key observation of the article is that the update rule coincides with the relaxation dynamics for the variable 6. Indeed, suppose 6 is a DoF for the considered system, but we assume 6 to be a slow variable, so that at each moment of time, the distribution of x is given by the p for the fixed 6. The relaxation dynamics of 6 is described by:
6 dE
6 = -a~do,
where a is the damping parameter. Averaging over the ensemble, we obtain the second addendum in (4). Moreover, assuming ergodicity, the addendum can be approximated by the time average:
"dE(x; 6)'
E
'x^p
d6
dE\ ~d0/1
The first addendum in (4) is more complex, since the distribution p is external for the system. To take p into account, we extend our system, including new degrees of freedom x. The vector x belongs to the same state space as x, but we assume x be externally driven, so that its distribution at every moment of time is given by p and they are uncorrelated at different moments of time. Then the first addendum can be obtained as time averaging of:
E
x^p
dE(x; 6)
d6
/dE(x; 6)
d6
We define total energy of the new system according to the requirements:
(1) Energy E[x] is opposite to the energy E[x].
(2) x and x interact with the same parameters state 6.
t
Fig. 1. Proposed designs of self-training BM. Circles represent DoF: bold line marks slow variable/weights, double-line marks fast variables, punctured boundary marks externally driven DoF. The system is split to two lobes: the right one is a copy of the left one with externally driven DoFs substituted for some fast variables. Segments between circles indicate interactions. (a) BM learning PD demonstrated on x and generating the PD on x. (b) BM contains hidden neurons y and y; training data is demonstrated on X, the result is read from x. (c) BM for classification problem; features X and labels y from the training set are driven by external forces; predicted PD over labels is read from y.
Then the total energy is defined by:
ET = E(x; 0) - E(X; 0).
Summing up all properties above, we conclude that f can be obtained by averaging of the relaxation force:
f
/ dEl \ \ d0 /t•
We conclude that the relaxation dynamics of the extended system is a continuous version of the stochastic gradient descent for the minimization of the KL divergence. Since the approach is quite general, it opens many opportunities for implementation of self-learning devices. In the following sections, we consider several examples.
2.1. Standard deviation learning
Consider simple case of a single continuous random variable x having normal distribution with mean m and standard deviation a = 0- 2:
p(x) = Z 1 exp ( - -
(x — m)2
The PD can be considered Boltzmann distribution for the energy
"(x — m)2
E=
2
and the constant temperature fi = 1. The partition function in the case is well known:
Z =^2^/0,0.
The optimization of the mean m is relatively simple and was already solved in [16]. Here we let m = 0 and focus on optimization of 0. The free energy of the model system:
F = (log 0 + logfi — log(2n)) •
According to (2) the mean energy is:
d (fiF) = 1
dfi = 2fi'
Suppose target distribution p is also normal with standard deviation a = 2. The expectation value of E with respect to p is obtained by rescaling:
' 0 ^ \ 1 0
P W
KL divergence between the distributions is given by:
1 0 1
Ep[E] = ^ = ^ • (5)
Ep[E] = ( ~E ) =—
Dkl(P || P) = fiE^p[E(x) — E(x)] + fi(F — F) = 2 (0 — ^ + 2 (log0 — log^ •
2
3.0-
2.5
CD Ql
I 2.0-
ro to
Q.
1.5-
1.0-
25
50 75 100 Iteration xlO3
125 150
25
50 75 100 Iteration xlO3
125 150
0
0
Fig. 2. Convergence of estimation of standard deviation parameter 0 using EBM: CD learning (solid line), free energy minimization using Metropolis-Hastings algorithm (dashed line) and Langevin dynamics (dotted line). Target value of the parameter is 1, initial value is 3. Mean value of the parameter estimate (left panel) and standard deviation of (right panel) obtained by averaging over 20 runs and 1000 consecutive samples.
The divergence has only one minimum at 0 = 0 as expected. The derivative of the KL divergence will be used below to make optimization algorithms:
odkl 1 f1 1\
(6)
d0 2 V0 0/
2.2. Dissipative learning
Above we considered the parameter 0 to be fixed. Here we suppose that 0 a slow DoF of the system. Then relaxation dynamics will push 0 in the direction opposite to
dE = x2
~3e = - T'
Assuming x is a fast variable, the derivative can averaged over x:
" dE(x; 0)"
Ex^p
80
e r x2 i 0x2\ i "V Texp( — dx = - 20-
It is worth noting that the derivative of the mean energy with respect to 0 is zero according to (5), that is the expectation value and the differentiation do not commute. It turns out that the dissipation generates the same dynamics as the second addendum in (6). The first addendum here gives a reference point and is essential to obtain the correct stationary state. The first addendum can also be interpreted as mean energy, but with respect to the target distribution p:
"dE(x; 0)"
EXtp
de
i 20'
Then optimization direction will coincide with mean relaxation force if we expand the system with the second part depending on x and assuming the second part has opposite energy to the first one:
dDKL = E - -
de ^x^p^xt-^p
dE T
de
ET(x, x; 0) = E(x; 0) - E(X; i
In the real system, the expectation values can be obtained either by averaging over ensembles or over time. The distribution p can be enforced to the variable x by an external force demonstrating the training set to the device. Obtaining opposite signs of energy for two parts of the system for the same value of the parameter 0 is the most challenging task in designing real devices. Consider one example of such an implementation. Energy of liquid crystal interaction with an external electric field has the following form [18]:
= _ Ae(n ■ E)2 8n '
where E is electric field, n is the director vector, and Ae is anisotropy of permittivity. Let the electric field have fixed direction z and be proportional to the parameter 0. Let x be projection of the director n to z. Then the energy coincides with the considered above up to a constant. Sign of the constant is determined by Ae which can be both positive and negative depending on the exact choice of the liquid crystal. It means that by choosing Ae of opposite signs for two lobes x and x we will obtain opposite energies for the same value of the external field 0.
2.3. Contrastive divergence training
Contrastive divergence (CD) is probably the most popular method used for training EBMs. The method belongs to the family of gradient descent methods sharing the update rule:
n n , j- £ 9dkl
0 ^ 0 + nf f «—qT'
where n > 0 is the learning rate constant and f is an approximation to the steepest descent direction on the loss surface. The optimization algorithm here is a variant of stochastic gradient descent:
(1) Get a sample x from the training set.
(2) Sample x from the distribution p given for the current 0.
(3) Compute dissipation force f = dE(x; 0)/d0 — dE(x; 0)/d0.
(4) Update weights 0 ^ 0 + nf.
(5) Repeat.
To estimate the force f more accurately, one can sample values in batches and then average the force over all batches. Nevertheless, if the learning rate n is sufficiently small, the averaging of the generated force in time gives a good enough estimate of the expectation value to obtain convergence.
2.4. Metropolis-Hastings algorithm
For complex systems sampling values x from the PD p is generally difficult since the partition function Z is not known. It is common in this case to sample values using Metropolis-Hastings (MH) algorithm. In this case, values of x are not independent between steps, but rather form a Markov chain. To define the chain we need an auxiliary probability density g(y|x) called proposal density, which gives us a new candidate y given a previous value x. We will assume g symmetric: g(x|y) = g(y|x). The generation of x according to MH algorithm is done as follows:
(1) Sample y according to the distribution g(y|x) for the given previous value of x.
(2) Calculate the acceptance rate a = Z(y; 0)/Z(x; 0).
(3) Generate a uniform random number u e [0,1].
(4) If u < a, accept y as a new value of x.
(5) Otherwise let x preserve its value.
The optimization algorithm with MH sampling is the same as in the previous section, but the previous value of x is used to generate its new value. The emerging correlations between adjacent samples can increase spread of the optimization direction estimates f, which will require a decrease of the training rate constant to maintain convergence. For the benchmark below we took proposal density g(y |x) to be normal with the mean value y and standard deviation
0.5.
2.5. Langevin dynamics
Dynamics arising from the Monte-Carlo method described in the previous section is suitable to obtain correct estimates of expectation values, however it does not correspond to the real dynamics. In many cases, the dynamics is described by a variant Langevin equation, such as the stochastic Landau-Lifschitz-Gilbert (LLG) equation for magnetic systems. In the discretized form the Langevin equation for one variable can be written as follows:
m t ' dE(x4; 04) T .
xi+1 = x4 — n-VJ—- + W %
dx
where W4 has a normal distribution with zero mean and standard deviation \J2/n' and W4 for different t are independent. The distribution of x4 is the same Boltzmann distribution as above for the inverse temperature fi. Dynamics of the slow variable 0 are described by relaxation in the same way as above:
dE(x4; 04) dE(x4; 04)
0i+1 = 04 — nf4, f4
<90 <90
For the benchmark below we generated training samples x by Langevin equation with parameter 0 set to the target value 0. Time scale for the fast variable was defined by the ratio: n'/n = 500.
2.6. Benchmark
We compared performance of CD, dissipative learning with MH sampling and Langevin dynamics on the learning parameter 0 of the normal distribution introduced in Section 2.1. The initial value of parameter was set to 3, while the target value equals 1. We set learning rate to 2 • 10-4 and made 1.6 • 105 steps by all algorithm repeating the procedure 20 times. For each of the algorithms we averaged estimates of the parameter 0 over 1000 consecutive samples, and also estimated standard deviation of the estimates on the same intervals. The results are presented in Fig. 2.2. All the methods perform equally well on average having the same convergence rate. Correlations between adjacent samples in Monte-Carlo based method and in Langevin dynamics produce 3 — 4 times large variation of the estimate as expected. We can conclude that real-world physical dynamics is as suitable for training the ML model as a generally accepted CD method.
3. Hidden neurons
In Section 2, we described a general approach for turning arbitrary dissipative systems into a self-training machine. The class of generated PDs is defined by energy functional E. Although we have not imposed any restrictions on the energy (except of being bounded from below) and arbitrary PD can be generated under appropriate choice of the energy, in practice we have rather restricted choice of energies available for physical implementations. Standard BM is based on the Ising model, having the quadratic energy functional:
E[x] = 2 E WjkXjxfc + £ bjXj, (7)
jk j
with Xj representing the state of a spin j (playing the role of a neuron) taking values Xj = ±1. The parameters 6 = (w, b) consist of connection strengths wjk between neurons j and k and biases bj for every neuron j. The functional E allows to tune mean values and covariance of components of x, but it is impossible to obtain a nontrivial joint distributions of three and more neurons. More complex inter-dependencies between neurons can be obtained using the same quadratic energy functional, if we include new hidden DoF.
Suppose the variables X describe visible neurons, whose state is described by the training set, and which are output of the model. Introduce new DoF denoted y playing the role of hidden variables. Now the energy functional depends on x, y and 6. We do not impose any constraints on the energy functional, but for clarity of presentation we recall an example of such functional used in restricted BM (RBM):
E[x] = \Ej wjk Xj yk + £ aj Xj + £ bk yk, (8)
jk j k
where a and b are biases for visible and hidden neurons, respectively, and the interaction is nonzero only between units of different classes (visible, hidden, weight). The energy functional for RBM coincides with (7) with the vector x extended by y with additional constraints forbidding interaction between neurons of the same type. As in Section 2, we assume 6 to be slow variables, then for given 6 we have joint PD
p(x,y; 6) = Z-1Z(x,y), Z(x,y) = exp(-^E[x, y; 6]), Z = £ Z(x,y). (9)
The visible neurons state is described by the marginal PD:
p(x) = Ep(x, y) = Z-1Z(x), Z(x) = £ Z(x, y). (10)
y y
For example, the marginal distribution for RBM is notably easy to compute, since for a fixed x the components of y are independent. Easy algebra leads us to the result:
P(x|y) = Z-1 H e-^' x n (E e-"A3 + ' Aj = E Wjk yk- (11)
j k yj J k
Introducing sufficiently many hidden neurons, an arbitrary PD p(x) can be attained.
The training problem is to find parameters 6 such that the marginal distribution p(x; 6) minimizes distinction with a target PD p(x). The loss function can be defined as cross-entropy, which as we have seen above gives the same minimum as KL divergence, since entropy of p does not depend on 6 and does not affect the result:
H(p;p) = logp(x)] = - £p(x) logp(x).
x
The only change here compared to Section 2 is that p(x) is a marginal distribution. In virtue of (10),
H(p;p) = F p(x) log Z(x), F = £p(x) log Z, (12)
xx
where F is the Helmholtz free energy.
Now we compute the gradient of the loss with respect to parameters, which is used for the loss minimization. The derivative of the free energy is the same as in Equation (3):
dF ( ) dE(x,y) I"dE(x,y)'
x,y
For a fixed x the second addendum in (12) is a conditional expectation of the relaxation force:
d
log Z (X) = Z-- E ^ = -D E p(y|x) ^^ = -№..
x,yr^p
dE(x, y)
d6
(13)
x
where the conditional probability is given by:
p(y|x) =
p(x,y) Z (x,y)
p(x) Z (x)
Taking mean over the target PD we obtain the derivative over the second addendum:
d
— Ep(x) l°g Z(x) = -^Ep(x)P(y|x)
dE(x, y) ¡90
-ß Ex
Ey
dE(x, y)
<90
Finally, the loss derivative is mean difference between energy gradients averaged on the target and model distributions:
= ß I E
= ß (
Ey
dE (x, y)
<90
- Ex
dE (x, y)
¡90
(14)
The averaging over hidden variables y both times happens with respect to the model distribution p, hence we obtain the same result as in Section 2, but all values should be considered as values averaged over hidden DoF.
The system whose relaxation coincides with minimization of the loss can be constructed applying the same principles as in Section 2.
(1) The average over the ensemble is changed to the time average.
(2) Every addendum in (14) is generated by one of two copies of the system, which share weights 0 and exchanges, but has its own examples of x and y. The energies of the copies are chosen opposite.
(3) The variables x are controlled by external forces, which recreate distribution p.
(4) The dynamics of the weights 0 are dominated by relaxation. The weights must be slow variables, x and y are fast variables.
A schematic of the proposed device with hidden variables is shown in Fig. 2.b. The exact dynamics of the device is not important as soon as it guaranties the condition (4). For example, Monte-Carlo simulation results in the contrastive divergence method widely used for training RBM in ML. Langevin dynamics can be a more adequate method for real physical system simulation.
3.1. Monte-Carlo simulation
Correct distribution of variables x, y and y for a fixed 0 can be generated using Metropolis-Hastings algorithm. The approach does not take into account exact dynamics, but generates correct mean values of energy and energy gradients, which are the only values affecting the dynamics of slow variables 0. This method allows us to cover wide range of physical implementations of BM including magnetic in the approximation of the Ising model.
Here we focus on RBM, which energy is defined by (8). By definition visible and hidden neurons in RBM do not interact, therefore for a fixed hidden neurons state y the individual components of the visible neurons state x are independent and vice versa. In particular, conditional distribution of x given y is given by (11). This allows us to efficiently generate in the parallel manner samples of y given x and x given y. For example, the following algorithm is used to generate samples of x.
(1) Compute difference of energies of states that differs by single bit j:
Aj = 2^2 Ejkyk + 2aj. jk
(2) Compute conditional probability of xj = 1 given y:
Pj = P(xj = 1|y) =
1
(3)
(4)
1 + eA '
Generate vector £ of uniformly distributed values £j e [0,1].
Set component j of the generated vector xj to 1 if £j <= pj and to _1 otherwise.
The simulation is done step by step. We put the upper index t over all values for the time step t. The initial value of c0 is arbitrary. On each iteration all the values are updated as follows:
(1) xi+1 is sampled from the training set.
(2) Random yi+1 is generated by the above algorithm for the given xi+1 and 04.
(3) Random y4 is generated by the algorithm above for the given x4 and 04.
(4) Random xi+1 is generated by the algorithm above for given y4 and 04.
(5) The relaxation force is computed:
ft+1 = ¿E(xt+1,yt+1; 0) _ |0E(xi+1,yi+1; 0).
(6) Update the parameters:
(7) Repeat.
0i+1 = 0* + nft+1 + W
x
x
5 a W / < * i 3 t J 3 £ 1 ? 3- f 6 f / i Z i 3 1 T ** /
& A t £ 4 * * 1 1 r 1 A t £■ i s 3 0 A n A r 0 A H t M f 6
tf r 6 O a. t A 1 3 a 7. L J A 8 0 a i> 1 n A 0 ¥ £ A H i t 0
A t 3 i -7 i A I 6 3 O i- 1 3 i i o V A □ 0 a ? a 2 Z / I e ¥
1 4 * J A \ 3 3 A ¥ 7 z V ■Z- B f ? t, 9 a + & 1 A A i o 3 a 3-
t £ 9 H 6 M 1 7 A * i J- f A I c? 3 i 0 S A 1 "7 fc £ 8 z. JM f
O f 1 r 3. 1 / b / 0 J / D 0 t 1 EJ 1 3 0 1 b S 2 A
4 J 1 a ? i 0 a 3 r t t Z 3 ? isifloi T V O t ! £ fi J
2 5 a 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 t 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 9 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 ■J 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 8
1 8 t t f * ¥ t * *f *f * H H H H Y H H *f V V H *r f f V Y ¥ H H
f f f 9 r T ? 1 T t ? ? ? ? 1 1 1 * 9 f f 9 t f t f f
1 1 f ? * * <t 9 * 9 4 4 4 H H 4 4 H H H 4 4 4 H H
Fig. 3. (Upper panel) First 256 images of MNISTdataset scaled to 14 x 14 and converted to black and white. (Lower panel) State of visible neurons of the RBM after 800 epoch of training shown on every 16-th simulation step. Images are listed from left to right and from top to down.
The parameter update on step (6) is done simulating Langevin dynamics for slow variables 6. The "learning rate" parameter n should be small enough to ensure convergence, which in physical terms means that dynamics of 6 must be much slower than of fast variables x and y. The addendum W is a random noise, whose amplitude is proportional to the temperature T. Since the smallest value of the error is reached at minimum of the energy, the temperature should not be large. On the other hand, small variations of 6 due to the noise allows one to avoid overfitting and can be beneficial.
It is worth noting that the presented algorithm is called contrastive divergence in ML, if no noise is appended. The algorithm with the noise was earlier proposed in [15], where it was shown that it demonstrates better convergence than the CD. In most implementations of the CD the value xi+1 is sampled using conditional distribution p(x|yi+1). In our approach, dynamics of x and x are not directly connected, which allows us to estimate positive and negative parts of the loss gradient independently in two copies of the system.
The most difficult part of the physical implementation of the training algorithm is to obtain both positive and negative parts of the gradient as relaxation forces. In Section 2, the problem is almost non-existent, since the state of X is externally driven and does not depend on 6, therefore, to obtain negative gradient, one can change ferromagnetic exchange to antiferromagnetic leading to negation of the energy. The trick however does not work, if there are hidden variables. Indeed, change of the exchanges energy landscape and therefore, PD of the hidden variables, which invalidates mean values of the gradient.
Pure mathematically the positive and negative terms can be considered an adversarial learning procedure, where discriminator time-flow is reversed [19]. Effective reversed time can appear in quantum mechanics [20], but it is hard to use in practice. Fortunately, the same negative gradient can be achieved by formal change of sign of D. Negative temperatures of spin systems were reported in a number of works [21-23]. Although the statistical physics for negative temperatures is studied for a long time [24,25], a practical realization of local negative temperatures is extremely challenging.
Another possible strategy for generation of negative gradients is to shield hidden variables from the thermostat allowing its interaction only with weights and visible neurons. In the case dynamics of the hidden state y is determined exclusively by x and 6. Therefore the change of sign of all interaction constants negates energy and its gradient, but does not affect PD of y. Following this approach the self-trained system should contain two copies of x and y, with ferromagnetic exchange in one copy and antiferromagnetic in another, both interacting with the same weights 6.
One more strategy is based on reformulation of the optimization problem, abandoning the dynamics defined by (14). Considering the loss minimization method as an iterative procedure with the fixed point p = p, other energy functionals can be proposed, whose relaxation procedure has the same property. For example, let the hidden variables be shared by two parts of the system y = y. Define the total energy
E = —(x — X) • wy — (x — X) • a.
Here X is defined by external impulses and x and y are subjected to Langevin dynamics. If the distributions of x and X coincide then the system is in thermal equilibrium. The relaxation with respect to w and a can be used to adjust the PD of x, but the error measure is no longer KL divergence, and in practice leads to much worse results than the CD method. A practical choice for implementation of the negative gradient in presence of hidden variables is still an open question.
3.2. MNIST digits generation
We benchmarked performance of the proposed approach for implementation of RBM on generation of images of hand-written digits trained in MNIST dataset. MNIST dataset [26] is split to the training set of 60 000 and the test set of 10 000 gray scale images of ten different digits together with labels, which digit the image represents. We downsampled the images to size 14 x 14 and converted to black and white, considering the pixel black, if its luminosity is smaller than 90% of the maximum. All 60 000 images were used for the training, setting a new image to x on each iteration. The beginning of the dataset is shown in the upper panel in Fig. 3.2.
The images were flattened to vectors of 196 visible neurons states, and 1000 more hidden neurons were introduced. Looping over all 60 000 images is considered one epoch, the training took 800 epoch in total. The training step was set n = 10-4. To track convergence we calculated mean _ log Z(x; 0) over samples x taken from the test set for 0 obtained on the corresponding epoch. This value is an estimation of Ex^p[_ logp(x; 0)], which is different from the cross entropy by the absence of the free energy term log Z. The applied training method does not compute free energy directly, which makes it fast, but this also prevents us from accurate estimation of the loss function. The obtained metrics are not reliable, if a lot of noise is introduced in the parameters 0 on each iteration, showing rapidly decreasing loss, while real cross entropy can increase. However, in our experience the free energy does not vary significantly close to the minimum, and the metric can be used for comparison of different methods. The success of the convergence was checked by inspection of the generated images, which indeed showed close resemblance with real hand-written digits.
We run two numerical experiments: one without noise and another with normally distributed W with amplitude 0.001. The generated PD in both cases converges to the target distribution. In the presence of the noise the convergence was smooth, while without noise the loss decrease happens in step-like manner stagnating between the steps. Overall convergence rate is higher without noise. While trained BM in both cases produces correctly looking digits, the BM obtained by training without loss demonstrates faster switching between digits classes, that is smaller correlation between adjacent states.
The dynamics of the dissipative BM trained without noise is shown in Fig. 3.2 (bottom panel). The BM started from a random state, then after a burnout period of 16 steps. We recorded 256 states of the visible neurons skipping every 15 intermediate steps. The generated images reconstruct distribution of pixel colors matching MNIST dataset, but adjacent images are correlated resulting in smooth change of the digit shape. Visual inspection confirms that the dissipative BM correctly generalizes the training set producing images indistinguishable from those produced by humans.
4. BM solving classification problem
In previous sections, we considered generative models producing random values w with a learned distribution p(x). For classification problems, it is often convenient to have a discriminative model, which produce random values of labels y distributed according to a learned conditional probability p(y |x) given an observation of features x. This discriminative model can be implemented on top of BM with some modifications required.
The BM energy in the case is a function of features x, labels y and weights 0. The joint PD of x and y is given by Boltzmann distribution (9). In contrast to Section 3 both x and y are states of visible neurons. As shown above, addition of hidden neurons allows complex PDs, but in all formulas they are averaged out, therefore we do not list the hidden neurons explicitly in this section.
Given target distribution of labels p(y |x) for fixed features values x, the model distribution can be compared with it using cross-entropy:
H (p,p|x) = p(y|x) log p(y|x) = Ey_p[- log p(y|x)|x].
y
The loss function is commonly defined as mean of the cross-entropy over all x:
L = ^2 p(x)H (p, p|x) = p(x,y)log p(y|x) = Ex,y~p-[_ log p(y|x)].
x x,y
Noticing (see Eq. (9)) that
p(y|x) = Z (x)-1Z (x,y), the loss function can be written in the following form:
L = ^p( x, y)E(x, y) + ^p(x, y) log Z(x).
x,y x,y
The first term here is the mean energy with respect to the target distribution. The second addendum is the Helmholtz free energy for a fixed x averaged over x. Consider the gradient of the loss over parameters 0. Since p does not depend on 0, the averaging over p can be swapped with differentiation over 0, which gives us the derivative of the first term. Using
Precision Recall ---Accuracy
25 50 75 Epoch
100
Fig. 4. Results of classification of hand-written digits from MNIST dataset. (Left panel) Confusion matrix on the test data after 100 epoch of training. (Right panel) Convergence history for accuracy and averaged over all classes precision and recall, demonstrating good agreement of all metrics. Metrics on the test set (red lines) and the training set (black lines) are almost identical.
earlier computed in (13) derivative of log Z(x), we obtain the loss derivative:
dL n I v-^ -r-^ . dE(x, y')
= p I 2^pp(x,y)—go--^p(x,y)2_^p(y lx)_
x,y x,y
dO
dO
= pe:
'x~p
E
yr^p
dE (x, y)
dO
- E,
yr p
dE(x, y)
dO
The argument of the outermost expectation value is exactly the derivative over O of the cross-entropy H(p,p) obtained in Section 2 assuming x fixed. Therefore, the discriminative model differs from the generative model in the way we treat the features state x: for the discriminative model x is always distributed according to the target distribution p.
The physical system whose dynamics coincides with the minimization of loss procedure, is designed using the principles stated in Section 2.
(1) The parameters O are treated as slow DoF of the system.
(2) While the state x = x is always driven by external forces defining target PD, a copy y of y is introduced in such a way that interactions between x, y, O are the same as for x, y, O, except for the opposite sign.
(3) Assuming ergodicity the expectation values are approximated by the time averages.
(4) The energy derivatives are treated as relaxation forces for the weight O.
The proposed schematics of the self-training device is shown in Fig. 2.c. Its part involving labels variables y, y is essentially the same as the schematic of the generative dissipative BM shown in Fig. 2.a, but now we have additional externally driven DoF encoding target distribution for the features x.
4.1. Digits classification benchmark
To validate the capability of the proposed method to solve complex problems, we do classification of hand-written digits from MNIST dataset using the dissipative BM. For the test, we used full size 28 x 28 black and white images, where the color was encoded in z projection of the spin: 1 for pure white and -1 for pure black pixel. 60 000 samples were used for training and 10 000 for testing. One loop over all training samples is considered an epoch. The training procedure took 100 epoch.
The state of the used BM consists of the features vector x, the labels vector y and the hidden units vector z. The size of the feature vector equals 784 and matches the number of pixels of the image. The labels vector y has 10 components, each component encodes the probability of the image to belong to the corresponding class, that represents one of ten digits. Both x and y compose the vector of visible neurons. For the benchmark, we set the number of hidden neurons to 200.
The energy of BM is defined by the functional:
E(x, y, z; i
E
wxzxjZfc + w
yz
yj zk
+E 6xxi+E 6y y + E 6z
x
x
z
0 25 50 75 Epoch
100
0
25 50 75 Epoch
100
Fig. 5. Convergence history for precision (left panel) and recall (right panel) for individual classes computed on the test set.
where interaction matrices w and biases b form parameters vector 6 = (wxz,wyz, bx,by, bz). The proposed dissipative BM contains two copies of x, y and z; we mark the second copy by tilde above the letter. The total energy of spin BM consists of energies of the two copies sharing the same parameters:
ET = E(x, y, z; 6) - E(X, y, 5; 6).
The possible strategies to negate energy of the second contribution are discussed in Section 3. During the training X and y are controlled by the external forces specifying training samples. The variables x, y, z and 5 are distributed according to Boltzmann distribution. The dynamics was approximated by Monte-Carlo sampling of fast variables and relaxation dynamics of slow variables 6 according to the algorithm:
(1) Sample new X, y from the training set.
(2) Sample 5 according to (11).
dET
(3) Update 6 ^ 6 - n-—.
d6
(4) Repeat.
To reduce jitter we sample all variables in batches of 10 elements, that correspond to the physical system with 10 independent copies of each subsystem sharing the same weights.
The simulation was run on hand-written Python code using Numpy and JAX for GPU acceleration. The relaxation rate n was set to 10-4. Initial weights 6 were randomly sampled from the normal distribution with zero mean and standard deviation 0.1. Results of the training are shown in Fig. 4.1 and Fig. 4.1.
After 100 epoch we reached ~ 73% accuracy and very close values of precision and recall averaged over all classes. Although the metrics are much smaller than attainable by other methods, the result is comparable with the result of RBM with 200 hidden units. Increasing the number of hidden units and changing architecture of the machine including more hidden layers can significantly improve the result [15]. In the article, our main concern was comparison of the performance classical RBM and the proposed adaptation suitable for nanomagnetic implementation.
Confusion matrix shown in the left panel in Fig. 4.1 demonstrates that the most errors arise in misclassification of 9 as 4, 3 as 5 or 8, which is not much different from the errors of other methods. The same errors are confirmed by the convergence history plots shown in Fig. 4.1, pointing out that digits 4 and 5 are most confused with other digits. The metrics on the training set and on the test set are very close with precision and recall of individual digits continue to grow with every epoch, indicating the prolongation of training can lead to even better results.
Since in the proposed dissipative BM subsystem generating positive and negative update steps for the weights are generated by independent subsystems, the time required for averaging the contributions is larger than in CD method. In the BM, we are forced to use smaller training steps than in CD, therefore, convergence of simulated SBM is slower. In real magnetic nanosystems, the natural time scale defined by Larmor precession is orders of magnitude smaller than obtainable in simulation. Despite a slower convergence rate in simulation, the physical implementation of the BM is expected to be drastically faster than the ML model of BM used currently.
5. Discussion
Above, we proposed several approaches to implement energy-based ML models as physical devices. In particular, nanomagentic-devices are a natural candidate for the implementation of BM. Two problems are not completely solved and remain challenging for experimental implementation of the devices. The first problem is the necessity of three spin interactions between two neurons and the connection weight. Multispin connections in some cases give significant contribution to the total energy [27], but at the moment, it is not clear how to design magnetics with the desired multi-spin interactions. Probably more promising is creation of an effective multispin exchange by introducing auxiliary spins and averaging over them in the spirit of the work [28].
The second challenging problem is learning the weights of the hidden spins by relaxation. In this case, the introduction of lobes with opposite energies also affects the PD of the hidden neurons. The solution here can lay in avoidance of the hidden neurons interaction with the thermostat, and considering stochastic LLG dynamics of the spins, instead of Fokker-Planck equation. At least a similar approach allowed one to create a simple self-learning device experimentally [17].
References
[1] Min B., Ross H., Sulem E., et al. Recent Advances in Natural Language Processing via Large Pre-trained Language Models: A Survey. ACM Computing Surveys, 2023, 56(2), P. 1-40.
[2] Chen Y., Nazhamaiti M., Xu H., et al. All-analog photoelectronic chip for high-speed vision tasks. Nature, 2023, 623(7985), P. 48-57.
[3] Grollier J., Querlioz D., Camsari K.Y., et al. Neuromorphic spintronics. Nature Electronics, 2020, 3(7), P. 360-370.
[4] Li S., Kang W., Zhang X., et al. Magnetic skyrmions for unconventional computing. Materials Horizons, 2021, 8(3), P. 854-868.
[5] Yokouchi T., Sugimoto S., Rana B., et al. Pattern recognition with neuromorphic computing using magnetic fieldinduced dynamics of skyrmions. Science Advances, 2022, 8(39), P. eabq5652.
[6] Gu K., Guan Y., Hazra B.K., et al. Three-dimensional racetrack memory devices designed from free- standing magnetic heterostructures. Nature Nanotechnology, 2022, 17(10), P. 1065-1071.
[7] Blasing R., Khan A.A., Filippou P.C., et al. Magnetic Racetrack Memory: From Physics to the Cusp of Applications Within a Decade. Proceedings of the IEEE, 2020, 108(8), P. 1303-1321.
[8] Marullo C., Agliari E. Boltzmann Machines as Generalized Hopfield Networks: A Review of Recent Results and Outlooks. Entropy, 2020, 23(1), 34 pp.
[9] Khater A., Abou Ghantous M. Magnetic properties of 2D nano-islands I: Ising spin model. Journal of Magnetism and Magnetic Materials, 2011, 323(22), P. 2717-2726.
[10] Yang S., Shin J., Kim T., et al. Integrated neuromorphic computing networks by artificial spin synapses and spin neurons. NPG Asia Materials, 2021,13(1), P. 1-10.
[11] Dixit V., Selvarajan R., Alam M.A., et al. Training Restricted Boltzmann Machines With a D-Wave Quantum Annealer. Frontiers in Physics, 2021, 9,589626, 10 pp.
[12] Lecun Y., Chopra S., Hadsell R., et al. A tutorial on energy-based learning. In G. Bakir, T. Hofman, B. Scholkopt, A. Smola, B. Taskar, editors, Predicting structured data. MIT Press. 2006.
[13] Dayan P., Hinton G.E., Neal R.M., et al. The Helmholtz Machine. 1999. Neural Comput., 1995, 7(5), P. 889-904.
[14] Teh Y.W., Welling M., Osindero S., et al. Energy-Based Models for Sparse Overcomplete Representa- tions. Journal of Machine Learning Research, 2003, 4, P. 1235-1260.
[15] Du Y., Mordatch I. Implicit Generation and Modeling with Energy Based Models. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
[16] Lobanov I. Spin Boltzmann machine. Nanosystems: Physics, Chemistry, Mathematics, 2022,13(6), P. 593-607.
[17] Kiraly B., Knol E.J., Van Weerdenburg W.M.J., et al. An atomic Boltzmann machine capable of self- adaption. Nature Nanotechnology, 2021, 16(4), P. 414-420.
[18] AksenovaE., Dobrun L., Kovshik A., et al. Magnetic Field-Induced Macroscopic Alignment of Liquid- Crystalline Lanthanide Complexes. Crystals, 2019, 9(10), P. 499.
[19] Yair O., Michaeli T. Contrastive Divergence Learning is a Time Reversal Adversarial Game. URL/arXiv: https://doi.org/10.48550/arXiv.2012.03295.
[20] Colombo S., Pedrozo-Peafiel E., Adiyatullin A.F., et al. Time-reversal-based quantum metrology with many-body entangled states. Nature Physics, 2022, 18(8), P. 925-930.
[21] Purcell E.M., Pound R.V. A Nuclear Spin System at Negative Temperature. Physical Review, 1951, 81(2), P. 279-280.
[22] Hakonen P., Lounasmaa O.V. Negative Absolute Temperatures: "Hot" Spins in Spontaneous Magnetic Order. Science, 1994, 265(5180), P. 18211825.
[23] Medley P., Weld D.M., Miyake H., et al. Spin Gradient Demagnetization Cooling of Ultracold Atoms. Physical Review Letters, 2011, 106(19), 195301, 4 pp.
[24] Ramsey N.F. Thermodynamics and Statistical Mechanics at Negative Absolute Temperatures. Physical Review, 1956, 103(1), P. 20-28.
[25] Baldovin M., Iubini S., Livi R., et al. Statistical mechanics of systems with negative temperature. Physics Reports, 2021, 923, P. 1-50.
[26] Deng L. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 2012, 29(6), P. 141-142.
[27] Hoffmann M., Blgel S. Systematic derivation of realistic spin models for beyond-Heisenberg solids. Physical Review B, 2020, 101(2), P. 024418.
[28] Yoshioka N., Akagi Y., Katsura H. Transforming generalized Ising models into Boltzmann machines. Physical Review E, 2019, 99(3), P. 032113.
Submitted 10 October 2023; revised 11 November 2023; accepted 7 December 2023
Information about the authors:
IgorS. Lobanov - Faculty of Physics, ITMO University, Lomonosova Str. 9, Saint Petersburg, 191002 Russia; ORCID 0000-0001-8789-3267; [email protected]; [email protected]