Научная статья на тему 'OPTIMAL INCENTIVE STRATEGY IN A DISCOUNTED STOCHASTIC STACKELBERG GAME'

OPTIMAL INCENTIVE STRATEGY IN A DISCOUNTED STOCHASTIC STACKELBERG GAME Текст научной статьи по специальности «Математика»

CC BY
3
3
i Надоели баннеры? Вы всегда можете отключить рекламу.
Область наук
Ключевые слова
STACKELBERG GAME / MARKOV DECISION PROCESS / INCENTIVE STRATEGY

Аннотация научной статьи по математике, автор научной работы — Rokhlin Dmitry B., Ougolnitsky Gennady A.

We consider a game where manager’s (leader’s) aim is to maximize the gain of a large corporation by the distribution of funds between m producers (followers). The manager selects a tuple of m non-negative incentive functions, and the producers play a discounted stochastic game, which results in a Nash equilibrium. Manager’s aim is to maximize her related payoff over the class of admissible incentive functions. It is shown that this problem is reduced to a Markov decision process.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «OPTIMAL INCENTIVE STRATEGY IN A DISCOUNTED STOCHASTIC STACKELBERG GAME»

Contributions to Game Theory and Management, XII, 273-281

Optimal Incentive Strategy in a Discounted Stochastic Stackelberg Game*

Dmitry B. Rokhlin and Gennady A. Ougolnitsky

1.1. Vorovich Institute of Mathematics, Mechanics and Computer Sciences of Southern Federal University, 8a, Milchakova, Rostov-on-Don, Russia E-mail: dbrohlin@sfedu.ru E-mail: gaugolnickiy@sfedu.ru

Abstract We consider a game where manager's (leader's) aim is to maximize the gain of a large corporation by the distribution of funds between m producers (followers). The manager selects a tuple of m non-negative incentive functions, and the producers play a discounted stochastic game, which results in a Nash equilibrium. Manager's aim is to maximize her related payoff over the class of admissible incentive functions. It is shown that this problem is reduced to a Markov decision process.

1. Introduction

The problem of incentives plays a key role in economics and management. Its mathematical formalization is proposed by the theory of incentives (Laffont and Mar-timort, 2002), mechanism design (Myerson, 1983), the theory of control in organizational systems (Novikov, 2013). However, the majority of the respective problem formulations are static.

A natural dynamic incentive model is provided by the dynamical inverse Stackelberg games, where the leader strategy depends on the followers' actions (an incentive mechanism). A general review can be found in (Olsder, 2009). In the paper (Rokhlin and Ougolnitsky, 2018) (inspired by Novikov and Shokhina, 2003) we formalized the incentive problem as a stochastic inverse Stackelberg game and obtained a simple description of leader's optimal strategy in the case of a single follower. In the present paper we extend this result for the case of multiple followers.

Consider a game where manager's (leader's) aim is to maximize the gain of a large corporation by the distribution of funds between m producers (followers). To each follower the leader reports a non-negative stimulating (incentive) function c®(x, a), depending on the state of the system x (e.g., the market price of the produced good) and the actions a = (a®,..., am) of the producers (e.g., the production levels). At each stage of the game the producers select their actions a\ independently and get the rewards r®(xt,at) = ci(xi,ai) — g®(xt,at), where g® are the production costs. The manager, or the corporation, one-stage gain equals to f (xt,at) — mi c®(xt,at), where f can be regarded as the sales revenue. The

xt

q informally, P(xt+i G B|xt,at) = q(B|xt,at).

Each player's gain is estimated over the infinite horizon with the common discount factor p. So,

* The research is supported by the Russian Science Foundation, project 17-19-01038.

is the objective functional of the leader, and

E y^Pt (ci(xt, at) - gi(xt, at)) ^ max

t=0

are the objective functional of the followers. For each tuple (ci,..., cm) the pool of producers responds by a Nash equilibrium in the corresponding discounted stochas-

ci

propriate class. From the previous work (Novikov and Shokhina, 2003; Rokhlin and Ougolnitsky, 2018) it is known that it is optimal for the leader to economically motivate the followers to implement the strategies a\ = ui(xt), where u = (u1,... ,un) is an optimal stationary deterministic Markov strategy in the Markov decision problem

Passing to the case of multiple followers, draw some technical difficulties related to the existence of a stationary Markov equilibrium. To overcome these technical issues we modify the class of incentive functions, considered in (Rokhlin and Ougolnitsky, 2018), to make them continuous in actions. Furthermore, we confine ourselves to the games with a coarser transition kernel (He and Sun, 2017). Other related assumptions on the transition kernel q, providing the existence of a stationary Markov equilibrium (see Jaskiewicz and Nowak, 2018), would be suffice.

In Section 2. we give a general formal description of a discounted stochastic game and a Markov decision process. In Section 3. we use this formalism to precisely describe an e-optimal strategy of the leader and her value function in our model, formulated as a Stackelberg game: see Theorem 4. In two final remarks we compare this theorem with the results of (Rokhlin and Ougolnitsky, 2018), and mention that the technical coarser transition kernel condition can be dropped by passing to a correlated equilibrium.

2. Basics of discounted stochastic games

Let (H, F) be a measurable space, and let (Y, t) be a topological space. A function F : H x Y ^ R is called a Caratheodory function if the function F(-,y) is F-measurable for each y e Y and the function F(w, ■) is T-continuous for each w € H (Aliprantis and Border, 2006Definition 4.50). If (Y, t) is a separable metrizable

F

4.51. Denote by (H x Y) the set of uniformly bounded Caratheodory functions. Also, recall that a standard Borel space is a measurable space isomorphic to a Borel subset of a Polish space (separable, completely metrizable topological space) Srivastava, 1998.

Let I = {1,...,m} be the set of players. The discounted stochastic game is determined by

— A standard Borel state space (X, B(X)) with its Borel a-algebra B(X).

— Separable metrizable spaces (Ai,Ti), i € I of players' actions.

— Compact-valued mappings x ^ Ai(x) C A^ A set Ai(x) describes the admissible actions of i-th player in the state x e X. It is assumed that the multivalued mappings x ^ Ai(x) are measurable (Hu and Papageorgiou, 1997Chapter 2,

Definition 1.1), that is {x G X : A®(x) n U = 0} G B(X) for any open set U c A®. _ _ _

— Reward functions r® G Cb(X x A), where A = Ai x • • • x Am is endowed with the product topology t.

— A transition probability Q(-|-) from X x A to X (Bogachev, 2007Definition 10.7.1), that is

• the function (x, a) ^ Q(B|x, a) is B(X) x B(A)-measurable for every B G B(X),

• the fanction B ^ Q(B|x, a) is a probability measure on B(X) for every (x, a) G X x A.

It is assumed that the function a ^ / w(y)Q(dy|x, a) is continuous for any x G X and any bounded Borel measurable function w on X.

— A discount factor p G [0,1).

We assume that the players use stationary Markov strategies, which can be identified with the transitions probabilities a® from X to A® such that a®(x)(A® (x)) = 1. For x G X each tuple a = (a®)®6/ induces the probability measure

(dxodao . .. dxtdat) = ¿x(dxo) JJ a®(xo)(da0)x

®6/

x Q(dxi|xo, ao) . .. Q(dxt|xt-i, at-i) JJ a®(xt)(dat) (1)

®6/

on the space of sequences (xt, at)t6Z+, (xt, at) G X x A endowed with the product a

The expected discounted payoff of the player i equals to

w

J®(x,a) = Ptr®(xt, at). (2)

t=o

A tuple a* = (a* )®6/ is called a Nash equilibrium if

J®(x,a*) > J®(x,a-®, a®), i G I

for any strategies a® and any x G X. We use the standard notation a-® = (aj)j6(/\®}-Formally, a Markov decision process is a stochastic game involving a single m=1 function (2) as follows:

w

J (x, a) = Pt r(xt, at).

t=o

Denote by v(x) = supCT J(x, a) the related value function.

Theorem 1. For the described Markov decision process the following assertions hold true:

(i) v is the unique solution of the Bellman equation

v(x) = sup < r(x, a) + pl v(y)Q(dy|x, a) > (3)

aeA(x} I Jx J

(ii) There exists an optimal strategy a*(x)(dy) = 5u*(x)(dy), which can be identified with a Borel measurable selector

u*(x) £ arg max <r(x,a) + ft / v(y)Q(dy|x, c

Jx

(iii) If a* is an optimal strategy: v(x) = J(x, a*), then

v(x) = r(x, a) + ft / v(y)Q(dy|x, an a*(x)(da) (4)

JA(x) \ JX J

The proof of (i), (ii) can be found e.g. in (Himmelberg et al., 1976; Hu and Papageorgiou, 2000, Section VII.2)). The relation (4) is known as the dynamic programming principle or the "fundamental equation": (Dynkin and Yushkevich, 1979Chapter 6).

If (a* )ie1 is a Nash equilibrium in an m-player game, m > 1, then each a* is an optimal solution of the problem

Ji(x,a*i,ai) ^ max . (5)

A simple calculation shows that

Ji (x,ai,a * i) = ft t ri(xt ,alt; a * i),

t=0

where

ri(x,al; a* i) = ri (x,ai,a- i) a * i(da- i) (6)

JA-i

and the expectation Es,CTijCT * is taken with respect to the measure generated by the transition probabilities

Qa*_ (B|x,ai)= [_ Q(B|x,ai,a-i)ati(x)(da~i) (7)

i J A - i

and by the strategies ai on the space of sequences (xt, a\)teZ+, (xt, at) £ X x Ai in the same way as (1).

It follows that (5) is a Markov decision process, satisfying the assumptions of Theorem 1. Let (x) = supCTi J(x,ai,a*_i) be the related value function. Since a* is an optimal solu tion: Va*_, (x) = J (x, a*, a * i), from the optimality principle (4) and the Bellman equation (3) we get

. (x)= (r-i (x,ai; a*i) + ft . (y)Qa*_. (dylx,ai)) a** (x)(dai)

JA(x)\JX J

> ri (x,ai; a*_i)+ ft / Val. (y)Qa* - (dylx,ai), ai £ A*(x). (8)

Jx i i

For fixed x and (a*)ie1 consider the one-shot game r(x,a*) on Ai(x) x ■ ■ ■ x Am(x), where the payoff of i-th player equals to

Hi(x,a)= ri(x, a1; a*i) + ft . (y)Qa*. (dylx,ai).

jx i i

From (8) it follows that (a* (x))ieI is a mixed strategy Nash equilibrium in the game r(x, a*). So, we have proved the following well-known result: see (Jaskiewicz and Nowak, 2018, He and Sun, 2017) for similar statements.

Theorem 2. For a discounted stochastic game with payoff junctionals (2) and stationary Markov strategies (a*)ieI the following conditions are equivalent.

(i) (a* )ieI is a Nash equilibrium.

(ii) Each a*, i G I is an optimal solution of the Markov decision process with the objective functional (5), transition kernel (7), and reward (6).

(iii) (a*(x))ieI is a mixed strategy Nash equilibrium in the game r(x, a*) for each x G X.

There are several additional assumptions that ensure the existence of a stationary Markov equilibrium: see the survey (Jaskiewicz and Nowak, 2018). We rely on the results of (He, 2014; He and Sun, 2017). Assume that

(A) Q(-|x, a) is absolutely continuous with respect to a probability measure A on (X, B(X)) for each (x, a) G X x A The related density (x, a, y) ^ q(y|x, a) is assumed to be B(X) x B(A) x B(X)-measurable. Here B(A) is the Borel a-algebra of the topological space (A, t).

(B) For each x G X the mapping a ^ Q(-|x, a) is continuous in the total variation norm:

lim sup |Q(B|x, an) — Q(B|x, a)| = 0.

anBeB(x)

Consider a probability space F, P). A sub-a-algebra G c F is said to be setwise coarser He, 2014 than F if for every D G F with P(D) > 0 there exists a set Do c D, Do g F such that P(DoADi) > 0 for any Di G {D' n D : D' G G} (equivalently, F has no G-atom under P: see (He and Sun, 2017; He and Sun, 2018) for this terminology). A stochastic game has a coarser transition kernel if there exists a a-algebra G G B(X) such that q(-|x, a) is G-measurable for all (x, a) G X x A Mid G is setwise coarser than B(X).

Theorem 3 (He and Sun, 2017, Theorem 1). Assume that the assumptions (A), (B) are satisfied and the game has a coarser transition kernel. Then there exists a Nash equilibrium (a*)ieI.

In (He and Sun, 2017) Theorem 3 was formulated and proved under a more general assumption that there exists a decomposable coarser transition kernel.

3. The result

We use the notation of Section 2. and assume that the conditions of Theorem 3 are satisfied. A formal description of the game in question is given as follows.

(I) The manager selects a tuple c = (c®)ieI of non-negative stimulating functions 0, G (X x A).

(II) The pool of m > 1 producers with the reward functions

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

r®(x, a) = 0®(x, a) — gi(x, a), 0 < g® G C(X x A)

and payoffs

w

Ji(x,a, c) = Ex,a^2 ftt(c*(x,a) — g*(x, a))

,a

t=0

play the described stochastic game which results in a (stationary Markov) Nash equilibrium (a*)ieI. (Ill) The manager gets the payoff

N

Jl(x, a*,c) = Ex,a* ftM f (xt, at) - c*(xt,at)\ , (9)

t=0 \ i=1 J

where f £ Cb(X x A).

Denote by T (c) the set of Nash equilibriums for a given stimulating function c. The leader aim is to maximize her payoff for a worst Nash equilibrium:

G(x,c)= inf JL(x,a* ,c).

a*ET (c)

The problem of this sort is known as a weak Stackelberg game: see (Breton et al., 1988). Let us call

VL(x) = sup{G(x, c) : c* £ Cb(X x A), i £ I} the value of the leader. A tuple ce is called an e-Stackelberg solution, if

VL(x) — e < G(x,ce), x £ X.

A pair (cE,a*), a* £ T(c£) is called an e-Stackelberg equilibrium. Consider an auxiliary Markov decision process:

w / N \

J(x,a) = Ex,^^ftM f (xt, at) — V"gi(xt,at)\ ^ max . (10)

t=o v i=i J (ffi)i£i

This problem is attributed to the leader, who performs the maximization over the tuples a = (a*)ie1. By Theorem 1 this problem has an optimal solution of the form a = (5ui(x))*ei-, which can be identified with a Borel measurable selector

(u*(x))*eI £ arg max \f (x,a) — V"g*(x,a)+ ft V(y)Q(dylx, a) > .

°eA(x) [ *=1 jx J

Here V(x) = supCT J(x, a) is the value function of the problem (10). It would coincide with the optimal payoff of the leader if she was engaged in the production without resorting to the services of producers.

We will assume that any producer suffers no cost if his production level is zero:

g*(x, 0,a-)=0. (11)

Theorem 4. Under the assumptions of Theorem 3 the following assertions hold true.

(i) VL(x) = V(x).

(ii) The tuple ce = (cij£)ie/,

£

&e(x, a) = gi(x, a) +--(1 - ß)(1 - |a® - Uj(x)|)+, := max{y, 0}

m

is an e-Stackelberg solution. The related Nash equilibrium is unique: T (ce) =

Proof. For any a* G T(c), 0 < c G Cb(X x A) we have

J®(x,a*,c) > Ji(x, (¿0,a-i),c) > 0 in view of (11). It follows that

m

Jl(x, a*, c) < Jl(x, a*, c) J®(x, a*, c)

i=i

œ m

= ß^/(xt,at) - ^ gi(xt, at)) = J(x,a*) < V(x). (12)

t=0 i=1

Hence, VL(x) < V(x).

Let a* G T(ce). As was mentioned in Section 2., each component of the tuple (a*)ie1 is an optimal solution of the Markov decision process

œ

Ji(x, (a®, a-i),c) = * . ß^i^X; a-J - g(xt,at; a-J)

t=0

(1 - ß)-- Vßt(1 - |at - Ui(xt)|)-

m г ^—^

t=0

^ max.

Clearly,

Ji(x, (a®, a-i),ce) < (1 - ß)-Vßt < -,

m ' i i m

t=0

Hence,

Ji(x, (¿«, a-i),ce)

VCT *. (x, ce) := sup J®(x, (a®, a-i),ce) =

By Theorem 2 for each x the tuple (a*(x)) IS ct mixed strategy Nash equilibrium in the one-shot game on Ai(x) x • • • x Am(x), where the payoff of i-th player equals to

Hi(x,a,ce) = ci(xt,at; a-- g®(xt,at; a-+ ft / K- * . (y)Q * . (dy|x, a®)

IX

= (1 -|a® - Ui(x)|)+ + -.

In this trivial game the tuple (a*(x))ie1 = )ie/ of pure strategies is the unique

Nash equilibrium.

œ

e

Finally, substituting ce in (9), we get

w / m \

G(x,Ce) = JL(x, (Sui )iei ,Ce) = Ex,^ ft M f (xt, at) 9i(xt, at) I

t=0 V i=1 J

w m

- (1 - ft) m Ex,(Sui ^Y. (! - |at - Ui(xt)\)+ = V (x) - e.

t=0 i=1

By (12) it follows that ce is an e-Stackelberg solution, and VL(x) = V(x).

Remark 1. In the case of a single follower (m = 1) similar results were obtained in (Rokhlin and Ougolnitsky, 2018). However, the incentive premium in (Rokhlin

a

ce(x, a) - g(x, a) = e(1 - ft)I{a=u(x)} (13)

(we drop the index "1"). Furthermore, an analogue of Theorem 4 was proved either under a strong assumption that in the auxiliary problem (10) there exists a continuous optimal strategy u (by the way, this requires a topology on the state space), or by working with the notion of (e, n)-Stackelberg solution and with the class of uni-

c

that the production costs gi; the revenue function f and the stimulating functions ci belong to the class Cb(X x A) leads to more natural and simple results. On the other hand, in (Rokhlin and Ougolnitsky, 2018Theorem 2) it was shown that for the incentive premium (13) the follower can deviate from u only at the expense of "large losses". Thus, such premium has its own merits.

In the case of finite state and action spaces, where arise no measure theoretic

difficulties, Theorem 4 remains valid for discontinuous incentive premium

e

Ci e(x, a) - gi(x, a) = —(1 - ft)I{ai=Ui(x)}-m

Remark 2. In He and Sun, 2017 it is mentioned that Theorem 3 implies the existence of a stationary Markov correlated equilibrium under the assumptions (A), (B). Closely following (He and Sun, 2017), we succintly describe this point as follows. Consider the extended state space X' = X x [0,1] endowed with the product a-algebra B(X) ® B([0,1]) and the product measure A' = A ® n where n is the Lebesgue measure on B([0,1]). In the related model at each stage all players obtain a signal zt e [0,1]. These signals are independent random variables, which are

[0, 1]

Q'(B x C\x, z, a) = Q(B\x, a)n(C).

For the density q'(^\x, z, a) of Q'(^\x, z, a) with respect to A' we have

q'(y,u\x,z,a) = q(y\x,a).

The function q'(^\x, z, a) = q(^\x, a) is measurable with respect to the a-algebra G' = B(X)<g>{0, L} for all x, a, and this a-algebra is setwise coarser than B(X)®B([0,1]). Hence, the new model satisfies the condition of coarser transition kernel and possesses a stationary Markov equilibrium. By definition, this means the existence of a correlated equilibrium in the original model, satisfying the assumptions (A), (B).

So, if in the scheme (I) - (HI), describing the Stackelberg game, at stage (II) we replace a Nash equilibrium by a correlated equilibrium, then all assertions of Theorem 4 remain valid, if the condition of coarser transition kernel is dropped. The proof in fact does not change, since it is insensitive to the state space.

4. Conclusion

The present paper is related to the development of the theory of incentives in a dynamical stochastic formulation. Essentially, it generalizes the results of (Rokhlin and Ougolnitsky, 2018) for the case of multiple followers. Overall, the leader should assume that she does not rely on producers' services and attribute their costs to herself. After determining optimal production strategies from the corresponding Markov decision process she should economically motivate the producers to follow these policies. The closely related problems of multiple leaders and continuous time deserve further study.

References

Aliprantis, C.D., Border, K. C. (2006). Infinite dimensional analysis: a hitchhiker's guide.

Springer-Verlag, Berlin. Bogachev, V. (2007). Measure theory. Volume II. Springer-Verlag, Berlin. Breton, M., Alj, A., Haurie, A. (1988). Sequential Stackelberg equilibria in two-person

games. J. Optim. Theory Appl., 59(1), 71-97. Dynkin, E. B., Yushkevich, A. A. (1979). Markov control processes and their applications.

Springer-Verlag, New York. He, W. (2014). Theory of correspondences and games. Ph.D. Thesis, National University of Singapore.

He, W., Sun, Y. (2017). Stationary Markov perfect equilibria in discounted stochastic

games. J. Econ. Theory, 169, 35-61. He, W., Sun, Y. (2018). Conditional expectation of correspondences and economic applications. Econ. Theory, 66(2), 265-299. Himmelberg, C. J., Parthasarathy, T., VanVleck, F. S. (1976). Optimal plans for dynamic

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

programming problems. Math. Oper. Res., 1(4), 390-394. Hu, S., Papageorgiou N. S. (1997). Handbook of multivalued analysis. Volume 1: Theory. Kluwer, Dordrecht.

Hu, S., Papageorgiou N. S. (2000). Handbook of multivalued analysis. Volume 2: Applications. Kluwer, Dordrecht. Jaskiewicz, A., Nowak, A. S. (2018). Non-zero-sum stochastic games. In Basar, T., Zaccour,

G. (Eds.), Handbook of Dynamic Game Theory. Springer, Cham.Springer. Laffont, J.-J., Martimort, D. (2002). The theory of incentives: the principal-agent model.

Princeton University Press, Princeton, NJ. Myerson, R. B. (1983). Mechanism design by an informed principal. Econometrica, 51(6), 1767-1797.

Novikov, D. A. (2013). Theory of control in organizations. Nova Science Publishers, New York.

Novikov, D. A., Shokhina, T. E. (2003). Incentive mechanisms in dynamic active systems.

Autom. Remote Control, 64, 1912-1921. Olsder, G.J. (2009). Phenomena in inverse Stackelberg games, part 2: Dynamic problems.

J. Optim. Theory Appl., 143(3), 601-618. Rokhlin, D. B., Ougolnitsky, G. A. (2018). Stackelberg equilibrium in a dynamic stimulation

model with complete information. Autom. Remote Control, 79, 701-712. Srivastava, S. M. (1998). A course on Borel sets. Springer, New York.

i Надоели баннеры? Вы всегда можете отключить рекламу.