Вычислительные технологии
Том 11, № 2, 2006
COMPUTATION OF THE REGRESSION KERNEL MATRIX USING SEMIDEFINITE PROGRAMMING*
T. B. TRAFALIS, A. M. MALYSCHEFF School of Industrial Engineering University of Oklahoma, Norman, USA e-mail: [email protected], [email protected]
Оптимальное ядро матрицы регрессии рассчитано методом полуопределенного программирования с использованием трех базисных матриц. В работе представлены предварительные результаты, использующие стандартные реперные данные, для которых выявлены оптимальные параметры линейной комбинации трех базисных матриц ядра.
Introduction
The problem of identifying an optimal kernel for a specific class of data has recently found increasing attention in the machine learning community. Chapelle et al. [6] have investigated the selection of optimal parameters using a traditional steepest descent method. Their approach locates a local minimum in the space of parameters. Cristianini et al. [8] have suggested an approach which finds the kernel matrix that best describes the labels of the training set (kernel target alignment). Lanckriet et al. [9] employed ideas from semidefinite programming for computing the optimal kernel matrix for pattern classification problems. Bach, Lanckriet and Jordan [1] use sequential minimal optimization techniques to improve computational efficiency for solving support vector machine classification problems, which are based on a combination of kernel matrices. Trafalis and Malyscheff [16] computed the optimal kernel matrix for regression analysis problems using semidefinite programming excluding basis matrices. In this paper we extend these ideas computing the optimal parameters for a linear combination of three basis regression kernel matrices using semidefinite programming techniques [10, 11, 17]. We illustrate our findings with some examples and apply them to standard benchmark data.
The paper is organized as follows: in section 1 we will provide a brief introduction to support vector machine learning and describe the primal and dual versions for regression analysis problems. Based on the dual formulation we will then compute the dual of the dual support vector regression problem, since this formulation has some computational advantages for our purposes. Section 2 introduces the semidefinite programming framework and incorporates the results from the previous section resulting in two formulations, which were used for experimentation. In section 3 we will compute a few simple examples illustrating our formulation and then present results using standard benchmark data.
*This research has been supported partially by the National Science Foundation, NSF Grant ECS-0099378.
© Институт вычислительных технологий Сибирского отделения Российской академии наук, 2006.
1. Support Vector Machine Learning for Regression Analysis
Support vector machine (SVM) learning and other kernel-based learning algorithms can be implemented either as a classification or regression analysis problem. This discussion will focus on the regression analysis environment, for further information on the subject, in particular on support vector machines in classification, we refer the reader to the texts [5, 7, 13, 18]. Regression analysis problems focus on the computation of a linear scalar based on one or more input values (attributes). Mathematically, let the training data consist of l vectors x- G Sd with an a priori known output value y- G S, where j = 1,..., l. Hence, the training set can be written as T = {(xj, y-)j=1} C Sd+1. Vapnik [19] has shown that in the case of linear support vector machine regression the following primal optimization problem must be solved:
y- — wTxj - b < e Vj
wTXj + b — y- < e Vj
where w G Sd is the slope of the regression function and b 6 S the offset with respect to the origin. Note that for d-dimensional input data w G Sd. The parameter e can be interpreted as the precision that is required from the regression function. Geometrically, it creates a tube of width 2e around the regression function, within which all measured data samples (x- , y-) must be contained.
Let A- be the Lagrangian multiplier corresponding to the first set of constraints and A* be the Lagrangian multiplier corresponding to the second set (j = 1, ...,l). Define the vectors AT = AT—A*T = (A1 — A1, A2 — A*, •••, A/ — A*), 1T = (1,1,..., 1), and yT = (y1, y2,..., y^) as well as the matrix K- = xTx- in the linear case and K- = k (xj, x-) in the general case, where k (xj, x-) is the kernel function. Popular kernel functions include a d-degree polynomial kernel k (xj, x-) =
(xTx- + 1)d or a radial-basis function kernel k (xj, x-) = exp ^—0.5 (xj — x-)T (xj — x-) /a2 The dual problem can then be written in closed form as:
(P) min - llwO2 ,
v ) 2 11 11 ’
subject to
= 1,...,l, (A-),
= 1,..v1, (A*
(1)
(D) W (Ktr) = max — -2 A T Ktr A — e (AT1 + A*T 1
subject to
AT1 =0 (n),
(A)- > 0 Vj = 1,...,l (r)-,
(A*)- > 0 Vj = 1,...,l (r*)-.
T
(2)
By writing Ktr we emphasize that for the solution of this problem we entirely rely on the training set excluding kernel products from the test set for the computation of W (Ktr). We will elaborate on Ktr in section 2. We also denote the vector of the training labels by ytr, again, in order to show that this variable is solely based on training data. Note that the regression function can be expressed as
/(x-)
i
E
j=1
A) k (xj, x-) + b Vj = 1,...,l
(3)
r
for the training data. In this paper we identify during optimization also the k (xj, x-) for the
test data, thus, the prediction regression function for the test set can be found from
i
/ (x-) = (A) _ k (xj, x-) + b Vj = l + 1,...,l + nts, (4)
j=1 j
where nts indicates the number of samples in the test set. The bias b can be computed by solving the complementary slackness conditions [12, 13]. We will further discuss b in section 2.
In the next step we will compute the dual problem of the dual support vector regression formulation in (2), as this will facilitate the computational analysis. Using the variables r = (y1 ,..., Yl), r* = (y*, ..., y*), A = (A1,..., Al), A* = (A1,..., A*), and A = A — A* the Lagrangian of the (dual) support vector machine regression problem can be written as:
L (A-,A*,n,Y-,Y*) = ytrTA — ytrTA* — ^ (A — A*) Ktr(A — A*) — e (A + A*) 1 +
+rTA + (r*)T A* + n (A — A*)T 1. (5)
From duality theory [2, 4] we know:
A- >max; > 0 { y- > o”Y”> 0,n <L (A-’ - y-.Y;,n)}}, (6)
min \ max r ( * * )} 1
Y- > 0,y* > 0,n { A- > 0, A* > 0 ^L ^ ,A- ,Y ,Y ,^J .
Computing the gradients VaL and Va* L results in:
VaL = ytr — Ktr (A — A*) — e1 + r + n1 = 0; (7)
Va*L = —ytr + Ktr (A — A*) — e1 + r* — n1 = 0. (8)
Upon combining equations (7) and (8) one finds:
r + r* = 2e1. (9)
Since Ktr is positive definite, expression (7) can be solved for A:
A = K-;1 (ytr — e1 + n1 + r) + A*. (10)
We can recompute the Lagrangian by using the results from (9) and (10):
W (Ktr) =
W (Ktr) =
L(A-,A* ,n,Y- ,Y*) =
= ytrTKtr1 (ytr — e1 + n1 + r) — e1T [Ktr1 (ytr — e1 + n1 + r) + 2A*] +
+n 1TKtr1 (ytr — e1 + n1 + r)+ rT ^tr1 (ytr — e1 + n1 + r) + A*] +
+ (r* )T A* — 1 [K-r1 (ytr — e1 + n1 + r)]T KtrK-1 (ytr — e1 + n1 + r) and after simplifying we obtain:
L (^opt, A*,opt, n, Y ,Y*) = 2 (ytr — e1 + n1 + r)T Ktr1 (ytr — e1 + n1 + r) . (11)
Therefore, the “dual of the dual” for support vector machine regression reduces to:
W (Ktnffopt, ropt) = min 1 (ytr — e1 + n1 + r)T K“1 (ytr — e1 + n1 + r) (12)
subject to 2e1 — r > 0,
r > 0,
n unrestricted.
Note that the regression parameters can be obtained from A = A — A*. Taking into account equation (10) one finds for the regression parameters:
A = A — A* = Ktr1 (ytr — e1 + n1 + r). (13)
2. Formulation using Semidefinite Programming
In the previous section we briefly discussed support vector machine learning in regression analysis and presented the dual of the dual formulation. We will now introduce the semidefinite programming (SDP) framework in which problem (12) will be embedded.
Let us begin by first decomposing the kernel matrix K. This matrix contains mappings of scalar products of the input data for both the training and the test set. Since a part of the analysis extracts information solely contained in the training set, while other parts will require information of test set input data, we will write the kernel matrix K as:
K
Ktr Ktr,t
KL Kt
(14)
The matrix Ktr reflects information of the training data, Ktr,t describes mappings of scalar products between training and test set, while Kt represents mappings solely of test set input vectors. The expression in equation (2) for example operates only on scalar product mappings Ktr from the training inputs:
1 '* A^ ,^A Ti , a*t-, \ „ T
W (Ktr) = max — ^ (A — A*)J Ktr (A — A*) — e (AT1 + A 1) + ytoT (A — A*) :
(A — A*)T 1 = 0, A > 0, A* > 0.
It can be observed that W (Ktr) is convex in Ktr. Moreover, since regression analysis problems can be interpreted as a classification problem [3], we can use the concept of the margin and apply it in this context. Thus, since W (Ktr) is the inverse of the margin for classification problems, we can follow the same reasoning minimizing W (Ktr) under the assumption that K y 0 and trace (K) = const [8]. Thus, we require semidefiniteness for all data samples, while optimality is enforced on the training data:
min W (Ktr) (15)
subject to
K y 0, trace (K) = c.
By introducing the variable t the above problem can be formulated in terms of Ktr and K as follows:
min t (16)
subject to
K y 0, t > W (Ktr), trace (K) = c.
Moreover, considering both equations (12) and (16) one can write:
min t (17)
subject to
K y 0,
t > min | ^ (ytr — e1 + n 1 + r) Ktr1 (ytr — e1 + n 1 + r) : 2e1 — r > 0, r > 0
trace (K) = c.
The constraint imposed on r can be shifted from the subproblem to the global problem:
(18)
2 (ytr - e1 + n1 + r)T Ktr1
min t
subject to
K y 0,
t > min
u — l 2 > 0,
r > 0,
trace (K) = c.
From Schur’s complement we know that for the symmetric matrix
X = XT
A B BT C
(19)
holds that if A y 0, then X y 0, if and only if S = C — BTA 1B y 0. Using this additional information equation (18) can be written as:
min t K,t,n, r subject to trace (K) = c,
K y 0,
Ktr ytr — el + ni + r
(ytr — el + ni + r) 2t
y 0,
2e1 — r > 0,
r > 0.
The matrices Ktr and K contain as elements input vector scalar products or mappings thereof. Various different mappings exist (polynomial, radial-basis function) and one aspect of research in support vector machine learning addresses the issue of selecting an efficient parameter for these mappings. Here, these parameters will be preselected for three basis kernel matrices, which are subsequently optimally combined using a set of multipliers ^j G K. Mathematically, consider
K = ^1K1 + ^2K2 + ^3K3. (21)
Here, K1 describes a polynomial kernel with entries
k1 (^ x- )= (xT x- + 1)d. (22)
For experiments in this paper a value of d =2 was selected. Next, K2 implements a radialbasis function kernel with entries
k2 (xj, x-) = exp (—0.5 (xj — x-)T (xj — x-) /a2) . (23)
For this analysis we chose a = 0.5. Finally, K3 realizes a linear kernel with entries
k3 (x^ x-) = xTx-. (24)
Taking into account the decompostion as described in equations (21)-(24) we can rewrite problem (20):
min t ^1,^2,^3,t,n, r subject to trace GU1K1 + )
+ ^K2 + ^3K3 ^1K1,tr + ^2K2,tr + ^3K3,tr ytr — e1 + n1 + r
(ytr — e1 + n1 + r) 2t
2e1 — r
r
^1^2, ^3
(25)
— c,
y 0,
y 0,
> 0,
> 0,
free.
The semidefinite programming problem in (25) leaves the parameters ^ unrestricted and one set of experiments on standard benchmark data was conducted using this formulation. In addition we also formulated a kernel-based learning algorithm with the additional requirement of the to be nonnegative. This optimization problem is spelled out in (26) and benchmark
tests were also performed using this formulation.
min t
^1,^2,^3,t,n, r
subject to trace (U1K1 + + ^3^)
^1K1 + ^2K2 + ^3K3
(26)
c,
0,
^1K1,tr + ^2K2,tr + ^K3,tr
(ytr — e1 + n1 + r)
yt.
— el + nl + r 2t
2e1 — r > 0,
r
^1, ^2, ^3
>
>
0,
0.
In order to compute the predicted output f (x-) for both training and test set we require the , which can be computed from equations (13), (14), and (21). For the computation of the bias b in equation (3) complementary slackness imposes:
and
Yj A,
yA-
0 Vj = 1,...,/,
0 Vj = 1,...,/.
(27)
(28)
In addition, we know from (9) that r + r* = 2e1. Therefore, if y* = 0, we conclude that Y- = 0 requiring A- = 0. Finally, going back to the original problem in (1) we can postulate that b = y- — wTx- — e, if A- = 0. A similar analysis can be conducted for y- = 0, which yields overall:
and
Yj = 0
Y* = 0
Aj = 0
A* = 0
b = y, — WT Xj — e
— b =
7 T
b = y, — w X, + e.
(29)
(30)
The scalar product of wTx- can be replaced by ^ ( A) k (xj, x-). Note that it is possible
¿=1
to have both, Y- = 0 and Y-* = 0, resulting in data points which are located entirely inside the regression tube.
----
----
--->
3. Computational Results 3.1. Examples
In this section we will employ the semidefinite programming problem from (25) to compute the optimal kernel matrix for several simple regression problems. For our computations we used SeDuMi 1.05 [14] and Yalmip [15].
In the first experiment consider the quadratic function f (x) = x2. The training set spanned xTr = (—2, —1, 0,1, 2) with the test set consisting of x^ = (1.5). Thus, the training output
while the predicted output yJ'S pr was compared
assumed the values of ytrT = (4,1, 0,1, 4), t j tspr
to the value y^S = (2.25). For this experiment we selected e = 0.01 and c =1 requiring thus trace (K) = 1. Solving problem (25) for these values yields the following result for the parameters ^:
/ ^1 \ / 0.0222
» = ^2 = 0.0000 I . (31)
V )U3 / V —0.0444
Keeping in mind that K1 corresponds to a polynomial kernel of degree 2, K2 to a radial-basis function kernel of degree 0.5, and K3 to a linear kernel the overall training matrix becomes:
(32)
0.3773 0.1110 0.0222 0.1110 0.3773
0.1110 0.0444 0.0222 0.0444 0.1110
Ktr = ^1K1,tr + ^2K2,tr + ^3K3,tr = 0.0222 0.0222 0.0222 0.0222 0.0222
0.1110 0.0444 0.0222 0.0444 0.1110
0.3773 0.1110 0.0222 0.1110 0.3773
The overall matrix includes also the test set, for this simple example the test se
only one pattern, therefore K G K6x6 carrying the values:
0.3773 0.1110 0.0222 0.1110 0.3773 0.2219
0.1110 0.0444 0.0222 0.0444 0.1110 0.0721
0.0222 0.0222 0.0222 0.0222 0.0222 0.0222
K= 0.1110 0.0444 0.0222 0.0444 0.1110 0.0721
0.3773 0.1110 0.0222 0.1110 0.3773 0.2219
0.2219 0.0721 0.0222 0.0721 0.2219 0.1345
(33)
Indeed, the trace of this matrix adds up to one. Moreover, for the value of the objective function one finds t = 22.31. The unrestricted Lagrangian multiplier assumes a value of —0.01. For r and r* the following vectors are calculated respectively:
r
0.0000 0.0150 0.0200 0.0150 0.0000
and r*
0.0200 0.0050 0.0000 0.0050 0.0200
(34)
The vector of A’s is obtained using the following identity:
( 5.0953 \
2.0405 — 14.2701
2.0405 5.0953
A = Ktr (y — e1 + n1 + r)
(35)
and the predicted training outputs ytr , pr can be computed from
ytr ,pr = KtrA + b1
( 3.9967 \
1.0117 0.0167
1.0117 3.9967
(36)
where b = 0.0167 was derived using the complementary slackness conditions in (29) and (30).
In order to identify the predicted test output introduce the matrix Kts = [Ktr, Ktr t]T with
K
ts
0.3773 0.1110 0.0222 0.1110 0.3773
0.1110 0.0444 0.0222 0.0444 0.1110
Ktr 0.0222 0.0222 0.0222 0.0222 0.0222
Ktr)t _ 0.1110 0.0444 0.0222 0.0444 0.1110
0.3773 0.1110 0.0222 0.1110 0.3773
0.2219 0.0721 0.0222 0.0721 0.2219
(37)
where the last row corresponds to the matrix Ktr,t in equation (14). Indeed, the expression KtsA + b becomes now
KtsA + b1
ytr , pr yts , pr
( 3.9967 \
1.0117 0.0167
1.0117 3.9967
\ 2.2554 )
(38)
where the last value is the predicted value for y^ = (2.25). Equation (38) can also be interpreted as the closed form version of equations (3) and (4).
Next, let us discuss an approximation of the function f (x) = exp(x). We computed f (x) for values from —5 to +5 at increments of 1.0. We chose e = 0.5 and c =1. The semidefinite programming approach identified a feasible optimal solution with an objective function value of t = 76100. For the Lagrangian multiplier we found n = —5.5912. For the bias we calculated b = 5.5903, while the parameters for the basis matrices attained the values:
»
0.0001
0.0558
0.0016
(39)
Fig. 1. Comparison f (x) = exp(x). Fig. 2. Comparison f (x) = sin(x).
Figure 1 shows the graphs for the labels ytr and the predicted values ypr computed by SeDuMi.
Note that in this experiment the predicted training labels are almost identical to the true training labels, as the two curves are very close to each other.
For the third experiment the function f (x) = sin(x) was examined. Here, we computed labels from 0 to 6.28 at increments of 0.4. Once more, we chose e = 0.5 and c =1. The solution of the semidefinite programming problem yielded an objective function value of t = 3.5102 and a Lagrangian multiplier of n = —0.0132. For the sinusoidal function the bias assumes a significantly smaller value of b = 0.0086 than was the case for the exponential function. For the parameters for the basis matrices we retrieve the values:
( ^1 \ ( 0.0002 \
» = ^2 = 0.0634 . (40)
\»3 J \ —0.0041 )
Figure 2 shows the two graphs for labels ytr and the predicted values ypr.
Notice that for both the exponential and the sinusoidal function most of the weight of the
^ is placed on the radial-basis function kernel, while for the quadratic function in the first
experiment the polynomial kernel is disproportionately favored.
3.2. Benchmark tests
Hereafter, generalization performance for the semidefinite programming approach was computed using problems (25) and (26). Subsets with 100 samples of the publicly available datasets abalone1, add102, and boston3 were selected as benchmark datasets. All datasets were randomly split into a training and a test set with a training to test set ratio of 80 % : 20 %. For each dataset 30 different scenarios were created and results are presented as an average over these 30 scenarios. As a reference the best radial-basis function kernel support vector machine is also displayed. The kernel parameter was tuned using cross-validation over 30 training set scenarios. The abalone dataset was first preprocessed converting nonnumeric information and subsequently normalizing inputs and target. The goal is to predict the age of abalone from physical measurements. The value e was set to e = 0.05. Performance was also evaluated on the first of the three synthetic Friedman-functions (add10), which are all popular benchmark datasets in regression analysis [20]. The first Friedman model has 10 attributes, however, the output value is computed by using only the first five inputs and by including normally distributed noise £. More specifically, the function y =10 sin (nx1x2) + 20 (x3 — 0.5)2 + 10x4 + 5x5 + C is to be examined. The 10 input variables are uniformly distributed in [0,1]. A regression tube of e = 5 was selected. The dataset boston with 13 attributes describing various input characteristics (e. g. crime rate, pupil-teacher ratio, highway accessibility) predicts as output the value of a home. The samples are normalized and e = 0.1 was selected.
Table displays the mean square generalization error and the standard deviation for the semidefinite programming formulations (25) and (26). The table also lists the mean square error for the best radial-basis function support vector regression tuned using cross-validation. The second column shows the e-values for which the experiments were conducted. Also, the corresponding values for the ^ are displayed. Performance of the two SDP formulations is
ftp: / / ftp.ics.uci.edu / pub / machine-learning-databases / abalone
2http://www.cs.toronto.edu/ delve/data/add10/desc.html
3http: //lib.stat.cmu.edu / datasets/boston
Mean Square Error on Standard Benchmark Data
Dataset e MSE(SDP) MSE(SDP > 0) MSE(RBF-SVM)
abalone 0.05 №l / №2 / № -0.0014 / 0.0111 / 0.0043 (10.400 ± 1.923) ■ 10-4 №l / №2 / №3 0 / 0.01 / 0 (10.387 ± 1.921) ■ 10-4 (10.456 ± 1.923) ■ 10-4
add10 5 №l / №2 / № 0.0009 / 0.0057 / -0.0040 15.111 ± 3.594 №l / №2 / №3 0.0003 / 0.0046 / 0 15.453 ± 4.906 11.170 ± 2.314
boston 0.1 №l / №2 / №3 0.0041 / 0.0121 / -0.0186 (44.013 ± 6.483) ■ 10-4 №l / №2 / №3 0 / 0.01 / 0 (44.061 ± 6.441) ■ 10-4 (40.897 ± 9.601) ■ 10-4
comparable, the difference between MSE(SDP) and MSE(SDP ^ > 0) is rather marginal for all three datasets. For the dataset abalone nonnegative ^’s lead to a slight improvement of the mean square error. Furthermore, for the datasets abalone and boston nonnegative ^’s result in a pure radial-basis function solution with ^ = 0 for both datasets. For unrestricted
multipliers the dataset abalone shows a negative coefficient for the polynomial kernel, while the datasets add10 and boston display a negative coefficient for the linear kernel. Compared to the best support vector machine the SDP approach is competitive for the datasets abalone and boston. Nonetheless one needs to keep in mind that the computational effort for evaluating problems (25) and (26) is significantly smaller when compared to support vector machine learning tuned using cross-validation. The solution for the dataset abalone for SDP (^ > 0) is governed by a pure radial-basis function solution (^ and ^3 are zero). The mean square error for the corresponding support vector machine solution is very similar (10.387 • 10-4 versus 10.456 • 10-4), since tuning for the dataset abalone resulted in an optimal radial-basis function parameter of a = 0.5, which is identical to the a used for K2.
Conclusion and Outlook
In this paper we have presented a new method for calculating the regression kernel matrix as a linear combination of three basis kernel matrices using semidefinite programming techniques.
The coefficients for the three matrices were first left unconstrained and subsequently constrained to be nonnegative. The parameter selection problem is equivalent to a convex optimization problem guaranteeing that this algorithm identifies the global optimum. Preliminary experimentation on standard benchmark data show promising results for these techniques when compared to the best radial-basis function support vector machine.
References
[1] Bach F.R., Lanckriet G.R.G., Jordan M.I. Multiple Kernel Learning, Conic Duality, and the SMO Algorithm // Proc. of the 21st Intern. Conf. on Machine Learning, Banff, Canada, 2004.
[2] Bazaraa M.Z., Sherali H.D., Shetty C.M. Nonlinear Programming: Theory and Algorithms. N.Y.: John Wiley & Sons, 1993.
[3] Bi J., Bennett K.P. A geometric approach to support vector regression // Neurocomputing.
2003. Vol. 55. P. 79-108.
[4] BERTSEKAS D.P. Nonlinear Programming. Athena Scientific, Belmont, Massachusetts, 1999.
[5] BuRGES C.J.C. A tutorial on support vector machines for pattern classification // Data Mining and Knowledge Discovery. 1998. Vol. 2(2). P. 121-167.
[6] Chapelle O., Vapnik V., Bousquet O., Mukherjee S. Choosing multiple parameters for support vector machines // Machine Learning. 2002. Vol. 46(1/3). P. 131-159.
[7] Cristianini N., Shawe-Taylor J. An Introduction to Support Vector Machines. Cambridge: Cambridge Univ. Press, 2000.
[8] Cristianini N., Shawe-Taylor J., Kandola J., Elisseef A. On kernel target alignment // Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2001.
[9] Lanckriet G., Cristianini N., El-Ghaoui L. et al. Learning the kernel matrix with semidefinite programming // J. of Machine Learning Research. 2004. Vol. 5. P. 27-72.
[10] Pardalos P.M., Wolkowicz H. (Eds) Topics in semidefinite and interior-point methods // Fields Institute Communications Series. 1998. Vol. 18. Amer. Math. Society.
[11] Ramana M., Pardalos P.M. Semidefinite programming // Interior point methods of mathematical programming / T. Terlaky (Ed.). N.Y.: Kluwer Acad. Publ., 1996. P. 369-398.
[12] Scholkopf B., Smola A.J. Learning with Kernels. Cambridge, Massachusetts: MIT Press, 2002.
[13] Smola A., Scholkopf B. A tutorial on support vector regression // Statistics and Computing. 2004. Vol. 14(3). P. 199-222.
[14] Sturm J.F. Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones // Optimization Methods and Software. 1999. Vol. 11-12. P. 625-653.
[15] Lofberg J. YALMIP: a Toolbox for Modeling and Optimization in MATLAB // Proc. of the CACSD Conf., 2004. Taipei, Taiwan. http://control.ee.ethz.ch/~joloef/yalmip.php.
[16] Trafalis T.B., Malyscheff A.M. Optimal selection of the regression kernel matrix with semidefinite programming // Frontiers In Global Optimization. Ser. on Nonconvex Optimization and Its Applications / C.A. Floudas, P.M. Pardalos (eds). N.Y.: Kluwer Acad. Publ., 2004. P. 575-584.
[17] Vandenberghe L., Boyd S. Semidefinite programming // SIAM Review. 1996. Vol. 38(1).
[18] Vapnik V. Estimation of Dependencies Based on Empirical Data. Berlin: Springer Verlag, 1982.
[19] Vapnik V. The Nature of Statistical Learning Theory. Berlin: Springer Verlag, 1995.
[20] Vapnik V. Statistical Learning Theory. N.Y.: John Wiley & Sons, 1998.
Received for publication September 18, 2005