178
Проблемы высшего образования
ПРОБЛЕМЫ ВЫСШЕГО ОБРАЗОВАНИЯ
UDC 519.234
M. M. Lutsenko, N. V. Shadrintseva
Department of «Mathematics and Simulation»
М. М. Луценко, Н. В. Шадринцева
RELIABILITY OF TEST SCORE НАДЕЖНОСТЬ ОЦЕНКИ ТЕСТИРОВАНИЯ
In this paper we will develop several game models of testing, and specify their reliability - the probability of correct assessment of a person being tested (hereinafter “Examinee”), as well as the optimal decision function and the worst priori distribution. During development of models we will use the results of the authors’ work [1-3] and concepts of the statistical decision theory, the main object of which is a statistical game between Nature and Statistician. The problems are solved by MS Excel when a test has 10 items. Reliability of test scores is found without assumption about priory distribution of the Examinee’s level of knowledge. In many important cases the reliability of assessment turns out to be very low.
Представлены некоторые теоретико-игровые модели тестирования и расчет их надежности (вероятности правильной оценки тестируемого), оптимальная решающая функция, наихудшее априорное распределение. В процессе исследования моделей используются результаты предыдущих работ авторов, концепции теории статистических решений, главным объектом которой является статистическая игра между природой и статистиком. Задачи решаются с помощью MS Exell в тех случаях, когда тест содержит десять задач. Надежность тестовых баллов найдена без предположений об априорном распределении уровня знаний экзаменуемого. Во многих важных случаях надежность оценки становится очень низкой.
educational testing, antagonistic game, statistical game, randomized decision function, the worst priori distribution.
педагогическое тестирование, антагонистическая игра, статистическая игра, рандомизированная решающая функция, наихудшее априорное распределение.
Introduction
The main objective of any testing is the assessment intended to measure test-takers’ knowledge, skills, aptitudes, or classification in many other problems. This objective becomes an issue of high priority when the administrative decisions, such as: issue of Certificate of Education, enrollment in an educational institution etc., are taken on the basis of the test results. There are many kinds of literature on the test theory (see [4-6]) but in all the papers directly or indirectly an interval estimation of the parameter of binomial (hypergeometric) distribution is used. Moreover, all the assess-
ments of reliability of testing are made on the assumption that the distribution of the knowledge level of an Examinee is normal but it is difficult to agree with this assumption. At test check of the student’s knowledge level the essential part of the learning process is devoted to preparation for test passing. Nevertheless, if the students know in which form the test is to be conducted as well as the subjects and types of tasks, they can transform their knowledge so that the objective assessment will be impeded. Therefore, a conflict situation (game) arises, participants of which are: an Examinee, who wants to spend as little time as possible for preparation for the test and to get the highest
2012/2
Proceedings of Petersburg Transport University
Проблемы высшего образования
179
score, and a decision maker (a Statistician) who is to assess the level of knowledge as accurately as possible.
Reliability in the classical test theory is (indirectly) an estimate of the error you’d expect if a student fulfilled a hypothetical parallel test. And in the generalizability theory it is an estimate of the difference between the «universe» score and the score for any particular test. In our model the educational test is a measurement instrument and we want to find its accuracy or reliability - the probability of the correct assessment of an Examinee. This approach was worked out by the authors in [1-3] and it is different from the one considered in literature [4-6] where item response function values are estimated.
Classification and Assessment
Let us formulate the objective of student’s knowledge level assessment more accurately. Let us assume that by the test results a group of pupils is divided into N subgroups, and the level of pupil’s knowledge is determined by the number of a subgroup into which the pupil has got. Selection of a subgroup number can be performed using different methods, for example, by the number of correct responses to test tasks or by the number of score points given for the tasks solved, in case of different «weights» of the tasks. Let us denote a number of a subgroup into which the pupil with the knowledge level 9 (type of an Examinee) has got by X0. Thus, the Statistician observing value x of a random variable X9 is to assess the type 9 of an Examinee.
Let us reduce the classification problem to the problem of statistical assessment. For this purpose we introduce the following notations. We denote: a finite set of possible levels of the pupil’s knowledge by 0 = {91, 92, ..., 9m} (set of parameters); a set of values of the random variable X9 by X = {x1, x2, ..., xN} (set of observations) and the family of distributions of the variable X9 on set Xby {P9 (x)}9e0. So, P9(x) returns the probability that the test score X9 of a pupil is equal to x if his level of knowledge is
equal to 9. We denote a set of acceptable grades of pupil’s knowledge by D = {d1, d2, ..., d} (set of decisions). We designate the Statistician’s decision by 5(x) in case when the value of a random variable X9 is equal to x. The function 5: XD is called a decision function. We denote a set of decision functions by D = DX. It is obvious that every decision function can be represented as a vector 5 = {51, 52, ., 5N} with 5k = 5(xk)eD.
In these designations the Statistician observing the value of a random variable X9 with the unknown value of parameter 9 should make the decision 5(X9) e D that gives the most accurate estimation of parameter 9 or he should find such a decision function 5 the value of which is the closest to 9.
In statistics two groups of estimations are considered. There are point and interval estimations. For constructing the first group it is necessary to know the losses of the Statistician in case when the estimation of the unknown parameter 9 is incorrect, i. e. a loss function of the Statistician L(9, d). Unfortunately, from the data of the problem it is hard to construct such a convex on variable d function.
In order to construct the interval estimation we tie with each grade d e D a subset of knowledge levels 0 (d) z 0 that is acceptable for this decision. By the given family of subsets {0(d)}deD we construct a payoff function of the Statistician as follows:
h(d, 9) = 10(^ )(9)
1, if 9 e 0(d), 0, if 9^0(d).
Thus, the payoff of the Statistician equals to one (1) only when he has estimated an Examinee correctly or when type 9 of an Examinee belongs to the set of types 0(d) acceptable at a given decision d e D.
Let us fix a family of acceptable intervals {0(d)}deD. Each decision function 5: X D generates a family of confidence intervals
{0(5(x))Lx
For each parameter 9 and decision function 5 let us find the probability that the family of confidence intervals {0(5(x))}xeX will cover the unknown parameter 9. For this purpose
ISSN 1815-588Х. Известия ПГУПС
2012/2
180
Проблемы высшего образования
let us use the law of the total probability as follows:
P(0G0(5(Xe))) =
= E P(Xe = x) • P(0 e 0(5(x)) | Xe = x).
xeX
Using the designations introduced above for the function h and the family of distributions
we get:
P(0e0(5(Xe ))) =
= E Pe(x) • h(5(x), 0) = H(5,0).
xeX
Let us call the function H(5, 0) a success function similar to Wald risk function.
The smallest probability that the family of confidence intervals {0(5 (x))}xeX generated by the decision function 5 will cover the unknown parameter 0 is called a confidence probability for this family (for the decision function 5), i. e.
Y = Y(5) = min P(e e 0(5(Xe ))).
0e0
Determination of decision function 5 (in other words, the family of confidence intervals) for which the confidence probability Y will have the maximum value becomes the aim of the Statistician.
On the other hand, let us assume that the parameter 0 itself is a random variable with the known distribution v, i. e. the Statistician observes a random variable X with distribution
V
P( Xv= x) = J Pe (x) • d v(0).
0
Distribution v is called priori distribution, random variable Xv - posteriori random variable, and its distribution is called posteriori distribution.
A weighted mean of a success function at the given decision function 5 and a priori distribution v are equal to
H(v, 5) = J H (0,5)d v(0) =
0
= EJ Pe (x) • h(5( x), 0)d v(0).
x 0
It is equal to the probability that the unknown parameter 0 falls into the confidence interval
generated by a decision function 5 if the parameter has the known distribution v.
Function 5v maximizing H(v, 5) is called a Bayesian decision (Bayesian decision function) in relation to distribution v, and its value is called Bayesian success for priori distribution v, i. e., Bayesian success equals to
H(v, 5v) = max H(v, 5).
5
Bayesian decision function 5v generates the family of intervals for which the average probability of coverage would have the maximum value at the given priori distribution v of parameter 0.
In our case, the set D is the N-ary Cartesian product or the N-ary Cartesian power of the set D, and the function ф(5) = ф(5р 52, ..., 5N) = = H(v, 5) is a separable function in relation to variables 5p 52, ., 5N. Hence, for the Bayesian success we obtain:
J Pe(x„ ) • h(0, &„ )dv(0) ,
0
5k =5(xk ).
Alternatively, the value of the Bayesian decision function 5v at a point x is the maximum of sum term which depends only on this variable. Consequently, the values of function 5v can be found at every point independent of values at other points.
The worst distribution for the Statistician is the priori distribution v* of parameter 0 for which the Bayesian success is minimal. In this case, the minimal Bayesian success is equal to:
H(v*, 5v*) = min max H(v, 5).
v 5
Priori distribution v* where this minimum is achieved is called the worst priori distribution.
N
H(v, 5v) = E max
k=1 5k gD
Testing as a statistical game
For the classification problem considered above we construct statistical game «Testing» Г = (D, 0, H) between the Statistician
2012/2
Proceedings of Petersburg Transport University
Проблемы высшего образования
181
and Nature. In this game the set of the Statistician’s strategies D = DX is a space of decision functions, the set of Nature’s strategies 0 is a parameter set and the payoff function has the following form:
H(5,0) = XP(x)• h(5(x),0), 5eD = DX.
x
The Statistician (player 1) wants to increase the confidence probability H(5, 0) and Nature wants to get the best mark, i. e. the latter wants to distort the result of an exam.
The lower value of game Г equals to the maximum confidence level that the Statistician can provide regardless of the actions of Nature.
v = max min H(5,0) = max у(5).
5eD 0e0 5eD
The decision function 5* on which the maximum is reached generates the optimal family of confidence intervals. Note that the upper value of game Г equals to one, i. e.
v = min max H (5,0) = 1.
0e0 5gd
Since the upper value of game Г is greater than the lower one, we shall search for the solution of the garne in mixed strategies. We shall denote by D, 0 the spaces of probability measures (distributions) defined on the respective sets and containing all the degenerate mea-suresMhen, a payoff function of mixed extension Г = (D, 0, H> of the game Г is
H(q,v) = J H (5,0)dq(5)dv(0).
Dx0
If q*, v* are degenerated measures with the supports 5*, 0* respectively then we can write
H(5*, v) = H(q*, v); H(q, 0*) = H(q, v*);
H(5*, 0*) = H (q*, v*) = H (5*, 0*).
Mixed strategies (probability measures) q e D, v e 0 assign probabilities to pure strategies of players. These mixed strategies allow the players to select randomly pure strategies. The payoff function H(q, v) is the expectation
of payoff function H(5, 0- if the_players used their mixed strategies q e D, v e 0 respectively.
The solution of statistical game Г = (D, 0, H> in _mixed strategies is a solution of game I" = (D , 0, H> that is a triple (q*, v*, v> for which the following inequalities are fulfilled:
H(q, v*) < v < H(q*, v) for any q e D, v e0.
It is easy to prove that in these inequalities we can restrict ourselves to degenerated measures q, v. This means that it is sufficient to verify the following inequalities:
H(5, v*) < v < H(q*, 0) for any 5e D, 0e0.
These two inequalities are equivalent to the following equalities:
v = min max H(5, v) = max min H(q, 0).
v 5 q 0
Thus, optimal strategy q* of the player 1 is a randomized decision function for which the probability to cover an unknown parameter 0 would be the greatest.
The optimal strategy of the Nature (the worst priori distribution) is a distribution for which the Bayesian decision function would be the least effective.
The value of the statistical game Г is the probability of the fact that every student will be correctly estimated or his type is defined correctly.
Solution of finite statistical games
It is well-known that the Statistician can use the following mixed strategies: q = (qp q2, ..., qN), where N = |X | is a number of elements of set Хand qk, k = 1, N are probability measures on the decision set D.
If the sets X, D, 0 are finite and numbers of their elements are equal to N, n, m respectively, then Г is a matrix game that has the payoff matrix B of the size Nnxm. We denote elements of matrix B by b = h (0., d), i = 1,m; j = 1,n, nonzero elements of diagonal matrix Лк by X. = P0 (x,), i = 1, m, к = 1, N; elements of ran-
г, г 0. v ky’ 55 5 5
ISSN 1815-588Х. Известия ПГУПС
2012/2
182
Проблемы высшего образования
domized decision function by p = (pp р2, ..., MN) with pk = Ц, ^2, • • •, k = 1, N; a vector
priory distribution of parameter 9 by v = (v1, V, • • •, v У; a column of m units by 1 .
To solve the matrix game we construct a pair of dual linear programming problems. From the solution of the first and the second problems we find the best randomized decision function м = (p1, p2, ., pN) and the worst priori distribution v. The common value of two problems is the value of game Г
Primal problem
v ^ max;
N
XA‘%k >vim;
k=1
S^k =J; Wk >0; k =ъN; j = 1n-
}=1
Dual problem
N
v = S uk ^ min;
k=1
____ m
V ЛkB < Uk in; k = 1, N; S V = 1.
i=1
Solutions of testing games
Here we consider examples of testing games and give an interpretation of the solutions.
Example 1. According to the results of testing, a group of students is divided into 10 subgroups X = (A0, Ap A2, ., A9} (the space of observations). It is necessary to divide the original group into four classes so that the first class would consist only of excellent students; the second one would include only good students, the third and the fourth ones would include only fair and poor students respectively.
We denote the set of types of students (the space of parameters) by 0 = {excellent, good, fair, poor}; by P9(x) - the probability that a student of type 9 belongs to subgroup x e X. We suppose that these probabilities are known and given in Table 1.
Suppose that a subgroup Д. of students consists of students who found correct solutions of test items from 10i % to 10(i + 1) %. Then the data of table 1 are interpreted in the following way. Excellent students solved over 90 % of test items with probability 0.9 and from 80 % to 90 % with probability 0.1. Good students solved from 80 % to 90 % of tasks
TAB 1. Probability distributions P9(x) for example 1
x = A0 A1 A2 A3 A4 A5 A6 A7 A8 A9
9 = excellent 0 0 0 0 0 0 0 0 0,1 0,9
9 = good 0 0 0 0 0 0 0,05 0,8 0,1 0,05
9 = fair 0 0 0,05 0,1 0,7 0,1 0,05 0 0 0
9 = poor 0,1 0,15 0,6 0,1 0,05 0 0 0 0 0
There are many methods to solve linear programming problems. And the dynamic method that has been worked out for statistical games with threshold payoff functions would be the most convenient here, see [2, 3]. But in these cases the statistical game can be solved by standard program of MS Excel. Though the last method often does not give the exact solution it always gives acceptable solutions of problems and the upper and lower estimations of matrix game value.
with probability 0.8 and so on. Poor students solved less than 40 % of test tasks with probability 0.95.
Therefore, the table 1 (Table 1) is compiled so that different types of students are well separated from each other.
We denote by D = {excellent, good, fair, poor} the Statistician’s decision set. Thus, the set of parameters and the set of decisions are equal, i. e. D = 0. The pay-off function is given by the following formulae:
2012/2
Proceedings of Petersburg Transport University
Проблемы высшего образования
183
h(d, 0)
1, if 0 = d, 0, if 0^ d.
In other words, the Statistician wins a unit if he identifies the level of a student correctly. Hence, with every decision d we associate an interval that consists of one point 0.
Elements of the set of decision function D = DX are vectors d = (d0, d1, ..., d9), the coordinate d of which is a decision that the
k ----
Statistician makes if he observes k (k = 0,10). Expectation of the payoff function if decision function d = (d0, d1, ., d9) is used has the following form:
H (d, 0) = i P„ (a, )h(d,, 0).
i=0
Here 0 is a type of a student.
So we construct a statistical game Г = (D, 0, H) the components of which are defined above. This is a matrix game with 40 х 4 matrix size that can be solved by using MS Excel.
In this part we give a solution of this game. The value of game Г equals to 0,900. It means that the Statistician gives the correct assessment of the knowledge level only for 90 % of examinees. Randomized decision function of the Statistician q = (q0, q1, ..., q9) has the following form. An examinee is an excellent student if he solved over 90 % of test tasks correctly (q9 = excellent with probability 1); a good student if he solved from 80 %o to 70 %o of test tasks (p8 = q7 = good with probabilities 1); a fair student if he solved from 60 %o to 40 % of test tasks (q6 = q5 = q4 = fair with probabilities 1); a poor student if solved less than 30 % of test tasks (q2 = q1 = q0 = poor with probabilities 1). If an examinee solved 30 % of test tasks, we will regard him as a fair or poor student with equal probabilities (q3 =
=fair with probability 0,5 and q3 = poor with probability 0,5).
In table 2 (Table 2) we give an optimal strategy of Nature (recommendation for students). So, if the group of examinees contains 17,5 % of excellent students and 27,5 % of good, fair and poor ones then the Statistician gives the correct assessment of the knowledge level only in 90 % of the case.
Example 2. Assume that the test consists of 10 items, and the Statistician makes a decision on test results. A space of observations X consists of 11 numbers from zero to 10 (numbers of solved tasks). The probability 0 of the correct answer to one test item is the knowledge level of students. Suppose the set of parameters 0 = {0,95; 0,85; 0,75; 0,65; 0,55; 0,45; 0,35; 0,25; 0,15; 0,05;} contains all the possible knowledge levels of students. Then the probability of correct answers to items can be found by the Bernoulli formula
P0(x)
Г10 ^
•0x (1-0)10-x,
x = 0;10.
For the assessment of the student knowledge level the Statistician has the following four grades: D = {excellent, good, fair, poor}. A student is regarded as an excellent one if his knowledge level is between 95 % and 85 %. A student is regarded as a good one if his knowledge level is between 75 % and 55 %. If the level of the student’s knowledge is 45 % or 35 % then he is fair one. In other cases we regard him as a poor one. After that, we construct the statistical game Г = (D, 0, H) and solve it by mixed strategies. The payoff matrix in this game has the 44x10 size. Unfortunately, MS Excel does not allow us to solve exactly two linear programming problems. But we get the upper and lower bounds of the game value as well as the randomized decision function and
TAB 2. The worse a priory distribution v of parameter 0 for example 1
0 l excellent good fair poor
v l 0,175 0,275 0,275 0,275
ISSN 1815-588Х. Известия ПГУПС
2012/2
184
Проблемы высшего образования
the worst a priori distribution of the parameter
9. As the result of calculations we obtain the following lower (0.519) and upper (0.562) bounds for the game value.
Tables 3 and 4 contain optimal strategies of players. The columns of table 3 give the probabilities with which the Statistician makes decisions depending on his observation.
Thus, we get the correct assessment of the student knowledge level with the probability that lies between 0.52 and 0.56. Therefore, approximately 50 % of the Statistician’s decisions about the level of student’s knowledge are wrong.
Example 3. Suppose that a test contains
10 items and the Statistician makes a decision by the test results. Observation set X has
11 numbers from zero to ten. The probability 9 of the correct answer to a test item is a measure of the respondent’s knowledge. The possible knowledge levels form a parameter set 0 = {0,95; 0,85; 0,75; 0,65; 0,55; 0,45; 0,35; 0,25; 0,15; 0,05;}. The probability that the examinee will give exactly к correct answers is given by formula
(10 ^ __________________
P(X9 = X)= -9X(1 -9)10-X, X = 0;10.
V X У
Thus, each examinee has one of 10 possible knowledge levels the values of which vary from 95 % to 5 %. In this example the decision set D
and the parameter set 0 are equal (0 = D). The acceptable interval 0(d) includes only those parameters 9 which lay from d not further than 10 %, i. e.
h(d, 9) = l0(d )(9)
1 if I i- j I ^ 0,1 0 if 1 i- j1 > 0,1
Now we construct the statistical game Г = = (D, 0, H) and solve it in mixed strategies by means of MS Excel. The payoff matrix in the game has the 110x10 size.
In the result of the solution we get the upper (0,788) and lower (0,771) bounds of the game value as well as the randomized decision function p = (p0, Pj, ..., p10) (Table 5) and the worst priory distribution of parameter 9 (Table 6).
We point out that the Statistician observes only the random variable Xv with the following distribution:
m
P(Xv= X) = •£ P9, (X)v,.
i=1
The value of random variable Xv is a number of correct answers to test items for priori distribution v. The figure shows (Figure 1) the histogram of random variable Xv for the worst priori distribution v and its normal approximation. It is quite natural that the null hypothesis of normality for distribution Xv will be accepted.
TAB 3. Components of randomized decision function p for example 2
P10 P9 P8 P7 P6 P5 P4 P3 P2 P1 P0
Decisions excellent 1,00 0,49 0,75 0 0 0 0 0 0 0 0
good 0 0,51 0,24 0,95 0,75 0,70 0 0 0 0 0
fair 0 0 0 0,05 0,25 0,30 1,00 1,00 0 0 0
poor 0 0 0 0 0 0 0 0 1,00 1,00 1,00
TAB 4. The worse a priory distribution v of parameter 0 for example 2
9. 1 0,95 0,85 0,75 0,65 0,55 0,45 0,35 0,25 0,15 0,05
v 1 0,00 0,11 0,01 0,04 0,25 0,19 0,15 0,26 0,00 0,00
2012/2
Proceedings of Petersburg Transport University
Проблемы высшего образования
185
TAB 5. Components of randomized decision function p for example 3
P10 P9 Ps P7 Рб P5 P4 P3 P2 P1 P0
Decisions 0,95 0 0 0 0 0 0 0 0 0 0 0
0,85 1,00 0,56 0 0 0 0 0 0 0 0 0
0,75 0 0,41 0,89 0 0 0 0 0 0 0 0
0,65 0 0,03 0,11 0,95 0,05 0 0 0 0 0 0
0,55 0 0 0 0,05 0,92 0,52 0 0 0 0 0
0,45 0 0 0 0 0,03 0,48 0,99 0,03 0 0 0
0,35 0 0 0 0 0 0 0,01 0,96 0,10 0 0
0,25 0 0 0 0 0 0 0 0,01 0,90 0,40 0
0,15 0 0 0 0 0 0 0 0 0 0,60 1,00
0,05 0 0 0 0 0 0 0 0 0 0 0
TAB 6. The worse a priory distribution v of parameter 0 for example 3
0 1 0,95 0,85 0,75 0,65 0,55 0,45 0,35 0,25 0,15 0,05
V 1 0,03 0,05 0,12 0,14 0,18 0,11 0,14 0,10 0,08 0,03
i i i i i i i i i i
0 1 2 3 4 5 6 7 8 9 10
Pic. Histogram of the number of correct answers for the worst a priori distribution
and its normal approximation
Conclusions
If a test-taker knows the criteria for test scoring, then he is able to organize his training so, that the score assessment would not reflect his knowledge level in the wrong way. Consequently, testing cannot be the sole criterion for the assessment of the knowledge level of students.
The problems considered in the paper are usually solved by statistical methods. For this goal the confidence intervals are constructed and so on. But it works well if a group of examinees is large. The proposed method works equally well for all the groups (large and small). However, the mathematical model (the statistical game) is closely connected with the testing procedure (decision making). If the
ISSN 1815-588Х. Известия ПГУПС
2012/2
186
Проблемы высшего образования
decision set or payoff function is changed then the game solution (value, optimal strategies) is significantly changed as well.
Although the mathematical models discussed here are quite simpl1e (small number of test tasks, artificial family of distributions), however, for tests with a large number of tasks the results will be the same and the game value will be significantly less than a unity. But the Bayesian solution is stable for small deviations of the worst priori distribution.
References
1. Testing and Statistical Games, Abstract of the fourth international conference «Game theory and management» / M. M. Lutsenko. - St. Petersburg Univerity, PP. 115-118.
2. Minimax Confidence Intervals for the Binomial Parameter / M. M. Lutsenko, S. G. Malo-
shevsky. - Journal of Statistical Planning and Inference 113, PP. 67-77.
3. Minimax Confidence Intervals for the Parameter Hypergeometric Distribution / M. M. Lutsenko, M.A. Ivanov (2000). - Automat. Remote control 61(7) part 1, 1125-1132 (Avtomatika i Telemekhan-ika (7), PP. 68-76 (Минимаксные доверительные интервалы для параметра гипергеометрического распределения / М. М. Луценко, М. А. Иванов // Автоматика и телемеханика. - № 7. -2000. - С. 68-76.)
4. Handbook of Modern Item Response Theory. Editors Win J. van der Linden, R. K. Hambleton, 1997, Springer-Verlag. - N.Y., P. 510.
5. How to Make Achievement Tests and Assessments / N. Gronlund (1993). - 5th edition. - N. Y. : Allyn and Bacon.
6. Can There Be Validity Without Reliability? / P.A. Moss (1994). - Educational Researcher, 23(2), PP. 5-12.
2012/2
Proceedings of Petersburg Transport University