## Info

0.7%

2.6%

While the standard error is known and easily accessible when M =2 [43,45-47], to date when M>2 it is known and easily accessible only under the null hypothesis of randomness [43]. The calculation of the standard error in general when M>2 was described by Fleiss as 'too complicated for presentation' (reference [43], p. 232), referring readers to Lan-dis and Koch [48]. Not only is this standard error difficult to access, but also it is not known exactly how accurate it is for small to moderate sample size. Part of the problem lies in attempting to obtain a general solution when there are more than two categories (where intraclass kappa may be misleading), and when the number of ratings per patient is itself a variable from patient to patient (which may be problematic). The situation with the 2 x M intraclass kappa is much simpler.

For patient z, with probability p;1, the probability that s of the interchangeable independent M ratings will be positive is the binomial probability (s = 0,1,2,...,M) with probability pn the binomial probability (say Bin(s; pi1,M),s = 0,1,2,...,M). The probability that a randomly sample subject will be positive is the expected value of Bin(s; pi1,M) over the unknown distribution of p;1. This involves moments of the p;1 distribution up to order M. Since P and k involve only the first two moments, the distribution of the number of positive responses is determined by P and k only when M = 2. Consequently the quest for a standard error of the

2 x M intraclass sample kappas for M>2 that involves only parameters P and k, that is, only moments up to order 2, is one of those futile quests [49]. One might have many different distributions of p;1 that have the same first two moments (P and k) but that differ in the higher moments. For each such distribution the sample distribution for the 2 x M intraclass sample kappa would differ. This fact differentiates the distribution theory of the intraclass kappa for binary data from that of the intraclass correlation coefficient, p, to which it is closely computationally related, for interchangeable normal variates, for in the latter case, the distribution is determined by p, however large the number of raters, M.

For example, in Table II, we present an example of a 'sensitivity/specificity' model and of a 'know/guess' model selected to have almost exactly the same P = 0.10 and k = 0.50, and show the distribution of response for M =2,4,6. It can be seen that the population distributions are almost the same for M = 2, slightly different for M = 4 and very different for M = 6. Thus, unless M = 2, one would not expect that the distributions of the 2 x M intraclass kappa would be the same in these two cases, much less in all cases with P =0.10 and k = 0.50.

The vector of observed frequencies of the numbers of positive responses has a multinomial distribution with probabilities determined by the expected values of Bin(s:p;,M). Thus one can use the methods derived by Fisher [50] to obtain an approximate (asymptotic) standard error of kappa. An approximate standard error of k can also be obtained very easily using jack-knife procedures omitting one patient at time [45,47,51-53], as shown in Table I. These results correspond closely to those derived in various ways for the 2 x 2 intraclass kappas [43,46,47, 54]. The jack-knife procedure is demonstrated in Table I. (As a 'rule of thumb', the minimum number of patients should exceed both 10/P and 10/P'. When P =0.5, 20 patients are minimal; when P =0.01, no fewer than 1000 patients are needed.) A generalized version of the SAS program (SAS Institute Inc., Cary NC) that performs the calculations can be located at http://mirecc.stanford.edu

When there are a variable number of raters per patient, the problem becomes more complicated, since the exact distribution of responses changes as M varies, involving more or fewer unknown moments of the p;1 distribution. If the patient's number of ratings is totally independent of his/her p;1, one could stratify the patients by the number of ratings, obtain a 2 x 2 intraclass kappa from those with M = 2, a 2 x 3 intraclass kappa from those with M = 3 etc., and a standard error for each. Since these are independent samples from the same parent population, one could then obtain a weighted average of the kappas and its standard error using standard methods.

However, often the variation of the number of ratings is related to p;1. Patients with more serious illnesses, for example, are more likely to have a positive diagnosis and less likely to provide the greater number of ratings. In that case, the subsamples of patients with 2,3,4,... ratings may represent different populations and thus have different reliabilities that should not be muddled. This raises some serious questions about the practical application of the standard error derived by Landis and Koch [48] or any solution in which the number of ratings is variable.

To summarize, for the purpose of measuring reliability of a binary measure, the 2 x M (M¿2) is highly recommended, but the use of the KxM kappa for K>2 is questionable. To this it should be added that useful standards have been suggested for evaluation of the 2 x M kappa as a measure of reliability [24], with k 60.2 considered slight, 0.2 <k 60.4 as fair; 0.4<k60.6 as moderate, 0.6<k60.8 as substantial and k>0.8 as almost perfect. It is important to realize that a kappa coefficient below 0.2 is slight, no matter what the p-value is of a test of the null hypothesis of randomness. Moreover, a kappa coefficient above 0.6 that is not 'statistically significant' on such a test indicates inadequate sample size, not a definitive conclusion about the reliability of the measure. It is the magnitude of k that matters, and how precisely that is estimated, not the p-value of a test of the null hypothesis of randomness [55].

3. VALIDITY OF CATEGORICAL MEASURES: THE KxM WEIGHTED KAPPAS

The validity of a measure is defined as the proportion of the observed variance that reflects variance in the construct the measure was intended to measure [36,38], and is thus always no greater than the reliability of a measure. Validity is generally assessed by a correlation coefficient between a criterion or 'gold standard' (X) and the measure (Y) for each patient in a representative sample from the population to which the results are to be generalized. (Once again, a measure might be more valid in one population than in another.) If a measure is completely valid against a criterion, there should be a 1:1 mapping of the values of Yi onto the values of Xi. With categorical measures, the hope is to be able to base clinical or research decisions on Yi that would be the same as if those decisions were based on the 'gold standard' Xi. That would require not only that the number of categories of Yi match the number of categories of Xi, but that the labels be the same.

The 'gold standard' is the major source of difficulty in assessing validity, for there are very few true 'gold standards' available. Instead, many 'more-or-less gold standards' are considered, each somewhat flawed, but each of which provides some degree of challenge to the validity of the measure. Thus, as in the case of reliability, there are many types of validity, depending on how the 'gold standard' is selected: face validity; convergent validity; discriminative validity; predictive validity; construct validity.

While there are many problems in medical research that follow this paradigm, few of which are actually labelled 'validity' studies, we will for the moment focus on medical test evaluation. In medical test evaluation, one has a 'gold standard' evaluation of the presence/absence or type of disease, usually the best possible determination currently in existence, against which a test is assessed. To be of clinical and policy importance the test result for each patient should correspond closely to the results of the 'gold standard', for treatment decisions for patients are to be based on that result.

### 3.1. A 2 x 2 weighted kappa coefficient

Once again the most common situation is with two ratings per patient, say Xi and Yi each having only two categories of response. We use different designations here for the two ratings, Xi and Yi, in order to emphasize that the decision process underlying the 'gold standard' (Xi) and the diagnosis under evaluation (Yi) are, by definition, not the same. For the same reason, we focus on the probability of a positive result (category 1) in each case, with probability pi1 for Xi and qi1 for Yi, using different notation for the probabilities.

The distribution of pi1 and qi1 in the population of patients may be totally different, even if P=E(pi1) and Q = E(qi1) are equal. The equality of P and Q cannot be used to justify the use of the intraclass kappa in this situation, for the intraclass kappa is appropriate only to the situation in which all the moments, not just the first, are equal (interchangeable variables).

Since Xi and Yi are 'blinded' to each other, the probability that for patient i both Xi and Yi are positive is pi1 qi1. Thus in the population, the probability that a randomly selected patient has both Xi and Yi positive is E(pi1qi1) = PQ + papaq, where P=E(pi1), Q = E(qi1), p is the product moment correlation coefficient between pi1 and qi1, op = variance(pi1), oq2 = variance(qi1 ). All the probabilities similarly computed are presented in Table III.

It can be seen in Table III that the association between Xi and Yi becomes stronger as popoq increases from zero. At zero, the results in the table are consistent with random decision making. Any function of popoq, P and Q, that is strictly monotonic in popoq, that takes on the value zero when p = 0, and takes on the value +1 when the probabilities on the cross diagonal are both 0, and -1 when the probabilities on the main diagonal are both 0, is a type of correlation coefficient between X and Y. The difficulty is that there are an infinite number of such functions (some of the most common defined in Table III), and therefore an

Table III. The 2x2 weighted kappa: probabilities and weights. Definitions of some common measures used in medical test evaluation or in risk assessment.

Probabilities

Total Q

Weights indicating loss or regret (0<r<1): X =1 0

k(t) = (ad - bc) (PQ'r + P'Qr') = pjpjq/(PQ'r + P'Qr'), (0 <r< 1).

k(1/2) = 2(ad - bc) I (PQ' + P'Q) = (po - pc)/(1 - pc), (po = a + d, pc = PQ + P'Q' ).

Predictive value of a positive test: PVP = a/Q = P + P'k(0).

Predictive value of a negative test: PVN = d / Q' = P' + Pk(1).

Attributable risk= k(0).

Odds ratio = ad / bc = (SeSp)/(Se'Sp') = (PVP PVN)/(PVP' PVN').

infinite number of correlation coefficients that yield results not necessarily concordant with each other.

There is one such correlation coefficient, a certain 2 x 2 weighted kappa, unique because it is based on an acknowledgement that the clinical consequences of a false negative (X positive, Yj negative) may be quite different from the clinical consequences of a false positive (X negative, Yt positive) [47]. For example, a false negative medical test might delay or prevent a patient from obtaining needed treatment in timely fashion. If the test were to fail to detect the common cold, that might not matter a great deal, but if the test were to fail to detect a rapidly progressing cancer, that might be fatal. Similarly a false positive medical test may result in unnecessary treatment for the patient. If the treatment involved taking two aspirin and calling in the morning, that might not matter a great deal, but if it involved radiation, chemotherapy or surgical treatment, that might cause severe stress, pain, costs and possible iatragenic damage, even death, to the patient. The balance between the two types of errors shifts depending on the population, the disorder and the medical sequelae of a positive and negative test. This weighted kappa coefficient is unique among the many 2 x 2 correlation coefficients in that in each context of its use, it requires that this balance be explicitly assessed a priori and incorporated into the parameter.

For this particular weighted kappa, a weight indicating the clinical cost of each error is attributed to each outcome (see Table III); an index r is set that ranges from 0 to 1 indicating the relative importance of false negatives to false positives. When r = 1, one is primarily concerned with false negatives (as with a screening test); when r = 0, one is primarily concerned with false positives (as with a definitive test); when r =1/2, one is equally concerned with both (as with a discrimination test). The definition of x(r) in this case [47, 56] is

The sample estimator is k(r) = (ad -bc)/(PQ'r+P'Qr'), where a,b,c,d are the proportions of the sample in the cells so marked in Table III, P and Q estimated by the sample proportions. Cohen's kappa [40], often called the 'unweighted' kappa, is k(1/2)

where p0 again is the proportion of agreement, and here pc = PQ + P'Q', once again a PACC (see Table III for a summary of definitions). When papers or programs refer to 'the' kappa coefficient, they are almost inevitably referring to k(1/2), but it must be recognized that k(1/2) reflects a decision (conscious or unconscious) that false negatives and false positives are equally clinically undesirable, and K(r) equals PACC only when r =1/2.

Different researchers are familiar with different measures of 2 x 2 association, and not all readers will be familiar with all the following. However, it is important to note the strong interrelationships among the many measures of 2 x 2 association. Risk difference (Youden's index) is k(Q'), and attributable risk is k(0), reflecting quite different decisions about the relative importance of false positives and negatives. The phi coefficient is the geometric mean of k(0) and k(1): (k(0)k(1))1/2. Sensitivity and predictive value of a negative test rescaled to equal 0 for random decision making and 1 when there are no errors, equal k(1). The specificity and predictive values of a positive test, similarly rescaled, equal k(0). For any r between 0 and 1, K(r)/maxK(r) and phi/max phi [57], where max K(r) and max phi are the maximal achievable values of K(r) and phi, respectively, equal either k(0) or k(1), depending on whether P is greater or less than Q. This briefly demonstrates that most of the common measures of 2 x 2 association either (i) equal K(r) for some value of r, or, (ii) when rescaled, equal K(r) for some value of r, or (iii) equal some combination of the K(r). Odds ratio and measures of association closely related to odds ratio seem the notable exceptions.

Researchers sometimes see the necessity of deciding a priori on the relative clinical importance of false negatives versus false positives as a problem with K(r), since other measures of 2 x 2 association do not seem to require any such a priori declaration. In fact, the opposite is true. It has been demonstrated [58] that every measure of 2 x 2 association has implicit in its definition some weighting of the relative importance of false positives and false negatives, often unknown to the user. The unique value of this weighted kappa as a measure of validity is that it explicitly incorporates the relative importance of false positives and false negatives, whereas users of other 2 x 2 measures of association make that same choice by choosing one measure rather than another, and often do so unaware as to the choice they have de facto made. If they are unaware of the choice, that is indeed a problem, for there is risk of misleading clinical and policy decisions in the context in which the user applies it [58].

However, unlike the situation with reliability, it cannot be argued that K(r), in any sense, defines validity, for the appropriate choice of a validity measure depends on what the user stipulates as the relative importance of false positives and false negatives. How these are weighted may indicate a choice of index not directly related to any K(r) (the odds ratio, for example).

It is of importance to note how the relative clinical importance (r) and the reliabilities of X and Y (the intraclass kx and ky defined above for X and Y) influence the magnitude of K(r):

K(r) = p(KxKY )1/2(PP' QQ')1/2/(PQ'r + P'Qr') with P' = 1 - P, Q' =1 - Q, r' =1 - r.

Here, as defined above, p is the correlation between p;1 and q;1 (which does not change with r). kx and ky are the test-retest reliabilities of X and Y (which do not depend on r). As is always expected of a properly defined reliability coefficient, the correlation between X and Y reflected in K(r) suffers attenuation due to the unreliabilities of X and Y, here measured by the intraclass kappas kx and ky. Only the relationship between P and Q affects K(r) differently for different values of r. When P = Q, K(r) is the same for all values of r and estimates the same population parameter as does the intraclass kappa although the distribution of the sample intraclass kappa is not exactly the same as that of the sample weighted kappa. For that matter, when P = Q, the sample distributions of k(r) for different values of r are not all the same, even though all estimate the same parameter. Otherwise, in effect, too many positive tests (Q>P) are penalized by K(r) when false positives are of more concern (r nearer 0), and too many negative tests (Q<P) are penalized by K(r) when false negatives are of more concern (r nearer 1).

A major source of confusion in the statistical literature related to kappa is the assignment of weights [13]. Here we have chosen to use weights that indicate loss or regret, with zero loss for agreements. Fleiss [43] used weights that indicate gain or benefit, with maximal weights of 1 for agreements. Here we propose that false positives and false negatives may have different weights. Fleiss required that they be the same. Both approaches are viable for different medical research problems, as indeed are many other sets of weights, including sets that assign different weights to the two types of agreements.

If the weights reflect losses or regrets, K(r) = (Ec(r) - Eo(r))/(Ec(r) - min), while if the weights reflect gains or benefits, K(r) = (Eo(r) - Ec(r))/(max -Ec(r)), where Ec(r) is the expected weight when p = 0 and Eo(r) the expected weight with the observed probabilities. The scaling factor min is the ideal minimal value of Eo(r) when losses are considered, and max is the ideal maximal value of Eo(r) when gains are considered, for the particular research question. Here min is 0, where there are no disagreements; Fleiss' max is 1, also when there are no disagreements. Regardless of the weight assigned to disagreements in Fleiss' version of kappa, his weighted kappas in the 2 x 2 situation all correspond to what is here defined as k(1/2), while if P and Q are unequal, here K(r) changes with r, and generally equals k(1/2) only when r =1/2.

How the weights, min and max, are assigned changes the sampling distribution of K(r), which may be one of the reasons finding its correct standard error has been so problematic. Since the weights should be dictated by the nature of the medical research question, they should and will change from one situation to another. It is not possible to present a formula for the standard error that would be correct for all possible future formulations of the weights. For the particular weights used here (Table III) the Fisher procedure [50] could be used to obtain an asymptotic standard error. However, given the difficulties engendered by the wide choice of weights, and the fact that it is both easier and apparently about as accurate [54] when sample size is adequate, we would here recommend instead that the jack-knife estimator be used. That would guarantee that the estimate of the standard error be accurate for the specific set of weights selected and avoid further errors.

### 3.2. The K x 2 multi-category kappa

In the validity context, as noted above, if the 'gold standard' has K categories, any candidate valid measure must also have K categories with the same labels. Thus, for example,

Table IV. Example: the joint probability distribution of a three-category X and a three-category Y, with one perfectly valid category (Y = 1 for X = 1), and two invalid categories (Y = 2 for X = 2) and (Y =3 for X=3) because of an interchange of Y =2 and Y =3 (Pi + P2 + P3 = 1).