## Introduction

'Many human endeavors have been cursed with repeated failures before final success is achieved. The scaling of Mount Everest is one example. The discovery of the Northwest Passage is a second. The derivation of a correct standard error for kappa is a third'. This wry comment by Fleiss et al. in 1979 [1] continues to characterize the situation with regard to the kappas coefficients up to the year 2001, including not only derivation of correct standard errors, but also the formulation, interpretation and application of kappas.

* Correspondence to: Helena Chmura Kraemer, Department of Psychiatry and Behavioral Sciences, MC 5717,

Stanford University, Stanford, CA 94305, U.S.A. ^E-mail: [email protected]

Contract/grant sponsor: National Institute of Mental Health; contract/grant number: MH40041 Contract/grant sponsor: National Institute of Aging; contract/grant number: AG17824 Contract/grant sponsor: Department of Veterans Affairs Sierra-Pacific MIRECC Contract/grant sponsor: Medical Research Service of the Department of Veterans Affairs

Tutorials in Biostatistics Volume 1: Statistical Methods in Clinical Studies Edited by R. B. D'Agostino © 2004 John Wiley & Sons, Ltd. ISBN: 0-470-02365-1

The various kappa coefficients are measures of association or correlation between variables measured at the categorical level. The first formal introductions of kappa were those, more than 40 years ago, by Scott [2] and Cohen [3]. Since then, the types of research questions in medical research that are well addressed with kappas (for example, reliability and validity of diagnosis, risk factor estimation) abound, and such areas of research have become of ever growing interest and importance [4]. Not surprisingly, numerous papers both using and criticizing the various forms of kappas have appeared in the statistical literature, as well as in the psychology, education, epidemiology, psychiatry and other medical literature. It is thus appropriate, despite the many existing 'revisits' of kappas [5-15], to take stock of what kappas are, what they are well-designed or ill-designed to do, and to bring up to date where kappas stand with regard to their applications in medical research.

To set the stage for discussion let us consider five major issues concerning kappas that are often forgotten or misinterpreted in the literature:

1. Kappa has meaning beyond percentage agreement corrected for chance (PACC). Sir Alexander Fleming in 1928 discovered penicillin by noticing that bacteria failed to grow on a mouldy Petri dish. However, in summarizing current knowledge of penicillin and its uses, a mouldy Petri dish is at most a historical curiosity, not of current relevance to knowledge about penicillin. In much the same way, Jacob Cohen discovered kappa by noticing that this statistic represented percentage agreement between categories corrected for chance (PACC). Since then, there has also been much expansion and refinement of our knowledge about kappa, its meaning and its use. Whether to use or not use kappa has very little to do with its relationship to PACC. With regard to kappa, that relationship is a historical curiosity. Just as some scientists study moulds, and others bacteria, to whom penicillin is a side issue, there are scientists specifically interested in percentage agreement. To them whether rescaling it to a kappa is appropriate to its understanding and use is a side issue [16-20]. Consequently there are now two separate and distinct lines of inquiry, sharing historical roots, one concerning use and interpretation of percentage agreement that will not be addressed here, and that concerning use and interpretation of kappa which is here the focus.

2. Kappas were designed to measure correlation between nominal, not ordinal, measures. While the kappas that emerged from consideration of agreement between non-ordered categories can be extended to ordinal measures [21-23], there are better alternatives to kappas for ordered categories. Technically, one can certainly compute kappas with ordered categories, for example, certain, probable, possible and doubtful diagnosis of multiple sclerosis [24], and the documentation of many statistical computer programs (for example, SAS) seem to support this approach, but the interpretation of the results can be misleading. In all that follows, the measures to be considered will be strictly nominal, not ordered categories.

3. Even restricted to non-ordered categories, kappas are meant to be used, not only as descriptive statistics, but as a basis of .statistical inference. RBI or batting averages in baseball are purely descriptive statistics, not meant to be used as a basis of statistical inference. Once one understands how each is computed, it is a matter of personal preference and subjective judgement which statistic would be preferable in evaluating the performance of batters. In contrast, means, variance, correlation coefficients etc., as they are used in medical research, are descriptive statistics of what is seen in a particular sample, but are also meant to estimate certain clinically meaningful population characteristics, and to be used as a basis of inference from the sample to its population. To be of value to medical research, kappas must do likewise.

Nevertheless, presentations of kappas often do not define any population or any parameter of the population that sample kappas are meant to estimate, and treat kappas purely as descriptive statistics [7]. Then discussions of bias, standard error, or any other such statistical inference procedures from sample to population are compromised. Many of the criticisms of kappas have been based on subjective opinions as to whether kappas are 'fair to the raters' or 'large enough', behave 'as they should', or accord with some personal preference as to what 'chance' means [7,13,25,26]. These kinds of discussions of subjective preferences are appropriate to discussing RBI versus batting average, but not to estimation of a well-defined parameter in a population. We would urge that the sequence of events leading to use of a kappa coefficient should be: (i) to start with an important problem in medical research; (ii) to define the population and the parameter that the problem connotes; (iii) to discuss how (or whether) sample kappa might estimate that parameter, and (iv) to derive its statistical properties in that population. When this procedure is followed, it becomes clear that there is not one kappa coefficient, but many, and that which kappa coefficient is used in which situation is of importance. Moreover, there are many situations in which kappa can be used, but probably should not be.

4. In using kappas as a basis of statistical inference, whether or not kappas are consistent with random decision making is usually of minimal importance. Tests of the null hypothesis of randomness (for example, chi-square contingency table analyses) are well established and do not require kappa coefficients for implementation. Kappas are designed as effect sizes indicating the degree or strength of association. Thus bias of the sample kappas (relative to their population values), their standard errors (in non-random conditions), computation of confidence intervals, tests of homogeneity etc. are the statistical issues of importance [27-30]. However, because of overemphasis on testing null hypotheses of randomness, much of the kappa literature that deals with statistical inference focuses not on kappa as an effect size, but on testing whether kappas are random or not. In this discussion no particular emphasis will be placed on the properties of kappas under the assumption of randomness.

5. The use of kappas in statistical inference does not depend on any distributional assumptions on the process underlying the generation of the classifications. However, many presentations impose such restricting assumptions on the distributions of that may not well represent what is actually occurring in the population.

The population model for a nominal rating is as follows. Patients in a population are indexed by i, i =1,2,3,... . A single rating of a patient is a classification of patient i into one of K(K> 1) mutually exclusive and exhaustive non-ordered categories and is represented by a K-dimensional vector Xi = (Xi,X2,...,XiK), where Xij = 1, if patient i is classified into category j, and all other entries equal 0. For each patient, there might be M (M> 1) such ratings, each done blinded to all the others. Thus any correlation between the ratings arises from correlation within the patients and not because of the influence of one rater or rating on another. The probability that patient i (i =1,2,...) is classified into category j (j = 1,2,...,K) is denoted pij, and is the K-dimensional vector (pi1,pi2,...,piK) with non-negative entries summing to 1. In a particular population of which patient i is a member, has some, usually unknown, distribution over the K — 1 dimensional unit cube.

For example, when there are two categories (K = 2), for example, diagnosis of disease positive or negative, one common assumption is that the probability that a patient actually has the disease is n, and that if s/he has the disease, there is a fixed probability of a positive diagnosis (Xi1 = 1), the sensitivity (Se) of the diagnosis (pi1 = Se); if s/he does not have the disease (Xi2 = 2), a fixed probability of a negative diagnosis, the specificity (Sp) of the diagnosis (1 — pi1 =Sp). This limits the distribution of pi1 to two points, Se and 1 — Sp (pi2 = 1 — pi1): the 'sensitivity/specificity model' [31].

In the same situation, another model suggested has been the 'know/guess' model [25,32, 33]. In this case, it is assumed that with a certain probability, n1, a patient will be known with certainty to have the disease (pi1 = 1); with a certain probability, no, a patient will be known with certainty not to have the disease (pi1 = 0). For these patients, there is no probability of classification error. Finally, with the remaining probability, 1 — n1 — n0, the diagnosis will be guessed with probability pi1 = a. This limits the distribution of pi1 to 3 points (1,a, 0).

One can check the fit of any such model by obtaining multiple blinded replicate diagnoses per patient. For these two models, three blinded diagnoses per patient would be required to estimate the three parameters in each model, (n, Se, Sp) or (n1 ,n0,a), and at least one additional diagnosis per patient to test the fit of the model. In practice, it is hard to obtain four or more diagnoses per patient for a large enough sample size for adequate power, but in the rare cases where this has been done, such restrictive models are often shown to fit the data poorly [34]. If inferences are based on such limiting distributional assumptions that do not hold in the population, no matter how reasonable those assumptions might seem, or how much they simplify the mathematics, the conclusions drawn on that basis may be misleading. Kappas are based on no such limiting assumptions. Such models merely represent special cases often useful for illustrating certain properties of kappa, or for disproving certain general statements regarding kappa, as they here will be.

2. ASSESSMENT OF RELIABILITY OF NOMINAL DATA: THE INTRACLASS KAPPA

The reliability of a measure, as technically defined, is the ratio of the variance of the 'true' scores to that of the observed scores, where the 'true' score is the mean over independent replications of the measure [35,36]. Since the reliability of a measure, so defined, indicates how reproducible that measure will be, how attenuated correlations against that measure will be, what loss of power of statistical tests use of that measure will cause, as well as how much error will be introduced into clinical decision making based on that measure [37], this is an important component of the quality of a measure both for research and clinical use. Since one cannot have a valid measure unless the measure has some degree of reliability, demonstration of reliability is viewed as a necessary first step to establishing the quality of a measure [14, 38].

The simplest way to estimate the reliability of a measure is to obtain a representative sample of N patients from the population to which results are to be generalized. (The same measure may have different reliabilities in different populations.) Then M ratings are sampled from the finite or infinite population of ratings/raters to which results are to be generalized, each obtained blinded to every other. Thus the ratings might be M ratings by the same pathologist of tissue slides presented over a period of time in a way that ensures blindness: intra-observer reliability. The ratings might be diagnoses by M randomly selected clinicians from a pool of clinicians all observing the patient at one point in time: inter-observer reliability. The ratings might be observations by randomly selected observers from a pool of observers, each observing the patient at one of M randomly selected time points over a span of time during which the characteristic of the patient being rated is unlikely to change: test-retest reliability. Clearly there are many different types of reliability depending on when, by whom, and how the multiple blinded ratings for each patient are generated. What all these problems have in common is that because of the way ratings are generated, the M successive ratings per patient are 'interchangeable', that is, the process underlying the M successive ratings per patient has the same underlying distribution of p¡, whatever that distribution might be [39].

### 2.1. The 2 x 2 intraclass kappa

The simplest and most common reliability assessment with nominal data is that of two ratings (M = 2), with two categories (K =2). In that case, we can focus on the Xi1 since X'2=1 - and on pi1, since pi2 = 1 - p;1. Then E(XÍ1) = pi1, the 'true score' for patient i, E(pi1) = P, variance(pi1) = op2. Thus by the classical definition of reliability, the reliability of X is variance(p;1 )/variance(Xi1 ) = ap2/PP', where P' =1 - P.

This intraclass kappa, k, may also be expressed as k = (po - pc)/(1 - pc)

where p0 is the probability of agreement, and pc = P2 + P'2, that is, the PACC, for this has been shown to equal ap2/PP' [31]. So accustomed are researchers to estimating the reliability of ordinal or interval level measures with a product-moment, intraclass or rank correlation coefficient, that one frequently sees 'reliability' there defined by the correlation coefficient between test-retest data. In the same sense, for binary data the reliability coefficient is defined by the intraclass kappa.

The original introductions of kappa [3,40] defined not the population parameter, k, but the sample estimate k, where the probability of agreement is replaced by the observed proportion of agreement, and P is estimated by the proportion of the classifications that selected category 1. This was proposed as a measure of reliability long before it was demonstrated that it satisfied the classical definition of reliability [31]. Fortunately, the results were consistent. However, that sequence of events spawned part of the problems surrounding kappa, since it opened the door for others to propose various sample statistics as measures of binary reliability, without demonstration of the relationship of their proposed measure with reliability as technically defined. Unless such a statistic estimates the same population parameter as does the intraclass kappa, it is not an estimate of the reliability of a binary measure. However, there are other statistics when M= 2, that estimate the same parameter in properly designed reliability studies (random sample from the population of subjects, and a random sample of blinded raters/ratings for each subject), such as all weighted kappas (not the same as an intraclass kappa as will be seen below), or the sample phi coefficient, the risk difference or the attributable risk. Typically these provide less efficient estimators than does the sample intraclass kappa.

It is useful to note that k = 0 indicates either that the heterogeneity of the patients in the population is not well detected by the raters or ratings, or that the patients in the population are homogeneous. Consequently it is well known that it is very difficult to achieve high reliability of any measure (binary or not) in a very homogeneous population (P near 0 or 1 for binary measures). That is not a flaw in kappa [26] or any other measure of reliability, or a paradox. It merely reflects the fact that it is difficult to make clear distinctions between the patients in a population in which those distinctions are very rare or fine. In such populations, 'noise' quickly overwhelms the 'signals'.

### 2.2. The K x 2 intraclass kappa

When there are more than two categories (K>2) both X, and p, are K-dimensional vectors. The classical definition of reliability requires that the covariance matrix of p,, Ep, be compared with the covariance matrix of X,, . The diagonal elements of are Kjpp', where Kj is the 2 x 2 intraclass kappa with category j versus 'not-j', a pooling of the remaining categories, p is the E(pj), p' =1 -p. The off-diagonal elements are p/j*pp*, j=j*, with p/j* the correlation coefficient between pij and pij*. The diagonal elements of are pp', and the off-diagonal elements are -pp*.

What has been proposed as a measure of reliability is the K x 2 intraclass kappa k = trace(£p)/trace(£x) = £(pp 'k )/£(pp')

Again it can be demonstrated that this is equivalent to PACC with p0 again the probability of agreement, now with pc = £pp'.

From the above, it is apparent that to obtain a non-zero K x 2 intraclass kappa requires that only one of the K categories have non-zero Kj. If that one category has reasonable heterogeneity in the population (pp' large) and has large enough Kj, the K x 2 intraclass kappa may be large.

Consider the special case for K =3, when p, = (1,0,0) with probability re, and p, = (0,0.5,0.5) with probability re' =1 - re. In this case category 1 is completely discriminated from categories 2 and 3, but the decisions between 2 and 3 are made randomly. Then k1 = 1, and k2 = k3 = re/(re + 1), and the 3 x 2 intraclass kappa is 3re/(3re + 1). When re = 0.5, for example, k = 0.60, and k2 = k3 = 0.33, even if 2 and 3 are here randomly assigned. Such a large overall k can be mistakenly interpreted as a good reliability for all three categories, where here clearly only category 1 is reliably measured.

No one index, the K x 2 intraclass kappa or any other, clearly indicates the reliability of a multi-category X . For categorical data, one must consider not only how distinct each category is from the pooled remaining categories (as reflected in the Kj, j = 1,2,...,K), but how easily each category can be confused with each other [13,41]. Consequently, we would suggest that: (i) multi-category kappas are not used as a measure of reliability with K>2 categories; (ii) that seeking any single measure of multi-category reliability is a vain effort; and (iii) at least the K individual category Kj's be reported, but that, better yet, methods be further developed to evaluate the entire misclassification matrix [42]. In particular, the decision to recommend kappa with two categories, but to recommend against kappa with more than two categories, is not influenced by the fact that kappa is related to PACC in both cases.

Table I. Estimation of the 2xM intraclass correlation coefficient in the Periyakoil et al. data, with s the number of positive (grief) classifications from the M = 4 raters, / the proportion of items with that number, fa the kappa coefficient based on omitting one subject with s positive classifications, and ws the weight needed to calculate the asymptotic variance.

Table I. Estimation of the 2xM intraclass correlation coefficient in the Periyakoil et al. data, with s the number of positive (grief) classifications from the M = 4 raters, / the proportion of items with that number, fa the kappa coefficient based on omitting one subject with s positive classifications, and ws the weight needed to calculate the asymptotic variance.

s |
fs |
s/M |
1 - s/M |
fa |

## Post a comment