Info

Zm =28

to other capture histories. The missing cell Z000 = N - M denotes the uncounted. When we add over a sample, the subscript corresponding to that sample is replaced by a '+' sign. For example, Z+u =Z0n + Z1U and Z++i=Z00i + Zon + Zm + Z1U, and Z+++=N. Let tij, 7 = 1,2,...,? be the number of individuals listed in sample j. For t = 3, we have n\=Z\++, n2=Z+\+, «3=Z++i.

For the HAV data in Table II, there were 63 people listed in the E-list only, 55 people listed in the Q-list only, and 18 people listed in both lists Q and E but not in the P-list. Similarly, we can interpret the other records. The purpose here is to estimate the number of total individuals (that is, N) who were infected in the outbreak. It is thus equivalent to predicting the number of missed (that is, Zqoo — N - M) by all three sources.

In a typical approach in epidemiology, cases in various lists are merged and any duplicate cases are eliminated. That is, the capture histories in Table I and the categories in Table II are ignored in the analysis and only the final merged list is obtained. This typical approach assumes complete ascertainment and does not correct or adjust for under-ascertainment. However, there were non-negligible uncounted cases in many epidemiological surveillance studies. For example, before the screen serological check for all students of that college, epidemiologists suspected that the observed number of cases (271) in Table II considerably un-dercounted the true number of infected and an evaluation of the degree of undercount was needed [1, 29].

3.2. Dependence among samples

A crucial assumption in the traditional statistical approach is that the samples are independent. Since individuals can be cross-classified according to their presence or absence in each list, the dependence for any two samples is usually interpreted from a 2 x 2 categorical data analysis in human applications. In animal studies, traditional 'equal-catchability assumption' is even more restrictive, that is, in each fixed sample all animals including marked and unmarked have the same capture probability. (Equal catchability assumption implies independence among samples but the reverse is not true; see Section 4.3.) Non-independence or unequal catchabilities may be caused by the following two sources:

(i) Local dependence (also called list dependence or local list dependence) within each individual; conditional on any individual, the inclusion in one source has a direct causal effect on his/her inclusion in other sources. That is, the response of a selected individual to one source depends on his/her response to the other sources. For example, the probability of going to a hospital for treatment for any individual depends on his/her result on the serum test of the HAV. The ascertainment of the serum sample and that of the hospital sample then becomes dependent. We remark that 'local independence' has been a fundamental assumption in many statistical methodologies [30].

(ii) Heterogeneity between individuals; even if the two lists are independent within individuals, the ascertainment of the two sources may become dependent if the capture probabilities are heterogeneous among individuals. This phenomenon is similar to Simpson's paradox in categorical data analysis. That is, aggregating two independent 2x2 tables might result in a dependent table. Hook and Regal [31] presented an interesting

These two types of dependencies are usually confounded and cannot be easily disentangled in a data analysis. Lack of independence leads to a bias (called 'correlation bias' in census undercount estimation [32]) for the usual estimator which assumes independence. We use a two-sample animal experiment to explain the direction of the bias. Assume that a first sample of n\ animals is captured. Therefore, the marked rate in the population is n\/N. A second sample of n2 animals is subsequently drawn and there are m2 (that is, Z\\ in our notation for grouped data) previously marked. The capture rate for the marked (recapture rate, overlap rate) in the second sample can be estimated by m2ln2. If the two samples are independent, then the recapture rate should be approximately equal to the marked rate in the population. Therefore we have m2/n2 = ni/N, which yields an estimate of the population size under independence: Nv—n\n2/m2 (the well-known Petersen estimator or dualsystem estimator). However, if the two samples are positively correlated, then those individuals captured in the first sample are more easily captured in the second sample. The recapture rate in the second sample tends to be larger than the marked rate in the population. That is, we would expect that m2ln2>ii\/N, which yields N>n\n2/m2. As a result, Petersen's estimator underestimates the true size if both samples are positively dependent. Conversely, it overestimates for negatively dependent samples. A similar argument is also valid for a general number of samples. That is, a higher (lower) overlap rate is observed for positively (negatively) dependent samples, which implies fewer (more) estimated missing cases. Therefore, a negative (positive) bias exists for any estimator which assumes indepen-

When only two lists are available, three cells are observable: people identified in list 1 only; people identified in list 2 only, and people listed in both. However, there are four parameters: N, two mean capture probabilities and a dependence measure. The data are insufficient for estimating dependence unless additional covariates are available. All existing methods unavoidably encounter this problem and adopt the independence assumption. This independence assumption has become the main weak point in the use of the capture-recapture method

A variety of models incorporating dependence among samples have been proposed in the literature. We will review three classes of models: ecological models; log-linear models, and the sample coverage approach. The latter two approaches can be used to provide estimates for some ecological models, but they are considered separately because of their different ways

Table III. Two types of capture probabilities for ecological models.

Model Multiplicative model in log-linear form

Logistic model

Mtbh

\og(Pij) = oa + Pj + yHj log (Pij) = cti + yYu log (Pij) = ai + pJ log(i>y) = a,

\og(Pv) = a + yYj (a,-= a) lo g(Pij) = Pj logit(P(,) = a,- + [Sj + yYjj

\og\t(Pij) = ai + yYij logit (Pij) = pJ + yYij logit(P,/) = a, + [ij (Rasch model)

0 0