Covrml XXr1 xjy fti y X XTX

where the N x K matrix D (N being the overall length of the data vector y and K the length of the parameter vector P) contains the derivatives dE(yij)/d[ik, and V = cov(y) is the covariance matrix of the yi/s, which contains known functions of E(yi}) and possibly other unknown parameters. Note that, for our logistic regression model, E(yij) = ni} = (1 + e~XJ/!)"1, where Xy denotes the column vector of covariate values xijk. These estimating equations are the weighted least squares 'normal equations' that result from minimizing (y — £(y))F_1(y — E(y)) when V contains known constants. Where V is functionally dependent on E(ytj) (that is, on P) and possibly on other parameters, the solution of these equations has been shown to have a number of attractive properties.10,16 The appeal of the quasi-likelihood approach is that it requires only the specification of second-order structure, in the form of the variance V, but does not demand a full

For our logistic regression specification, the estimating equations (7) become

Clearly if V = A the equations reduce to the independence ML equations. More generally, the numerical solution of the 'score-type' estimating equations (7) is readily accomplished using the same iteratively reweighted least squares algorithm (IRLS) used for the likelihood equations.10

In the context of repeated measurements (clustered or longitudinal) on i = 1,..., n individuals, a generalization of quasi-likelihood that has become widely used is the generalized estimating equations (GEE) method.5,17 The key to GEE is the representation of V in the estimating equations (7) as a block diagonal matrix consisting of n submatrices Vt having the form where A( is the diagonal submatrix of A corresponding to individual i and R{n) is termed a 'working' correlation matrix and is a function of further unknown parameters a. The role of the working correlation matrix is to provide a guess at the true marginal covariance matrix F(y) and it turns out that as long as the robust information-sandwich method is used for standard errors, the GEE method works well in large samples even if R(a) is misspecified. Incorporating a guess at the correlation structure increases the efficiency of estimation of p (that is, the variance of the estimate should be smaller). If we set R(<x) = I, the identity matrix, then once again we recover the ML estimates. More generally, for an initial value of a, IRLS provides an initial estimate of P by solving (7). Given this p the 'GEE method' is to construct standardized residuals for the ytJs from which an updated estimate of a is obtained (using the method of moments). These iterations are

If the first and second moment specifications are correct, the formal parallel between quasi-likelihood and likelihood can be used to show that large-sample standard errors for the GEE

where the subscript 'M' stands for 'model-based', since the formula will not produce consistent estimates if the assumed form of V is wrong. In practice, of course, estimates of D and V based on substituting the estimated values of P and a must be used. Again, however, the information-sandwich idea can be used to 'robustify' these standard errors, leading to covRrEE) = (DTV-'D)"1 £ (Dj vr1 (y( - aofo - Tti)TVi~1Di) (D?V~ 4))"1

which is simply a generalized version of the previous formula (6), which robustified the standard error of the independence ML estimate.

To capitalize on the possible efficiency gains of the GEE method, appropriate specification of R(ol) is required. In some situations a single-parameter 'exchangeable' model, in which corr(7¿j, Yik) = a for all j # k may provide substantial improvements over the independence assumption; for longitudinal data an autoregressive or other 'banded' correlation structure may be more appropriate. With large n and small J an unstructured correlation with J(J - l)/2 parameters may be feasible and provides maximum flexibility.

None of the previously described methods produces valid estimates if there are missing data, unless the data are missing 'completely at random',18 meaning that the chance that a particular response value is missing is independent of the observed responses and covariates for that subject. Heuristically, the methods break down because the estimating equations on which they are based no longer have zero expectation if there are subjects missing in a non-random fashion, so that consistency of the resulting estimates cannot be guaranteed. This view of the problem suggests a simple method of modifying the estimating equations by applying weights to allow for the pattern of missingness.19~22 We use a weighted version of (3):

where wu is a weight equal to the inverse probability that the jth response for individual i is observed (supposing for the moment that these probabilities are known). These weights can be expressed more formally as wy = l/PrQfy observed) = l/£(/u), where ¡¡j denotes a random variable that takes the value 1 if yu is observed and 0 if not. The weighted estimating equations are unbiased under repeated sampling of the /s, assuming the same sampling probabilities or weights, since

Y^m^ijUiAk)

(where Ez(.) denotes taking the expectation under the distribution of the random variable Z). The wtj serve to weight the score equations in such a way that we compensate for missing data values.

Weighting the estimating equations in this way is formally similar to quasi-likelihood in that the weighted estimating equations can be written in exactly the same form as (8) with AV_1 replaced by W, a diagonal matrix containing the weights wu. The parallel with quasi-likelihood again indicates that the same numerical methods can be used to solve the weighted estimating equations, and similarly a robust variance estimate allowing for possible dependence within individuals can be obtained. It should be noted, however, that the weighted estimating equations do not incorporate a within-subject correlation structure in the way that the GEEs do. It would be attractive to combine these two features into the estimating equations, but it is not clear how to handle the estimation of the a parameters when weights are present.

In practice, the probability of being observed is not known, so we need to estimate the weights in some reasonable way.23 We do this using a method related to that of'post-stratification' in the survey sampling literature.24 An initial analysis was performed using logistic regression to determine which of a number of fixed covariates were most strongly predictive of subjects completing only one or two waves. The covariates identified were sex, country of birth (Australia or not), parental divorce and smoking status at recruitment. These four dichotomous variables created a cross-classification of subjects into 16 cells. At each wave empirical weights were estimated within each of these cells as the inverse of the response rate achieved within that cell. Note that these weights will not be reliably estimated unless reasonable numbers are available within each cell. Estimation of the weights implies that the consistency argument given above no longer strictly holds, but it still provides an appealing heuristic justification for the method, assuming that the post-stratification is in fact predictive of the missing data mechanism. The method is valid as long as the missing data are a random subsample of responses within the cells

A final consideration that may be accommodated within the marginal modelling analysis is the complex design of the survey from which the data arose. Recall that the initial cohort was identified from a two-stage sampling process where schools were selected by stratified random sampling at the first stage. Full consideration of the issues raised by sample survey design is beyond the scope of this tutorial, partly because these issues rapidly become tangled with philosophical questions about the nature of statistical inference concerning finite populations, the traditional starting point of sample survey theory.25 We retain a pragmatic view for the current exposition and simply point out that the recruitment of a study group via a known sampling process in a finite population may involve one or all of the following:

(i) sampling weights, which reflect differential probabilities of being included for different

(ii) stratification, where sampling was carried out independently within sections (strata) of the

(iii) clustering, where a multi-stage selection process may produce a sample with higher levels of correlation within identified subgroups or clusters of the sample than between such

The effect of these features can be accommodated within the estimating equation framework. In particular, the sampling weights can be handled in exactly the same way as described in the previous section. Once the weights have been allowed for, stratification and clustering do not affect the expected value of the 'pseudo-score' statistic and so the estimating equations continue to provide consistent estimates. The sampling design may, however, affect the standard errors, in that stratification will lead to more precise estimation if between-strata variance is larger than within, while clustering may lead to less precise estimation because of dependencies within clusters. These effects may be allowed for in the information-sandwich approximation.3'19'25 First, the effect of stratification is dealt with in the calculation of the empirical variance of the observed that is used in the 'middle' of the sandwich, using standard sample survey concepts of variance estimation under stratification. Second, the clustering may be allowed for by expanding the unit on which the empirical variance of the sandwich formula is based from the individual level to the cluster (school) level. Thus, ignoring the other issues, allowance for cluster effects could be achieved in (6) by replacing the summation over i with a summation over (say) s, where s indexes schools, and the corresponding Xs, ys and ns are the school-specific components of X, y and n.

3.5. Results

Model (1) was fitted to the adolescent cohort data using the succession of methods discussed above. The analysis was carried out using the statistical package Stata3 and we present Stata commands as they were used with our data. Note that lines beginning with '*' are comments in Stata. The following labelling commands serve to introduce the variables used in the analysis:

* Label variables for prevalence analysis

* (HB Stata variable names are limited to 8 characters)

label variable Id label variable wave label variable regsmoke label variable c_age label variable wgt_mv label variable wgt_str label variable sregion

"identification of individuals" "Wave of data collection: 1 to 6" "Indicator of regular smoking" "Age centred at mean of cohort" "Missing data weights" "Strata sampling weights" "Stratum to which school belongs"

The first method of fitting model (1) was maximum likelihood, which corresponds to solving the estimating equations (4), with model-based standard errors given by the inverse information matrix, (5). This is accomplished in Stata by the logit (or closely related logistic) command, which is a special case of the glm command for performing ML estimation in generalized linear models. The command is coupled with the xi command, which generates dummy and interaction variables:

* Method (1): Maximum Likelihood logistic regression with model-based SEs xi: logit regsmoke i.sex*c_age

The results from this and subsequent methods are presented in Table I, where we present age effects for males and females separately, that is, fi2 and + h, respectively. The latter effect is not standard output from Stata regression commands but can be obtained after fitting the model, by using the lincom command, which obtains the standard error of linear combinations of estimated parameters, using the estimated variance-covariance matrix of the vector /J:

* Obtain age effect in females = age effect in males + interaction effect lincom _b[c_age] + _bfIsXc_a_l]

* Stata's xi command automatically generates the name IsXc_a_l for the

* interaction variable sex.age

The second method of fitting the model, maximum likelihood with robust standard errors, also solves the estimating equations (4) to estimate the model parameters, but standard errors are

Table I. Parameter estimates (with standard errors) from six different methods of estimating the expanded version of model (1) that includes the interaction between sex and age*

Survey estimation weighted for strata Maximum likelihood (ML) and missingness (Section 3.4)

0 0