# T Ry ey0

where Ty = (1, Ty; Ty )' is a vector and Ry = (A0y, fay fa2y )' is a random vector and independent from ey, the first-level residual. We assume that E(ey) = 0 and cov(ey,e/g ) = / {i = f & y = g} a2, where I {.} is an indicator function, that is, I {i = f & y = g} = 1, if i = f and y = g, and = 0, otherwise; variance of the first-level error e is assumed to be homogeneous and independent among subjects as well as among different time points within subjects. This assumption implies that the first-level residuals are uncorrelated after adjusting for trend, and var(ey ) = c2 for all i and y. However, this assumption does not necessarily imply that the first-level observations within individuals are uncorrelated - see (4).

The random coefficients fa's are further modelled (as functions of covariates Z) for the second-level individuals by decomposing into fixed and random components as follows:

faqy = 700 + 701Z1 y + 702 Z2y +-----+ yopo Zpo y + «0y fay = 710 + ynZy + ynZ2y +-----+ y1pi Zpy + My and fa2y = 720 + 721 Zy + 722Z2y +-----+ 72p Zp2y + W2y where M0j, Miy and u2j represent residuals for intercepts, slope and quadratic terms, respectively. It follows that for k = 0,1 and 2

where Zkj 's are vectors with different lengths depending on k and their first component is 1,

Zkj = (1,Zij,Z2j,...,Zpkj)' and yk = ()'ko,7ki,7k2,...,7kpk)'. The other components of Zkj's are the covariate values for the jth individual's baseline characteristic. The vector y's are fixed parameter column vectors, and E(uj) = 0 = (0,0 , 0)' , Uj = (u0j, u1j, u2j)' and

for all j. This covariance is modelled for the second-level variance of the growth curve. In sum, by introducing the second-level model (2) into the first-level model (1), we have

Yy = (Z0y Y0> Z1yS1 , Z2yS2 )Ty + Tyuy + ey the first term in the right hand side (RHS) representing the fixed effects and the other terms in the RHS the random effects (or parameters). Thus, the hierarchical linear model can be represented in terms of expectation and covariance in general as

where Y = (Y' Yj YN)' and Yj = (Yij,Y1},...,Y„j})'. The expectation XT and the co-variance matrix Q represent fixed and random effects, respectively. Specifically, for the fixed effects, the row of the design matrix X corresponding to Yj is (Zj.Tj. Zj Tj2Z'2j), and r = (s0. Y'. y2)', the fixed coefficient vector. On the other hand, regarding elements of Q

cov( Y/j. Yfg) = I {i = f & j = g]a2 +1 {j = g}T/ X„Tfg (4)

In other words, the variance of the first-level observation is var(Yj ) = c2 + TjSuTij-, which is the sum of the first- and the second-level variance, and the covariance between observations at two different time points i and f within each individual is Tj£uTfg. This implies that the covariance matrix Q is block-diagonal with size £/ni=1 „j x £/ni=1 „j, that is, Q ^ 0/N=1(^2I„/ + Vj), where © denotes the direct sum of matrices, I„j is the „j x „j identity matrix, and Vj is the „j x „j matrix with (i. f )th element Tj£uTfg for the jth individual. In this case, the ith time point is not necessarily different from the fth one. This covariance structure is crucial to the estimation procedure described in the following section, because the second level covariance must be taken into consideration, which traditional approaches such as OLS estimation do not accommodate. OLS assumes that Vj is a zero matrix for all j.

### 5.2. Estimation

Under model (3) with unknown Q, which is a general case, estimation of both fixed and random effects would require iterative generalized least squares (IGLS; for example, reference [37]). Briefly, the generalized least squares estimate of T is r = (X'QX)_1X'iQY for some initial estimate Q. With the newly estimated r and residuals calculated from it, estimates Q can be updated. With the updated estimates Q, estimates r will again be obtained. Such iteration continues until some convergence criteria are met. For further details of estimation procedures and properties of resulting estimates, see Goldstein (reference [7, Chapter 2]).

With a final Q, the variance at the first level, the variance at the second level at time Tj = t, and the covariance of r will be estimated as <r2, (1.t.t2)'t,u(1.t.t2)' and (X'Q^X)-1, respectively. The IGLS estimates are shown to be equivalent to maximum likelihood estimates [37] when the error terms are normally distributed, that is, eij~N(0.c2) and Uj~MN(0.£u), which in turn implies

However, even when the data are not multivariate normal, the IGLS still yields consistent estimates [7].

### 5.3. Goodness of fit

For comparisons of several competing models with different first- and second-level variables X (especially with different Z), relative goodness of fit can be measured by the resulting log-likelihood. That is, under normality assumption (5) apart from constants logL(Y; Xr . Q) = -tr[Q-1(Y - XT)(Y - XT)'] - log |Q|

where tr is the trace, that is, sum of diagonal terms of a square matrix, and |Q| is the determinant of the matrix Q. Even individuals with only single time point measurements can contribute not only to the estimation of the intercept time course of BMI change but also to the estimation of slopes, that is, the longitudinal changes in BMI because of the iterative nature of fitting and thus calculation of the likelihood. Alternatively, for the comparisons, we used the correlation r between predicted values from estimated fixed effects and observations, that is r = corr(Y, XI") and the mean squared error (MSE), that is

where /(Y) is the length of vector Y.

Significance testing of joint effects of additional covariates was performed by comparing the so-called 'deviance', that is, twice minus the log-likelihood difference, which is asymptotically distributed as x2 with the number of additional covariates for the degrees of freedom. To see a relative increment of goodness of fit by those additional covariates, we calculated the pseudo multiple R2 improvement described by Everitt (reference [38], p. 267), that is, the per cent increment in log-likelihoods due to addition of more independent variables in a model:

x 100

where W and A are subsets of X and r, respectively. In other words, the model with WA is nested in the model with XT.

For comparisons among non-nested models (that is, when W and A are not necessarily subsets of X and T, respectively), both Akaike's information criterion (AIC) [14] and the Bayesian information criterion (BIC) [15] can be used. The AIC is simply the reduction in log-likelihood minus the increase in number of parameters; positive AIC indicates an improvement in fit (reference [39], p. 272). Specifically, AIC can be defined as

AIC(Xr : WA) = logL(Y; Xr, Q) - logL(Y; WA, Q) - [/(r) - /(A)] Similarly, BIC can be defined as

BIC(Xr : WA) = logL(Y; Xr, Q) - logL(Y; WA, Q) - [/(r) - /(A)] log ^£ j

Positive BIC also indicates an improvement in fit.

### 5.4. Models considered

We considered five models, namely M1 to M5. The associated coefficients and parameters can be found in Table V. These five models have the same form as in (1). However, they

Table V. Results of hierarchical linear model fitting with 75 351 data units from 14 257 subjects; the TMFS subjects are the referent.

 Baseline covariate (Z)