Model Validation Methods

As mentioned before, examination of the apparent accuracy of a multivariable model using the training dataset is not very useful. The most stringent test of a model (and of the entire data collection system) is an external validation - the application of the 'frozen' model to a new population. It is often the case that the failure of a model to validate externally could have been predicted from an honest (unbiased) 'internal' validation. In other words, it is likely that many clinical models which failed to validate would have been found to fail on another series of subjects from the original source, because overfitting is such a common problem. The principal methods for obtaining nearly unbiased internal assessments of accuracy are data-splitting,53 cross-validation54 and bootstrapping.5^58 In data-splitting, a random portion, for example f, of the sample is used for all model development (data transformations, stepwise variable selection, testing interactions, estimating regression coefficients, etc.). That model is 'frozen' and applied to the remaining sample for computing calibration statistics, c, etc. The size of the validation sample must be such that the relationship between predicted and observed outcomes can be estimated with good accuracy, and the remaining data are used as the training (model development) sample. Datasplitting is simple, because all the modelling steps, which may include subjective judgements, are only done once. Data-splitting also has an advantage when it is feasible to make the single split with respect to geographical location or time, resulting in a more stringent validation that demonstrates generalizability. However, in addition to severe difficulties listed below, data splitting does not validate the final model, if one desires to recombine the training and test data to derive a model for others to use.

Cross-validation is repeated data-splitting. To obtain accurate estimates using cross-validation, more than 200 models may need to be developed and tested,54 with results averaged over the 200 repetitions. For example, in a sample of size n = 1000, the modelling process (all components of it!) could be done 400 times, leaving out a random 50 subjects each time and developing the model on the 950 remaining subjects. The benefits of cross-validation over data-splitting are clear; the size of the training samples can be much larger, so less data are discarded from the estimation process. Secondly, cross-validation reduces variability by not relying on a single sample split.

Efron has shown that cross-validation is relatively inefficient due to high variation of accuracy estimates when the entire validation process is repeated.54 Data-splitting is far worse; the indexes of accuracy will vary greatly with different splits. Bootstrapping is an alternative method of internal validation that involves taking a large number of samples with replacement from the original sample. Bootstrapping provides nearly unbiased estimates of predictive accuracy that are of relatively low variance, and fewer model fits are required than cross-validation. Bootstrapping has an additional advantage that the entire dataset is used for model development. As others have shown, data are too precious to waste.59,60

Suppose that we wish to estimate the expected value (for new patient samples similar to the derivation sample) of the Somers' D rank correlation coefficient between predicted and observed survival time. The following steps can be used (see references 55, 58 and 60 for the basic method when applied to binary outcomes):

1. Develop the model using all n subjects and whatever stepwise testing is deemed necessary. Let Dapp denote the apparent D from this model, i.e., the rank correlation computed on the same sample used to derive the fit.

2. Generate a sample of size n with replacement from the original sample (for both predictors and the response).

3. Fit the full or possibly stepwise model, using the same stopping rule as was used to derive

Aipp-

4. Compute the apparent D for this model on the bootstrap sample with replacement. Call it

5. 'Freeze' this reduced model, and evaluate its performance on the original dataset. Let Dorig denote the D.

6. The optimism in the fit from the bootstrap sample is Dboot — Dorig.

7. Repeat steps 2 to 6 100-200 times.

8. Average the optimism estimates to arrive at O.

9. The bootstrap corrected performance of the original stepwise model is Dapp - O. This difference is a nearly unbiased estimate of the expected value of the external predictive discrimination of the process which generated Dapp. In other words, Dapp — O is an honest estimate of internal validity, penalizing for overfitting.

As an example, suppose we want to validate a stepwise Cox model developed from, say, a sample of size n = 300 with 30 events. The candidate regressors are age, age2, sex, mean arterial blood pressure (MBP), and a non-linear interaction between age and sex with the terms age x sex and age2 x sex. MBP is assumed to be linear and additive. Denote these variables by the numbers 1-6. The model y2 is 45 with 6 d.f., so the approximate expected shrinkage is \$6 = 0*87, or 0*13 overfitting, so some caution needs to be exercised in using the estimated model coefficients and hence in using extreme predicted survival probabilities without calibration (shrinkage). The D for the full model is 0-42. A step-down variable selection using Akaike's information criterion (AIC)34,61 as a stopping rule (y1 for set of variables tested > 2 x d.f.) resulted in a model with the variables age, age2, sex, age x sex. The reduced model had D = 0-39, a typical loss due to deleting marginally important but statistically insignificant variables. Two-hundred bootstrap repetitions are done, repeating the variable selection for each sample using the same stopping rule. We want to detect whether the D = 0-39 is likely to validate in a new series of subjects from the same population. The first five samples might yield the results shown in Table I.

 Re-sample Awot Variables retained Dbool ^orig Optimism
0 0