## Summary Of Modelling Strategy

1. Assemble accurate, pertinent data and as large a sample as possible. For survival time data, follow-up must be sufficient to capture enough events as well as the clinically meaningful phases if dealing with a chronic disease.

2. Formulate focused clinical hypotheses that lead to specification of relevant candidate predictors, the form of expected relationships, and possible interactions.

3. Discard observations having missing Y after characterizing whether they are missing at random." See reference 62 for a study of imputation of Y when it is not missing at random.

4. If there are any missing Xs, analyse factors associated with missingness. If the fraction of observations that would be excluded due to missing values is very small, or one of the variables that is sometimes missing is of overriding importance, exclude observations with missing values1. Otherwise impute missing Xs using individual predictive models that take into account the reasons for missing, to the extent possible.

5. If the number of terms fitted or tested in the modelling process (counting non-linear and cross-product terms) is too large in comparison with the number of outcomes in the sample, use data reduction (ignoring 7)20 23 until the number of remaining free variables needing regression coefficients is tolerable. Assessment of likely shrinkage (overfitting) can be useful in deciding how much data reduction is adequate. Alternatively, build shrinkage into the initial model fitting.19

6. Use the entire sample in the model development as data are too precious to waste. If steps listed below are too difficult to repeat for each bootstrap or cross-validation sample, hold out test data from all model development steps which follow.

7. Check linearity assumptions and make transformations in Xs as needed.

8. Check additivity assumptions and add clinically motivated interaction terms.

9. Check to see if there are overly-influential observations.30 Such observations may indicate overfitting, the need for truncating the range of highly skewed variables or making other pre-fitting transformations, or the presence of data errors.

I For survival time data, no observations should be missing on Y. They should only have curtailed follow-up.

II Alternatively, impute missing values for the predictor but perform secondary analyses later to estimate the strength of association between X and y after deleting observations with that predictor imputed, as imputation will attenuate the relationship.

10. Check distributional assumptions and choose a different model if needed (in the case of Cox models, stratification or time-dependent covariables can be used if proportional hazards is violated).

11. Do limited backwards step-down variable selection.63 Note that since stepwise techniques do not really address overfitting and they can result in a loss of information, full model fits (that is, leaving all hypothesized variables in the model regardless of P-values) are frequently more discriminating than fits after screening predictors for significance.2'40 They also provide confidence intervals with the proper coverage, unlike models that are reduced using a stepwise procedure,60'64'6 5 from which confidence intervals are falsely narrow. A compromise would be to test a pre-specified subset of predictors, deleting them if their total %2 < 2 x d.f. If the x1 is that small, the subset would likely not improve model accuracy.

12. This is the 'final' model.

13. Validate this model for calibration and discrimination ability, preferably using bootstrapping. Steps 7 to 11 must be repeated for each bootstrap sample, at least approximately. For example, if age was transformed when building the final model, and the transformation was suggested by the data using a fit involving age and age2, each bootstrap repetition should include both age variables with a possible step-down from the quadratic to the linear model based on automatic significance testing at each step.

14. If doing stepwise variable selection, present a summary table depicting the variability of the list of 'important factors' selected over the bootstrap samples or cross-validations. This is an excellent tool for understanding why data-driven variable selection is inherently ambiguous.

15. Estimate the likely shrinkage of predictions from the model, either using equation (2) or by bootstrapping an overall slope correction for the predictions.34 Consider shrinking the predictions to make them calibrate better, unless shrinkage was built-in. That way, a predicted 0-4 mortality is more likely to validate in a new patient series, instead of finding that the actual mortality is only 0-2 because of regression to the mean mortality of 0-1.

## Post a comment