Gary S. Collins, Maarten van Smeden, Richard D. Riley
European Respiratory Journal 2020; DOI: 10.1183/13993003.02643-2020
COVID-19 prediction models should adhere to methodological and reporting standards
Letter Re: Development of a clinical decision support system for severity risk prediction and triage of COVD-19 patients at hospital admission: an international multicentre study.
The covid-19 pandemic has led to a proliferation of clinical prediction models to aid diagnosis, disease severity assessment and prognosis. A systematic review has identified sixty-six covid-19 prediction models – concluding all, with no exception, are at high risk of bias due to concerns surrounding the data quality, statistical analysis and reporting, and none are recommended for use. Therefore, we read with interest the recent paper by Wu and colleagues describing the development of a model to identify covid-19 patients with severe disease on admission to facilitate triage. However, our enthusiasm was dampened by a number of concerns surrounding the design, analysis and reporting of the study which deserve highlighting to readers.
Our first point relates to design. The authors randomly split their dataset in a training and test set. This been long been shown to be an inefficient use of the data –reducing the size of the training set (increasing the risk of model overfitting), and creating a test set too small for model evaluation. There are alternative stronger approaches that use the entire data to both develop and internally validate a model based on cross-validation or bootstrapping. This naturally leads us to further elaborate on the sample size. The sample size in a prediction model study is largely influenced by the number of individuals experiencing the event to be predicted (in Wu’s study, those with severe disease). Using published sample size formulae for developing prediction models, based on information reported in the Wu study (75 predictors, outcome prevalence of 0.237), then depending on the anticipated model R-squared, the minimum sample size in the most optimistic scenario (e.g., that the model gives the highest R-squared) would be 1285 individuals (306 events). To precisely estimate the intercept alone requires 279 individuals (66 events). After splitting their data, the authors developed their model with a sample size of 239 individuals (57 events) – clearly insufficient to estimate even the model intercept, let alone develop a prediction model.
The test set was then used to evaluate the performance of their model comprising 60 individuals of whom ∼14 experienced the event. To put this in perspective, current sample size recommendations to evaluate model performance suggest a minimum of 100 events. The performance of the model was also evaluated separately in each of five external validation datasets where the number of events ranged from 7 to 98, all not meeting this minimum requirement.
Other concerns include the handling of missing data; it is hard to believe all patients had complete information on all 75 predictors, and indeed the flow chart reveals 38 individuals with missing data were simply excluded, which can led to bias. Continuous predictors were assumed to be linearly associated with the outcome, which can reduce predictive accuracy. Model overfitting (a clear concern given the small sample size) was not addressed either in adjusting the performance measures for optimism or shrinking the regression coefficients that are likely overestimated (e.g. using penalisation techniques). “Synthetic sampling” was used to address imbalanced data, but this is inappropriate since artificially balancing data will produce an incorrect estimation of the model intercept (unless it is re-adjusted post-estimation) leading to incorrect model predictions (miscalibration). Model performance was poorly and inappropriately assessed, including presenting a confusion matrix (inappropriate for evaluating prediction models), reporting sensitivity/specificity (where net benefit would be more informative), and assessing model calibration using weak and again discredited approaches (e.g. Hosmer-Lemeshow test, rather than calibration plots with graphical loess curves). We also question the arbitrary choice of risk groupings, and why individuals with a predicted risk of 0.21 are considered the same (“middle risk”) as those with a predicted risk of 0.80.
Arguably the most important aspect of a prediction model article is the presentation of the model so that others can use or evaluate it in own their own setting. The authors have presented a nomogram and (prematurely) linked to a web calculator. Whilst both these formats can be used to apply the model to individual patients (though given our concerns we urge against this), for independent validation the prediction model needs to be reported in full – namely all the regression coefficients and the intercept, but these are noticeably absent.