Abstract
Predictors of distant metastasis in colorectal cancer: a multimodal statistical and machine‑learning analysis
- Collegium Medicum, Jan Kochanowski University, Kielce, Poland 2Meduniv Sp. z o.o., Kielce, Poland
Introduction
Distant metastasis (M1) is the principal determinant of outcome in colorectal cancer (CRC). Identifying robust molecular and clinical predictors at diagnosis may improve risk stratification.
Aim of the research
To develop and compare complementary statistical and machine‑learning models for predicting metastatic status and to synthesize convergent predictors.
Material and methods
In a single-centre cohort (N = 54, M1 = 26, M0 = 28), we modelled a binary outcome – any type of CRC metastasis (M_any) using multivariable logistic regression (LR), Random Forest (RF), LASSO-penalized logistic regression, and Support Vector Machine (SVM). In a secondary analysis, we fitted a multinomial logistic regression with three categories Early_No_Mets (pT1-2M0), Advanced_No_Mets (pT3-4M0), and Metastatic (M1). Discrimination was summarized as AUC; classification metrics used native validation schemes (LR in-sample; RF out-of-bag [OOB]; SVM 10-fold cross-validation [CV]). Reporting follows TRIPOD guidance.
Results
LR identified TP53 mutation as the strongest predictor (OR = 3.47; 95% CI: 0.92–14.5; p = 0.073; AUC = 0.713). RF achieved OOB error of 40.74% (accuracy = 59.3%); top features were NRAS, TP53_pathway, TP53, WNT_pathway, and n_genes. LASSO (10‑fold CV) retained NRAS, TP53, and age (coefficients +1.15, +0.85, −0.02). SVM yielded AUC = 0.678, accuracy = 55.6%. The multinomial model revealed complete separation for NRAS in the Advanced_No_Mets group, precluding standard MLE inference.
Conclusions
TP53 (gene/pathway) is a consistent risk signal across methods; NRAS carries high importance in ensemble/regularized models. Overall discrimination is modest, consistent with a small sample size; findings are hypothesis‑generating and warrant validation.
>Keywords
colorectal cancer, metastasis, TP53, NRAS, logistic regression, random forest, LASSO, SVM, TRIPOD
Coverage in
Integrated with