An interpretable weighted ensemble based on routinely collected clinical data for the accurate prediction of axillary lymph node metastasis

Ying Wang; Qingyu Li; Liyuan Zhao

doi:10.21037/qims-2025-1-2525

Original Article

An interpretable weighted ensemble based on routinely collected clinical data for the accurate prediction of axillary lymph node metastasis

Ying Wang¹ , Qingyu Li², Liyuan Zhao¹

¹Department of Medical Imaging, Huaihe Hospital of Henan University, Kaifeng, China; ²Department of Medical Imaging, The Third Affiliated Hospital of Zhengzhou University, Zhengzhou, China

Contributions: (I) Conception and design: Y Wang, L Zhao; (II) Administrative support: Y Wang; (III) Provision of study materials or patients: Q Li; (IV) Collection and assembly of data: Y Wang; (V) Data analysis and interpretation: Q Li, Y Wang; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Ying Wang, Bachelor. Department of Medical Imaging, Huaihe Hospital of Henan University, No. 8, Baobei Road, Kaifeng 475000, China. Email: WYcathy123@outlook.com.

Background: Axillary lymph node (ALN) status is a primary prognostic indicator in breast cancer, yet conventional surgical staging for determining ALN status is invasive. We aimed to develop an interpretable, noninvasive weighted ensemble model for ALN metastasis prediction using only routine, universally accessible clinicopathological data.

Methods: We analyzed a retrospective cohort of 915 patients (training set: n=732; test set: n=183). Twelve routine clinicopathological variables, including age, tumor diameter, histological grade, and biomarkers [estrogen receptor (ER), progesterone receptor (PR), human epidermal growth factor receptor 2 (HER2), and Ki-67], served as predictors. A two-stage weighted ensemble was developed through the integration of logistic regression (LR) and extreme gradient boosting (XGBoost) via Python version 3.9. Model performance was evaluated with the held-out test set according to the area under the receiver operating characteristic curve (AUC), area under the precision-recall curve (AUPRC), and sensitivity. Model interpretability was achieved through Shapley additive explanations (SHAP).

Results: The weighted ensemble model achieved a superior AUC of 0.762 on the test set, outperforming optimized XGBoost (AUC =0.752) and tuned LR (AUC =0.741). The model demonstrated a robust AUPRC of 0.575 and achieved a high sensitivity of 0.800. SHAP analysis revealed that model predictions were primarily driven by tumor diameter, invasive ductal carcinoma pathology type, and plateau time-intensity curve patterns.

Conclusions: The interpretable weighted ensemble model, based only on standard tabular clinicopathological data, provides accurate and transparent ALN risk stratification. Its high sensitivity supports its use as a valuable triage tool for identifying low-risk patients who may safely forego invasive axillary surgery.

Keywords: Breast cancer; axillary lymph node metastasis (ALN metastasis); weighted ensemble learning; explainable artificial intelligence (XAI); clinicopathological data

Submitted Nov 23, 2025. Accepted for publication Mar 13, 2026. Published online Mar 30, 2026.

doi: 10.21037/qims-2025-1-2525

Introduction

The management of the axilla in patients with early-stage breast cancer remains a critical and evolving clinical challenge. Axillary lymph node (ALN) status is one of the most powerful independent prognostic factors for patients with breast cancer and a cornerstone of oncological staging that informs adjuvant systemic therapy, radiotherapy planning, and the extent of surgery (1,2). Traditionally, ALN dissection (ALND) has been relied upon to provide definitive staging but is associated with debilitating long-term morbidity, most notably lymphedema, chronic pain, and sensory deficits. The advent of sentinel lymph node biopsy (SLNB) represented a major paradigm shift, substantially reducing this morbidity. However, SLNB is itself an invasive procedure, and a large proportion of patients subjected to it are found to be node-negative. Consequently, there is an urgent and unmet need for accurate, noninvasive, preoperative models that can reliably identify patients with a low probability of ALN metastasis, which would safely de-escalate surgery and spare a significant portion of this population from any invasive axillary procedure.

Recent research efforts toward achieving this goal have largely focused on computationally intensive modalities, particularly the deep learning-based analysis of whole-slide images (WSIs) from biopsies (3-5). These attention-based multiple-instance learning (AMIL) frameworks have set high benchmarks, achieving areas under the curve (AUCs) in the range of 0.81 to 0.86 (3,6). This has been improved upon through the use of advanced multimodal learning approaches that fuse WSI-derived features with clinical information (7,8) or advanced deep learning architectures in histopathology analysis (4,5). Other work has investigated the use of advanced imaging modalities, such as the development of deep learning radiomics for magnetic resonance imaging (MRI) data (9). This focus on high-complexity, high-cost data modalities, however, requires a sophisticated informatics infrastructure, slide digitization, and extensive pathologist annotation, limiting their widespread clinical adoption.

Despite these advances, the vast majority of clinical decisions globally are made with the most ubiquitous, standardized, and cost-effective data source: structured, tabular clinicopathological records. This dataset includes parameters such as patient age, tumor size, histological grade, and biomarker status [e.g., estrogen receptor (ER), progesterone receptor (PR), human epidermal growth factor receptor 2 (HER2), and Ki-67]. Some studies have compared standard machine learning algorithms, such as extreme gradient boosting (XGBoost) and logistic regression (LR), for this type of data (10,11), and recent work by Song et al. (12) and Zhang et al. (13) has established predictive models for cohorts of breast cancer patients with similar clinicopathological characteristics, while others have developed ensemble learning approaches (14-16). Nonetheless, a systematic framework for this widely available tabular clinicopathological data modality is urgently needed. Furthermore, predictive accuracy alone is insufficient for clinical adoption. The “black-box” nature of complex models is a major, well-documented barrier to ensuring clinical trust and implementation (17,18).

Crucially, in the context of quantitative imaging, establishing a robust baseline with standardized clinicopathological data is a prerequisite for evaluating the true incremental value of advanced imaging markers. A transparent, high-performance clinical model can serve not only as a standalone screening tool but also as a necessary “clinical benchmark” against which complex radiomic models can be compared. Furthermore, understanding the biological consistency between routine clinical risk factors and imaging phenotypes is essential for developing effective multimodal fusion models.

To address these issues, we developed, validated, and comprehensively interpreted a two-stage weighted ensemble model for ALN metastasis prediction based only on preprocessed, tabular clinicopathological data. Our approach is guided by a theoretical framework that holds the following assumptions: (I) superior prediction requires capturing the complex, nonlinear interactions inherent in these data; (II) model validation demands rigorous benchmarking against both interpretable linear baselines and state-of-the-art nonlinear models; (III) clinical translation is contingent upon explainable artificial intelligence (XAI)-driven transparency. We hypothesized that a weighted ensemble, by integrating the stability of a linear model (through LR) with the robust nonlinear feature-interaction capture of a tree-based model (XGBoost) can achieve predictive performance superior to its constituent components. Specifically, we defined two sub-hypotheses: H1, that the weighted ensemble would achieve a higher AUC on the held-out test set than either individual base learner alone; and H2, that this performance gain would reflect true model synergy—captured by the meta-learner—rather than overfitting to training data. Furthermore, we hypothesized that our XAI-driven approach can validate the model’s clinicopathological reasoning (H3) and provide a transparent, patient-specific attribution framework, thereby fulfilling a key requirement for clinical trust and adoption (H4). We present this article in accordance with the TRIPOD+AI reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2525/rc) (19).

Methods

Study design and theoretical framework

A retrospective cohort study was conducted to develop and validate machine learning models for predicting ALN metastasis. The models were trained and evaluated on a processed, tabular dataset derived from preoperative clinical and pathological patient records. The entire analytical process was guided by a three-pillar theoretical framework established a priori according to established standards for clinical prediction models (20). The tenets of this framework are as follows: (I) predictive models must be capable of capturing complex, nonlinear feature representations beyond simple linear associations; (II) model superiority must be established through rigorous benchmarking in a multi-model comparison paradigm against both classical statistical baselines and state-of-the-art machine learning models; (III) clinical adoption requires high performance to be paired with model transparency, achieved through XAI to build clinical trust, in accordance with established principles for the development of clinical prediction models (20).

Data acquisition and preprocessing

Data from patients with pathologically confirmed ALN were retrieved from a single-center retrospective database at Huaihe Hospital of Henan University (Kaifeng, China). These data were processed into standardized training and testing cohorts. The raw data underwent a rigorous, multistep preprocessing pipeline to ensure data quality and standardization. This included the consolidation of data from multiple sources, removal of duplicate entries, and harmonization of feature names.

Feature cleaning was extensive. For numerical features (e.g., age, maximum diameter, and ER, PR, and Ki-67 expression), values were standardized; for example, percentage strings were converted to float values (0.0–1.0), and nonnumeric entries or placeholders were coerced to missing values. For categorical and ordinal features [e.g., histological grade, menopause, HER2 expression, location, pathology type, time-intensity curve (TIC), and lymphovascular invasion], values were normalized by mapping text-based variations to a single integer or category and removing extraneous characters. The target variable, ALN, was binarized, with positive status mapped to 1 and negative status to 0.

The final processed dataset included 12 primary features. This dataset was then transformed by a final preprocessing pipeline, which included median imputation for missing numerical values, most-frequent imputation for categorical values, standard scaling (z-score normalization) for all numerical and ordinal features, and one-hot encoding for nominal categorical features. This resulted in a final feature set of 39 dummy variables used for model training.

This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments, and was approved by the Institutional Review Board of Huaihe Hospital of Henan University (No. 2025130). The requirement for informed consent was waived due to the retrospective nature of the analysis and the use of deidentified data.

Experimental model pipeline

The entire experimental pipeline was implemented in Python version 3.9 (Python Software Foundation, Wilmington, DE, USA). Specifically, data manipulation and preprocessing were conducted via the Pandas and Numpy Python libraries. Model training, hyperparameter tuning, and performance evaluation were performed via the scikit-learn library, while the gradient boosting architecture was implemented via the xgboost library. XAI analyses were executed via the Shapley additive explanations (SHAP) package. All models were trained and tuned on the full training dataset and evaluated on the held-out test dataset.

Class imbalance

The target variable was imbalanced in the training set (positive class prevalence ~27.7%). We addressed this by applying a balancing weight to the loss function calculation, inversely proportional to class frequencies (21,22). This weight (~2.61 for the positive class) was applied to the XGBoost model and to the LR models.

Model 1: LR (baseline)

A standard L2-penalized LR model was used as the interpretable baseline. The model predicts the probability of the positive class $p (y = 1 | x)$ via the sigmoid function as follows:

$p (y = 1 | x) = σ (w^{T} x + b) = \frac{1}{1 + e^{- (w^{T} x + b)}}$ [1]

The model was optimized by minimizing the L2-regularized binary cross-entropy cost function, with the hyperparameter C (inverse of regularization strength) being tuned using fivefold cross-validation grid search. The model with the highest mean cross-validation AUC was selected.

Model 2: XGBoost (state-of-the-art baseline)

We used an XGBoost classifier (10) as the state-of-the-art baseline for tabular data. XGBoost is an ensemble of K decision trees, $f_{k}$ , in which the final prediction ${\hat{y}}_{i}$ for a patient i is the sum of the predictions from all trees:

${\hat{y}}_{i} = \sum_{k = 1}^{K} f_{k} (x_{i}), f_{k} \in F$ [2]

The model is trained by sequentially adding trees that minimize a regularized objective function, as follows:

$L = \sum_{i} l (y_{i}, {\hat{y}}_{i}) + \sum_{k} Ω (f_{k})$ [3]

where $l$ is the loss function (log loss for classification). And $Ω$ is a regularization term that penalizes tree complexity. We performed an extensive fivefold cross-validation grid search to optimize key hyperparameters, including the number of estimators, maximum tree depth, learning rate, and subsample ratios. The best-performing hyperparameter set (yielding a cross-validation AUC of 0.781) was selected for the final model.

Model 3: Stacking ensemble (proposed model)

Our final proposed model was a two-stage weighted ensemble classifier (14). This architecture incorporated the previously optimized LR and XGBoost models as first-level base learners. The predictions of these base learners on the training data were generated with an internal fivefold cross-validation scheme to prevent data leakage. These “out-of-fold” predictions were then used as a new feature set to train a second-level meta-learner, which was a separate, L2-penalized LR classifier. This meta-learner’s regularization strength was optimized via a threefold cross-validation grid search. The best-performing ensemble (cross-validation AUC 0.787) was selected as our final model.

Statistical and explainability analysis

All performance analyses were conducted on the held-out test set. We calculated the AUC, area under the precision-recall curve (AUPRC), accuracy, precision, recall (sensitivity), F1-score, and specificity.

Calibration

Model calibration was assessed by plotting the mean predicted probability against the observed fraction of positive cases across 10 decile bins, with a diagonal line representing perfect calibration.

Decision boundaries

To visualize model logic, we first applied principal component analysis to reduce the training data dimensionality to two components. The three models were then retrained on this two-dimensional (2D) data. A mesh grid was created over the 2D space, and the prediction function of each model was used to generate a decision contour, which was overlaid on a scatter plot of the 2D training data.

Ablation studies

We performed two ablation studies on the weighted ensemble model. The component ablation was conducted by comparing the final weighted ensemble model’s performance to that of its individual base learners (Table 1). The feature-group ablation was conducted through a dynamic identification of feature subsets based on column name semantics (“clinical” vs. “pathology”). The weighted ensemble model architecture was reinstantiated and retrained de novo on these feature subsets, and the performance was evaluated on the corresponding subsets of the test data (Table 2).

Table 1

Comparison of model performance on the test cohort

Model	AUC	AUPRC	Accuracy	Precision	Recall (sensitivity)	F1-score	Specificity
LR (tuned)	0.741	0.477	0.576	0.365	0.760	0.494	0.507
XGB (tuned)	0.752	0.505	0.674	0.444	0.800	0.571	0.627
Weighted ensemble (LR + XGB)	0.762	0.575	0.598	0.385	0.800	0.519	0.522

Performance metrics for the three optimized models. The proposed weighted ensemble (LR + XGB) model achieved the highest AUC and AUPRC. The italicized indicate the best-performing model for the given metric. AUC, area under the receiver operating characteristic curve; AUPRC, area under the precision–recall curve; LR, logistic regression; XGB, extreme gradient boosting.

Table 2

Results from the feature-group ablation study

Model (feature set)	AUC	AUPRC	Accuracy	Precision	Recall (sensitivity)	F1-score	Specificity
Clinical features only	0.690	0.432	0.565	0.347	0.680	0.459	0.522
Pathological features only	0.667	0.517	0.489	0.323	0.800	0.460	0.373
All features (full model)	0.765	0.596	0.609	0.392	0.800	0.526	0.537

Performance of the weighted ensemble model when retrained with only the subsets of features. The “all features” model significantly outperformed the models trained on either subset alone, highlighting the synergistic value of integrating both clinical and pathological data. AUC, area under the receiver operating characteristic curve; AUPRC, area under the precision-recall curve.

Explainability (XAI)

Feature importance for the LR model was derived from the absolute value of its learned coefficients. For XGBoost, it was derived from the gain metric. For the final weighted ensemble model, we employed the SHAP methodology (23). SHAP assigns each feature $j$ an importance value $Φ_{j}$ for a given prediction based on game-theoretic principles. This value is the feature’s marginal contribution across all possible feature subsets (coalitions) $S$ :

$Φ_{j} (f, x) = \sum_{S \subseteq F {j}} \frac{| S |! (| F | - | S | - 1)!}{| F |!} [F_{x} (S \cup {j}) - f_{x} (S)]$ [4]

where $F$ is the set of all features, and $f_{x} (S)$ is the model’s prediction under only the features in subset $S$ . We used a model-agnostic kernel-based explainer that initialized on a 100-sample background dataset from the training data. SHAP values were then calculated for all test set samples to generate global summary plots, local waterfall plots for individual patients, and dependence plots to visualize feature interactions.

Results

Dataset stratification and feature distribution

The study cohort (n=915) was divided into a training set (n=732) and a held-out test set (n=183) via a stratified random split based on the target variable (ALN status). This approach ensured that the prevalence of ALN metastasis was proportionally consistent (~27.7%) across both cohorts. Due to the stringent data de-identification protocols implemented prior to modeling, standardized processed features were employed. Furthermore, to assess potential dependencies among the clinicopathological variables, we performed a Pearson correlation analysis (Figure S1). To transparently verify the integrity of the data split, we visualized the comparative distribution of all input features across the training and test cohorts (Figure S2). The comparative boxplots and proportional bar charts demonstrated that the feature distributions were well-balanced between the two sets, preventing distribution shift bias during model evaluation. We observed expected clinical correlations (e.g., between ER and PR status); importantly, no features exhibited prohibitive multicollinearity (correlation coefficients <0.8), confirming the suitability of the selected features for the weighted ensemble framework.

Synergy and superior performance of the weighted ensemble model

Our primary objective was to determine the optimal modeling strategy for predicting ALN metastasis from tabular data. We first compared the diagnostic performance of three optimized models on the held-out test cohort: the baseline LR, the state-of-the-art tree-based XGBoost model, and our proposed weighted ensemble (LR + XGB) model.

The weighted ensemble model achieved the highest overall diagnostic discrimination, with an AUC of 0.762. As detailed in Table 1, this result also served as a model component ablation, confirming our H1 and H2. The full weighted ensemble model (AUC =0.762) synergistically outperformed its individual, fully optimized base learners: the LR model (AUC =0.741) and the XGBoost model (AUC =0.752). This indicated that the meta-learner successfully synthesized the predictions of the linear and nonlinear base models to achieve a superior performance gain.

A similar pattern was observed for the AUPRC, a metric particularly suited for imbalanced datasets. The weighted ensemble model (AUPRC =0.575) demonstrated marked superiority over both XGBoost (AUPRC =0.505) and LR (AUPRC =0.477), suggesting a more stable performance profile across this imbalanced cohort. Analysis of other performance metrics revealed important clinical tradeoffs. The XGBoost model, driven by its high specificity (0.627), yielded the highest F1-score (0.571) and accuracy (0.674). Critically, however, both the weighted ensemble model and the XGBoost model had higher recall (sensitivity) value (0.800) than did the LR model (0.760). The superior discriminative power of the ensemble model is visually presented in the AUCs and AUPRCs in Figure 1A. We further assessed model reliability via calibration curves (Figure 1B). The error distribution across the three optimized models for the test cohort was further visualized via confusion matrices (Figure 2A), which confirmed that despite some variability, all models consistently provide high recall. Furthermore, although all models exhibited miscalibration to some extent (lying slightly below the “perfectly calibrated” diagonal), indicating a mild tendency to overestimate risk, the LR and weighted ensemble models yielded a calibration slope closer to 1.0. This suggests their raw predicted probabilities are more reliable and more closely aligned with observed event frequencies than are those from the standalone XGBoost model.

Figure 1 Model performance and reliability. (A) ROC curves (left) and PR curves (right) for the three optimized models on the test set. The weighted ensemble (LR + XGB) model demonstrated superior discrimination in both analyses. (B) Calibration curves showing the relationship between predicted probabilities and observed event frequency. The weighted ensemble (LR + XGB) and LR models showed better calibration than did the XGB model. AUC, area under the receiver operating characteristic curve; AUPRC, area under the precision-recall curve; LR, logistic regression; PR, precision-recall; ROC, receiver operating characteristic; XGB, extreme gradient boosting.

Figure 2 Model error analysis and feature set ablation. (A) Confusion matrices for the test cohort for the LR, XGBoost, and the weighted ensemble models. All models demonstrated high recall (sensitivity). (B) Visualization of the AUC results from the feature-group ablation study (Table 2). The model incorporating all features significantly outperformed the subset models, confirming the complementary value of both data types. AUC, area under the receiver operating characteristic curve; LR, logistic regression; XGBoost, extreme gradient boosting.

Data synergy confirmed through feature-group ablation

Having established the superiority of the ensemble architecture, we next investigated the contribution of the input data streams themselves. We performed feature-group ablation by retraining the final weighted ensemble model on distinct, mutually exclusive feature subsets: clinical features only (e.g., “age” and “maximum tumor diameter”) and pathological features only (e.g., ER expression, Ki-67 proliferation index, and histological grade).

The results, as presented in Table 2, demonstrate the critical value of data integration. The model trained on all features (AUC =0.765) significantly outperformed the models trained on clinical features only (AUC =0.690) or pathological features only (AUC =0.667). This result strongly suggests that both clinical and pathological data streams provide unique, complementary predictive information. Neither feature set alone is sufficient, and thus integrating them is essential for maximizing accuracy. This finding is presented visually in the ablation bar plot in Figure 2B.

Model interpretability and clinically driven logic

To understand the differences in the model’s logic qualitatively, we visualized the decision boundaries after retraining the models on the first two principal components of the training data (Figure 3). This visualization highlighted the fundamental differences in their mechanisms. The LR model (Figure 3A) learns a single, linear boundary, cleanly separating the two classes but failing to capture local complexities. The XGBoost model (Figure 3B) creates a complex, nonlinear, axis-parallel checkerboard pattern, effectively isolating small clusters of positive cases. The weighted ensemble model (Figure 3C) generates a hybrid boundary, capturing the nonlinear clusters identified by XGBoost but smoothing them into a more generalized and robust separation, which likely accounts for its superior and more stable performance.

Figure 3 Visualization of model decision boundaries. Models were retrained on the first two principal components (PC1 and PC2) of the training data, and their decision logic was visualized. Blue dots represent negative ALN status (Class 0), and orange dots represent positive ALN status (Class 1). (A) LR learns a single linear boundary. (B) XGBoost learns complex, nonlinear, axis-parallel boundaries. (C) The weighted ensemble model learns a hybrid, nonlinear boundary that generalizes the patterns captured by its base learners. ALN, axillary lymph node; LR, logistic regression; PC, principal component; XGBoost, extreme gradient boosting.

A primary objective of this study was to ensure model transparency. We first compared the global feature contributions of the two base learners (Figure 4). For the XGBoost model, the gain metric identified Ki-67 proliferation index, maximum tumor diameter, and histological grade as the top three predictive features (Figure 4A). For the LR model, the absolute magnitude of its coefficients (representing the contribution to log odds) indicated a similar, albeit differently ranked, set of features, with lymphovascular invasion (“CatLVIYes”) and Ki-67 proliferation index being the most impactful (Figure 4B). This strong congruence for the most important features (Ki-67 proliferation index, maximum tumor diameter, lymphovascular invasion, and histological grade) between the two architecturally distinct models supports the biological validity of the underlying signals being captured.

Figure 4 Comparison of feature contributions across base learner models. Feature contributions for the top 15 features as ranked by XGBoost importance. (A) XGBoost feature importance, quantified by gain (the average contribution of a feature to the model’s predictions). Ki-67 proliferation index (“NumKi67”) is the dominant feature. (B) LR feature contribution, quantified by the absolute magnitude of the model’s coefficients. A similar set of features was identified as important, but with a different ranking. LR, logistic regression; XGBoost, extreme gradient boosting.

We then applied SHAP, a game-theoretic XAI method, to the final weighted ensemble model to fully analyze its behavior in the test set. The SHAP summary plot in Figure 5A provides a global view of feature impact and directionality. It shows that “NumMaxDiameter” has the largest mean impact on model output, followed by “CatPathologyTypeIDC” and “CatTICPlateau”, and that the high values (red dots) of these features (e.g., large tumor diameter and IDC pathology type) strongly push the model’s prediction toward a positive ALN status (positive SHAP value). Notably, while Ki-67 proliferation index was the dominant feature in the standalone XGBoost model (Figure 4A), the ensemble model’s SHAP analysis relegates it to a secondary role, suggesting that the meta-learner re-weights feature contributions to prioritize morphological and hemodynamic indicators. Conversely, high PR expression and high ER expression are associated with negative predictions (negative SHAP values, represented as blue dots in Figure 5A), a fact consistent with their known role as favorable prognostic markers.

Figure 5 XAI analysis of the final weighted ensemble (LR + XGB) model. (A) Global feature importance (SHAP summary plot). Each dot is a patient feature, and the position on the x-axis is its impact on the model’s prediction (log odds). Color indicates feature value (red = high; blue = low). Maximum tumor diameter (“NumMaxDiameter”) had the highest mean impact. (B) Feature interaction (SHAP dependence plot) for “CatLocationNan”. The y-axis shows the SHAP value for “CatLocationNan”, while the color scale shows the value of “CatPathologyTypeOther”, revealing how the missingness pattern of tumor location interacts with pathology type in the ensemble model’s predictions. (C) Local prediction explanation (SHAP waterfall plot) for a single test patient. The plot shows how each feature contributes to pushing the model’s output from the baseline value (E[f(x)]) to the final prediction (f(x)). LR, logistic regression; SHAP, Shapley additive explanations; XAI, explainable artificial intelligence; XGB, extreme gradient boosting.

Regarding feature interactions, the SHAP dependence plot for “CatLocationNan” in Figure 5B provides further insight into model behavior. The plot reveals that the missingness pattern of the tumor location variable carries predictive information within the ensemble framework. The vertical color dispersion, representing interaction with “CatPathologyTypeOther”, demonstrates that this effect is further modulated by pathology type, highlighting the model’s capacity to capture complex, clinically relevant feature interactions beyond simple univariate associations.

Finally, waterfall plots were drawn to demonstrate clinical utility at the patient level. The SHAP waterfall plot in Figure 5C depicts a single prediction for an example patient from the test set. The model’s baseline prediction (E[f(x)] =0.497) is adjusted by the impact of each feature. For this patient, HER2 2+ status, a plateau TIC pattern, and IDC pathology type were the primary factors increasing the predicted risk, while HER2 1+ status partially mitigated the risk, resulting in a final model output (f[x] =0.527) above the baseline, indicating a positive prediction. This granular, patient-specific attribution provides a transparent and interpretable basis for clinical review, directly addressing the challenge of black-box medical practice. This granular, patient-specific attribution provides a transparent and interpretable basis for clinical review, directly addressing the challenge of black-box medical practice.

Representative clinical case

To demonstrate the real-world manifestation of the model-identified features, we present a representative clinical case (Figure 6). This case was not included in model training or testing.

Figure 6 Representative clinical case illustrating model-identified risk features. Contrast-enhanced breast MRI in a 53-year-old postmenopausal woman with invasive breast carcinoma exhibiting the following high-risk features: grade II histology, Ki-67 30%, minimal hormone receptor expression (ER 3% and PR−), HER2 3+, and confirmed axillary lymph node metastasis (pN+). (A) T1WI-FS. (B) T2WI-FS. (C) DWI. (D) ADC. (E) T1WI CE of the primary breast lesion. (F) T1WI CEof the axillary region showing the metastatic lymph node. (G) TIC. (H) T2WI-FS. (I) T1WI. This case was not included in model training or testing and is shown for illustrative purposes. Large red arrows (with a dot at the tail) indicate the primary breast mass; small red arrows (without a dot at the tail) indicate the metastatic axillary lymph node. ADC, apparent diffusion coefficient; DWI, diffusion-weighted imaging; ER, estrogen receptor; HER2, human epidermal growth factor receptor 2; MRI, magnetic resonance imaging; PR, progesterone receptor; T1WI, T1-weighted imaging; T1WI CE, T1-weighted imaging with contrast enhancement; T1WI-FS, T1-weighted imaging with fat saturation; T2WI-FS, T2-weighted imaging with fat saturation; TIC, time–intensity curve.

Clinical presentation and pathological profile

A 53-year-old postmenopausal woman had invasive breast carcinoma (nonspecial type), and our SHAP analysis indicated that she had the following high-risk features: Ki-67 proliferation index 30% (a recognized risk factor for ALN metastasis), histological grade II, a tumor diameter of 1.8 cm, minimal hormone receptor expression (ER 3% and PR-negative), and HER2 3+ overexpression (HER2-enriched subtype). Lymphovascular invasion was not definitively assessed on core biopsy. Subsequent axillary evaluation confirmed metastatic carcinoma in the ALNs (pathologically node-positive). This outcome supports the risk stratification capacity of the model-prioritized features. Each of these parameters—Ki-67, tumor size, histological grade, and receptor status—contributed to an elevated predicted risk according to the established clinical evidence and the model’s feature importance analysis (Figures 4,5A), with their combined effect correctly predicting nodal involvement.

The manifestation on contrast-enhanced breast MRI, including irregular margins, heterogeneous enhancement, and rapid initial contrast uptake, were consistent with the aggressive clinicopathological profile (Figure 6A,6B). These radiological findings independently corroborated the high-risk molecular phenotype, demonstrating that routine clinicopathological variables effectively capture the underlying tumor biology that also manifests across imaging modalities.

Discussion

In this study, we demonstrated that a two-stage weighted ensemble model, trained exclusively on 12 routine clinicopathological variables, can accurately and interpretably predict ALN metastasis. Our primary finding is that this interpretable framework, which strategically integrates a linear model with a nonlinear gradient boosting model, achieved diagnostic performance (AUC =0.762) superior to both other well-tuned baseline models. The performance of this hybrid approach aligns with recent evidence demonstrating that gradient boosting and stacking ensembles provide robust diagnostic performance across a variety of medical tabular datasets (24,25). This result, combined with the observation from the comprehensive XAI analysis, provides a powerful, transparent, and practical tool for noninvasive ALN risk stratification.

Our findings constitute strong empirical support for the use of our three-pillar theoretical framework. The weighted ensemble model (AUC =0.762) surpassed both its linear (AUC =0.741) and nonlinear (AUC =0.752) base learners, validating H1 and H2. This confirms that while nonlinear interactions are critical for accurate ALN metastasis prediction from clinicopathological tabular data—as evidenced by the inferior performance of the purely linear LR model (AUC =0.741) compared to the nonlinear XGBoost (AUC =0.752)—a simple tree-based model may overfit to nonlinear noise. The ensemble succeeds by leveraging the meta-learner to balance the robust, nonlinear predictions of XGBoost with the stable, generalized, linear predictions of LR. This creates a hybrid, smoothed decision boundary (Figure 3C) that generalizes better to unseen data, as evidenced by its superior AUC and AUPRC.

Beyond predictive accuracy, our XAI-driven approach validated the model’s clinical utility. The SHAP analysis (Figure 5A) confirmed H3, demonstrating that the model’s logic is nonspurious and rooted in established pathobiology. Consistent with the quantitative feature importance scores, the highest-ranked predictive features were tumor maximum diameter (“NumMaxDiameter”), the invasive ductal carcinoma (IDC) pathology type (“CatPathologyTypeIDC”), and the TIC plateau pattern (“CatTICPlateau”). These top contributors are well-established, independent risk factors associated with tumor aggressiveness and lymphatic spread. Naturally, larger tumors carry a higher risk of metastasis due to the increased surface area for vascular or lymphatic invasion and their longer evolution time. Furthermore, the prominence of the IDC pathology type and the plateau TIC pattern—often indicative of rapid, sustained tumor angiogenesis—highlights the model’s reliance on core morphological and hemodynamic indicators of malignancy. Conversely, while previous clinical assumptions might heavily emphasize variables such as the Ki-67 proliferation index or histological grade, our SHAP analysis accurately relegates these lower-scoring features to appropriate secondary predictive roles in this specific ensemble framework. Finally, the patient-specific waterfall plot in Figure 5C directly addresses H4 by providing a transparent, patient-specific risk attribution. This moves beyond a single, opaque risk score and offers a “glass box” explanation that clinicians can review, critique, and integrate into their holistic patient assessment, fulfilling a key recommendation from recent XAI reviews (14).

The representative clinical case described above (Figure 6) further illustrates the relevance of our model to the field of quantitative imaging. The high-risk features identified by our weighted ensemble model—specifically high Ki-67 proliferation and high histological grade—manifest radiologically as heterogeneous enhancement and rapid wash-out kinetics (Figure 6). This biological concordance confirms that accessible clinicopathological data can capture the same aggressive tumor phenotypes often targeted by radiomics. By establishing this link, our study provides a validated “clinical ground truth” that supports and complements future quantitative imaging and multimodal fusion research.

Our results contextualize the broader field of artificial intelligence in ALN prediction. Recent studies have further underscored the value of clinicopathological data. Song et al. (12) and Zhang et al. (13) successfully developed predictive models for ALN status. However, these approaches primarily focused on linear associations or standard machine learning. Our approach is distinct by virtue of its weighted ensemble architecture, which synergistically captures the nonlinear feature interactions (as evidenced by our SHAP dependence plots) that simpler models may overlook. Furthermore, we prioritize “white-box” transparency through patient-level SHAP explanations, addressing the interpretability gap often present in previous works. Furthermore, our framework contributes to the growing body of literature emphasizing that stacking ensembles combined with XAI can bridge the gap between high accuracy and clinical interpretability (26).

We acknowledge, however, that our absolute performance metrics—specifically an accuracy of approximately 0.67 and an F1-score of 0.57—are modest. These metrics reflect the inherent biological limitations of predicting microscopic nodal metastasis through an exclusive reliance on macroscopic, routine clinicopathological tabular data without advanced molecular profiling. Although several studies have reported models with higher AUCs, often in the 0.81–0.86 range (3,5,7), these models frequently rely on deep learning analysis of WSI histopathology slides (3-5), multimodal learning approaches (7,27,28), or advanced MRI sequences (9). These modalities are computationally expensive, require specialized expertise, and are not universally available.

Despite trading overall accuracy for these practical benefits, the primary clinical value of our ensemble lies in its high sensitivity (recall =0.800). In a preoperative screening context, minimizing false negatives (missing actual metastasis) is paramount. Our model, which achieved a reliable AUC of 0.762 using only 12 standard, tabular data points, represents a highly practical, low-cost, and scalable alternative. It can serve as an interpretable “first-pass” triage tool—demonstrating that clinically meaningful baseline predictions can be achieved using universally available data—and offers an immediately implementable alternative to resource-intensive approaches. Although advanced multiple instance learning and channel-attention frameworks for WSI analysis offer high performance, their computational complexity remains a hurdle to routine integration compared to our streamlined tabular approach (29,30).

Several limitations to this study should be acknowledged. First, we employed a retrospective dataset derived from a single-center data pipeline. The model’s generalizability and performance on external, multicenter cohorts must be confirmed before widespread clinical application. Second, the feature set, while effective, was predefined. Future work could benefit from the use of more granular data, such as specific receptor percentages rather than binarized status, or the inclusion of additional molecular biomarkers. Third, our feature group ablation (Table 2 and Figure 2B) confirmed the value of both clinical and pathological features, but the automatic grouping based on feature semantics is a proxy, and a more rigorous, expert-defined feature grouping could yield deeper insights. Finally, while a kernel-based SHAP explainer is a powerful, model-agnostic tool for interpreting the weighted ensemble model, it is computationally intensive and provides an approximation of the true SHAP values.

Conclusions

This study successfully developed and validated a two-stage weighted ensemble model that accurately and noninvasively predicts ALN metastasis in patients with breast cancer. By synergistically integrating the linear stability of LR with the robust nonlinear feature interaction capture of XGBoost, our model achieved a superior AUC of 0.762 on the test set, outperforming its individual constituent components.

The inclusion of SHAP ensures clinical transparency by providing glass-box interpretability and identified tumor diameter, IDC pathology type, and TIC patterns as the primary drivers of model risk predictions. Given its high sensitivity of 0.800, this framework serves as a practical, low-cost, and scalable triage tool that can be implemented with universally accessible clinicopathological data. It offers a significant clinical opportunity to safely de-escalate surgical procedures for low-risk patients, thereby sparing them from the morbidity associated with invasive axillary staging. Despite our findings suggesting strong potential, future multicenter, prospective validation remains essential to translating this baseline clinicopathological framework into routine precision oncologic practice.

Acknowledgments

We would like to thank the Institutional Review Board of Huaihe Hospital of Henan University for approving this study (approval No. 2025130) and for waiving the requirement for informed consent given the retrospective nature of the analysis and use of de-identified data for this retrospective cohort analysis. We also would like to thank the clinical and technical staff at the Department of Medical Imaging, Huaihe Hospital of Henan University, and The Third Affiliated Hospital of Zhengzhou University for their professional support in data acquisition and the maintenance of the patient database.

Footnote

Reporting Checklist: The authors have completed the TRIPOD+AI reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2525/rc

Data Sharing Statement: Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2525/dss

Funding: This work was supported by “Clinical Study on Planning and Localization of Breast-Conserving Surgery for Breast Cancer Based on Artificial Intelligence and 3D Printing Technology” and “Diagnostic Value and Prognostic Correlation Analysis of Breast Benign and Malignant Lesions Using Combined Magnetic Resonance Imaging with Multi-Sequence and Multi-Parameters, Dynamic Contrast-Enhanced, DWI, MRS, and SWI” (Nos. 2203044 and LHGJ20230442).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2525/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. This study was approved by the Institutional Review Board of Huaihe Hospital of Henan University (No. 2025130). Informed consent was waived in this retrospective study due to the use of de-identified data.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Giuliano AE, Ballman K, McCall L, Beitsch P, Whitworth PW, Blumencranz P, Leitch AM, Saha S, Morrow M, Hunt KK. Locoregional Recurrence After Sentinel Lymph Node Dissection With or Without Axillary Dissection in Patients With Sentinel Lymph Node Metastases: Long-term Follow-up From the American College of Surgeons Oncology Group (Alliance) ACOSOG Z0011 Randomized Trial. Ann Surg 2016;264:413-20.
Slamon D, Eiermann W, Robert N, Pienkowski T, Martin M, Press M, et al. Adjuvant trastuzumab in HER2-positive breast cancer. N Engl J Med 2011;365:1273-83. [Crossref] [PubMed]
Xu F, Zhu C, Tang W, Wang Y, Zhang Y, Li J, Jiang H, Shi Z, Liu J, Jin M. Predicting Axillary Lymph Node Metastasis in Early Breast Cancer Using Deep Learning on Primary Tumor Biopsy Slides. Front Oncol 2021;11:759007. [Crossref] [PubMed]
Ilse M, Tomczak JM, Welling M. Attention-based deep multiple instance learning. ICML 2018. Available online: https://proceedings.mlr.press/v80/ilse18a/ilse18a.pdf
Campanella G, Hanna MG, Geneslaw L, Miraflor A, Werneck Krauss Silva V, Busam KJ, Brogi E, Reuter VE, Klimstra DS, Fuchs TJ. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat Med 2019;25:1301-9. [Crossref] [PubMed]
Zheng X, Yao Z, Huang Y, Yu Y, Wang Y, Liu Y, Mao R, Li F, Xiao Y, Wang Y, Hu Y, Yu J, Zhou J. Deep learning radiomics can predict axillary lymph node status in early-stage breast cancer. Nat Commun 2020;11:1236. [Crossref] [PubMed]
Park D, Lee YM, Eo T, An HJ, Kang H, Park E, Cha YJ, Park H, Kwon D, Kwon SY, Jung HR, Shin SJ, Park H, Lee Y, Park S, Kim JM, Choi SE, Cho NH, Hwang D. Multimodal AI model for preoperative prediction of axillary lymph node metastasis in breast cancer using whole slide images. NPJ Precis Oncol 2025;9:131. [Crossref] [PubMed]
Windsor GO, Bai H, Lourenco AP, Jiao Z. Application of artificial intelligence in predicting lymph node metastasis in breast cancer. Front Radiol 2023;3:928639. [Crossref] [PubMed]
Mao N, Dai Y, Lin F, Ma H, Duan S, Xie H, Zhao W, Hong N. Radiomics Nomogram of DCE-MRI for the Prediction of Axillary Lymph Node Metastasis in Breast Cancer. Front Oncol 2020;10:541849. [Crossref] [PubMed]
Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2016:785-94.
Zheng J, Li J, Zhang Z, Yu Y, Tan J, Liu Y, Gong J, Wang T, Wu X, Guo Z. Clinical Data based XGBoost Algorithm for infection risk prediction of patients with decompensated cirrhosis: a 10-year (2012-2021) Multicenter Retrospective Case-control study. BMC Gastroenterol 2023;23:310. [Crossref] [PubMed]
Song L, Zhang F, Ma K, Wang B, Zhang T, Sun S. Analysis of factors affecting axillary lymph node metastasis in breast cancer and the establishment and validation of a predictive model. Sci Rep 2025;15:43630. [Crossref] [PubMed]
Zhang X, Zhang C, Zhang J, Zhang X, Dou X. Establishment of Prediction Model of Axillary Lymph Node Metastasis Before Operation for Early-Stage Breast Cancer. Cancer Control 2025;32:10732748251363328. [Crossref] [PubMed]
Wolpert DH. Stacked generalization. Neural Networks 1992;5:241-59.
Adekoya A, Saeed F, Ghaban W, Qasem SN. Ensemble learning approach with explainable AI for improved heart disease prediction. Front Pharmacol 2025;16:1654681. [Crossref] [PubMed]
Kablan R, Miller HA, Suliman S, Frieboes HB. Evaluation of stacked ensemble model performance to predict clinical outcomes: A COVID-19 study. Int J Med Inform 2023;175:105090. [Crossref] [PubMed]
Ponce-Bobadilla AV, Schmitt V, Maier CS, Mensing S, Stodtmann S. Practical guide to SHAP analysis: Explaining supervised machine learning model predictions in drug development. Clin Transl Sci 2024;17:e70056. [Crossref] [PubMed]
Salih AM, Zahra Raisi-Estabragh Z, Galazzo IB, Radeva P, Petersen SE, Lekadir K, Menegaz G. A Perspective on Explainable Artificial Intelligence Methods: SHAP and LIME. Adv Intell Syst 2025;7:2400304.
Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ 2015;350:g7594.
Steyerberg EW. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. 2nd ed. Cham: Springer; 2019.
He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng 2009;21:1263-84.
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res 2002;16:321-57.
Lundberg SM, Lee SI. A unified approach to interpreting model predictions. NeurIPS 2017. doi: 10.5555/3295222.3295230.
Dinh A, Miertschin S, Young A, Mohanty SD. A data-driven approach to predicting diabetes and cardiovascular disease with machine learning. BMC Med Inform Decis Mak 2019;19:211. [Crossref] [PubMed]
Grinsztajn L, Borard E, Varoquaux G. Why do tree-based models still outperform deep learning on typical tabular data? Advances in Neural Information Processing Systems 2022;35:507-20.
Sultan SQ, Javaid N, Alrajeh N, Aslam M. Machine Learning-Based Stacking Ensemble Model for Prediction of Heart Disease with Explainable AI and K-Fold Cross-Validation: A Symmetric Approach. Symmetry 2025;17:185.
Guo F, Sun S, Deng X, Wang Y, Yao W, Yue P, Wu S, Yan J, Zhang X, Zhang Y. Predicting axillary lymph node metastasis in breast cancer using a multimodal radiomics and deep learning model. Front Immunol 2024;15:1482020. [Crossref] [PubMed]
Bychkov D, Linder N, Turkki R, Nordling S, Kovanen PE, Verrill C, Walliander M, Lundin M, Haglund C, Lundin J. Deep learning based tissue analysis predicts outcome in colorectal cancer. Sci Rep 2018;8:3395. [Crossref] [PubMed]
Afonso M, Bhawsar PMS, Saha M, Almeida JS, Oliveira AL. Multiple Instance Learning for WSI: A comparative analysis of attention-based approaches. J Pathol Inform 2024;15:100403. [Crossref] [PubMed]
Mao J, Xu J, Tang X, Liu Y, Zhao H, Tian G, Yang J. CAMIL: channel attention-based multiple instance learning for whole slide image classification. Bioinformatics 2025;41:btaf024. [Crossref] [PubMed]

Cite this article as: Wang Y, Li Q, Zhao L. An interpretable weighted ensemble based on routinely collected clinical data for the accurate prediction of axillary lymph node metastasis. Quant Imaging Med Surg 2026;16(4):318. doi: 10.21037/qims-2025-1-2525

An interpretable weighted ensemble based on routinely collected clinical data for the accurate prediction of axillary lymph node metastasis

Introduction

Methods

Study design and theoretical framework

Data acquisition and preprocessing

Experimental model pipeline

Class imbalance

Model 1: LR (baseline)

Model 2: XGBoost (state-of-the-art baseline)

Model 3: Stacking ensemble (proposed model)

Statistical and explainability analysis

Calibration

Decision boundaries

Ablation studies

Table 1

Table 2

Explainability (XAI)

Results

Dataset stratification and feature distribution

Synergy and superior performance of the weighted ensemble model

Data synergy confirmed through feature-group ablation

Model interpretability and clinically driven logic

Representative clinical case

Clinical presentation and pathological profile

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share