An interpretable weighted ensemble based on routinely collected clinical data for the accurate prediction of axillary lymph node metastasis
Introduction
The management of the axilla in patients with early-stage breast cancer remains a critical and evolving clinical challenge. Axillary lymph node (ALN) status is one of the most powerful independent prognostic factors for patients with breast cancer and a cornerstone of oncological staging that informs adjuvant systemic therapy, radiotherapy planning, and the extent of surgery (1,2). Traditionally, ALN dissection (ALND) has been relied upon to provide definitive staging but is associated with debilitating long-term morbidity, most notably lymphedema, chronic pain, and sensory deficits. The advent of sentinel lymph node biopsy (SLNB) represented a major paradigm shift, substantially reducing this morbidity. However, SLNB is itself an invasive procedure, and a large proportion of patients subjected to it are found to be node-negative. Consequently, there is an urgent and unmet need for accurate, noninvasive, preoperative models that can reliably identify patients with a low probability of ALN metastasis, which would safely de-escalate surgery and spare a significant portion of this population from any invasive axillary procedure.
Recent research efforts toward achieving this goal have largely focused on computationally intensive modalities, particularly the deep learning-based analysis of whole-slide images (WSIs) from biopsies (3-5). These attention-based multiple-instance learning (AMIL) frameworks have set high benchmarks, achieving areas under the curve (AUCs) in the range of 0.81 to 0.86 (3,6). This has been improved upon through the use of advanced multimodal learning approaches that fuse WSI-derived features with clinical information (7,8) or advanced deep learning architectures in histopathology analysis (4,5). Other work has investigated the use of advanced imaging modalities, such as the development of deep learning radiomics for magnetic resonance imaging (MRI) data (9). This focus on high-complexity, high-cost data modalities, however, requires a sophisticated informatics infrastructure, slide digitization, and extensive pathologist annotation, limiting their widespread clinical adoption.
Despite these advances, the vast majority of clinical decisions globally are made with the most ubiquitous, standardized, and cost-effective data source: structured, tabular clinicopathological records. This dataset includes parameters such as patient age, tumor size, histological grade, and biomarker status [e.g., estrogen receptor (ER), progesterone receptor (PR), human epidermal growth factor receptor 2 (HER2), and Ki-67]. Some studies have compared standard machine learning algorithms, such as extreme gradient boosting (XGBoost) and logistic regression (LR), for this type of data (10,11), and recent work by Song et al. (12) and Zhang et al. (13) has established predictive models for cohorts of breast cancer patients with similar clinicopathological characteristics, while others have developed ensemble learning approaches (14-16). Nonetheless, a systematic framework for this widely available tabular clinicopathological data modality is urgently needed. Furthermore, predictive accuracy alone is insufficient for clinical adoption. The “black-box” nature of complex models is a major, well-documented barrier to ensuring clinical trust and implementation (17,18).
Crucially, in the context of quantitative imaging, establishing a robust baseline with standardized clinicopathological data is a prerequisite for evaluating the true incremental value of advanced imaging markers. A transparent, high-performance clinical model can serve not only as a standalone screening tool but also as a necessary “clinical benchmark” against which complex radiomic models can be compared. Furthermore, understanding the biological consistency between routine clinical risk factors and imaging phenotypes is essential for developing effective multimodal fusion models.
To address these issues, we developed, validated, and comprehensively interpreted a two-stage weighted ensemble model for ALN metastasis prediction based only on preprocessed, tabular clinicopathological data. Our approach is guided by a theoretical framework that holds the following assumptions: (I) superior prediction requires capturing the complex, nonlinear interactions inherent in these data; (II) model validation demands rigorous benchmarking against both interpretable linear baselines and state-of-the-art nonlinear models; (III) clinical translation is contingent upon explainable artificial intelligence (XAI)-driven transparency. We hypothesized that a weighted ensemble, by integrating the stability of a linear model (through LR) with the robust nonlinear feature-interaction capture of a tree-based model (XGBoost) can achieve predictive performance superior to its constituent components. Specifically, we defined two sub-hypotheses: H1, that the weighted ensemble would achieve a higher AUC on the held-out test set than either individual base learner alone; and H2, that this performance gain would reflect true model synergy—captured by the meta-learner—rather than overfitting to training data. Furthermore, we hypothesized that our XAI-driven approach can validate the model’s clinicopathological reasoning (H3) and provide a transparent, patient-specific attribution framework, thereby fulfilling a key requirement for clinical trust and adoption (H4). We present this article in accordance with the TRIPOD+AI reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2525/rc) (19).
Methods
Study design and theoretical framework
A retrospective cohort study was conducted to develop and validate machine learning models for predicting ALN metastasis. The models were trained and evaluated on a processed, tabular dataset derived from preoperative clinical and pathological patient records. The entire analytical process was guided by a three-pillar theoretical framework established a priori according to established standards for clinical prediction models (20). The tenets of this framework are as follows: (I) predictive models must be capable of capturing complex, nonlinear feature representations beyond simple linear associations; (II) model superiority must be established through rigorous benchmarking in a multi-model comparison paradigm against both classical statistical baselines and state-of-the-art machine learning models; (III) clinical adoption requires high performance to be paired with model transparency, achieved through XAI to build clinical trust, in accordance with established principles for the development of clinical prediction models (20).
Data acquisition and preprocessing
Data from patients with pathologically confirmed ALN were retrieved from a single-center retrospective database at Huaihe Hospital of Henan University (Kaifeng, China). These data were processed into standardized training and testing cohorts. The raw data underwent a rigorous, multistep preprocessing pipeline to ensure data quality and standardization. This included the consolidation of data from multiple sources, removal of duplicate entries, and harmonization of feature names.
Feature cleaning was extensive. For numerical features (e.g., age, maximum diameter, and ER, PR, and Ki-67 expression), values were standardized; for example, percentage strings were converted to float values (0.0–1.0), and nonnumeric entries or placeholders were coerced to missing values. For categorical and ordinal features [e.g., histological grade, menopause, HER2 expression, location, pathology type, time-intensity curve (TIC), and lymphovascular invasion], values were normalized by mapping text-based variations to a single integer or category and removing extraneous characters. The target variable, ALN, was binarized, with positive status mapped to 1 and negative status to 0.
The final processed dataset included 12 primary features. This dataset was then transformed by a final preprocessing pipeline, which included median imputation for missing numerical values, most-frequent imputation for categorical values, standard scaling (z-score normalization) for all numerical and ordinal features, and one-hot encoding for nominal categorical features. This resulted in a final feature set of 39 dummy variables used for model training.
This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments, and was approved by the Institutional Review Board of Huaihe Hospital of Henan University (No. 2025130). The requirement for informed consent was waived due to the retrospective nature of the analysis and the use of deidentified data.
Experimental model pipeline
The entire experimental pipeline was implemented in Python version 3.9 (Python Software Foundation, Wilmington, DE, USA). Specifically, data manipulation and preprocessing were conducted via the Pandas and Numpy Python libraries. Model training, hyperparameter tuning, and performance evaluation were performed via the scikit-learn library, while the gradient boosting architecture was implemented via the xgboost library. XAI analyses were executed via the Shapley additive explanations (SHAP) package. All models were trained and tuned on the full training dataset and evaluated on the held-out test dataset.
Class imbalance
The target variable was imbalanced in the training set (positive class prevalence ~27.7%). We addressed this by applying a balancing weight to the loss function calculation, inversely proportional to class frequencies (21,22). This weight (~2.61 for the positive class) was applied to the XGBoost model and to the LR models.
Model 1: LR (baseline)
A standard L2-penalized LR model was used as the interpretable baseline. The model predicts the probability of the positive class via the sigmoid function as follows:
The model was optimized by minimizing the L2-regularized binary cross-entropy cost function, with the hyperparameter C (inverse of regularization strength) being tuned using fivefold cross-validation grid search. The model with the highest mean cross-validation AUC was selected.
Model 2: XGBoost (state-of-the-art baseline)
We used an XGBoost classifier (10) as the state-of-the-art baseline for tabular data. XGBoost is an ensemble of K decision trees, , in which the final prediction for a patient i is the sum of the predictions from all trees:
The model is trained by sequentially adding trees that minimize a regularized objective function, as follows:
where is the loss function (log loss for classification). And is a regularization term that penalizes tree complexity. We performed an extensive fivefold cross-validation grid search to optimize key hyperparameters, including the number of estimators, maximum tree depth, learning rate, and subsample ratios. The best-performing hyperparameter set (yielding a cross-validation AUC of 0.781) was selected for the final model.
Model 3: Stacking ensemble (proposed model)
Our final proposed model was a two-stage weighted ensemble classifier (14). This architecture incorporated the previously optimized LR and XGBoost models as first-level base learners. The predictions of these base learners on the training data were generated with an internal fivefold cross-validation scheme to prevent data leakage. These “out-of-fold” predictions were then used as a new feature set to train a second-level meta-learner, which was a separate, L2-penalized LR classifier. This meta-learner’s regularization strength was optimized via a threefold cross-validation grid search. The best-performing ensemble (cross-validation AUC 0.787) was selected as our final model.
Statistical and explainability analysis
All performance analyses were conducted on the held-out test set. We calculated the AUC, area under the precision-recall curve (AUPRC), accuracy, precision, recall (sensitivity), F1-score, and specificity.
Calibration
Model calibration was assessed by plotting the mean predicted probability against the observed fraction of positive cases across 10 decile bins, with a diagonal line representing perfect calibration.
Decision boundaries
To visualize model logic, we first applied principal component analysis to reduce the training data dimensionality to two components. The three models were then retrained on this two-dimensional (2D) data. A mesh grid was created over the 2D space, and the prediction function of each model was used to generate a decision contour, which was overlaid on a scatter plot of the 2D training data.
Ablation studies
We performed two ablation studies on the weighted ensemble model. The component ablation was conducted by comparing the final weighted ensemble model’s performance to that of its individual base learners (Table 1). The feature-group ablation was conducted through a dynamic identification of feature subsets based on column name semantics (“clinical” vs. “pathology”). The weighted ensemble model architecture was reinstantiated and retrained de novo on these feature subsets, and the performance was evaluated on the corresponding subsets of the test data (Table 2).
Table 1
| Model | AUC | AUPRC | Accuracy | Precision | Recall (sensitivity) | F1-score | Specificity |
|---|---|---|---|---|---|---|---|
| LR (tuned) | 0.741 | 0.477 | 0.576 | 0.365 | 0.760 | 0.494 | 0.507 |
| XGB (tuned) | 0.752 | 0.505 | 0.674 | 0.444 | 0.800 | 0.571 | 0.627 |
| Weighted ensemble (LR + XGB) | 0.762 | 0.575 | 0.598 | 0.385 | 0.800 | 0.519 | 0.522 |
Performance metrics for the three optimized models. The proposed weighted ensemble (LR + XGB) model achieved the highest AUC and AUPRC. The italicized indicate the best-performing model for the given metric. AUC, area under the receiver operating characteristic curve; AUPRC, area under the precision–recall curve; LR, logistic regression; XGB, extreme gradient boosting.
Table 2
| Model (feature set) | AUC | AUPRC | Accuracy | Precision | Recall (sensitivity) | F1-score | Specificity |
|---|---|---|---|---|---|---|---|
| Clinical features only | 0.690 | 0.432 | 0.565 | 0.347 | 0.680 | 0.459 | 0.522 |
| Pathological features only | 0.667 | 0.517 | 0.489 | 0.323 | 0.800 | 0.460 | 0.373 |
| All features (full model) | 0.765 | 0.596 | 0.609 | 0.392 | 0.800 | 0.526 | 0.537 |
Performance of the weighted ensemble model when retrained with only the subsets of features. The “all features” model significantly outperformed the models trained on either subset alone, highlighting the synergistic value of integrating both clinical and pathological data. AUC, area under the receiver operating characteristic curve; AUPRC, area under the precision-recall curve.
Explainability (XAI)
Feature importance for the LR model was derived from the absolute value of its learned coefficients. For XGBoost, it was derived from the gain metric. For the final weighted ensemble model, we employed the SHAP methodology (23). SHAP assigns each feature an importance value for a given prediction based on game-theoretic principles. This value is the feature’s marginal contribution across all possible feature subsets (coalitions) :
where is the set of all features, and is the model’s prediction under only the features in subset . We used a model-agnostic kernel-based explainer that initialized on a 100-sample background dataset from the training data. SHAP values were then calculated for all test set samples to generate global summary plots, local waterfall plots for individual patients, and dependence plots to visualize feature interactions.
Results
Dataset stratification and feature distribution
The study cohort (n=915) was divided into a training set (n=732) and a held-out test set (n=183) via a stratified random split based on the target variable (ALN status). This approach ensured that the prevalence of ALN metastasis was proportionally consistent (~27.7%) across both cohorts. Due to the stringent data de-identification protocols implemented prior to modeling, standardized processed features were employed. Furthermore, to assess potential dependencies among the clinicopathological variables, we performed a Pearson correlation analysis (Figure S1). To transparently verify the integrity of the data split, we visualized the comparative distribution of all input features across the training and test cohorts (Figure S2). The comparative boxplots and proportional bar charts demonstrated that the feature distributions were well-balanced between the two sets, preventing distribution shift bias during model evaluation. We observed expected clinical correlations (e.g., between ER and PR status); importantly, no features exhibited prohibitive multicollinearity (correlation coefficients <0.8), confirming the suitability of the selected features for the weighted ensemble framework.
Synergy and superior performance of the weighted ensemble model
Our primary objective was to determine the optimal modeling strategy for predicting ALN metastasis from tabular data. We first compared the diagnostic performance of three optimized models on the held-out test cohort: the baseline LR, the state-of-the-art tree-based XGBoost model, and our proposed weighted ensemble (LR + XGB) model.
The weighted ensemble model achieved the highest overall diagnostic discrimination, with an AUC of 0.762. As detailed in Table 1, this result also served as a model component ablation, confirming our H1 and H2. The full weighted ensemble model (AUC =0.762) synergistically outperformed its individual, fully optimized base learners: the LR model (AUC =0.741) and the XGBoost model (AUC =0.752). This indicated that the meta-learner successfully synthesized the predictions of the linear and nonlinear base models to achieve a superior performance gain.
A similar pattern was observed for the AUPRC, a metric particularly suited for imbalanced datasets. The weighted ensemble model (AUPRC =0.575) demonstrated marked superiority over both XGBoost (AUPRC =0.505) and LR (AUPRC =0.477), suggesting a more stable performance profile across this imbalanced cohort. Analysis of other performance metrics revealed important clinical tradeoffs. The XGBoost model, driven by its high specificity (0.627), yielded the highest F1-score (0.571) and accuracy (0.674). Critically, however, both the weighted ensemble model and the XGBoost model had higher recall (sensitivity) value (0.800) than did the LR model (0.760). The superior discriminative power of the ensemble model is visually presented in the AUCs and AUPRCs in Figure 1A. We further assessed model reliability via calibration curves (Figure 1B). The error distribution across the three optimized models for the test cohort was further visualized via confusion matrices (Figure 2A), which confirmed that despite some variability, all models consistently provide high recall. Furthermore, although all models exhibited miscalibration to some extent (lying slightly below the “perfectly calibrated” diagonal), indicating a mild tendency to overestimate risk, the LR and weighted ensemble models yielded a calibration slope closer to 1.0. This suggests their raw predicted probabilities are more reliable and more closely aligned with observed event frequencies than are those from the standalone XGBoost model.
Data synergy confirmed through feature-group ablation
Having established the superiority of the ensemble architecture, we next investigated the contribution of the input data streams themselves. We performed feature-group ablation by retraining the final weighted ensemble model on distinct, mutually exclusive feature subsets: clinical features only (e.g., “age” and “maximum tumor diameter”) and pathological features only (e.g., ER expression, Ki-67 proliferation index, and histological grade).
The results, as presented in Table 2, demonstrate the critical value of data integration. The model trained on all features (AUC =0.765) significantly outperformed the models trained on clinical features only (AUC =0.690) or pathological features only (AUC =0.667). This result strongly suggests that both clinical and pathological data streams provide unique, complementary predictive information. Neither feature set alone is sufficient, and thus integrating them is essential for maximizing accuracy. This finding is presented visually in the ablation bar plot in Figure 2B.
Model interpretability and clinically driven logic
To understand the differences in the model’s logic qualitatively, we visualized the decision boundaries after retraining the models on the first two principal components of the training data (Figure 3). This visualization highlighted the fundamental differences in their mechanisms. The LR model (Figure 3A) learns a single, linear boundary, cleanly separating the two classes but failing to capture local complexities. The XGBoost model (Figure 3B) creates a complex, nonlinear, axis-parallel checkerboard pattern, effectively isolating small clusters of positive cases. The weighted ensemble model (Figure 3C) generates a hybrid boundary, capturing the nonlinear clusters identified by XGBoost but smoothing them into a more generalized and robust separation, which likely accounts for its superior and more stable performance.
A primary objective of this study was to ensure model transparency. We first compared the global feature contributions of the two base learners (Figure 4). For the XGBoost model, the gain metric identified Ki-67 proliferation index, maximum tumor diameter, and histological grade as the top three predictive features (Figure 4A). For the LR model, the absolute magnitude of its coefficients (representing the contribution to log odds) indicated a similar, albeit differently ranked, set of features, with lymphovascular invasion (“CatLVIYes”) and Ki-67 proliferation index being the most impactful (Figure 4B). This strong congruence for the most important features (Ki-67 proliferation index, maximum tumor diameter, lymphovascular invasion, and histological grade) between the two architecturally distinct models supports the biological validity of the underlying signals being captured.
We then applied SHAP, a game-theoretic XAI method, to the final weighted ensemble model to fully analyze its behavior in the test set. The SHAP summary plot in Figure 5A provides a global view of feature impact and directionality. It shows that “NumMaxDiameter” has the largest mean impact on model output, followed by “CatPathologyTypeIDC” and “CatTICPlateau”, and that the high values (red dots) of these features (e.g., large tumor diameter and IDC pathology type) strongly push the model’s prediction toward a positive ALN status (positive SHAP value). Notably, while Ki-67 proliferation index was the dominant feature in the standalone XGBoost model (Figure 4A), the ensemble model’s SHAP analysis relegates it to a secondary role, suggesting that the meta-learner re-weights feature contributions to prioritize morphological and hemodynamic indicators. Conversely, high PR expression and high ER expression are associated with negative predictions (negative SHAP values, represented as blue dots in Figure 5A), a fact consistent with their known role as favorable prognostic markers.
Regarding feature interactions, the SHAP dependence plot for “CatLocationNan” in Figure 5B provides further insight into model behavior. The plot reveals that the missingness pattern of the tumor location variable carries predictive information within the ensemble framework. The vertical color dispersion, representing interaction with “CatPathologyTypeOther”, demonstrates that this effect is further modulated by pathology type, highlighting the model’s capacity to capture complex, clinically relevant feature interactions beyond simple univariate associations.
Finally, waterfall plots were drawn to demonstrate clinical utility at the patient level. The SHAP waterfall plot in Figure 5C depicts a single prediction for an example patient from the test set. The model’s baseline prediction (E[f(x)] =0.497) is adjusted by the impact of each feature. For this patient, HER2 2+ status, a plateau TIC pattern, and IDC pathology type were the primary factors increasing the predicted risk, while HER2 1+ status partially mitigated the risk, resulting in a final model output (f[x] =0.527) above the baseline, indicating a positive prediction. This granular, patient-specific attribution provides a transparent and interpretable basis for clinical review, directly addressing the challenge of black-box medical practice. This granular, patient-specific attribution provides a transparent and interpretable basis for clinical review, directly addressing the challenge of black-box medical practice.
Representative clinical case
To demonstrate the real-world manifestation of the model-identified features, we present a representative clinical case (Figure 6). This case was not included in model training or testing.
Clinical presentation and pathological profile
A 53-year-old postmenopausal woman had invasive breast carcinoma (nonspecial type), and our SHAP analysis indicated that she had the following high-risk features: Ki-67 proliferation index 30% (a recognized risk factor for ALN metastasis), histological grade II, a tumor diameter of 1.8 cm, minimal hormone receptor expression (ER 3% and PR-negative), and HER2 3+ overexpression (HER2-enriched subtype). Lymphovascular invasion was not definitively assessed on core biopsy. Subsequent axillary evaluation confirmed metastatic carcinoma in the ALNs (pathologically node-positive). This outcome supports the risk stratification capacity of the model-prioritized features. Each of these parameters—Ki-67, tumor size, histological grade, and receptor status—contributed to an elevated predicted risk according to the established clinical evidence and the model’s feature importance analysis (Figures 4,5A), with their combined effect correctly predicting nodal involvement.
The manifestation on contrast-enhanced breast MRI, including irregular margins, heterogeneous enhancement, and rapid initial contrast uptake, were consistent with the aggressive clinicopathological profile (Figure 6A,6B). These radiological findings independently corroborated the high-risk molecular phenotype, demonstrating that routine clinicopathological variables effectively capture the underlying tumor biology that also manifests across imaging modalities.
Discussion
In this study, we demonstrated that a two-stage weighted ensemble model, trained exclusively on 12 routine clinicopathological variables, can accurately and interpretably predict ALN metastasis. Our primary finding is that this interpretable framework, which strategically integrates a linear model with a nonlinear gradient boosting model, achieved diagnostic performance (AUC =0.762) superior to both other well-tuned baseline models. The performance of this hybrid approach aligns with recent evidence demonstrating that gradient boosting and stacking ensembles provide robust diagnostic performance across a variety of medical tabular datasets (24,25). This result, combined with the observation from the comprehensive XAI analysis, provides a powerful, transparent, and practical tool for noninvasive ALN risk stratification.
Our findings constitute strong empirical support for the use of our three-pillar theoretical framework. The weighted ensemble model (AUC =0.762) surpassed both its linear (AUC =0.741) and nonlinear (AUC =0.752) base learners, validating H1 and H2. This confirms that while nonlinear interactions are critical for accurate ALN metastasis prediction from clinicopathological tabular data—as evidenced by the inferior performance of the purely linear LR model (AUC =0.741) compared to the nonlinear XGBoost (AUC =0.752)—a simple tree-based model may overfit to nonlinear noise. The ensemble succeeds by leveraging the meta-learner to balance the robust, nonlinear predictions of XGBoost with the stable, generalized, linear predictions of LR. This creates a hybrid, smoothed decision boundary (Figure 3C) that generalizes better to unseen data, as evidenced by its superior AUC and AUPRC.
Beyond predictive accuracy, our XAI-driven approach validated the model’s clinical utility. The SHAP analysis (Figure 5A) confirmed H3, demonstrating that the model’s logic is nonspurious and rooted in established pathobiology. Consistent with the quantitative feature importance scores, the highest-ranked predictive features were tumor maximum diameter (“NumMaxDiameter”), the invasive ductal carcinoma (IDC) pathology type (“CatPathologyTypeIDC”), and the TIC plateau pattern (“CatTICPlateau”). These top contributors are well-established, independent risk factors associated with tumor aggressiveness and lymphatic spread. Naturally, larger tumors carry a higher risk of metastasis due to the increased surface area for vascular or lymphatic invasion and their longer evolution time. Furthermore, the prominence of the IDC pathology type and the plateau TIC pattern—often indicative of rapid, sustained tumor angiogenesis—highlights the model’s reliance on core morphological and hemodynamic indicators of malignancy. Conversely, while previous clinical assumptions might heavily emphasize variables such as the Ki-67 proliferation index or histological grade, our SHAP analysis accurately relegates these lower-scoring features to appropriate secondary predictive roles in this specific ensemble framework. Finally, the patient-specific waterfall plot in Figure 5C directly addresses H4 by providing a transparent, patient-specific risk attribution. This moves beyond a single, opaque risk score and offers a “glass box” explanation that clinicians can review, critique, and integrate into their holistic patient assessment, fulfilling a key recommendation from recent XAI reviews (14).
The representative clinical case described above (Figure 6) further illustrates the relevance of our model to the field of quantitative imaging. The high-risk features identified by our weighted ensemble model—specifically high Ki-67 proliferation and high histological grade—manifest radiologically as heterogeneous enhancement and rapid wash-out kinetics (Figure 6). This biological concordance confirms that accessible clinicopathological data can capture the same aggressive tumor phenotypes often targeted by radiomics. By establishing this link, our study provides a validated “clinical ground truth” that supports and complements future quantitative imaging and multimodal fusion research.
Our results contextualize the broader field of artificial intelligence in ALN prediction. Recent studies have further underscored the value of clinicopathological data. Song et al. (12) and Zhang et al. (13) successfully developed predictive models for ALN status. However, these approaches primarily focused on linear associations or standard machine learning. Our approach is distinct by virtue of its weighted ensemble architecture, which synergistically captures the nonlinear feature interactions (as evidenced by our SHAP dependence plots) that simpler models may overlook. Furthermore, we prioritize “white-box” transparency through patient-level SHAP explanations, addressing the interpretability gap often present in previous works. Furthermore, our framework contributes to the growing body of literature emphasizing that stacking ensembles combined with XAI can bridge the gap between high accuracy and clinical interpretability (26).
We acknowledge, however, that our absolute performance metrics—specifically an accuracy of approximately 0.67 and an F1-score of 0.57—are modest. These metrics reflect the inherent biological limitations of predicting microscopic nodal metastasis through an exclusive reliance on macroscopic, routine clinicopathological tabular data without advanced molecular profiling. Although several studies have reported models with higher AUCs, often in the 0.81–0.86 range (3,5,7), these models frequently rely on deep learning analysis of WSI histopathology slides (3-5), multimodal learning approaches (7,27,28), or advanced MRI sequences (9). These modalities are computationally expensive, require specialized expertise, and are not universally available.
Despite trading overall accuracy for these practical benefits, the primary clinical value of our ensemble lies in its high sensitivity (recall =0.800). In a preoperative screening context, minimizing false negatives (missing actual metastasis) is paramount. Our model, which achieved a reliable AUC of 0.762 using only 12 standard, tabular data points, represents a highly practical, low-cost, and scalable alternative. It can serve as an interpretable “first-pass” triage tool—demonstrating that clinically meaningful baseline predictions can be achieved using universally available data—and offers an immediately implementable alternative to resource-intensive approaches. Although advanced multiple instance learning and channel-attention frameworks for WSI analysis offer high performance, their computational complexity remains a hurdle to routine integration compared to our streamlined tabular approach (29,30).
Several limitations to this study should be acknowledged. First, we employed a retrospective dataset derived from a single-center data pipeline. The model’s generalizability and performance on external, multicenter cohorts must be confirmed before widespread clinical application. Second, the feature set, while effective, was predefined. Future work could benefit from the use of more granular data, such as specific receptor percentages rather than binarized status, or the inclusion of additional molecular biomarkers. Third, our feature group ablation (Table 2 and Figure 2B) confirmed the value of both clinical and pathological features, but the automatic grouping based on feature semantics is a proxy, and a more rigorous, expert-defined feature grouping could yield deeper insights. Finally, while a kernel-based SHAP explainer is a powerful, model-agnostic tool for interpreting the weighted ensemble model, it is computationally intensive and provides an approximation of the true SHAP values.
Conclusions
This study successfully developed and validated a two-stage weighted ensemble model that accurately and noninvasively predicts ALN metastasis in patients with breast cancer. By synergistically integrating the linear stability of LR with the robust nonlinear feature interaction capture of XGBoost, our model achieved a superior AUC of 0.762 on the test set, outperforming its individual constituent components.
The inclusion of SHAP ensures clinical transparency by providing glass-box interpretability and identified tumor diameter, IDC pathology type, and TIC patterns as the primary drivers of model risk predictions. Given its high sensitivity of 0.800, this framework serves as a practical, low-cost, and scalable triage tool that can be implemented with universally accessible clinicopathological data. It offers a significant clinical opportunity to safely de-escalate surgical procedures for low-risk patients, thereby sparing them from the morbidity associated with invasive axillary staging. Despite our findings suggesting strong potential, future multicenter, prospective validation remains essential to translating this baseline clinicopathological framework into routine precision oncologic practice.
Acknowledgments
We would like to thank the Institutional Review Board of Huaihe Hospital of Henan University for approving this study (approval No. 2025130) and for waiving the requirement for informed consent given the retrospective nature of the analysis and use of de-identified data for this retrospective cohort analysis. We also would like to thank the clinical and technical staff at the Department of Medical Imaging, Huaihe Hospital of Henan University, and The Third Affiliated Hospital of Zhengzhou University for their professional support in data acquisition and the maintenance of the patient database.
Footnote
Reporting Checklist: The authors have completed the TRIPOD+AI reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2525/rc
Data Sharing Statement: Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2525/dss
Funding: This work was supported by “
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2525/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. This study was approved by the Institutional Review Board of Huaihe Hospital of Henan University (No. 2025130). Informed consent was waived in this retrospective study due to the use of de-identified data.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Giuliano AE, Ballman K, McCall L, Beitsch P, Whitworth PW, Blumencranz P, Leitch AM, Saha S, Morrow M, Hunt KK. Locoregional Recurrence After Sentinel Lymph Node Dissection With or Without Axillary Dissection in Patients With Sentinel Lymph Node Metastases: Long-term Follow-up From the American College of Surgeons Oncology Group (Alliance) ACOSOG Z0011 Randomized Trial. Ann Surg 2016;264:413-20.
- Slamon D, Eiermann W, Robert N, Pienkowski T, Martin M, Press M, et al. Adjuvant trastuzumab in HER2-positive breast cancer. N Engl J Med 2011;365:1273-83. [Crossref] [PubMed]
- Xu F, Zhu C, Tang W, Wang Y, Zhang Y, Li J, Jiang H, Shi Z, Liu J, Jin M. Predicting Axillary Lymph Node Metastasis in Early Breast Cancer Using Deep Learning on Primary Tumor Biopsy Slides. Front Oncol 2021;11:759007. [Crossref] [PubMed]
- Ilse M, Tomczak JM, Welling M. Attention-based deep multiple instance learning. ICML 2018. Available online: https://proceedings.mlr.press/v80/ilse18a/ilse18a.pdf
- Campanella G, Hanna MG, Geneslaw L, Miraflor A, Werneck Krauss Silva V, Busam KJ, Brogi E, Reuter VE, Klimstra DS, Fuchs TJ. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat Med 2019;25:1301-9. [Crossref] [PubMed]
- Zheng X, Yao Z, Huang Y, Yu Y, Wang Y, Liu Y, Mao R, Li F, Xiao Y, Wang Y, Hu Y, Yu J, Zhou J. Deep learning radiomics can predict axillary lymph node status in early-stage breast cancer. Nat Commun 2020;11:1236. [Crossref] [PubMed]
- Park D, Lee YM, Eo T, An HJ, Kang H, Park E, Cha YJ, Park H, Kwon D, Kwon SY, Jung HR, Shin SJ, Park H, Lee Y, Park S, Kim JM, Choi SE, Cho NH, Hwang D. Multimodal AI model for preoperative prediction of axillary lymph node metastasis in breast cancer using whole slide images. NPJ Precis Oncol 2025;9:131. [Crossref] [PubMed]
- Windsor GO, Bai H, Lourenco AP, Jiao Z. Application of artificial intelligence in predicting lymph node metastasis in breast cancer. Front Radiol 2023;3:928639. [Crossref] [PubMed]
- Mao N, Dai Y, Lin F, Ma H, Duan S, Xie H, Zhao W, Hong N. Radiomics Nomogram of DCE-MRI for the Prediction of Axillary Lymph Node Metastasis in Breast Cancer. Front Oncol 2020;10:541849. [Crossref] [PubMed]
- Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2016:785-94.
- Zheng J, Li J, Zhang Z, Yu Y, Tan J, Liu Y, Gong J, Wang T, Wu X, Guo Z. Clinical Data based XGBoost Algorithm for infection risk prediction of patients with decompensated cirrhosis: a 10-year (2012-2021) Multicenter Retrospective Case-control study. BMC Gastroenterol 2023;23:310. [Crossref] [PubMed]
- Song L, Zhang F, Ma K, Wang B, Zhang T, Sun S. Analysis of factors affecting axillary lymph node metastasis in breast cancer and the establishment and validation of a predictive model. Sci Rep 2025;15:43630. [Crossref] [PubMed]
- Zhang X, Zhang C, Zhang J, Zhang X, Dou X. Establishment of Prediction Model of Axillary Lymph Node Metastasis Before Operation for Early-Stage Breast Cancer. Cancer Control 2025;32:10732748251363328. [Crossref] [PubMed]
- Wolpert DH. Stacked generalization. Neural Networks 1992;5:241-59.
- Adekoya A, Saeed F, Ghaban W, Qasem SN. Ensemble learning approach with explainable AI for improved heart disease prediction. Front Pharmacol 2025;16:1654681. [Crossref] [PubMed]
- Kablan R, Miller HA, Suliman S, Frieboes HB. Evaluation of stacked ensemble model performance to predict clinical outcomes: A COVID-19 study. Int J Med Inform 2023;175:105090. [Crossref] [PubMed]
- Ponce-Bobadilla AV, Schmitt V, Maier CS, Mensing S, Stodtmann S. Practical guide to SHAP analysis: Explaining supervised machine learning model predictions in drug development. Clin Transl Sci 2024;17:e70056. [Crossref] [PubMed]
- Salih AM, Zahra Raisi-Estabragh Z, Galazzo IB, Radeva P, Petersen SE, Lekadir K, Menegaz G. A Perspective on Explainable Artificial Intelligence Methods: SHAP and LIME. Adv Intell Syst 2025;7:2400304.
- Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ 2015;350:g7594.
- Steyerberg EW. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. 2nd ed. Cham: Springer; 2019.
- He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng 2009;21:1263-84.
- Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res 2002;16:321-57.
- Lundberg SM, Lee SI. A unified approach to interpreting model predictions. NeurIPS 2017. doi:
10.5555/3295222.3295230 . - Dinh A, Miertschin S, Young A, Mohanty SD. A data-driven approach to predicting diabetes and cardiovascular disease with machine learning. BMC Med Inform Decis Mak 2019;19:211. [Crossref] [PubMed]
- Grinsztajn L, Borard E, Varoquaux G. Why do tree-based models still outperform deep learning on typical tabular data? Advances in Neural Information Processing Systems 2022;35:507-20.
- Sultan SQ, Javaid N, Alrajeh N, Aslam M. Machine Learning-Based Stacking Ensemble Model for Prediction of Heart Disease with Explainable AI and K-Fold Cross-Validation: A Symmetric Approach. Symmetry 2025;17:185.
- Guo F, Sun S, Deng X, Wang Y, Yao W, Yue P, Wu S, Yan J, Zhang X, Zhang Y. Predicting axillary lymph node metastasis in breast cancer using a multimodal radiomics and deep learning model. Front Immunol 2024;15:1482020. [Crossref] [PubMed]
- Bychkov D, Linder N, Turkki R, Nordling S, Kovanen PE, Verrill C, Walliander M, Lundin M, Haglund C, Lundin J. Deep learning based tissue analysis predicts outcome in colorectal cancer. Sci Rep 2018;8:3395. [Crossref] [PubMed]
- Afonso M, Bhawsar PMS, Saha M, Almeida JS, Oliveira AL. Multiple Instance Learning for WSI: A comparative analysis of attention-based approaches. J Pathol Inform 2024;15:100403. [Crossref] [PubMed]
- Mao J, Xu J, Tang X, Liu Y, Zhao H, Tian G, Yang J. CAMIL: channel attention-based multiple instance learning for whole slide image classification. Bioinformatics 2025;41:btaf024. [Crossref] [PubMed]

