Development and validation of an integrated model combining deep learning, radiomics, and clinical and breast ultrasound features for Breast Imaging Reporting and Data System 4A lesion malignancy classification
Introduction
Breast cancer is the most common malignancy among women worldwide and is the second most frequent cause of mortality linked to cancer (1). Timely detection and diagnosis are critical to optimizing patient prognosis (2,3). The modalities for diagnosing breast lesion primarily comprise magnetic resonance imaging (MRI), mammography, and ultrasound (US) (4). Although MRI provides exceptional soft tissue contrast, its practical limitations include its substantial costs and prolonged examination times; moreover, the necessity for contrast administration restrict its application for population-based screening (5). US imaging is superior to mammography in detecting intraductal and nodular pathologies (6,7), and with the additional benefit of radiation-free operation, it is particularly suitable for younger or pregnant patients. The anatomical characteristics of Chinese women—typically featuring smaller breast volume with increased glandular density—coupled with earlier disease onset patterns (8), have established US as the primary imaging method for both screening and preoperative evaluation, while mammography remains an important adjunctive technique (9).
The Breast Imaging Reporting and Data System (BI-RADS) classification system, developed by the American College of Radiology (ACR) for breast US imaging, provides a standardized framework for describing the findings from breast examinations. This system aids radiologists and breast surgeons in assessing the likelihood of malignancy in observed breast abnormalities and guides subsequent diagnostic and therapeutic strategies (10,11). Lesions categorized as BI-RADS 4A indicate a low probability of malignancy, estimated at 2–10% (12). BI-RADS 4A is the most frequently encountered subcategory within category 4, accounting for approximately 55.6% of the cases (13). However, the 2013 version of BI-RADS does not provide explicit guidelines for subcategories within BI-RADS 4 (10,11). Given the overlapping radiological characteristics of atypical benign and malignant lesions, diagnosis depends heavily on radiologists’ subjective experience, and specific criteria are lacking (14,15). Although the majority of BI-RADS 4A lesions are pathologically benign, routine biopsy procedures are inherently limited. Notably, biopsy results may only reflect localized tumor characteristics due to intratumoral heterogeneity, even with multiple sampling. These procedures may induce patient anxiety during the waiting period, incur unnecessary healthcare costs, and occasionally lead to complications. Consequently, developing a more precise method for differentiating among 4A lesions that could facilitate safe monitoring and minimize unnecessary biopsies would significantly improve clinical decision-making.
Artificial intelligence (AI) is extensively used in the field of medical imaging to improve diagnostic accuracy and decision-making processes. Radiomics, which involves converting medical images into large volumes of data, allows for a comprehensive characterization of lesion heterogeneity (16) and has shown promising results in breast lesion diagnosis (17-20) and in predicting treatment response and prognosis (21,22) in patients with breast cancer. The application of deep learning (DL) algorithms, which use multiple layers of transformations to process input data, represents a significant advancement in breast cancer diagnosis (23,24), staging (25,26), and treatment response and prognostic prediction (27-29). The development of an integrated model that incorporates both DL and radiomics features allows for a more comprehensive extraction of information within medical images, thereby enhancing the precision and reliability of diagnosis. However, despite the potential of both approaches, studies that integrate DL with handcrafted radiomics for a comprehensive analysis are scarce, particularly for the specific task of differentiating BI-RADS 4A lesions (24,30-34). The bulk of the related research has focused on broader categories (e.g., all BI-RADS 4 lesions) or used either radiomics or DL in isolation. Therefore, to address this deficiency, we conducted a study to develop and validate an integrated model combining DL, handcrafted radiomics, and clinical and US features for differentiating benign and malignant BI-RADS 4A breast lesions in a two-center dataset. We present this article in accordance with the TRIPOD + AI reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1463/rc) (35).
Methods
Patient population
This retrospective study was approved by the Ethics Committee of Foshan First People’s Hospital (approval No. 2024-64) and the Medical Ethics committee of Nanfang Hospital of Southern Medical University (approval No. NFEC-2025-322) and was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The requirement for informed consent was waived due to the retrospective nature of the analysis.
A total of 1,423 patients with histologically confirmed BI-RADS 4A lesions from two Chinese centers, Foshan First People’s Hospital and Nanfang Hospital of South Medical University in China, were recruited. Each patient contributed a single BI-RADS 4A lesion to the study. For patients with multiple lesions, only the most representative index lesion was selected and analyzed to maintain data independence. Subsequently, 935 patients who attended Foshan First People’s Hospital between February 2018 and December 2021 were reviewed and randomly divided into a training cohort (n=654) and an internal validation cohort (n=281) at a ratio of 7:3. Furthermore, 488 patients who attended Nanfang Hospital of South Medical University between April 2020 and October 2023 were enrolled as an external validation cohort. The inclusion criteria were as follows: (I) females with BI-RADS 4A breast lesions confirmed by two board-certified radiologists through consensus review; (II) clearly visible lesions on B-mode US; (III) lesions adequately visualized via high-frequency transducers; and (IV) histopathological verification of all cases. The exclusion criteria were as follows: (I) unsatisfactory US image quality; (II) lesions only partially visible on US images due to excessively large diameters; (III) biopsy or anticancer treatment administered before US examinations; and (IV) pregnant or lactating patients. The flowchart of the enrollment process for this study is shown in Figure 1.
Image acquisition
All patients underwent US examination within 1 month prior to surgical resection or US-guided core needle biopsy. Three board-certified radiologists specializing in breast imaging (with 6, 8, and 10 years of experience, respectively) conducted all US examinations. The US images were acquired in both centers with the following scanners: the MyLabTwice (Esaote, Genoa, Italy); LOGIQ E8, E9, and P9 (GE HealthCare, Chicago, IL, USA); Aplio 800 and 500 (Canon Medical Systems, Otawara, Japan); Aixplorer (SuperSonic Imagine, Aix-en-Provence, France); Resona 7 (Mindray, Shenzhen, China); EPIQ 5 and IU22 (Philips, Amsterdam, the Netherlands); and Acuson Sequoia (Siemens Healthineers, Erlangen, Germany). The largest longitudinal section of the targeted lesion on grayscale US was retrieved from the picture archiving and communication systems (PACS) for further analysis.
Acquisition of clinicoradiological features
Clinical characteristics, including age, lesion laterality, maximum diameter, and pathological results of lesions, were retrieved from the medical records. The identification of US features was based on the ACR BI-RADS lexicon. The evaluated parameters included lesion growth orientation, lesion shape, and the presence of macrocalcification, microcalcification, cystic areas, and posterior acoustic attenuation. The US features were independently evaluated by two radiologists (with 5 and 10 years of experience in breast US, respectively). Both radiologists were blinded to the clinical information and pathological diagnoses of the patients. Interobserver agreement between the two radiologists was assessed. Discrepant evaluations were resolved through consensus discussion.
Image segmentation
Using ITK-SNAP software (ver. 4.0.2, http://www.itksnap.org/), an experienced radiologist (5 years in breast imaging) manually annotated the regions of interest (ROIs) for each lesion on US images. The segmented ROIs were then reviewed and confirmed by a second radiologist with 10 years of experience. Subsequently, the US images were cropped based on the delineated ROIs and resized to 224×224 and to 299×299.
Handcrafted radiomics feature extraction
Before the feature extraction process, image resampling and normalization were performed to improve the standardization of the US images. The Neuroimaging Informatics Technology Initiative (NIfTI) images (nii.gz format) and ROIs obtained from ITK-SNAP software were imported into Python (Python Software Foundation, Wilmington, DE, USA; http://www.python.org). All handcrafted features were extracted via PyRadiomics (http://pyradiomics.readthedocs.io). These features were classified into three categories: (I) geometric; (II) intensity-based; and (III) texture-related. The texture features included the gray-level co-occurrence matrix (GLCM), gray-level dependence matrix (GLDM), gray-level run-length matrix (GLRLM), gray-level size-zone matrix (GLSZM), and neighborhood gray-tone difference matrix (NGTDM). A total of 1,561 handcrafted radiomics features, including intensity, shape, texture, and wavelet features, were extracted through the PyRadiomics.
DL feature extraction
The use of transfer learning can aid in building the underlying features of the images. We developed four convolutional neural networks (CNNs) including residual network 50 (ResNet50), ResNet101, DenseNet121, and Inception_V3, along with a vision transformer (ViT) network. Initially, these networks were pretrained on the extensive ImageNet dataset (https://image-net.org/) to establish robust initial parameters. Next, we fine-tuned the model parameters using images from the training cohort. To mitigate potential bias from a single random split, we further performed fivefold cross-validation on the training cohort to evaluate the performance of each DL model. The model with the highest average area under the curve (AUC) across the folds was selected as the best-performing model. ResNet-101 consistently outperformed the other models (Table S1) and was therefore chosen for subsequent feature extraction. Features from the penultimate fully connected layer were extracted for further analysis. Training hyperparameters included a batch size of 32, 50 epochs, and the stochastic gradient descent (SGD) optimizer with a learning rate of 0.01. We also fine-tuned the weights of the features using the training cohort data. The extracted deep features had a high dimensionality of 2,048. To mitigate overfitting and improve generalizability, principal component analysis (PCA) was applied to reduce the feature dimension to 64. Subsequently, the resulting PCA-transformed features were standardized with the Z-score method. For each feature, the mean was subtracted and the result was divided by the standard deviation to achieve a distribution with zero mean and unit variance.
Construction of the prediction model based on DL and handcrafted radiomics
We developed a three-step methodology in the training cohort to select robust features in the compressed DL features and the extracted handcrafted radiomics to build predictive models: (I) All extracted features underwent initial screening via analysis of variance to select those meeting the dual criteria of statistical significance and an intraclass correlation coefficient (ICC) >0.8. (II) The screened features were further refined via least absolute shrinkage and selection operator (LASSO) regression with 10-fold cross-validation for parameter optimization to determine the final feature subset with nonzero coefficients. (III) Predictive models were constructed with six machine learning algorithms, including logistic regression (LR), support vector machine (SVM), random forest (RF), extreme gradient boosting (XGBoost), multilayer perceptron (MLP), and light gradient boosting machine (LightGBM). Thus, two distinct predictive models were developed: a handcrafted radiomics model and a DL model, each capturing unique phenotypic features of breast lesions for malignancy prediction. All DL and radiomics analyses were performed on a workstation equipped with a GeForce RTX 4090 GPU (Nvidia, Santa Clara, CA, USA), an i9-14900HX CPU (Intel, Santa Clara, CA, USA), and 32-GB RAM running on the Windows 11 operating system (Microsoft Corp., Redmond, WA, USA). Key software libraries included Python (version 3.9.7) and PyRadiomics (version 1.3.0).
Integrated model construction
Multivariable logistic analysis was conducted to identify significant clinical and breast US features (including age, lesion size, shape, orientation, presence of macrocalcification, microcalcification, cystic areas, and posterior attenuation) in the training cohort. The significant features identified were retained for subsequent model integration.
Additionally, the six aforementioned machine learning algorithms were used to construct an integrated model through the combination of the selected handcrafted radiomics features, DL features, and the significant clinical and breast US features. For comparative purposes, we also developed a baseline model incorporating solely clinical and breast US features. Model efficacy was assessed through receiver operating characteristic (ROC) curve analysis, with statistical comparison of AUC values between cohorts being performed via the DeLong test. Figure 2 summarizes the protocol of this study.
Three decision thresholds of the integrated model were determined through ROC analysis to evaluate biopsy reduction potential. These sensitivity-prioritized cutoffs (S1: ≥97%; S2: ≥95%; S3: ≥90%), optimized in the training cohort, were validated in the internal and external validation cohorts to quantify diagnostic performance metrics, including specificity and sensitivity.
Model explanation and visualization
To address the issue of the black-box nature of AI models, we employed the SHapley Additive Explanations (SHAP) (36,37) to interpret the best-performing integrated model. Specifically, density plots were used to visualize the impact of each feature on the model’s predictions. The SHAP value represents the average marginal contribution of a feature value across all possible feature subsets, quantifying its impact on the prediction outcome. The importance of each feature was ranked based on the mean absolute SHAP value. This approach could improve the clinical interpretability of the integrated model’s predictions.
Statistical analysis
All statistical analyses were performed with SPSS version 22.0 (IBM Corp., Armonk, NY, USA), MedCalc version 20.010 (MedCalc Software, Ostend, Belgium), R version 2.15.3 (The R Foundation for Statistical Computing; https://www.r-project.org), and Python version 3.9.7 (http://www.python.org). Continuous variables were tested for the normality of distribution, with the independent t-test being used to compare variables. The Mann-Whitney test was used to compare nonnormally distributed continuous variables. For categorical variables, comparisons were made with either the Chi-squared test or Fisher exact test, as deemed suitable. Kappa statistics were used to analyze the interobserver agreement of US feature evaluation. A two-tailed P value <0.05 was considered to indicate a statistically significant result.
Comparisons between groups (i.e., the clinical model, DL model, the handcrafted radiomics model, and the integrated model) were performed with the DeLong test for AUCs. Since the differences were significant, comparisons were conducted with the post hoc Bonferroni test. The Bonferroni correction adjusted the P value for each hypothesis to 0.017 (0.05/3). Under this condition, values of P<0.017 were considered statistically significant.
Results
Baseline characteristics
Our analysis comprised 1,423 matched case-image pairs. Table S2 details the histological distribution of BI-RADS 4A lesions, including 1,196 benign (84.0%) and 227 malignant (16.0%) cases. Comparative analysis of demographic and sonographic characteristics across cohorts (Table S3) indicated significant intergroup differences in age, lesion size, lesion orientation, posterior acoustic features, and pathological outcomes. Interobserver reliability analysis between the two radiologists for US parameters yielded excellent agreement (κ=0.827–0.885; Table S4). A comparative analysis of demographic and US imaging characteristics between benign and malignant cases in the three cohorts is presented in Table 1. Multivariable LR in the training cohort identified four independent predictors of malignancy: older age, larger lesion diameter, irregular shape, and microcalcifications (Table S5).
Table 1
| Characteristic | Training cohort | Internal validation cohort | External validation cohort | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Malignant (n=123) | Benign (n=531) | P value | Malignant (n=44) | Benign (n=237) | P value | Malignant (n=60) | Benign (n=428) | P value | |||
| Age (years) | 48.0 (40.0, 56.0) | 44.0 (37.5, 50.5) | <0.001 | 47.0 (39.0, 55.0) | 45.0 (38.5, 51.5) | 0.013 | 44.5 (36.0, 53.0) | 41.0 (33.5, 48.5) | 0.451 | ||
| Lesion size (cm) | 1.60 (0.95, 2.25) | 1.10 (0.75, 1.45) | <0.001 | 1.30 (0.80, 1.80) | 1.10 (0.70, 1.50) | 0.004 | 1.80 (1.25, 2.35) | 1.20 (0.85, 1.55) | <0.001 | ||
| Lesion laterality | 0.912 | 0.068 | 0.842 | ||||||||
| Left | 66 (53.7) | 282 (53.1) | 17 (38.6) | 127 (53.6) | 29 (48.3) | 201 (47.0) | |||||
| Right | 57 (46.3) | 249 (46.9) | 27 (61.4) | 110 (46.4) | 31 (51.7) | 227 (53.0) | |||||
| Shape | <0.001 | 0.010 | 0.015 | ||||||||
| Round/oval | 18 (14.6) | 162 (30.5) | 6 (13.6) | 78 (32.9) | 9 (15.0) | 129 (30.1) | |||||
| Irregular | 105 (85.4) | 369 (69.5) | 38 (86.4) | 159 (67.1) | 51 (85.0) | 299 (69.9) | |||||
| Orientation | 0.217 | 0.217 | 0.305 | ||||||||
| Parallel | 88 (71.5) | 408 (76.8) | 31 (70.5) | 187 (78.9) | 47 (78.3) | 358 (83.6) | |||||
| Not parallel | 35 (28.5) | 123 (23.2) | 13 (29.5) | 50 (21.1) | 13 (21.7) | 70 (16.4) | |||||
| Presence of macrocalcification | 0.158 | 0.170 | 0.131 | ||||||||
| No/normal | 115 (93.5) | 474 (89.3) | 42 (95.5) | 210 (88.6) | 58 (96.7) | 389 (90.9) | |||||
| Suspicious | 8 (6.5) | 57 (10.7) | 2 (4.5) | 27 (11.4) | 2 (3.3) | 39 (9.1) | |||||
| Presence of microcalcification | <0.001 | 0.002 | <0.001 | ||||||||
| No/normal | 86 (69.9) | 487 (91.7) | 36 (81.8) | 225 (94.9) | 38 (63.3) | 396 (92.5) | |||||
| Suspicious | 37 (30.1) | 44 (8.3) | 8 (18.2) | 12 (5.1) | 22 (36.7) | 32 (7.5) | |||||
| Presence of cystic areas | 0.999 | 0.462 | 0.870 | ||||||||
| No/normal | 104 (84.6) | 449 (84.6) | 36 (81.8) | 204 (86.1) | 50 (83.3) | 353 (82.3) | |||||
| Suspicious | 19 (15.4) | 82 (15.4) | 8 (18.2) | 33 (13.9) | 10 (16.7) | 75 (17.7) | |||||
| Presence of posterior attenuation | 0.908 | 0.648 | 0.462 | ||||||||
| No/normal | 111 (90.2) | 481 (90.6) | 39 (88.6) | 204 (86.1) | 59 (98.3) | 408 (95.3) | |||||
| Suspicious | 12 (9.8) | 50 (9.4) | 5 (11.4) | 33 (13.9) | 1 (1.7) | 20 (4.7) | |||||
Data are shown as n (%) or median (interquartile range).
Validation of the Radiomics and DL prediction models
Following dimensionality reduction and feature selection, 29 handcrafted radiomics features and 22 DL features were selected to build the respective models (Figure S1). The selected features for these models are listed in Appendix 1 and Table S6. The performance of the integrated models, constructed from different machine learning algorithms, in differentiating benign and malignant BI-RADS 4A lesions is presented in Table 2. In the external validation cohort, the integrated model developed with LightGBM obtained the highest AUC [0.864; 95% confidence interval (CI): 0.803–0.919], followed by XGBoost (0.846; 95% CI: 0.789–0.903), SVM (0.810; 95% CI: 0.746–0.874); LR (0.782; 95% CI: 0.714–0.851), MLP (0.754; 95% CI: 0.682–0.826), and RF (0.750; 95% CI: 0.678–0.823). As shown in Table 3, the DL model achieved moderate predictive performance, with AUCs of 0.905 (95% CI: 0.880–0.930), 0.772 (95% CI: 0.700–0.844), and 0.763 (95% CI: 0.695–0.830) in the training, internal validation, and external validation cohort, respectively; similarly, the handcrafted radiomics model showed acceptable predictive performance, with corresponding AUCs of 0.894 (95% CI: 0.865–0.923), 0.734 (95% CI: 0.649–0.818), and 0.719 (95% CI: 0.649–0.789) across the three cohorts (Table 3 and Figure 3).
Table 2
| Machine learning algorithm used | Cohort | AUC (95% CI) | Sensitivity | Specificity |
|---|---|---|---|---|
| LR | Training | 0.933 (0.910–0.956) | 0.847 | 0.887 |
| Internal validation | 0.836 (0.767–0.905) | 0.674 | 0.853 | |
| External validation | 0.782 (0.714–0.851) | 0.750 | 0.752 | |
| SVM | Training | 0.986 (0.975–0.997) | 0.968 | 0.949 |
| Internal validation | 0.844 (0.775–0.913) | 0.651 | 0.903 | |
| External validation | 0.810 (0.746–0.874) | 0.717 | 0.778 | |
| RF | Training | 0.899 (0.869–0.927) | 0.718 | 0.906 |
| Internal validation | 0.851 (0.801–0.900) | 0.930 | 0.626 | |
| External validation | 0.750 (0.678–0.823) | 0.600 | 0.836 | |
| XGBoost | Training | 0.974 (0.963–0.985) | 0.887 | 0.923 |
| Internal validation | 0.862 (0.801–0.922) | 0.721 | 0.870 | |
| External validation | 0.846 (0.789–0.903) | 0.767 | 0.813 | |
| MLP | Training | 0.977 (0.966–0.987) | 0.927 | 0.940 |
| Internal validation | 0.821 (0.743–0.899) | 0.558 | 0.954 | |
| External validation | 0.754 (0.682–0.826) | 0.633 | 0.797 | |
| LightGBM | Training | 0.938 (0.914–0.961) | 0.847 | 0.862 |
| Internal validation | 0.870 (0.814–0.927) | 0.721 | 0.870 | |
| External validation | 0.861 (0.803–0.919) | 0.833 | 0.808 |
AUC, area under the curve; BI-RADS, Breast Imaging Reporting and Data System; CI, confidence interval; LightGBM, light gradient boosting machine; LR, logistic regression; MLP, multilayer perceptron; RF, random forest; SVM, support vector machine; XGBoost, extreme gradient boosting.
Table 3
| Model | AUC (95% CI) | Sensitivity | Specificity |
|---|---|---|---|
| Training cohort | |||
| Clinical model | 0.819 (0.778–0.860) | 0.468 | 0.928 |
| Handcrafted radiomics model | 0.894 (0.865–0.923) | 0.831 | 0.802 |
| Deep learning model | 0.905 (0.880–0.930) | 0.935 | 0.725 |
| Integrated model | 0.938 (0.914–0.961) | 0.847 | 0.862 |
| Internal validation cohort | |||
| Clinical model | 0.791 (0.719–0.863) | 0.721 | 0.655 |
| Handcrafted radiomics model | 0.734 (0.649–0.818) | 0.698 | 0.693 |
| Deep learning model | 0.772 (0.700–0.844) | 0.791 | 0.651 |
| Integrated model | 0.870 (0.814–0.927) | 0.721 | 0.870 |
| External validation cohort | |||
| Clinical model | 0.781 (0.714–0.849) | 0.717 | 0.731 |
| Handcrafted radiomics model | 0.719 (0.649–0.789) | 0.683 | 0.671 |
| Deep learning model | 0.763 (0.695–0.830) | 0.650 | 0.794 |
| Integrated model | 0.861 (0.803–0.919) | 0.833 | 0.808 |
AUC, area under the curve; CI, confidence interval.
Performance and validation of the integrated model
Ultimately, the integrated model was constructed with 29 radiomics features, 22 DL features, and 4 clinical/US features (Table S6). This multimodal model demonstrated consistent superiority in the training and internal validation cohorts, with statistically significant improvements over all comparator models (Figure S2A,S2B). Furthermore, in the external validation cohort, it significantly outperformed the DL model (P=0.011), the handcrafted radiomics model (P<0.001), and the clinical model (P=0.015) (Table 3 and Figure S2C). The selected features were weakly correlated or uncorrelated in the integrated model (Figure S3). The integrated model demonstrated strong predictive performance in the training cohort, with an AUC of 0.938 (95% CI: 0.914–0.961). This robust performance was maintained in both validation cohorts, with AUC values exceeding 0.860 (Figure 3). The calibration curve indicated good agreement between the predicted probabilities and observed outcomes in all three cohorts (Figure 4A-4C). Decision curve analysis confirmed that the integrated model provided the highest net clinical benefit across all cohorts (Figure 4D-4F).
Explanation of the integrated model
A SHAP density plot was drawn to demonstrate the features with significant contributions to the integrated model (Figure 5). The findings revealed that wavelet_LHL_firstorder_Mean, a radiomic feature that represents the average intensity value within the lesion after a specific wavelet filter [low-high-low (LHL)] was applied to the image, and lesion diameter had the greatest contributions to the prediction of malignant 4A breast lesions. The SHAP values for wavelet_LHL_firstorder_Mean and lesion diameter varied within the cohort. The color gradient visualization demonstrated an inverse relationship between wavelet_LHL_firstorder_Mean values and malignancy probability while showing a positive correlation between lesion diameter measurements and predicted cancer risk.
Assessment of unnecessary biopsy prevention
Evaluation of biopsy reduction indicated that at the S1–S3 thresholds, the model correctly identified 43.3–66.4% and 20.6–48.1% of the benign lesions as biopsy-unnecessary in the internal and external validation cohorts, respectively (Table 4). Importantly, even at the cutoff of S3, only four malignancies were missed internally. These results suggest that the integrated model can significantly reduce unnecessary biopsies while maintaining diagnostic accuracy.
Table 4
| Operating points | Target sensitivity | Training cohort | Internal validation cohort | External validation cohort | |||||
|---|---|---|---|---|---|---|---|---|---|
| Actual sensitivity | Specificity | Actual sensitivity | Specificity | Actual sensitivity | Specificity | ||||
| 1 | 97% | 97.6% | 63.8% (338/530) | 97.7% | 43.3% (103/238) | 98.3% | 20.6% (88/428) | ||
| 2 | 95% | 95.2% | 68.3% (368/530) | 95.4% | 50.8% (121/238) | 95.0% | 29.9% (128/428) | ||
| 3 | 90% | 90.3% | 78.7% (417/530) | 90.7% | 66.4% (158/238) | 90.0% | 48.1% (206/428) | ||
Discussion
In this study, we developed an integrated model combining DL, handcrafted radiomics, and clinical and US features and independently validated its ability to differentiate benign and malignant BI-RADS 4A breast lesions in an external validation cohort. The integrated model demonstrated superior performance compared with the DL model and the handcrafted radiomics model, significantly reducing false-positive rates while maintaining high cancer detection sensitivity.
Breast lesions classified as BI-RADS 4A exhibit radiological heterogeneity on US despite being associated with a low risk of malignancy. This highlights the need for precise differentiation of benign and malignant BI-RADS 4A lesions to guide appropriate treatment decisions and minimize unnecessary biopsies. A growing body of evidence suggests that various features can predict the malignancy of BI-RADS 4A lesions. Studies have found that texture features, such as those derived from wavelet transforms and GLCM, can predict malignancy in BI-RADS 4A lesions (19,32,38). However, these findings are limited in clinical applicability due to the small sample sizes and the absence of validation in multicenter cohorts. In our study, we developed a radiomics model based on a larger cohort, achieving an AUC of 0.715 in an external validation cohort of 488 patients. This result is more compelling and may have broader clinical relevance. Notably, all 29 features selected for our radiomics model were wavelet transform and GLCM features, offering detailed insights into lesion heterogeneity.
We employed a ResNet101-based DL framework for feature extraction. Unlike traditional methods, DL autonomously learns discriminative features through its hidden layer architectures without requiring predefined parameters. Although previous research (24,39-41) has established DL’s capability in breast lesion malignancy assessment, BI-RADS 4A-specific applications remain underexplored. Our DL model achieved clinically relevant predictive performance and marginally outperformed the handcrafted radiomics model in terms of discrimination ability in all cohorts. These findings indicate that DL provides rich information reflecting the spatial heterogeneity related to the nature of the lesion.
Nevertheless, the DL model and the handcrafted radiomics model seem to be prone to overfitting, which could be explained by the dataset being heavily imbalanced toward benign lesions. However, this arises from the nature of BI-RADS 4A lesions. Furthermore, the efficacy and generalizability of a single model constructed by features from a single domain main suboptimal. Thus, we developed an integrated model by using a feature fusion approach via a machine learning algorithm. Notably, the performance of our integrated model significantly outperformed that of the handcrafted radiomics model (P<0.001), the DL model (P=0.011), and the clinical model (P=0.015) in the external validation cohort. Overfitting showed marked improvement after fusion model implementation. Combining DL and radiomics features provides a synergistic effect, the strengths of both methodologies being leveraged. DL recovers intricate patterns missed by predefined features, while radiomics anchors predictions in robust biophysical priors, collectively reducing dependence on any single feature source and consequently diminishing the overfitting risk. The distributions of age, lesion diameter, shape, and microcalcification, which were incorporated into our integrated model, showed significant variation across all cohorts. Previous studies have indicated that these four clinical/US features are associated with the malignancy of BI-RADS 4 lesions (17,19,42,43). However, these metrics are inconsistent in these studies. The restricted diagnostic scope of clinical/US features likely contributes to the clinical model’s inconsistent performance. In contrast, our integrated framework (I) captures high-dimensional imaging biomarkers; (II) performs complete heterogeneity quantification; and (III) maintains strong generalizability despite cohort variations. Heatmap analysis verified the feature complementarity as indicated by the predominantly low correlation coefficients.
SHAP analysis offers a transparent method for interpreting the contributions of features within the LightGBM model, enabling clinicians to intuitively understand feature contributions through summary plot visualizations. Notably, in our study, wavelet_LHL_firstorder_Mean and wavelet_HLL_firstorder_Mean were two of the most influential nonsemantic imaging biomarkers in the integrated model identified by this explainability analysis. These wavelet features’ higher values reveal specific tissue organization patterns. The LHL feature’s increased mean value detects smooth horizontal texture changes, appearing as regular layered structures on US. Similarly, the high-low-low (HLL) feature’s higher mean indicates uniform vertical patterns with strong low-frequency signals, which appear as a consistent background tissue texture. Pathologically, these patterns match classic benign characteristics. The LHL component shows well-organized gland structures maintained by an intact fibrous capsule. Meanwhile, the HLL component demonstrates neatly arranged collagen fibers in the supporting tissue. When cancer develops, tumor cells invade and damage this orderly structure. The LHL values decrease as gland organization breaks down. The HLL values also drop when the supporting tissue’s structure weakens. This makes these wavelet measurements especially useful for spotting early cancerous changes.
Our study involved several limitations that should be acknowledged. First, due to the retrospective nature and differences in data distribution between centers, inherent biases could not be avoided despite an adequate sample size in the external validation cohort. Consequently, further well-designed prospective studies are essential to confirm the generalizability and clinical utility of our integrated model. Second, our model was developed using grayscale US images only. Data from advanced techniques such as elastography or contrast-enhanced US were not included. Incorporating these multimodal features in the future may enhance diagnostic performance. Third, the generalizability of our model may be limited due to it being developed and validated on a Chinese cohort. Its performance may not be the same in populations with different genetic backgrounds, disease prevalence, or imaging protocols due to spectrum bias. Thus, external validation in multiethnic and multinational cohorts is essential before clinical application to ensure robustness and equity. Finally, the biological significance of the integrated model, particularly the DL features, should be further investigated. Therefore, future research integrating imaging with molecular or genetic data could provide deeper insights into microlevel information and its interrelationships.
Conclusions
We developed and validated an integrated model incorporating DL, handcrafted radiomics, and clinical and US features to differentiate benign from malignant BI-RADS 4A breast lesions. This model has the potential to reduce unnecessary biopsies while maintaining diagnostic accuracy.
Acknowledgments
None.
Footnote
Reporting Checklist: The authors have completed the TRIPOD + AI reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1463/rc
Data Sharing Statement: Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1463/dss
Funding: This study was supported by grants from
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1463/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This retrospective study was approved by the Ethics Committee of Foshan First People’s Hospital (approval No. 2024-64) and the Medical Ethics committee of Nanfang Hospital of Southern Medical University (approval No. NFEC-2025-322), and individual consent for this retrospective analysis was waived. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Siegel RL, Giaquinto AN, Jemal A. Cancer statistics, 2024. CA Cancer J Clin 2024;74:12-49. [Crossref] [PubMed]
- Rivera-Franco MM, Leon-Rodriguez E. Delays in Breast Cancer Detection and Treatment in Developing Countries. Breast Cancer (Auckl) 2018;12:1178223417752677. [Crossref] [PubMed]
- Koh J, Kim MJ. Introduction of a New Staging System of Breast Cancer for Radiologists: An Emphasis on the Prognostic Stage. Korean J Radiol 2019;20:69-82. [Crossref] [PubMed]
- Mann RM, Hooley R, Barr RG, Moy L. Novel Approaches to Screening for Breast Cancer. Radiology 2020;297:266-85. [Crossref] [PubMed]
- Guo R, Lu G, Qin B, Fei B. Ultrasound Imaging Technologies for Breast Cancer Detection and Management: A Review. Ultrasound Med Biol 2018;44:37-70. [Crossref] [PubMed]
- Wang Y, Li Y, Song Y, Chen C, Wang Z, Li L, Liu M, Liu G, Xu Y, Zhou Y, Sun Q, Shen S. Comparison of ultrasound and mammography for early diagnosis of breast cancer among Chinese women with suspected breast lesions: A prospective trial. Thorac Cancer 2022;13:3145-51. [Crossref] [PubMed]
- von Euler-Chelpin M, Lillholm M, Vejborg I, Nielsen M, Lynge E. Sensitivity of screening mammography by density and texture: a cohort study from a population-based screening program in Denmark. Breast Cancer Res 2019;21:111. [Crossref] [PubMed]
- He J, Chen WQ, Li N, Shen HB, Li J, Wang Y, Li J, Tian JH, Zhou BSConsulting Group of China Guideline for the Screening and Early Diagnosis and Treatment of Female Breast Cancer. Expert Group of China Guideline for the Screening and Early Diagnosis and Treatment of Female Breast Cancer; Working Group of China Guideline for the Screening and Early Diagnosis and Treatment of Female Breast Cancer. China guideline for the screening and early detection of female breast cancer (2021, Beijing). Zhonghua Zhong Liu Za Zhi 2021;43:357-382. [Crossref] [PubMed]
- Prasad SN, Houserkova D. The role of various modalities in breast imaging. Biomed Pap Med Fac Univ Palacky Olomouc Czech Repub 2007;151:209-18. [Crossref] [PubMed]
- Mendelson E, Böhm-Vélez M, Berg W, Whitman G, Feldman M, Madjar H, D’Orsi C, Sickles E, Morris E. ACR BI-RADS® atlas, breast imaging reporting and data system. Reston, VA: American College of Radiology; 2013:97-9.
- Mercado CL. BI-RADS update. Radiol Clin North Am 2014;52:481-7. [Crossref] [PubMed]
- Stavros AT, Freitas AG, deMello GGN, Barke L, McDonald D, Kaske T, Wolverton D, Honick A, Stanzani D, Padovan AH, Moura APC, de Campos MCV. Ultrasound positive predictive values by BI-RADS categories 3-5 for solid masses: An independent reader study. Eur Radiol 2017;27:4307-15. [Crossref] [PubMed]
- Elezaby M, Li G, Bhargavan-Chatfield M, Burnside ES, DeMartini WB. ACR BI-RADS Assessment Category 4 Subdivisions in Diagnostic Mammography: Utilization and Outcomes in the National Mammography Database. Radiology 2018;287:416-22. [Crossref] [PubMed]
- Yoon JH, Lee HS, Kim YM, Youk JH, Kim SH, Jeong SH, Hwang JY, Moon JH, Park YM, Kim MJ. Effect of training on ultrasonography (US) BI-RADS features for radiology residents: a multicenter study comparing performances after training. Eur Radiol 2019;29:4468-76. [Crossref] [PubMed]
- He P, Cui LG, Chen W, Yang RL. Subcategorization of Ultrasonographic BI-RADS Category 4: Assessment of Diagnostic Accuracy in Diagnosing Breast Lesions and Influence of Clinical Factors on Positive Predictive Value. Ultrasound Med Biol 2019;45:1253-8. [Crossref] [PubMed]
- Mayerhoefer ME, Materka A, Langs G, Häggström I, Szczypiński P, Gibbs P, Cook G. Introduction to Radiomics. J Nucl Med 2020;61:488-95. [Crossref] [PubMed]
- Niu S, Huang J, Li J, Liu X, Wang D, Zhang R, Wang Y, Shen H, Qi M, Xiao Y, Guan M, Liu H, Li D, Liu F, Wang X, Xiong Y, Gao S, Wang X, Zhu J. Application of ultrasound artificial intelligence in the differential diagnosis between benign and malignant breast lesions of BI-RADS 4A. BMC Cancer 2020;20:959. [Crossref] [PubMed]
- Debbi K, Habert P, Grob A, Loundou A, Siles P, Bartoli A, Jacquier A. Radiomics model to classify mammary masses using breast DCE-MRI compared to the BI-RADS classification performance. Insights Imaging 2023;14:64. [Crossref] [PubMed]
- Wang SJ, Liu HQ, Yang T, Huang MQ, Zheng BW, Wu T, Qiu C, Han LQ, Ren J. Automated Breast Volume Scanner (ABVS)-Based Radiomic Nomogram: A Potential Tool for Reducing Unnecessary Biopsies of BI-RADS 4 Lesions. Diagnostics (Basel) 2022;12:172. [Crossref] [PubMed]
- Wang SJ, Liu HQ, Yang T, Huang MQ, Zheng BW, Wu T, Han LQ, Zhang Y, Ren J. Machine learning based on automated breast volume scanner (ABVS) radiomics for differential diagnosis of benign and malignant BI-RADS 4 lesions. Int J Imaging Syst Technol 2022;32:1577-87.
- Lin Y, Wang J, Li M, Zhou C, Hu Y, Wang M, Zhang X. Prediction of breast cancer and axillary positive-node response to neoadjuvant chemotherapy based on multi-parametric magnetic resonance imaging radiomics models. Breast 2024;76:103737. [Crossref] [PubMed]
- Lee J, Kim SH, Kim Y, Park J, Park GE, Kang BJ. Radiomics Nomogram: Prediction of 2-Year Disease-Free Survival in Young Age Breast Cancer. Cancers (Basel) 2022;14:4461. [Crossref] [PubMed]
- Yang Y, Zhong Y, Li J, Feng J, Gong C, Yu Y, Hu Y, Gu R, Wang H, Liu F, Mei J, Jiang X, Wang J, Yao Q, Wu W, Liu Q, Yao H. Deep learning combining mammography and ultrasound images to predict the malignancy of BI-RADS US 4A lesions in women with dense breasts: a diagnostic study. Int J Surg 2024;110:2604-13. [Crossref] [PubMed]
- Zhao Z, Hou S, Li S, Sheng D, Liu Q, Chang C, Chen J, Li J. Application of Deep Learning to Reduce the Rate of Malignancy Among BI-RADS 4A Breast Lesions Based on Ultrasonography. Ultrasound Med Biol 2022;48:2267-75. [Crossref] [PubMed]
- Liang R, Li F, Yao J, Tong F, Hua M, Liu J, Shi C, Sui L, Lu H. Predictive value of MRI-based deep learning model for lymphovascular invasion status in node-negative invasive breast cancer. Sci Rep 2024;14:16204. [Crossref] [PubMed]
- Yang X, Wu L, Ye W, Zhao K, Wang Y, Liu W, Li J, Li H, Liu Z, Liang C. Deep Learning Signature Based on Staging CT for Preoperative Prediction of Sentinel Lymph Node Metastasis in Breast Cancer. Acad Radiol 2020;27:1226-33. [Crossref] [PubMed]
- Gu J, Tong T, He C, Xu M, Yang X, Tian J, Jiang T, Wang K. Deep learning radiomics of ultrasonography can predict response to neoadjuvant chemotherapy in breast cancer at an early stage of treatment: a prospective study. Eur Radiol 2022;32:2099-109. [Crossref] [PubMed]
- Tong T, Li D, Gu J, Chen G, Bai G, Yang X, Wang K, Jiang T, Tian J. Dual-Input Transformer: An End-to-End Model for Preoperative Assessment of Pathological Complete Response to Neoadjuvant Chemotherapy in Breast Cancer Ultrasonography. IEEE J Biomed Health Inform 2023;27:251-62. [Crossref] [PubMed]
- Mondol RK, Millar EKA, Sowmya A, Meijering E. BioFusionNet: Deep Learning-Based Survival Risk Stratification in ER+ Breast Cancer Through Multifeature and Multimodal Data Fusion. IEEE J Biomed Health Inform 2024;28:5290-302. [Crossref] [PubMed]
- Yang L, Zhang N, Jia J, Ma Z. Deep learning radiomics on grayscale ultrasound images assists in diagnosing benign and malignant of BI-RADS 4 lesions. Sci Rep 2024;14:31479. [Crossref] [PubMed]
- Shen Y, Shamout FE, Oliver JR, Witowski J, Kannan K, Park J, et al. Artificial intelligence system reduces false-positive findings in the interpretation of breast ultrasound exams. Nat Commun 2021;12:5645. [Crossref] [PubMed]
- Ye J, Chen Y, Pan J, Qiu Y, Luo Z, Xiong Y, He Y, Chen Y, Xie F, Huang W. US-based Radiomics Analysis of Different Machine Learning Models for Differentiating Benign and Malignant BI-RADS 4A Breast Lesions. Acad Radiol 2025;32:67-78. [Crossref] [PubMed]
- Zhong L, Shi L, Zhou L, Liu X, Gu L, Bai W. Development of a nomogram-based model combining intra- and peritumoral ultrasound radiomics with clinical features for differentiating benign from malignant in Breast Imaging Reporting and Data System category 3-5 nodules. Quant Imaging Med Surg 2023;13:6899-910. [Crossref] [PubMed]
- Ji H, Zhu Q, Ma T, Cheng Y, Zhou S, Ren W, et al. Development and validation of a transformer-based CAD model for improving the consistency of BI-RADS category 3-5 nodule classification among radiologists: a multiple center study. Quant Imaging Med Surg 2023;13:3671-87. [Crossref] [PubMed]
- Collins GS, Moons KGM, Dhiman P, Riley RD, Beam AL, Van Calster B, et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 2024;385:e078378. [Crossref] [PubMed]
- Rodríguez-Pérez R, Bajorath J. Interpretation of Compound Activity Predictions from Complex Machine Learning Models Using Local Approximations and Shapley Values. J Med Chem 2020;63:8761-77. [Crossref] [PubMed]
- Lundberg S, Lee SI. A Unified Approach to Interpreting Model Predictions. NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems; 2017.
- Ma Q, Wang J, Xu D, Zhu C, Qin J, Wu Y, Gao Y, Zhang C. Automatic Breast Volume Scanner and B-Ultrasound-Based Radiomics Nomogram for Clinician Management of BI-RADS 4A Lesions. Acad Radiol 2023;30:1628-37. [Crossref] [PubMed]
- Gu Y, Xu W, Liu T, An X, Tian J, Ran H, et al. Ultrasound-based deep learning in the establishment of a breast lesion risk stratification system: a multicenter study. Eur Radiol 2023;33:2954-64. [Crossref] [PubMed]
- He P, Chen W, Bai MY, Li J, Wang QQ, Fan LH, Zheng J, Liu CT, Zhang XR, Yuan XR, Song PJ, Cui LG. Deep Learning-Based Computer-Aided Diagnosis for Breast Lesion Classification on Ultrasound: A Prospective Multicenter Study of Radiologists Without Breast Ultrasound Expertise. AJR Am J Roentgenol 2023;221:450-9. [Crossref] [PubMed]
- Yi M, Lin Y, Lin Z, Xu Z, Li L, Huang R, Huang W, Wang N, Zuo Y, Li N, Ni D, Zhang Y, Li Y. Biopsy or Follow-up: AI Improves the Clinical Strategy of US BI-RADS 4A Breast Nodules Using a Convolutional Neural Network. Clin Breast Cancer 2024;24:e319-e332.e2.
- Gu Y, Tian JW, Ran HT, Ren WD, Chang C, Yuan JJ, et al. The Utility of the Fifth Edition of the BI-RADS Ultrasound Lexicon in Category 4 Breast Lesions: A Prospective Multicenter Study in China. Acad Radiol 2022;29 Suppl 1:S26-34.
- Buzatto IPC, Recife SA, Miguel L, Bonini RM, Onari N, Faim ALPA, Silvestre L, Carlotti DP, Fröhlich A, Tiezzi DG. Machine learning can reliably predict malignancy of breast lesions based on clinical and ultrasonographic features. Breast Cancer Res Treat 2025;211:581-93. [Crossref] [PubMed]

