Development and validation of radiomics models for the diagnoses of breast lesions with calcification on ultrasound

Xinyi Wang; Nan Zhang; Jieling Ma; Wangyan Qin; Shengri Liao; Hongjing Chang; Jianbo Liu; Ling Huo

doi:10.21037/qims-2025-1-517

Original Article

Development and validation of radiomics models for the diagnoses of breast lesions with calcification on ultrasound

Xinyi Wang¹, Nan Zhang¹, Jieling Ma¹, Wangyan Qin¹, Shengri Liao¹, Hongjing Chang¹, Jianbo Liu², Ling Huo¹

¹Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education/Beijing), Breast Center, Peking University Cancer Hospital & Institute, Beijing, China; ²Huafang Hanying Medical Technology Co., Ltd., Beijing, China

Contributions: (I) Conception and design: X Wang, L Huo; (II) Administrative support: J Liu; (III) Provision of study materials or patients: L Huo; (IV) Collection and assembly of data: X Wang, N Zhang, J Ma, W Qin, S Liao, H Chang, L Huo; (V) Data analysis and interpretation: X Wang; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Ling Huo, MD. Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education/Beijing), Breast Center, Peking University Cancer Hospital & Institute, 52 Fucheng Rd., Beijing 100142, China. Email: hlbcus@163.com.

Background: High-frequency ultrasound (US) technology has enhanced the visualization of breast calcification; however, the understanding of these calcifications among sonographers remains limited. This study aimed to investigate the diagnostic potential of US images for breast lesions with calcification via a radiomics approach.

Methods: This retrospective study analyzed breast lesions with calcification (verified on mammography) at a single center. Additionally, 56 lesions from the public Breast Ultrasound Image Dataset (BUSI) served as the external test set. Regions of interest (ROIs) for lesions and calcifications were delineated on the US image respectively and used for feature extraction. Pathological findings and follow-up assessments were used as the diagnostic criteria for benign and malignant lesions. Radiomics models were developed using lesion, calcification, and fusion features based on six machine learning algorithms. In addition to the receiver operating characteristic curve and other metrics, a reader study was performed to evaluate the model performance.

Results: A total of 508 US images of breast lesions from 490 women (mean age, 52.1±11.0 years) were included. Among the 919 extracted two-dimensional radiomics features, 44, 14, and 23 features were selected to construct the calcification-based, lesion-based, and fusion-based models, respectively. The fusion strategy outperformed the other two strategies across most algorithms and exhibited enhanced diagnostic performance on both test sets. In terms of the area under the receiver operating characteristic curve (AUC), the fusion models significantly outperformed the calcification-based models for the Decision Tree (DT) and Random Forest (RF) algorithms on the internal test set (both P<0.05), and significantly outperformed the lesion-based models for the Bootstrap Aggregation (BAG) and Logistic Regression (LR) algorithms on the external test set (both P<0.05). Notably, the AUC of the representative BAG model improved significantly after incorporating calcification features, increasing from 0.679 to 0.770 on the external test set (P<0.05), which was comparable to the performance of most experienced sonographers.

Conclusions: Fusing features of lesion and calcification on US images enhance the predictive efficacy of radiomics models, indicating the additional value of calcification in diagnosing breast lesions.

Keywords: Breast lesions; calcification; cancer; ultrasound (US); radiomics

Submitted Nov 26, 2025. Accepted for publication Mar 12, 2026. Published online Apr 08, 2026.

doi: 10.21037/qims-2025-1-517

Introduction

According to the latest global cancer burden estimate from the World Health Organization, breast cancer has emerged as the most common cancer among women in most countries (1). Consequently, early detection of breast cancer is important. Mammography and breast ultrasound (US) are effective imaging modalities for breast cancer screening. Among various imaging features, calcification is considered as an important reference in distinguishing between benign and malignant breast tumors. The characteristics of size, morphology, quantity, and distribution of breast calcifications provide valuable diagnostic clues. Large, amorphous, or isolated calcifications often indicate a low risk of malignancy, whereas fine pleomorphic or linear branching distribution calcifications are highly suspicious for malignancy. Mammography remains the gold standard for investigating breast calcifications; however, Asian women, who typically have denser breast tissue compared to other ethnicities (2), pose challenges for mammography screening, particularly those under 40 years old. Consequently, the US examination has become the primary screening method for breast in China.

When diagnosing breast lesions with US, a wide range of sonographic features can be assessed; however, calcifications have traditionally received less emphasis in this context. According to the 5^th edition of Breast Imaging Reporting and Data System (BI-RADS), calcifications on US images are categorized into only three types: calcifications within a mass, outside of a mass, and intraductal calcifications (3). With recent advancements in high-frequency US technology, the visualization of subtle structures within breast lesions has improved (4), including the detection of calcifications (5,6). Although US cannot display all breast calcifications as comprehensively as mammography, it is now capable of identifying most intralesional calcifications. Meanwhile, sonographers have become increasingly attentive to calcifications in patients with suspected malignancies (7,8). Nevertheless, the current understanding of breast calcification on US remains limited, underscoring the need for further investigation into their diagnostic value.

With the rapid evolution of computer technology, artificial intelligence (AI) has been increasingly applied in medical imaging to aid in diagnosis and prognosis prediction. Radiomics, a promising AI method, extracts numerous image features and converts them into mineable high-dimensional data for statistical modeling and decision support. Several studies (9-11) have developed AI models to detect calcification on US images, showing promising results. However, few have systematically explored the diagnostic value of these calcifications for breast malignancy. Given the interpretability of radiomics methods, we investigated the value of calcifications in diagnosing breast malignancies by developing radiomics models based on different regions of interest (ROIs) and evaluating their performance on internal and external test sets. A reader study was also performed to measure the clinically applicability of the radiomics models. We present this article in accordance with the TRIPOD+AI reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-517/rc).

Methods

Patients’ enrollment

The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Institutional Review Board of Peking University Cancer Hospital & Institute (No. 2024YJZ62), and individual consent for this retrospective analysis was waived. A total of 1,548 women presenting with breast abnormalities between January and September 2021 underwent breast US examinations for further diagnosis in Peking University Cancer Hospital & Institute and were consecutively enrolled in this study. Eligible patients were further screened with the following inclusion and exclusion criteria.

Inclusion criteria:

Lesions exhibiting a mass effect and measuring <5 cm in maximum diameter;
US evidence of calcifications within the lesion, and confirmation of corresponding calcifications by mammography;
Lesions categorized as BI-RADS 4 or higher on US necessitated pathological confirmation through biopsy or subsequent surgical excision;
Lesions categorized as BI-RADS 3 or lower on US required a minimum 24-month follow-up, during which no significant changes in lesion size, morphology, or imaging characteristics were observed; lesions showing interval growth or suspicious morphological changes during follow-up were reclassified for pathological evaluation.

Exclusion criteria:

Lesions with unclear visualization of calcifications on US images or absence of corresponding calcifications on mammography;
Lesions with missing key clinical or imaging information, including unavailable mammography, pathological results or insufficient follow-up duration for definitive diagnosis. Cases with incomplete data were excluded from analysis, and no data imputation was performed.

US images acquisition

The US diagnostic equipment used in the study includes the RS80A (Samsung Medison, Seoul, South Korea), Logiq E9 (GE HealthCare, Chicago, IL, United States), MyLab Class C (Esaote, Genoa, Italy), and MyLab 9 (Esaote, Genoa, Italy), all equipped with 3–12 MHz linear array probes. The US examinations for all included patients were performed by six senior sonographers with at least 10 years of experience (three over 10 years and three over 20 years of experience). Two-dimensional grayscale images of targeted lesions were saved during examinations. Following image acquisition, the examining sonographer classified the lesions using the BI-RADS system for clinical reference.

Delineation of ROIs

Figure 1 presents the work flow of models establishment. Routinely stored two-dimensional grayscale images of the largest orthogonal or characteristic section of each lesion during examinations were utilized in this study. The clearest image displaying calcifications among those with multiple images of each lesion was chosen for further analysis. One of the sonographers, possessing over 10 years of experience in breast US diagnosis, utilized 3D Slicer software (version 5.6.0, Slicer Community) to annotate the images. The ROIs corresponding to breast lesions and calcifications were delineated separately on each image and used for subsequent radiomics analysis. For lesions containing multiple calcifications, a single merged ROI was constructed by combining individual calcification masks. Radiomics features were then extracted from this integrated ROI, capturing the collective distribution and density characteristics of the calcification cluster.

Figure 1 Workflow of model establishment. Ultrasound images were collected from eligible patients, followed by manual annotation of lesion and calcification regions of interest by experienced sonographers. Individual calcification masks were merged into a single integrated region of interest for feature extraction. Radiomics features from seven categories were extracted, and a three-step feature selection strategy was applied to identify the most relevant features. Six machine learning algorithms representing distinct modeling paradigms were used to develop radiomics models. Model performance was evaluated on internal and external test sets using receiver operating characteristic analysis and decision curve analysis. AUC, area under the receiver operating characteristic curve; BAG, Bootstrap Aggregation; DCA, decision curve analysis; DT, Decision Tree; GLCM, Gray Level Co-occurrence Matrix; GLDM, Gray Level Dependence Matrix; GLRLM, Gray Level Run Length Matrix; GLSZM, Gray Level Size Zone Matrix; LASSO, least absolute shrinkage and selection operator; LR, Logistic Regression; MLP, Multi-layer Perceptron; NGTDM, Neighbouring Gray Tone Difference Matrix; RF, Random Forest; ROC, receiver operating characteristic; ROI, region of interest; SVM, Support Vector Machine; US, ultrasound.

Of note, feature stability was assessed on 56 calcified breast lesions selected from the publicly available Breast Ultrasound Image Dataset (BUSI) (12). For each case, the same sonographer re-annotated the lesion ROIs. Radiomics features extracted from the provided ROIs and re-annotated ROIs were then compared by calculating intraclass correlation coefficients (ICCs) to assess feature reproducibility.

Establishment of the gold standard

Lesions categorized as BI-RADS 4 or higher on US underwent core-needle biopsy using 14-gauge biopsy needles. Additionally, a subset of lesions categorized as BI-RADS 3 underwent biopsy based on comprehensive clinical and imaging evaluation by physicians. For lesions diagnosed as borderline or malignant on biopsy and subsequently removed via surgical resection, postoperative pathology findings were used as the reference standard for final lesion classification. For patients without pathological confirmation, lesions were classified as benign if no significant progression in size, morphology, or imaging characteristics was observed during a minimum follow-up period of 24 months, in accordance with standard clinical practice for probably benign breast lesions.

Data partition

We randomly selected 80% of all included lesions for model training, with the remaining 20% for internal testing. The external test set consisted of 56 breast lesions presenting calcifications selected from the dataset BUSI.

Radiomics feature extraction

Radiomics features were extracted from each delineated ROI using the open-source PyRadiomics software package (v3.0.1, https://pypi.org/project/pyradiomics), including first-order features, shape-based features, gray-level co-occurrence matrix features, gray-level dependence matrix features, gray-level run-length matrix features, size zone matrix features extracted from each ROI, and neighbouring gray-tone difference matrix features. Python (version 3.8, https://www.python.org) was used for the normalization of original images and radiomic feature extraction.

Radiomics feature selection

The feature selection involved significance testing of each feature to choose those with high predictive power. Features with statistically significant differences (P<0.05) were retained. Subsequently, all retained features were pairwise matched for redundancy reduction, and if the Pearson correlation coefficient between two features exceeded 0.85, the feature with the larger P value in the significance test was discarded. Finally, the least absolute shrinkage and selection operator (LASSO) feature selection method was applied to select features with non-zero coefficients from all features.

Radiomics models establishment

Three sets of Radiomics models were developed using calcification features, lesion features, and fusion features. Python scikit-learn machine learning software package (version 0.20.4, https://scikit-learn.org/stable) was used to implement classifiers for each downstream task. Six supervised classification models were selected, including Support Vector Machine (SVM), Logistic Regression (LR), Decision Tree (DT), Multi-layer Perceptron (MLP), Random Forest (RF), and Bootstrap Aggregation (BAG). To select the best model and hyperparameters for each model, five-fold cross-validation was performed on the development set. Grid-search for hyperparameters are presented in Table S1. In the model testing phase, an ensemble model using five-fold cross-validation was employed to identify the final classification.

Reader study for malignancy identification

Four of the six senior sonographers who participated in image acquisition were included in the reader study. Sonographers 1 and 2 had more than 10 years of experience in breast US diagnosis, while sonographers 3 and 4 had more than 20 years of experience. To minimize inter-reader variability, all readers underwent a pilot training session to standardize the diagnostic criteria prior to the study. The reader study was conducted using the external test set (BUSI), comprising 56 breast lesions. A rigorous blinding process was implemented: all images were fully anonymized by a non-reading coordinator, with personal identifiers and clinical metadata removed, and were presented in a randomized order to each other. Each sonographer independently reviewed all lesions, blinded to pathological and model predictions, and classified each lesion as benign or malignant based solely on US images. Lesions categorized as BI-RADS ≤3 were considered benign, while those categorized as BI-RADS ≥4 were considered malignant. Diagnostic performance was compared between sonographers and the representative radiomics model. To further evaluate inter-reader reliability, four sonographers with similar clinical experience were paired into two groups (sonographers 1 and 2, and sonographers 3 and 4), and inter-reader agreement was assessed using Cohen’s kappa coefficient.

Model performance evaluation and statistical analysis

Differences in continuous variables were analyzed using the independent-sample t-test or Mann-Whitney U test. The Chi-squared test and Fisher’s exact test were employed to assess differences in categorical variables. Model performance was evaluated using metrics of the area under the receiver operating characteristic curve (AUC), accuracy, sensitivity, and specificity on both the internal and the external test sets. The DeLong test was employed to assess the differences in AUCs. The Hosmer-Lemeshow test was conducted to assess the reliability of the models. Decision curves analysis (DCA) was performed to evaluate the clinical utility. A significance level of P<0.05 was considered statistically significant. Statistical analyses were conducted using SPSS Statistics (version 26, IBM Corporation) and Python (version 3.8, https://www.python.org).

Results

Clinical characteristics

We retrospectively enrolled 490 female patients, presenting a total of 508 breast lesions, including 18 patients with bilateral lesions. Figure 2 illustrates the flowchart of patients’ enrollment. All of these lesions were associated with calcifications, and 66.3% (337/508) were confirmed to be malignant. Median patient age was 57 years [interquartile range (IQR), 44–60 years]. There was no statistically significant age difference between patients with benign and malignant lesions (51.5±10.9 vs. 52.4±11.1 years; P>0.05). However, the malignant lesions were significantly larger than those of the benign ones (3.1±1.3 vs. 1.4±0.8 cm; P<0.001). Patient characteristics are summarized in Table 1. A comparison between the development and test sets revealed no significant differences in patient characteristics.

Figure 2 The flow chart of patients’ enrollment. US, ultrasound.

Table 1

Patient characteristics

Characteristics	Total (n=508)	Development set (n=409)	Test set (n=99)	P
Age (years)	52.1±11.0	52.4±11.1	50.7±10.7	0.158
Maximum diameter (cm)	2.5±1.4	2.5±1.4	2.6±1.5	0.370
Final diagnosis
Malignant (pathological)	337 (66.3)	274 (67.0)	63 (63.6)	0.450
Invasive ductal carcinoma	264	216	48
Ductal carcinoma in situ	46	39	7
Invasive lobular carcinoma	8	6	2
Solid papillary carcinoma	6	4	2
Mucinous carcinoma	2	1	1
Others^†	11	8	3
Benign (pathological)	54 (10.6)	40 (9.8)	14 (14.1)
Proliferative disease	19	14	5
Fibroadenoma	20	15	5
Inflammation	5	3	2
Others^‡	10	8	2
Benign (follow-up)	117 (23.0)	95 (23.2)	22 (22.2)

Data are presented as mean ± standard deviation or n (%) or n. ^†, including borderline phyllodes tumor, lymphoma, tubular carcinoma, etc.; ^‡, including atypical ductal hyperplasia, mammary duct ectasia, intraductal papilloma, etc.

Selection of radiomics features

A total of 407 lesions were randomly selected for model training, with the remaining 101 used for internal testing. From each ROI set, 919 two-dimensional radiomics features were extracted. Among the features derived from the provided and re-annotated ROIs in the BUSI dataset, 83.4% showed an ICC greater than 0.75, which was usually considered the threshold for excellent agreement (13), demonstrating high feature reproducibility. Following the three-stage feature selection process, 44 features were selected for calcification-based modeling, 14 for lesion-based modeling, and 23 for fusion modeling. Extracted and selected features per category are presented in Table S2, LASSO coefficient profiles and λ-tuning results are presented in Figure S1.

Of note, eight of the ten most task-relevant features in the fusion model were calcification-related (Figure 3), underscoring the substantial diagnostic value of calcifications on US images for distinguishing benign from malignant breast lesions.

Figure 3 Weight map of the top 10 features most relevant to the fusion model’s predictive objective.

Performance evaluation of radiomics models based on different ROI features in predicting lesion malignancy

As shown in Figure 4, radiomics models developed solely based on either calcification features or lesion features exhibited good performance on the internal test set (average AUC of 0.849 and 0.883, respectively). However, both showed noticeable declines on the external test set (average AUC of 0.701 and 0.676, respectively), indicating a tendency toward overfitting in the single-scale feature-based models.

Figure 4 The AUCs for six radiomics models constructed based on calcification features, lesion features, and fusion features on the internal test set (A) and the external test set (B). AUC, area under the receiver operating characteristic curve; BAG, Bootstrap Aggregation; BUSI, Breast Ultrasound Image Dataset; DT, Decision Tree; LR, Logistic Regression; MLP, Multi-layer Perceptron; RF, Random Forest; SVM, Support Vector Machine.

Notably, on internal test set, the lesion-based DT model achieved a significantly higher AUC than the calcification-based DT model (0.859 vs. 0.739, P<0.05). However, on the external test set, all lesion-based models showed reduced performance, falling slightly below the calcification-based models, though these differences were not statistically significant (all P>0.05).

Performance evaluation of fusion radiomics models in predicting lesion malignancy

Given the value of calcifications in predicting breast lesion malignancies, we further combined lesion and calcification features for fusion radiomics modeling. Of note, the fusion models outperformed the other two model types across most algorithms and demonstrated decent efficacy in identifying malignant lesions on both internal and external test sets. Statistically, fusion models significantly outperformed calcification-based models for DT and RF algorithms on the internal test set (both P<0.05, Figure 4A). Meanwhile, the fusion models significantly outperformed the lesion-based models for BAG and LR algorithms on the external test sets (both P<0.05, Figure 4B).

Since the BAG fusion model exhibited the optimal performance with an AUC of 0.770 on the external test set, it was selected as the representative model for subsequent analyses. The receiver operating characteristic curves of the BAG models indicated that the fusion model outperformed single-feature models on both internal and external test sets (Figure 5A,5B). Additionally, the Hosmer-Lemeshow tests showed that the BAG fusion model yielded non-significant results on both internal and external test sets (P=0.885 and 0.122, respectively), indicating no substantial deviation from perfect calibration. Regarding clinical utility, DCA demonstrated that the fusion model provided a higher net benefit compared with the other two models (Figure 5C,5D).

Figure 5 The ROC and DCA curves of BAG models constructed based on calcification features, lesion features, and fusion features. Panels (A) and (C) represent the internal test set results, while panels (B) and (D) represent the external test set results. AUC, area under the receiver operating characteristic curve; BAG, Bootstrap Aggregation; CI, confidence interval; DCA, decision curve analysis; ROC, receiver operating characteristic.

Detailed metrics of the BAG fusion models on internal and external test sets are summarized in Table 2.

Table 2

The performance metrics of the BAG models constructed using different ROI features on internal and external test sets

Models	Accuracy (95% CI)	AUC (95% CI)	Sensitivity (95% CI)	Specificity (95% CI)	PPV (95% CI)	NPV (95% CI)
Lesion
Internal	0.851 (0.769, 0.908)	0.889 (0.819, 0.943)	0.873 (0.769, 0.934)	0.816 (0.666, 0.908)	0.887 (0.785, 0.944)	0.795 (0.645, 0.892)
External-BUSI	0.679 (0.548, 0.786)	0.679 (0.486, 0.839)	0.690 (0.540, 0.809)	0.643 (0.388, 0.837)	0.853 (0.699, 0.936)	0.409 (0.233, 0.613)
Calcification
Internal	0.861 (0.781, 0.916)	0.872 (0.788, 0.944)	0.937 (0.848, 0.975)	0.737 (0.580, 0.850)	0.855 (0.753, 0.919)	0.875 (0.719, 0.950)
External-BUSI	0.643 (0.512, 0.755)	0.687 (0.521, 0.839)	0.595 (0.445, 0.730)	0.786 (0.524, 0.924)	0.893 (0.728, 0.963)	0.393 (0.236, 0.576)
Fusion
Internal	0.871 (0.792, 0.923)	0.911 (0.837, 0.970)	0.841 (0.732, 0.911)	0.921 (0.792, 0.973)	0.946 (0.854, 0.982)	0.778 (0.637, 0.875)
External-BUSI	0.786 (0.662, 0.873)	0.770 (0.602, 0.911)^†	0.810 (0.667, 0.900)	0.714 (0.454, 0.883)	0.895 (0.759, 0.958)	0.556 (0.337, 0.754)

^†, significantly higher than the AUC of the lesion-based model on the external-BUSI test set. AUC, area under the receiver operating characteristic curve; BAG, Bootstrap Aggregation; BUSI, Breast Ultrasound Images Dataset; CI, confidence interval; NPV, negative predictive value; PPV, positive predictive value; ROI, region of interest.

Performance evaluation of lesion malignancy prediction by the sonographer

The performance of the four sonographers on the external test set was evaluated using the same metrics as those for the BAG fusion model, with comparisons presented in Figure 6 and Table 3. The AUC of the BAG fusion model was equivalent to that of sonographers 1, 2, and 3 (all P>0.05) but significantly lower than sonographer 4 (0.770 vs. 0.905, P<0.05). In terms of sensitivity, the BAG fusion model demonstrated performance similar to that of sonographers 1, 3, and 4, but significantly lower than that of sonographer 2 (P<0.01). The specificity of the BAG fusion model exceeded that of most sonographers, although the difference did not reach statistical significance (P=0.510). These findings indicate that the BAG fusion model exhibited malignancy prediction performance comparable to that of most senior sonographers.

Figure 6 The ROC curve of the BAG fusion model and comparison with sonographers’ performance. AI, artificial intelligence; AUC, area under the receiver operating characteristic curve; BAG, Bootstrap Aggregation; CI, confidence interval; ROC, receiver operating characteristic.

Table 3

Performance of the sonographers and BAG fusion model on the external test set

Readers	AUC (95% CI)	Accuracy	Sensitivity	Specificity	PPV	NPV
Sonographer_1	0.798 (0.663–0.932)	0.875	0.952	0.643	0.889	0.818
Sonographer_2	0.786 (0.651–0.920)	0.893	1.000^†	0.571	0.875	1.000
Sonographer_3	0.774 (0.636–0.912)	0.839	0.905	0.643	0.884	0.692
Sonographer_4	0.905 (0.804–1.000)^‡	0.929	0.952	0.857	0.952	0.857
BAG fusion model	0.770 (0.616–0.924)	0.768	0.810	0.714	0.895	0.556

^†, sensitivity significantly higher than that of the other three sonographers and the BAG fusion model; ^‡, AUC significantly higher than that of the other three sonographers and the BAG fusion model. AUC, area under the receiver operating characteristic curve; BAG, Bootstrap Aggregation; CI, confidence interval; NPV, negative predictive value; PPV, positive predictive value.

The reader study of two groups demonstrated substantial inter-reader agreement, with Cohen’s kappa coefficients of 0.685 and 0.659, indicating good consistency between readers.

Discussion

In this study, we retrospectively analyzed 508 breast lesions with calcifications visible on both US and mammography. Three distinct sets of radiomics models were developed based on the calcification features, lesion features, and fusion features. Across most algorithms, there was no significant performance difference between calcification- and lesion-based models, and both showed reduced AUCs in external tests. Notably, the fusion strategy consistently achieved superior performance, with the representative BAG fusion model demonstrating significant improvements in malignancy prediction on the external test set. Its diagnostic performance was also comparable to that of experienced sonographers, underscoring its potential for clinical application.

Mammography is widely recognized as the gold standard for evaluating breast calcifications in medical imaging. Although US does not match the efficacy of mammography in detecting breast calcifications, advancements in imaging post-processing techniques have optimized the calcification detection capabilities. For example, the representative MicroPure^TM (Toshiba Medical Systems Corp., Tochigi, Japan) technique has demonstrated improvements in visualizing breast microcalcifications (i.e., defined as calcification less than 0.5 mm in diameter by the 4^th edition of BI-RADS (14), but was removed from the 5^th edition of BI-RADS) on US. Meanwhile, many studies also reported that the detection rate of microcalcifications using high-frequency B-mode US is higher than that achieved with MicroPure^TM (6) and comparable to mammography (41/44, 93%) (15). High-frequency US enhances B-mode resolution by increasing the transducer center frequency to typically above 7 MHz. When combined with the hypoechoic background of the lesion, this technique allows for clearer visualization of calcifications within the lesion compared to those located outside the lesion (16). For this reason, the analysis in this study was restricted to calcifications located within lesions. Although numerous studies (6,15-17) have demonstrated that high-frequency US offers a higher detection rate for calcifications and can assist in guiding biopsies, there is a scarcity of research focusing on the diagnostic value of calcifications in US images. The 5^th edition of BI-RADS only classifies breast calcifications on US images based on their location, and there is currently no unified standard for further describing and characterizing the nature of these calcifications. To address this gap, this study employed a radiomics approach to provide a more detailed analysis and characterization of the calcifications.

Previous studies exploring the roles of breast calcifications in malignancy diagnosis have primarily relied on mammography (18,19). To the best of our knowledge, this is the first study to evaluate the diagnostic value of calcifications for breast malignancy using US, leveraging recent advancements that have markedly improved the detection of calcifications on US images. In this study, both lesion-based models and calcification-based models exhibited decent performance on the internal test set but showed decreased performance on the external test set. This indicated that models relying solely on lesion or calcification features are insufficient for accurately predicting the malignancy of calcified breast lesions under current circumstances. Inspired by a previous study showing that combining multi-scale ROI-derived features enhanced breast lesion diagnosis in mammography (20), we adopted a fusion strategy integrating lesion and calcification features to predict breast lesion malignancy. Overall, the fusion models across employed algorithms generally outperformed single-scale feature-based models, especially on the external test set, underscoring the added diagnostic value of calcifications in US.

For feature selection, we employed a multi-step process consisting of significance testing, correlation analysis, and LASSO regression. LASSO regression was selected as the final feature selection step for its ability to address residual multicollinearity that may persist after preliminary redundancy reduction. Through L1 regularization, LASSO shrinks the coefficients of less informative or correlated features toward zero, thereby enhancing model stability and reducing overfitting risk (21,22). This approach has been widely applied for feature selection in moderately high-dimensional data following initial screening (23-25).

In addition, eight of the top ten features for fusion modeling were calcification-related, underscoring the pivotal role of calcifications in distinguishing benign from malignant breast lesions. Notably, these features correspond well to established imaging and pathological characteristics of breast calcifications, thereby enhancing model interpretability and mitigating concerns related to the “black-box” nature of deep learning-based approaches. For instance, shape-based features in the fusion models, including Sphericity and PerimeterSurfaceRatio, characterize the spatial arrangement of calcification clusters. The Calci_original_shape2D_Sphericity (weight: −0.318) reflects the spatial compactness of the cluster’s planar projection. Its negative association with malignancy indicates that clusters with lower sphericity, characterized by elongated or irregular morphologies, are more likely to be malignant. This finding is consistent with pathological evidence showing that malignant calcifications tend to distribute linearly or segmentally along ducts or stromal structures, resulting in irregular two-dimensional contours (26,27). In contrast, the Calci_original_shape2D_PerimeterSurfaceRatio (weight: 0.203) represents boundary complexity. Within the merged two-dimensional ROI, a higher perimeter-to-surface ratio corresponds to increased irregularity and a more fragmented or diffuse distribution, which aligns with the scattered calcification patterns commonly observed in malignant nodules (26). Additionally, the Calci_wavelet-LH_glszm_LargeAreaHighGrayLevelEmphasis (weight: −0.326) captures large, high-intensity regions within the cluster. Its negative correlation with malignancy suggests that malignant nodules are less likely to contain large, homogeneous, and highly echogenic calcification areas, consistent with the clinical observation that malignant calcifications are typically small and discretely distributed (26). Such consistency between model-derived features and established clinical knowledge provides biological plausibility for the fusion model and supports its potential utility as transparent decision-support tool in clinical practice.

To ensure a comprehensive evaluation, six representative machine learning algorithms spanning different modeling paradigms were included, covering linear and kernel-based models (LR, SVM), non-linear models (DT, MLP), and ensemble learning methods (RF, BAG). This design allowed us to systematically compare model behaviors under diverse assumptions and identify the most robust classifier for radiomics-based prediction. Ensemble models, particularly BAG, demonstrated superior performance in our experiments. This finding is consistent with prior radiomics studies, where ensemble learning has been shown to effectively reduce variance and improve robustness against high-dimensional noise in medical imaging data (28). By aggregating multiple weak learners, BAG was able to capture complementary information between calcification and lesion features while mitigating the overfitting commonly observed in single-estimator models. To account for potential class imbalance, class-weighted learning strategies were applied, and AUC was emphasized as the primary evaluation metric. Furthermore, hyperparameters for all models were optimized using grid search within a five-fold cross-validation framework, ensuring fair comparison and optimal model performance.

In the reader study, we included four experienced sonographers and compared their independent diagnostic performance to the proposed BAG fusion model, to explore its application potential in clinical practice. In consistent with other reader studies (29,30) involving ultrasonic AI, the representative BAG fusion model exhibited lower sensitivity but higher specificity compared to most experienced sonographers. The relatively low diagnostic specificity of most sonographers often results in more false-positive diagnosis, leading to unnecessary lesion biopsies (31). Therefore, our proposed BAG fusion model has the potential to improve the diagnostic specificity and reduce unnecessary biopsies as a decision-support tool integrated into routine US interpretation, rather than as a stand-alone diagnostic system. In clinical practice, the model could be applied after initial image acquisition to provide objective risk assessment that complements the sonographer’s judgment and supports more informed clinical decision-making. In addition, the model may serve as an educational resource for training junior sonographers by offering consistent feedback on lesion characterization. Nevertheless, the translation of this approach into routine clinical workflows and its impact on AI-human collaboration remain to be fully established, underscoring the need for prospective, multicenter validation to confirm real-world usability, generalizability, and clinical benefit across diverse institutions and patient populations.

Despite the promising clinical potential of the proposed BAG fusion model, this study has several limitations. First, the patient enrollment period (January–September 2021) was relatively short, which may have influenced the case mix. Although all eligible patients were consecutively enrolled to minimize investigator-driven selection, temporal factors and institution-specific practice patterns may still introduce residual selection bias. In addition, 23.0% (117/508) of the included lesions lacked pathological confirmation and were diagnosed as benign based on follow-up assessments, which may introduce diagnostic uncertainty. Second, the analysis was limited to static two-dimensional grayscale US images. Compared with three-dimensional or dynamic US imaging, static two-dimensional images lack volumetric and temporal information, such as spatial lesion continuity, depth-related morphology, and real-time echogenic behavior during probe movement. Although mammography confirmed the presence of calcifications, the absence of dynamic assessment may still lead to ambiguity between calcifications and other highly echogenic structures, such as fibrous tissue, during imaging annotation. Future studies incorporating three-dimensional or dynamic US data may better capture these diagnostic features and further improve model performance and clinical relevance. Third, while excellent inter-observer agreement was observed between the provided and re-annotate ROIs in the ICC analysis of the BUSI dataset, manual annotation inherently involves subjectivity. Even with high agreement, variability may arise without standardized annotation systems, including predefined delineation guidelines, annotator training, consensus review, and adjudication by experienced experts. Automated or semi-automated segmentation approaches may help reduce inter-observer variability, improve annotation consistency, and enhance reproducibility, particularly for large, multi-center datasets. Finally, the models were developed using single-center, retrospective data, which may limit generalizability due to institution-specific imaging protocols and clinical practice patterns, as well as inherent biases related to retrospective data collection. External validation was further constrained by the small size and limited independence of the external test set derived from a public dataset (BUSI), which may diverge from real-world clinical scenarios in terms of case distribution and the lack of comprehensive multimodal confirmation (e.g., mammography or pathological correlation). These factors may restrict the representativeness of the external cohort and limit the robustness of generalizability assessment across diverse clinical settings. Prospective, fully independent, multicenter studies are therefore essential to validate real-world performance, clinical workflow integration, and the generalizability of the proposed model.

Conclusions

The integration of radiomics features extracted from both lesion and their associated calcifications can improve the predictive performance and robustness of radiomics models in differentiating benign from malignant calcified breast lesions on US. Given its comparable performance to that of most senior sonographers and the clinical interpretability of its key features, the proposed fusion model shows promise as a reliable complementary diagnostic tool for the accurate assessment of calcified breast lesions in clinical practice.

Acknowledgments

We would like to thank Dr. Dawei Wang for artificial intelligence modeling technical support.

Footnote

Reporting Checklist: The authors have completed the TRIPOD+AI reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-517/rc

Data Sharing Statement: Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-517/dss

Funding: None.

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-517/coif). J.L. is a current employee of Huafang Hanying Medical Technology Co., Ltd. The other authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Institutional Review Board of Peking University Cancer Hospital & Institute (No. 2024YJZ62), and individual consent for this retrospective analysis was waived.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Ferlay J, Ervik M, Lam F, Laversanne M, Colombet M, Mery L, Piñeros M, Znaor A, Soerjomataram I, Bray F. Global Cancer Observatory: Cancer Today. Lyon, France: International Agency for Research on Cancer; 2024. Available online: https://gco.iarc.who.int/media/globocan/factsheets/populations/591-panama-fact-sheet.pdf. Accessed 26 Feb 2024.
Bissell MCS, Kerlikowske K, Sprague BL, Tice JA, Gard CC, Tossas KY, Rauscher GH, Trentham-Dietz A, Henderson LM, Onega T, Keegan THM, Miglioretti DLBreast Cancer Surveillance Consortium. Breast Cancer Population Attributable Risk Proportions Associated with Body Mass Index and Breast Density by Race/Ethnicity and Menopausal Status. Cancer Epidemiol Biomarkers Prev 2020;29:2048-56. [Crossref] [PubMed]
D'Orsi CJ, Sickles EA, Mendelson EB, Morris EA. ACR BI-RADS Atlas: Breast Imaging Reporting and Data System. Reston, VA: American College of Radiology; 2013.
Niu S, Huang J, Li J, Liu X, Wang D, Zhang R, Wang Y, Shen H, Qi M, Xiao Y, Guan M, Liu H, Li D, Liu F, Wang X, Xiong Y, Gao S, Wang X, Zhu J. Application of ultrasound artificial intelligence in the differential diagnosis between benign and malignant breast lesions of BI-RADS 4A. BMC Cancer 2020;20:959. [Crossref] [PubMed]
Moon WK, Im JG, Koh YH, Noh DY. Park IA. US of mammographically detected clustered microcalcifications. Radiology 2000;217:849-54. [Crossref] [PubMed]
Stöblen F, Landt S, Ishaq R, Stelkens-Gebhardt R, Rezai M, Skaane P, Blohmer JU, Sehouli J, Kümmel S. High-frequency breast ultrasound for the detection of microcalcifications and associated masses in BI-RADS 4a patients. Anticancer Res 2011;31:2575-81.
Lin M, Wu S. Ultrasound classification of non-mass breast lesions following BI-RADS presents high positive predictive value. PLoS One 2022;17:e0278299. [Crossref] [PubMed]
Pan J, Tong W, Luo J, Liang J, Pan F, Zheng Y, Xie X. Does contrast-enhanced ultrasound (CEUS) play a better role in diagnosis of breast lesions with calcification? A comparison with MRI. Br J Radiol 2020;93:20200195. [Crossref] [PubMed]
Chang RF, Hou YL, Huang CS, Chen JH, Chang JM, Moon WK. Automatic detection of microcalcifications in breast ultrasound. Med Phys 2013;40:102901. [Crossref] [PubMed]
Ren L, Liu Y, Tong Y, Cao X, Wu Y. Calcification segmentation based on a different scales superpixels saliency detection algorithm. Ultrasound Med Biol 2020;46:3404-12. [Crossref] [PubMed]
Karunia PD, Prajitno P, Soejoko DS. Automatic Detection of Breast Calcification in Ultrasound Imaging with Convolutional Neural Network. J Phys Conf Ser 2021;2019:012077.
Al-Dhabyani W, Gomaa M, Khaled H, Aly F. Deep learning approaches for data augmentation and classification of breast masses using ultrasound images. Int J Adv Comput Sci Appl 2019;10:1-11.
Haarburger C, Müller-Franzes G, Weninger L, Kuhl C, Truhn D, Merhof D. Radiomics feature reproducibility under inter-rater variability in segmentations of CT images. Sci Rep 2020;10:12688. [Crossref] [PubMed]
Bae MS, Cha JH, Chang JM, Cho N, Chu AJ, Han W, Jang M, Kim S-Y, Kim SM. Breast ultrasound diagnosis and report writing. In: Moon WK, editor. Breast Ultrasound. Xi’an: World Publishing Corporation; 2022:226.
Teh WL, Wilson AR, Evans AJ, Burrell H, Pinder SE, Ellis IO. Ultrasound guided core biopsy of suspicious mammographic calcifications using high frequency and power Doppler ultrasound. Clin Radiol 2000;55:390-4. [Crossref] [PubMed]
Nagashima T, Hashimoto H, Oshida K, Nakano S, Tanabe N, Nikaido T, Koda K, Miyazaki M. Ultrasound Demonstration of Mammographically Detected Microcalcifications in Patients with Ductal Carcinoma in situ of the Breast. Breast Cancer 2005;12:216-20. [Crossref] [PubMed]
Cho N, Moon WK, Cha JH, Kim SM, Jang M, Chang JM, Chung SY. Ultrasound-guided vacuum-assisted biopsy of microcalcifications detected at screening mammography. Acta Radiol 2009;50:602-9. [Crossref] [PubMed]
Chen JL, Cheng LH, Wang J, Hsu TW, Chen CY, Tseng LM, Guo SM. A YOLO-based AI system for classifying calcifications on spot magnification mammograms. Biomed Eng Online 2023;22:54. [Crossref] [PubMed]
Gerbasi A, Clementi G, Corsi F, Albasini S, Malovini A, Quaglini S, Bellazzi R. DeepMiCa: Automatic segmentation and classification of breast MIcroCAlcifications from mammograms. Comput Methods Programs Biomed 2023;235:107483. [Crossref] [PubMed]
Li H, Chen D, Nailon WH, Davies ME, Laurenson DI. Dual Convolutional Neural Networks for Breast Mass Segmentation and Diagnosis in Mammography. IEEE Trans Med Imaging 2022;41:3-13. [Crossref] [PubMed]
Tibshirani R. Regression shrinkage and selection via the lasso. J R Statist Soc B 1996;58:267-88.
Guyon IM, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res 2003;3:1157-82.
Li X, Zhang N, Hu C, Lin Y, Li J, Li Z, et al. CT-based radiomics signature of visceral adipose tissue for prediction of disease progression in patients with Crohn's disease: A multicentre cohort study. EClinicalMedicine 2023;56:101805. [Crossref] [PubMed]
Si Y, Abdollahi A, Ashrafi N, Placencia G, Pishgar E, Alaei K, Pishgar M. Optimized feature selection and advanced machine learning for stroke risk prediction in revascularized coronary artery disease patients. BMC Med Inform Decis Mak 2025;25:276. [Crossref] [PubMed]
Ya Y, Ji L, Jia Y, Zou N, Jiang Z, Yin H, Mao C, Luo W, Wang E, Fan G. Machine Learning Models for Diagnosis of Parkinson's Disease Using Multiple Structural Magnetic Resonance Imaging Features. Front Aging Neurosci 2022;14:808520. [Crossref] [PubMed]
van Leeuwen MM, Doyle S, van den Belt-Dusebout AW, van der Mierden S, Loo CE, Mann RM, Teuwen J, Wesseling J. Clinicopathological and prognostic value of calcification morphology descriptors in ductal carcinoma in situ of the breast: a systematic review and meta-analysis. Insights Imaging 2023;14:213. [Crossref] [PubMed]
Rizuana IH, Leong MH, Tan GC, Isa ZM. Association Between Microcalcification Patterns in Mammography and Breast Tumors in Comparison to Histopathological Examinations. Diagnostics (Basel) 2025;15:1687. [Crossref] [PubMed]
Parmar C, Grossmann P, Bussink J, Lambin P, Aerts HJWL. Machine Learning methods for Quantitative Radiomic Biomarkers. Sci Rep 2015;5:13087. [Crossref] [PubMed]
Zhao C, Xiao M, Jiang Y, Liu H, Wang M, Wang H, Sun Q, Zhu Q. Feasibility of computer-assisted diagnosis for breast ultrasound: the results of the diagnostic performance of S-detect from a single center in China. Cancer Manag Res 2019;11:921-30. [Crossref] [PubMed]
Wu JY, Zhao ZZ, Zhang WY, Liang M, Ou B, Yang HY, Luo BM. Computer-Aided Diagnosis of Solid Breast Lesions With Ultrasound: Factors Associated With False-negative and False-positive Results. J Ultrasound Med 2019;38:3193-202. [Crossref] [PubMed]
Wang XY, Cui LG, Feng J, Chen W. Artificial intelligence for breast ultrasound: An adjunct tool to reduce excessive lesion biopsy. Eur J Radiol 2021;138:109624. [Crossref] [PubMed]

Cite this article as: Wang X, Zhang N, Ma J, Qin W, Liao S, Chang H, Liu J, Huo L. Development and validation of radiomics models for the diagnoses of breast lesions with calcification on ultrasound. Quant Imaging Med Surg 2026;16(5):407. doi: 10.21037/qims-2025-1-517

Development and validation of radiomics models for the diagnoses of breast lesions with calcification on ultrasound

Introduction

Methods

Patients’ enrollment

US images acquisition

Delineation of ROIs

Establishment of the gold standard

Data partition

Radiomics feature extraction

Radiomics feature selection

Radiomics models establishment

Reader study for malignancy identification

Model performance evaluation and statistical analysis

Results

Clinical characteristics

Table 1

Selection of radiomics features

Performance evaluation of radiomics models based on different ROI features in predicting lesion malignancy

Performance evaluation of fusion radiomics models in predicting lesion malignancy

Table 2

Performance evaluation of lesion malignancy prediction by the sonographer

Table 3

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share