Multimodal combined model integrating 2.5D deep learning and habitat radiomics for malignancy discrimination in ≤2 cm BI-RADS 4 breast lesions

Shiyan Guo; Xiaohui Zhou; Jinguang Zhou; Liu Gong; Sijing Zhou; Liqing Jiang; Yan Zhang; Linyuan Jin; Ping Zhou

doi:10.21037/qims-2025-1-2812

Original Article

Multimodal combined model integrating 2.5D deep learning and habitat radiomics for malignancy discrimination in ≤2 cm BI-RADS 4 breast lesions

Shiyan Guo¹, Xiaohui Zhou², Jinguang Zhou¹, Liu Gong¹, Sijing Zhou¹, Liqing Jiang¹, Yan Zhang¹, Linyuan Jin², Ping Zhou¹

¹Department of Ultrasound, The Third Xiangya Hospital, Central South University, Changsha, China; ²Department of Ultrasound, Changsha Central Hospital, Changsha, China

Contributions: (I) Conception and design: S Guo, P Zhou; (II) Administrative support: P Zhou; (III) Provision of study materials or patients: P Zhou, Linyuan Jin; (IV) Collection and assembly of data: S Guo, X Zhou, S Zhou; (V) Data analysis and interpretation: S Guo, X Zhou, J Zhou, L Gong, Liqing Jiang, Y Zhang; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Ping Zhou, PhD. Department of Ultrasound, The Third Xiangya Hospital, Central South University, 138 Tongzipo Road, Yuelu District, Changsha 410013, China. Email: zhouping1000@hotmail.com.

Background: Differentiating between ≤2 cm Breast Imaging Reporting and Data System (BI-RADS) category 4 breast lesions remains challenging and often leads to unnecessary biopsies. This study aimed to develop and externally validate a multimodal fusion model to improve malignancy discrimination in ≤2 cm BI-RADS 4 lesions.

Methods: A total of 686 women with ≤2 cm BI-RADS 4 breast lesions were included in this dual-center retrospective study (The Third Xiangya Hospital, Central South University: n=526, training cohort, n=368, internal validation cohort, n=158; Changsha Central Hospital: n=160, external test cohort, n=160). Separate models based on automated breast volume scanner (ABVS) and strain elastography (SE) were developed, including ultrasound (US) models, radiomics-habitat (Rad-Habitat) models, and two-and-a-half dimensional (2.5D) and two-dimensional (2D) deep learning (DL) models. The predicted probabilities from the single-modality models were then integrated to construct two fusion models: the Combined (ABVS) model and the Combined (ABVS + SE) model. Model discrimination, calibration, and clinical utility were evaluated using the areas under the curve (AUCs), calibration curves, Brier scores, and decision curve analysis (DCA), with additional subgroup analyses of BI-RADS 4a lesions.

Results: In the internal validation and external test cohorts, the US models provided baseline discrimination (AUCs: 0.808–0.852). The Rad-Habitat models achieved AUCs ranging from 0.845–0.881, while the DL models achieved AUCs ranging from 0.874–0.887. The fusion models consistently outperformed all the single-modality models: the Combined (ABVS) model yielded AUCs of 0.948 and 0.942 in the internal and external cohorts, respectively, and the Combined (ABVS + SE) model further increased the internal AUC to 0.969 (DeLong test, P<0.05 for all pairwise comparisons vs. single-modality models). Both fusion models demonstrated good calibration with low Brier scores and achieved the highest net clinical benefit on DCA. In the ≤2 cm BI-RADS 4a subgroup, the combined models maintained high discriminatory performance (AUCs: 0.920–0.949) across the internal validation and external test cohorts.

Conclusions: The multimodal fusion model integrating clinical US, habitat radiomics, and 2.5D DL significantly improved malignancy discrimination for ≤2 cm BI-RADS 4 lesions with robust external validation, and may reduce unnecessary biopsies while supporting individualized patient management.

Keywords: Deep learning (DL); Breast Imaging Reporting and Data System (BI-RADS); breast; habitat; radiomics

Submitted Dec 27, 2025. Accepted for publication Mar 19, 2026. Published online Apr 14, 2026.

doi: 10.21037/qims-2025-1-2812

Introduction

With the widespread implementation of breast screening, the detection of breast nodules has increased substantially (1), making early and accurate characterization crucial for clinical decision-making. Malignancy risk stratification currently relies on the American College of Radiology (ACR) Breast Imaging Reporting and Data System (BI-RADS). Within this framework, BI-RADS category 4 lesions constitute a highly heterogeneous, suspicious group for which tissue biopsy is commonly recommended to minimize missed malignancy (2). However, large studies report positive predictive values (PPVs) of only about 7.6%, 22%, and 69.3% for BI-RADS 4a, 4b, and 4c lesions, respectively, indicating that many category 4 lesions are benign, resulting in a substantial number of negative biopsies (3). In small (≤2 cm) breast lesions, the limited size often obscures typical malignant features, leading to substantial overlap between benign and malignant morphologic and hemodynamic patterns, and weakening conventional qualitative assessment (4,5). Crucially, a small size does not indicate low biological risk: aggressive molecular subtypes are not uncommon in T1 breast cancers, and even T1a triple-negative tumors may be associated with unfavorable survival (6,7). Moreover, T1-stage breast cancer still shows about 26% axillary nodal positivity (8,9), and micrometastases are associated with reduced disease-free survival (10). Thus, accurate early staging and timely lesion characterization are essential for optimal treatment and axillary management.

Automated breast volume scanner (ABVS) provides standardized three-dimensional (3D), whole-breast acquisition with coronal reconstructions and multiplanar navigation, improving reproducibility and directly visualizing spatial signs such as the retraction phenomenon (11-13). The 3D volumetric datasets generated by ABVS are well-suited for artificial intelligence-based analysis and spatial heterogeneity characterization. Strain elastography (SE) complements ABVS by quantifying tissue stiffness and strain distribution, adding functional and biomechanical information that further refines the malignancy discrimination and characterization of small, early-stage nodules.

Radiomics transforms medical images into high-throughput quantitative features, linking macroscopic appearances with underlying microscopic phenotypes (14). Habitat radiomics further partitions tumors into subregions with similar texture patterns at the voxel or local-window level, explicitly depicting intratumoral heterogeneity and microenvironmental gradients, which is particularly useful for subtle early lesions (15,16). In parallel, deep learning (DL) models such as ResNet and DenseNet can automatically learn multiscale features via stacked convolutions and residual or dense connections (17,18), capturing both low-level texture and high-level semantics (19). In ABVS, two-and-a-half dimensional (2.5D) fusion of the coronal, sagittal, and transverse planes enables pseudo-3D encoding of spatial heterogeneity and directional anisotropy.

The study integrated clinical US features, habitat radiomics, and 2.5D DL into a multimodal fusion model to improve malignancy discrimination in ≤2 cm BI-RADS 4 breast nodules, with the goal of reducing unnecessary biopsies and better supporting individualized patient management. We present this article in accordance with the TRIPOD + AI reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2812/rc).

Methods

Patient enrollment

Female patients with ≤2 cm BI-RADS 4 breast lesions at The Third Xiangya Hospital, Central South University (Center 1; November 2019 to March 2025) and Changsha Central Hospital (Center 2; January 2021 to March 2025) were consecutively enrolled in this retrospective, dual-center study. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Ethics Committees of The Third Xiangya Hospital, Central South University (No. Kuai 23642) and Changsha Central Hospital (No. 2025 Yi Shen Di 408 Hao). The requirement for written informed consent was waived by both committees due to the retrospective study design and the use of de-identified data.

The inclusion criteria were as follows: (I) histopathologic confirmation of the diagnosis obtained by surgical excision or core needle biopsy; (II) completion of the relevant ultrasound (US) examinations within 2 weeks prior to surgery or biopsy (ABVS and SE at Center 1; ABVS at Center 2); (III) an US report indicating a maximum nodule diameter ≤2.0 cm and a BI-RADS category 4 assessment based on conventional handheld B-mode US findings, with SE findings not incorporated into the BI-RADS assessment; and (IV) complete clinical and imaging data. The exclusion criteria were as follows: (I) insufficient image quality for quantitative analysis; and/or (II) any prior interventional procedure or treatment involving the nodule before imaging.

In total, 526 patients from Center 1 were included in the study and randomly assigned to a training cohort and an internal validation cohort at a ratio of 7:3, and 160 patients from Center 2 were included as an external test cohort (Figure 1).

Figure 1 Study workflow. Overview of data collection, image preprocessing, development of clinical US, habitat radiomics and DL models, and external validation, culminating in the multimodal fusion models. Center 1: The Third Xiangya Hospital, Central South University; Center 2: Changsha Central Hospital. ABVS, automated breast volume scanner; BI-RADS, Breast Imaging Reporting and Data System; DCA, decision curve analysis; DL, deep learning; DL-2D, two-dimensional deep learning; DL-2.5D, two-and-a-half-dimensional deep learning; Grad-CAM, gradient-weighted class activation mapping; IDI, integrated discrimination improvement; LightGBM, light gradient boosting machine; LASSO, least absolute shrinkage and selection operator; Rad-Habitat, radiomics-habitat; SE, strain elastography; US, ultrasound.

Image acquisition and segmentation

US examinations at both centers were performed on the same system (ACUSON S2000, Siemens Healthineers, Germany) using an identical acquisition protocol. At each center, all examinations were conducted by one designated radiologist with over 10 years of experience in breast disease diagnosis, thereby minimizing operator-dependent variability. For ABVS, the patients were placed in the supine position with both arms raised, and a 14L5BV linear transducer was used. The imaging depth (4–6 cm) and gain (8–10 dB) were adjusted according to breast thickness to optimize image quality. Each breast was scanned to ensure complete whole-breast volumetric coverage, and for each lesion, one coronal, transverse, and sagittal slice through the tumor axis with the largest tumor extent was selected for analysis. For SE, the patients were positioned identically and examined using a 9L4 probe. The lesion was first localized on B-mode, and the plane showing the maximum long-axis diameter was selected. Color Doppler was then activated to assess lesion vascularity, after which the imaging mode was switched to SE. The sampling box was set to 1.5–2.0 times the maximum lesion diameter, and the probe was held perpendicular to the skin with minimal precompression to obtain stable, reproducible elastograms.

The regions of interest (ROIs) were manually delineated using 3D Slicer (version 5.6.2). Initial segmentations were performed by a US radiologist with 8 years of experience in US diagnosis, and subsequently reviewed and, when necessary, refined by a senior radiologist with more than 20 years of experience to ensure accuracy and consistency. The finalized ROIs were used for subsequent habitat-radiomics feature extraction and DL model construction. To assess interobserver agreement, 30 cases were randomly selected and independently re-segmented by another radiologist with 10 years of experience. All Dice similarity coefficients were greater than 0.8 (20), indicating good contouring consistency.

US models

Two radiologists, with 8 and 10 years of breast US experience, respectively, independently reviewed each lesion. The readers were blinded to the histopathologic results and documented the primary US features (composition, internal echogenicity, echogenic homogeneity, shape, orientation, margin, boundary, posterior acoustic changes, and intralesional microcalcifications) and associated findings (convergence sign, peripheral ductal dilatation, hyperechoic halo, perilesional edema, and Cooper’s ligament involvement) from the ABVS images. Additionally, color Doppler vascularity was recorded by the scanning operator at the time of examination, extracted from the original reports, and qualitatively graded as absent, low, moderate, or abundant according to the Adler classification (21,22). The elasticity assessment was performed based on SE elastograms. The elasticity score was assessed using the Tsukuba elasticity score (Itoh 5-point scale) (23), with higher scores indicating greater stiffness. On elastograms in our system, red represented harder tissue (lower strain), and green represented softer tissue (higher strain). Any discrepancies between the two readers were adjudicated by a senior radiologist with more than 20 years of experience.

The US features significantly associated with malignancy in the training cohort [chi-squared (χ²) test or Fisher’s exact test, as appropriate; P<0.05] were used to construct logistic regression (LR) models based on ABVS features alone and on ABVS combined with SE features, yielding the US (ABVS) and US (ABVS + SE) models, respectively.

Radiomics-habitat (Rad-Habitat) models

Habitat subregion generation

Manual lesion masks were used to guide a lesion-centered crop, and the lesion ROI was defined by the mask. To characterize intratumoral heterogeneity while avoiding the instability that often arises from very high-dimensional pixel-wise feature maps, local texture mapping was performed using a small set of representative and reproducible second-order descriptors. Specifically, five gray-level co-occurrence matrix (GLCM) features—contrast, difference entropy, joint energy, joint entropy, and correlation—were selected to jointly capture local intensity variation. This choice was motivated by evidence that habitat partitioning becomes more stable when computed from a limited subset of robust GLCM features rather than an exhaustive feature pool, and clinically meaningful habitats can be derived even from a single GLCM-based local texture map (24,25). In each lesion ROI, a 5×5 neighborhood centered on each pixel was used to compute local GLCM-based texture features (PyRadiomics version 3.0.1). After minimum-maximum normalization of texture vectors to [0, 1], a Gaussian mixture model was fitted on pixel-wise texture vectors from the training cohort, and then fixed and applied to the internal validation and external test cohorts to ensure consistent habitat assignment across datasets. The number of clusters (k=2–9) was selected by maximizing the training-cohort silhouette coefficient, and pseudo-color habitat maps were generated to visualize the spatial distribution of intratumoral subregions. For each habitat subregion, 648 radiomics features were extracted from the original images and wavelet-transformed counterparts, including first-order statistics and texture features derived from GLCM, gray-level size zone matrix (GLSZM), and gray-level run-length matrix (GLRLM), following Image Biomarker Standardization Initiative (IBSI) conventions.

Feature selection and model construction

In the training cohort, a structured, modality-specific feature selection pipeline was applied to the ABVS- and SE-derived radiomics feature sets. In each modality, all radiomics features were first standardized using Z-score normalization. An initial filtering step was then performed using the Mann-Whitney U test, followed by removal of highly collinear features (Pearson correlation coefficient >0.9). Based on this, the minimum redundancy maximum relevance (mRMR) algorithm was applied to prioritize features that were strongly associated with the target label while exhibiting minimal redundancy. Subsequently, the least absolute shrinkage and selection operator (LASSO) regression with 10-fold cross-validation was used to determine the optimal penalty parameter λ and derive a compact, discriminative feature subset for each modality.

Rad-Habitat (ABVS)

Using the habitat-based radiomics features derived from ABVS images, five supervised learning models were constructed: logistic regression (LR), support vector machine (SVM), random forest (RF), extra-trees (ET), and light gradient boosting machine (LightGBM). Model hyperparameters were optimized and fixed using stratified five-fold cross-validation within the training cohort, after which the model with the highest area under the curve (AUC) and the best cross-cohort stability in the internal validation cohort was selected as the final Rad-Habitat (ABVS) model and further evaluated in the external test cohort.

Rad-Habitat (SE)

For the SE-based habitat-radiomics features, the same modeling and selection strategy was applied (i.e., construction of the LR, SVM, RF, ET, and LightGBM models, hyperparameter tuning with stratified five-fold cross-validation in the training cohort, and performance comparison across the three cohorts). The model achieving the highest and most stable AUC on the internal validation was designated as the final Rad-Habitat (SE) model.

DL models

Data augmentation

During model training, a dynamic data augmentation strategy was employed to improve the network’s robustness to shape and orientation variations. Specifically, random cropping and horizontal and vertical flipping were applied to the input tumor subregions, increasing sample diversity and reducing the risk of overfitting. These geometric transformations were only applied during the training phase; no data augmentation was performed for the internal validation or external test sets to ensure objective and consistent model evaluation.

Data normalization

SE images were standardized using Z-score normalization across red-green-blue (RGB) channels, while ABVS slices were normalized to a [−1, 1] grayscale range to mitigate intensity scale differences. Each cropped tumor patch was resized to 224×224 pixels using interpolation. This intensity normalization and resampling pipeline was applied identically to the training, internal validation, and external test sets to ensure comparability and reproducibility of forward inference.

Training parameters

Given the specific characteristics of our imaging dataset, a cosine-annealing strategy was adopted to adjust the learning rate and enhance model generalization, with the learning rate dynamically updated according to the following equation:

$η_{t} = η_{\min}^{i} + \frac{1}{2} (η_{\max}^{i} - η_{\min}^{i}) (1 + \cos (\frac{T_{c u r}}{T_{i}} π))$ [1]

where $η_{\min}^{i}$ is set to 0, and $η_{\max}^{i}$ to 0.01. The parameter $T_{c u r}$ denotes the current training progress measured in epochs (fractional epochs), and $T_{i}$ denotes the total number of epochs. Additionally, stochastic gradient descent (SGD) was used as the optimizer, and softmax cross-entropy as the loss function. The batch size was set to 20, and the total number of iterations was 570 (30 epochs).

2.5D DL model

To build the 2.5D model, coronal, sagittal, and transverse ABVS slices through each tumor were stacked as a three-channel input, preserving multiplanar spatial context. These images were fed into a transfer-learning framework with ImageNet-pretrained DenseNet-121, ResNet-101, and ResNet-50 backbones, which were fine-tuned on our cohort; the architecture with the highest and most stable AUC in the internal validation was designated as the DL-2.5D model and subsequently evaluated in the external test cohort.

Two-dimensional (2D) DL model

For the 2D model, lesion ROIs were cropped from SE images while retaining their original RGB pseudo-color and formatted as three-channel inputs. The SE dataset was processed in the same transfer-learning pipeline using the same three backbones, fine-tuned on our data; the network with the highest and most stable AUC in the internal validation cohort was defined as the DL-2D model.

Combined models

Combined (ABVS) model

Following a late-fusion strategy, the predicted malignancy probabilities from the US (ABVS), Rad-Habitat (ABVS), and DL-2.5D models were used as input features for a LR model. The LR meta-model was fitted using the training cohort, and the fitted coefficients were then fixed and applied unchanged to the internal validation and external test cohorts.

Combined (ABVS + SE) model

To further incorporate SE-derived information, the predicted malignancy probabilities from US (ABVS + SE), Rad-Habitat (ABVS), Rad-Habitat (SE), DL-2.5D, and DL-2D were used as input features for the same LR model (detailed model specifications are provided in Appendix 1).

To examine model applicability in more challenging scenarios, subgroup analyses were further performed in BI-RADS 4a lesions across the training, internal validation, and external test cohorts.

Statistical analysis

All statistical analyses were performed in Python 3.7.12. The machine-learning models (LR, SVM, RF, ET, and LightGBM) were implemented using scikit-learn and the LightGBM. LGBMClassifier interface, while the DL models were developed in PyTorch using graphics processing unit (GPU) acceleration. Categorical variables were summarized as n (%), and compared between groups using the χ² test or Fisher’s exact test, as appropriate. All tests were two-sided with a significance threshold of P<0.05.

Model discrimination was assessed primarily using the AUC, with pairwise comparisons performed using the DeLong test. Accuracy, sensitivity, specificity, PPV, and negative predictive value (NPV) were also reported. Clinical utility was evaluated by decision curve analysis (DCA). Net reclassification improvement (NRI) and integrated discrimination improvement (IDI) were calculated to quantify gains in risk stratification and discrimination, and P values for IDI were reported. Calibration curves and the Brier score were used to assess agreement between the predicted and observed probabilities.

Results

Cohort baseline characteristics

A total of 686 ≤2 cm BI-RADS 4 breast lesions (from 686 women) were included in the study, of which 308 were malignant (44.90%) and 378 were benign (55.10%). By BI-RADS subclass, 465 (67.78%), 105 (15.31%), and 116 (16.91%) lesions were categorized as 4a, 4b, and 4c, respectively. The patients were allocated to a training cohort (n=368), an internal validation cohort (n=158), and an external test cohort (n=160), with comparable baseline clinical and US characteristics across cohorts (all P>0.05; Table 1).

Table 1

Baseline characteristics of patients and lesions in the three cohorts

Features	Training cohort			Internal validation cohort			External test cohort			Overall P value
Features	Benign (n=200)	Malignant (n=168)	P value	Benign (n=86)	Malignant (n=72)	P value	Benign (n=92)	Malignant (n=68)	P value	Overall P value
Age (years)			<0.001**			0.016*			0.003*	0.692
≤45	122 (61.00)	54 (32.14)		50 (58.14)	28 (38.89)		57 (61.96)	26 (38.24)
>45	78 (39.00)	114 (67.86)		36 (41.86)	44 (61.11)		35 (38.04)	42 (61.76)
BI-RADS			<0.001**			<0.001**			<0.001**	0.682
4a	189 (94.50)	57 (33.93)		77 (89.53)	27 (37.50)		81 (88.05)	34 (50.00)
4b	9 (4.50)	46 (27.38)		6 (6.98)	21 (29.17)		9 (9.78)	14 (20.59)
4c	2 (1.00)	65 (38.69)		3 (3.49)	24 (33.33)		2 (2.17)	20 (29.41)
Morphology			0.151			0.135			0.091	0.804
Regular	48 (24.00)	30 (17.86)		20 (23.26)	10 (13.89)		22 (23.91)	9 (13.24)
Irregular	152 (76.00)	138 (82.14)		66 (76.74)	62 (86.11)		70 (76.09)	59 (86.76)
Boundary			0.336			0.098			0.393	0.696
Well-defined	80 (40.00)	59 (35.12)		36 (41.86)	21 (29.17)		40 (43.48)	25 (36.77)
Ill-defined	120 (60.00)	109 (64.88)		50 (58.14)	51 (70.83)		52 (56.52)	43 (63.23)
Margin			<0.001**			<0.001**			<0.001**	0.265
Circumscribed	108 (54.00)	41 (24.41)		52 (60.46)	16 (22.22)		61 (66.30)	16 (23.53)
Non-circumscribed	92 (46.00)	127 (75.59)		34 (39.54)	56 (77.78)		31 (33.70)	52 (76.47)
Composition			0.004*			0.072			0.468	0.969
Solid	185 (92.50)	166 (98.81)		79 (91.86)	71 (98.61)		86 (93.48)	66 (97.06)
Cystic-solid	15 (7.50)	2 (1.19)		7 (8.14)	1 (1.39)		6 (6.52)	2 (2.94)
Echogenicity			0.002*			0.098			0.001*	0.348
Strongly hypoechoic	4 (2.00)	7 (4.17)		2 (2.33)	5 (6.94)		1 (1.09)	5 (7.35)
Hypoechoic	177 (88.50)	159 (94.64)		78 (90.70)	66 (91.67)		77 (83.69)	62 (91.18)
Isoechoic	19 (9.50)	2 (1.19)		6 (6.97)	1 (1.39)		14 (15.22)	1 (1.47)
Echogenic homogeneity			0.107			0.119			0.065	0.299
Homogeneous	168 (84.00)	130 (77.38)		72 (83.72)	53 (73.61)		64 (69.57)	56 (82.35)
Heterogeneous	32 (16.00)	38 (22.62)		14 (16.28)	19 (26.39)		28 (30.43)	12 (17.65)
Cooper’s ligament involvement			0.014*			0.044*			0.055	0.863
Absent	191 (95.50)	149 (88.69)		84 (97.67)	64 (88.89)		89 (96.74)	60 (88.24)
Presence	9 (4.50)	19 (11.31)		2 (2.33)	8 (11.11)		3 (3.26)	8 (11.76)
Peripheral tissue edema			0.017*			0.044*			0.050	0.179
Absent	193 (96.50)	152 (90.48)		84 (97.67)	64 (88.89)		86 (93.48)	57 (83.82)
Presence	7 (3.50)	16 (9.52)		2 (2.33)	8 (11.11)		6 (6.52)	11 (16.18)
Hyperechoic halo			<0.001**			0.012*			0.002*	0.712
Absent	199 (99.50)	152 (90.48)		85 (98.84)	64 (88.89)		91 (98.91)	59 (86.77)
Presence	1 (0.50)	16 (9.52)		1 (1.16)	8 (11.11)		1 (1.09)	9 (13.23)
Convergence sign			0.009*			0.044*			0.011*	0.601
Absent	191 (95.50)	148 (88.10)		84 (97.67)	64 (88.89)		88 (95.65)	57 (83.82)
Presence	9 (4.50)	20 (11.90)		2 (2.33)	8 (11.11)		4 (4.35)	11 (16.18)
Microcalcifications			<0.001**			0.002*			0.002*	0.459
Absent	170 (85.00)	103 (61.31)		76 (88.37)	49 (68.06)		77 (83.70)	42 (61.77)
Presence	30 (15.00)	65 (38.69)		10 (11.63)	23 (31.94)		15 (16.30)	26 (38.23)
Orientation			<0.001**			0.047*			0.101	0.622
Parallel	150 (75.00)	87 (51.79)		62 (72.09)	41 (56.94)		68 (73.91)	42 (61.77)
Non-parallel	50 (25.00)	81 (48.21)		24 (27.91)	31 (43.06)		24 (26.09)	26 (38.23)
Posterior acoustic change			0.136			0.076			0.121	0.595
Enhancement	27 (13.50)	19 (11.31)		7 (8.14)	6 (8.33)		7 (7.61)	8 (11.76)
No change	165 (82.50)	134 (79.76)		77 (89.53)	58 (80.56)		81 (88.04)	52 (76.47)
Attenuation	8 (4.00)	15 (8.93)		2 (2.33)	8 (11.11)		4 (4.35)	8 (11.76)
Peripheral ductal dilatation			0.433			0.653			0.595	0.83
Absent	178 (89.00)	145 (86.31)		75 (87.21)	61 (84.72)		80 (86.96)	61 (89.71)
Presence	22 (11.00)	23 (13.69)		11 (12.79)	11 (15.28)		12 (13.04)	7 (10.29)
Color Doppler vascularity			0.03*			0.174			0.009*	0.872
Absent	89 (44.50)	52 (30.95)		39 (45.35)	21 (29.17)		45 (48.91)	16 (23.53)
Low	84 (42.00)	80 (47.62)		33 (38.37)	32 (44.44)		33 (35.87)	33 (48.53)
Moderate	19 (9.50)	22 (13.10)		9 (10.47)	12 (16.67)		11 (11.96)	13 (19.12)
Abundant	8 (4.00)	14 (8.33)		5 (5.81)	7 (9.72)		3 (3.26)	6 (8.82)
Elasticity score^†			<0.001**			<0.001**			NA	0.745
1	16 (8.00)	0		3 (3.49)	0		NA	NA
2	50 (25.00)	11 (6.55)		20 (23.26)	6 (8.33)		NA	NA
3	91 (45.50)	31 (18.45)		44 (51.16)	11 (15.28)		NA	NA
4	40 (20.00)	68 (40.48)		16 (18.60)	31 (43.06)		NA	NA
5	3 (1.50)	58 (34.52)		3 (3.49)	24 (33.33)		NA	NA

Data are presented as n (%), percentages may not total 100% due to rounding. ^†, elasticity score was unavailable in the external test cohort; therefore, for this variable, overall P value compares the training and internal validation cohorts only. *, P<0.05; **, P<0.001. BI-RADS, Breast Imaging Reporting and Data System; NA, not available.

US models

In the training cohort, the patients with malignant lesions were significantly older than those with benign lesions. Compared with the benign nodules, the malignant lesions more frequently appeared solid and markedly hypoechoic, with non-circumscribed margins and non-parallel orientation, and were more likely to show internal microcalcifications, a retraction phenomenon on coronal ABVS, a peripheral echogenic halo, perilesional edema, Cooper’s ligament involvement, richer Doppler flow, and higher elastography scores (all P<0.05). Overall, similar patterns were observed in the internal validation and external test cohorts (Table 1).

The US variables significantly associated with malignancy in the training cohort were used to construct the US (ABVS) model, which yielded AUCs of 0.858, 0.820, and 0.808 in the training, internal validation, and external test cohorts, respectively (Figure 2). After incorporating SE-derived elasticity information in addition to ABVS features, the US (ABVS + SE) model achieved AUCs of 0.894 and 0.852 in the training and internal validation cohorts (Figure 2), respectively, corresponding to absolute increases of 0.036 and 0.032 over the US (ABVS) model.

Figure 2 Receiver operating characteristic curves for the clinical US models in the training, internal validation, and external test cohorts. ABVS, automated breast volume scanner; AUC, area under the curve; CI, confidence interval; SE, strain elastography; US, ultrasound.

Rad-Habitat models

Rad-Habitat (ABVS)

Using the silhouette coefficient, the optimal number of clusters was set to k=3, partitioning each tumor into three intratumoral habitats (Figure 3A). From the coronal, transverse, and sagittal ABVS planes, 648 radiomics features per habitat were extracted, yielding 5,832 features per lesion. After multi-step feature selection, 18 key features (6 coronal, 5 transverse, and 7 sagittal) were retained (Figure 3B), indicating that habitat-level texture information from all three planes contributed to benign–malignant discrimination. Among candidate classifiers, RF achieved the best internal validation performance (AUC 0.881) with consistent generalization to the external test set (AUC 0.870), compared with LightGBM (AUC 0.865/0.835), ET (AUC 0.844/0.836), SVM (AUC 0.826/0.818), and LR (AUC 0.823/0.817) (Figure 3C). Accordingly, RF was selected as the final Rad-Habitat (ABVS) model based on its superior AUC in the internal validation cohort, and its stability was further supported by the smallest AUC decrease from the training to internal/external cohorts (ΔAUC =0.025/0.036).

Figure 3 Habitat-based radiomics analysis and performance of the Rad-Habitat models. (A) Representative pseudo-color habitat maps generated by pixel-wise radiomics feature clustering using a Gaussian mixture model, partitioning each lesion into three intratumoral habitats (Habitat 1–3) on SE and on the coronal, transverse, and sagittal planes of ABVS. (B) Coefficients of the selected habitat-radiomics features with non-zero weights at the optimal penalty parameter (λ) in the LASSO model (upper, ABVS-based model, 18 features; lower, SE-based model, 14 features). (C) ROC curves and corresponding AUCs of five machine-learning classifiers—LR, SVM, RF, ET, and LightGBM—for Rad-Habitat (ABVS) in the training, internal validation, and external test cohorts, and for Rad-Habitat (SE) in the training and internal validation cohorts (SE available at The Third Xiangya Hospital, Central South University only). ABVS, automated breast volume scanner; AUC, area under the curve; ET, extra-trees; GLCM, gray-level co-occurrence matrix; GLRLM, gray-level run-length matrix; GLSZM, gray-level size zone matrix; LASSO, least absolute shrinkage and selection operator; LightGBM, light gradient boosting machine; LR, logistic regression; Rad-Habitat, radiomics-habitat; RF, random forest; ROC, receiver operating characteristic; SE, strain elastography; SVM, support vector machine.

Rad-Habitat (SE)

Using the silhouette coefficient, k=3 was similarly selected for SE-based habitat partitioning (Figure 3A). A total of 1944 features per lesion were extracted and reduced to 14 selected features (Figure 3B). As SE was available only at Center 1, the model was evaluated in the training and internal validation cohorts. Among candidate classifiers, RF achieved the highest internal validation performance (AUC 0.845) compared with LightGBM (AUC 0.823), ET (AUC 0.821), SVM (AUC 0.816), and LR (AUC 0.814) (Figure 3C), with acceptable stability (ΔAUC =0.049). Accordingly, RF was selected as the final Rad-Habitat (SE) model for subsequent fusion.

DL models

DL-2.5D model

Among the backbones, ResNet-50 achieved the best performance in the training/internal validation/external test cohorts (AUC 0.923/0.887/0.879), outperforming ResNet-101 (AUC 0.916/0.876/0.858) and DenseNet-121 (AUC 0.900/0.858/0.861), with smaller cross-cohort AUC drops (Figure 4A). Based on internal validation performance and cross-cohort stability, ResNet-50 was selected as the backbone for the final DL-2.5D model, and its performance was subsequently confirmed in the external test cohort. Gradient-weighted class activation mapping (Grad-CAM) indicated that the ABVS-based DL-2.5D network mainly attended to the central hypoechoic solid portion of the lesion across coronal, sagittal, and transverse planes, implying effective use of near-3D spatial context from multiplanar inputs (Figure 4B).

Figure 4 DL model performance and Grad-CAM visualization. (A) Receiver operating characteristic curves and corresponding AUCs comparing three backbone architectures (ResNet-50, ResNet-101, and DenseNet-121). Upper row: ABVS DL-2.5D model performance in the training, internal validation, and external test cohorts. Lower row: SE DL-2D model performance in the training and internal validation cohorts (SE available at The Third Xiangya Hospital, Central South University only). (B) Representative inputs and Grad-CAM heatmaps for the ABVS-based DL-2.5D model (three-plane ABVS slices) and the SE-based DL-2D model (SE image). Warmer colors indicate higher network attention. ABVS, automated breast volume scanner; AUC, area under the curve; CI, confidence interval; DL, deep learning; DL-2D, deep learning two-dimensional; DL-2.5D, deep learning two-and-a-half dimensional; Grad-CAM, gradient-weighted class activation mapping; SE, strain elastography.

DL-2D model

ResNet-50 also achieved the best and most stable internal validation performance (AUC 0.874) compared with ResNet-101 (AUC 0.853) and DenseNet-121 (AUC 0.856), and was therefore selected as the final backbone (Figure 4A). The SE-based DL-2D model, Grad-CAM showed focal, patch-like high-activation areas within stiffer, texturally heterogeneous lesion cores (Figure 4B).

Combined models

Among the single-modality models, US (ABVS) provided an interpretable baseline (training/internal/external AUC 0.858/0.820/0.808), Rad-Habitat (ABVS) added a stable heterogeneity-based gain (AUC 0.906/0.881/0.870), and DL-2.5D further improved discrimination (AUC 0.923/0.887/0.879). The fusion of these three outputs increased the AUCs to 0.970, 0.948 and 0.942 in the training, internal validation, and external test cohorts, respectively (Table 2; Figure 5A), with all pairwise comparisons to the single-modality models remaining statistically significant in all three cohorts (P<0.05; Figure 5B). Calibration curves indicated close agreement between the predicted and observed probabilities (Figure 5C), with low Brier scores in the training, internal validation, and external test cohorts (0.062, 0.101, and 0.088, respectively). NRI and IDI were consistently positive, supporting improved risk reclassification (Figure 6A,6B). Moreover, compared with the US models and the Rad-Habitat models, the IDI gains of the Combined (ABVS) model were statistically significant across all cohorts (P<0.05; Figure 6C). DCA further demonstrated the highest net clinical benefit for the Combined (ABVS) model (Figure 6D). In the external test cohort, the accuracy, sensitivity, and specificity were 0.894, 0.853, and 0.924, respectively, with a PPV/NPV of 0.892/0.895, indicating a favorable balance between minimizing missed cancers and avoiding unnecessary interventions.

Table 2

Diagnostic performance of each model across the three cohorts

Cohort	Model	AUC (95% CI)	Accuracy	Sensitivity	Specificity	PPV	NPV
Training	US (ABVS)	0.858 (0.820–0.897)	0.783	0.643	0.900	0.844	0.750
	US (ABVS + SE)	0.894 (0.861–0.928)	0.815	0.732	0.885	0.842	0.797
	Rad-Habitat (SE)	0.894 (0.863–0.925)	0.772	0.542	0.965	0.929	0.715
	Rad-Habitat (ABVS)	0.906 (0.878–0.935)	0.802	0.702	0.885	0.837	0.780
	DL-2D	0.892 (0.860–0.924)	0.802	0.756	0.840	0.799	0.804
	DL-2.5D	0.923 (0.896–0.950)	0.867	0.780	0.940	0.916	0.836
	Combined (ABVS)	0.970 (0.954–0.986)	0.929	0.899	0.955	0.944	0.918
	Combined (ABVS + SE)	0.985 (0.974–0.995)	0.948	0.917	0.975	0.969	0.933
Internal validation	US (ABVS)	0.820 (0.752–0.888)	0.747	0.583	0.884	0.808	0.717
Internal validation	US (ABVS + SE)	0.852 (0.789–0.914)	0.772	0.722	0.814	0.765	0.778
	Rad-Habitat (SE)	0.845 (0.784–0.906)	0.728	0.431	0.977	0.939	0.672
	Rad-Habitat (ABVS)	0.881 (0.828–0.934)	0.778	0.639	0.895	0.836	0.748
	DL-2D	0.874 (0.816–0.931)	0.797	0.819	0.779	0.756	0.837
	DL-2.5D	0.887 (0.835–0.939)	0.791	0.889	0.709	0.719	0.884
	Combined (ABVS)	0.948 (0.916–0.979)	0.861	0.917	0.814	0.805	0.921
	Combined (ABVS + SE)	0.969 (0.945–0.994)	0.911	0.944	0.884	0.872	0.950
External test	US (ABVS)	0.808 (0.739–0.878)	0.738	0.632	0.815	0.717	0.750
External test	Rad-Habitat (ABVS)	0.870 (0.814–0.926)	0.775	0.603	0.902	0.820	0.755
	DL-2.5D	0.879 (0.821–0.936)	0.844	0.765	0.902	0.852	0.838
	Combined (ABVS)	0.942 (0.905–0.978)	0.894	0.853	0.924	0.892	0.895

SE-based models were not evaluated in the external test cohort because SE images were unavailable at Changsha Central Hospital. ABVS, automated breast volume scanner; AUC, area under the receiver operating characteristic curve; CI, confidence interval; DL-2D, two-dimensional deep learning; DL-2.5D, two-and-a-half dimensional deep learning; NPV, negative predictive value; PPV, positive predictive value; Rad-Habitat, radiomics-habitat; SE, strain elastography; US, ultrasound.

Figure 5 Discrimination and calibration performance of the models in the training, internal validation, and external test cohorts. (A) ROC curves and corresponding AUCs for all models, including single-modality models: US (ABVS), US (ABVS + SE), Rad-Habitat (ABVS), Rad-Habitat (SE), and DL (DL-2D and DL-2.5D) models, and the combined models, Combined (ABVS) and Combined (ABVS + SE), showing their discriminatory ability across different cohorts. (B) Pairwise comparisons of AUCs using the DeLong test, demonstrating that the Combined (ABVS) and Combined (ABVS + SE) models significantly outperformed all single-modality models (all P<0.05). (C) Calibration curves for Combined (ABVS) and Combined (ABVS + SE). ABVS, automated breast volume scanner; AUC, area under the curve; DL, deep learning; DL-2D, deep learning two-dimensional; DL-2.5D, deep learning two-and-a-half dimensional; Rad-Habitat, radiomics-habitat; ROC, receiver operating characteristic; SE, strain elastography; US, ultrasound.

Figure 6 Risk reclassification and decision-analytic performance of the models. (A) NRI heatmaps for pairwise comparisons among all models—US (ABVS), US (ABVS + SE), Rad-Habitat (ABVS), Rad-Habitat (SE), DL-2.5D, DL-2D, Combined (ABVS), and Combined (ABVS + SE)—in each cohort. (B) IDI heatmaps for the corresponding pairwise comparisons in each cohort. (C) P values for IDI comparisons between models in each cohort. (D) DCA showing the net benefit of each model across a range of threshold probabilities in the training, internal validation, and external test cohorts. Overall, the combined models showed consistently favorable reclassification gains (NRI/IDI) and higher net benefit. ABVS, automated breast volume scanner; DCA, decision curve analysis; DL, deep learning; DL-2D, deep learning two-dimensional; DL-2.5D, deep learning two-and-a-half dimensional; IDI, integrated discrimination improvement; NRI, met reclassification improvement; Rad-Habitat, radiomics-habitat; SE, strain elastography; US, ultrasound.

Building on this ABVS-based framework, the Combined (ABVS + SE) model further incorporated SE-derived information, increasing the AUCs to 0.985 and 0.969 in the training and internal validation cohorts, respectively (Table 2; Figure 5A). Although the SE-based single-modality models showed slightly lower training and internal validation AUCs than their ABVS counterparts [Rad-Habitat (SE), 0.894/0.845; DL-2D, 0.892/0.874], they provided complementary information. When integrated with ABVS-derived morphologic, habitat-texture, and multiplanar deep features, SE yielded an additional performance gain: Combined (ABVS + SE) achieved a significantly higher AUC than Combined (ABVS) in the training cohort (P<0.05; Figure 5B), whereas no statistically significant difference was observed in the internal validation cohort (P>0.05; Figure 5B). It nevertheless significantly outperformed all single-modality models in terms of the AUC (P<0.05; Figure 5B) and showed consistently higher secondary performance metrics. Calibration curves indicated adequate calibration (Figure 5C), with lower Brier scores (0.042/0.069 for training/internal validation), and higher NRI/IDI. Further, the IDI gains of Combined (ABVS + SE) model over each single-modality model were statistically significant in both cohorts (P<0.05; Figure 6A-6C). DCA also demonstrated superior net benefit across a broad range of threshold probabilities, indicating enhanced discrimination and greater clinical utility (Figure 6D).

BI-RADS 4a subgroup

In the BI-RADS 4a subgroup, where malignant phenotypes are more subtle and overlap substantially with benign lesions, the Combined (ABVS) model achieved AUCs of 0.934, 0.925, and 0.920 in the training, internal validation, and external test cohorts, respectively (Table 3; Figure 7), with corresponding accuracies of 0.927, 0.885, and 0.887, and NPVs consistently >0.92. In the same subgroup, the Combined (ABVS + SE) model further improved performance, with AUCs of 0.964 and 0.949, and accuracies of 0.935 and 0.923 in the training and internal validation cohorts, respectively; the NPVs remained high (0.944 and 0.960, respectively) (Table 3; Figure 7). Collectively, these results suggest that the combined models retain reliable discrimination in this diagnostically challenging BI-RADS 4a setting, supporting their potential utility for second-level risk stratification.

Table 3

Diagnostic performance of combined models in BI-RADS 4a lesions across the three cohorts

Cohort	Model	AUC (95% CI)	Accuracy	Sensitivity	Specificity	PPV	NPV
Training	Combined (ABVS)	0.934 (0.892–0.976)	0.927	0.825	0.958	0.855	0.948
	Combined (ABVS + SE)	0.964 (0.936–0.991)	0.935	0.807	0.974	0.902	0.944
Internal validation	Combined (ABVS)	0.925 (0.874–0.976)	0.885	0.815	0.909	0.759	0.933
Internal validation	Combined (ABVS + SE)	0.949 (0.897–1.000)	0.923	0.889	0.935	0.828	0.960
External test	Combined (ABVS)	0.920 (0.864–0.977)	0.887	0.824	0.914	0.800	0.925

ABVS, automated breast volume scanner; AUC, area under the curve; CI, confidence interval; NPV, negative predictive value; PPV, positive predictive value; SE, strain elastography.

Figure 7 Receiver operating characteristic curves of the combined models in the ≤2 cm BI-RADS 4a subgroup. ABVS, automated breast volume scanner; AUC, area under the curve; BI-RADS, Breast Imaging Reporting and Data System; CI, confidence interval; SE, strain elastography.

Discussion

This dual-center study addressed the challenging imaging-based differentiation of ≤2 cm BI-RADS 4 breast lesions by systematically evaluating US, Rad-Habitat, and DL models. The multimodal fusion combined models achieved excellent, consistent discrimination across the three cohorts (AUC 0.942–0.985), outperforming all single-modality models, supporting improved robustness and cross-center generalizability.

BI-RADS 4 lesions display marked intratumoral heterogeneity, and routine histologic verification of all such lesions often leads to unnecessary biopsies and increased patient burden in the setting of small-volume tumors. Our clinical US model, derived from visually assessable morphologic and functional features, captured a canonical malignant phenotype characterized by solid, markedly hypoechoic masses with non-circumscribed margins, non-parallel orientation, and architectural distortion, together with microcalcifications, peritumoral stromal changes, hypervascularity, and increased stiffness. However, in small BI-RADS 4 nodules, these signs are often subtle, blurred, and difficult to quantify, which reduces information density and interobserver consistency; consequently, cross-cohort discrimination was modest (internal/external AUC 0.820/0.808). Adding SE-derived elasticity scores increased the internal AUC to 0.852 but did not fully capture complex intratumoral heterogeneity. These findings highlight the need for complementary, multimodal, multiscale approaches to achieve more robust discrimination and clinically meaningful risk stratification.

Motivated by the limited information density of subtle morphologic cues in small BI-RADS 4 lesions, we adopted a habitat-based strategy to quantify intratumoral heterogeneity that is difficult to assess visually. In such small lesions, macroscopic differences are often limited, whereas subtle tumor-stroma interface alterations can leave quantifiable “fingerprints” in texture and stiffness patterns (26). In our analysis, the tumors were partitioned into three habitats, enabling local heterogeneity to be summarized as stable radiomic signatures (27,28). Across ABVS and SE, the malignant lesions tended to show higher texture/stiffness complexity and non-uniformity (e.g., entropy- and non-uniformity-related metrics), whereas the benign lesions more often exhibited homogeneity-related patterns, consistent with prior SE observations of more uniform stiffness in benign disease (29). Collectively, multiplanar and multimodal habitat features provide objective heterogeneity signatures that complement conventional US assessment in small BI-RADS 4 lesions.

Recent studies suggest that DL can refine risk stratification within BI-RADS 4 lesions, with reported AUCs ranging from 0.86 to 0.95, and may help reduce unnecessary biopsies (30-32). However, most previous research has relied on a single modality and predominantly evaluated heterogeneous BI-RADS 4 cohorts, with limited focus on the more challenging subgroup of small (≤2 cm) lesions. To better leverage the inherently multiplanar nature of ABVS, we stacked coronal, sagittal, and transverse slices to form 2.5D inputs, introducing cross-plane spatial context while retaining the training efficiency and stability of 2D backbones. In parallel, SE-based DL-2D provides complementary mechanical information, enabling the network to learn stiffness gradients and tumor-stroma transition patterns (e.g., peripheral stiffening with patchy intralesional heterogeneity), which are difficult to quantify on grayscale imaging alone.

US features provide interpretable macroscopic morphologic and functional cues, habitat radiomics quantifies intratumoral heterogeneity, and DL captures complex texture and morphologic patterns. As these signatures are only partially redundant and exhibit partly distinct error patterns, probability-level fusion can improve discrimination via complementary information integration. In our study, the combined models consistently improved discrimination and risk stratification for ≤2 cm BI-RADS 4 lesions across cohorts, with good calibration and a clinically favorable net benefit. We further examined the BI-RADS 4a subgroup, where the low-specificity burden is greatest and many biopsied lesions prove benign. Prior BI-RADS 4a-focused multimodal US-based models reported AUCs of 0.861–0.911 in internal validation and external test cohorts (33,34). In our ≤2 cm BI-RADS 4a subgroup, the combined models achieved AUCs of 0.920–0.949 (internal/external), with accuracies ≥0.88 and NPVs ≥0.92. Collectively, these findings indicate that the proposed fusion strategy remains robust in this diagnostically challenging subgroup and, pending prospective and broader external validation, may support second-level risk stratification and shared decision-making within BI-RADS 4 assessments, potentially reducing unnecessary biopsies in carefully selected low-risk cases.

Nevertheless, in the internal validation and external test cohorts, the combined models achieved sensitivities of 0.853–0.944 for BI-RADS 4 lesions and 0.815–0.889 for the BI-RADS 4a subgroup, indicating that a small proportion of malignancies may still be missed despite overall improved performance. False-negative predictions may occur, particularly when mammography-detected microcalcifications—an important driver of biopsy decisions—are less well visualized on US (35). Therefore, when clinical, mammographic, or other imaging suspicions persist, tissue diagnosis or further diagnostic work-up remains warranted irrespective of model output.

This study had a number of limitations. First, its retrospective design may introduce selection bias; prospective multicenter studies are needed for further validation and improved generalizability. Second, assessment of US features and lesion segmentation still partly depends on expert judgment. Although multi-reader review and Dice-based consistency checks were employed, subjective variability cannot be fully eliminated. Incorporating automatic or semi-automatic segmentation in future work will be important to enhance reproducibility and facilitate clinical translation. Third, SE was available only at Center 1, which limited the external validation of SE-derived predictors and the Combined (ABVS + SE) model; additional multicenter studies with standardized SE acquisition protocols are warranted to confirm model robustness and clinical applicability. Fourth, our models were developed using US-based inputs and did not systematically integrate other information streams that often guide management, including mammography and magnetic resonance imaging findings, as well as broader clinical risk factors. Future studies that fuse multimodality imaging with multi-source clinical data may further improve model performance and better reflect real-world decision-making.

Conclusions

We developed a multimodal fusion framework that integrates 2.5D DL, habitat radiomics, and clinical US features for ≤2 cm BI-RADS 4 breast lesions. Across all cohorts, the combined models consistently improved discriminatory ability, risk reclassification, and clinical net benefit. This strategy has the potential to reduce unnecessary biopsies and optimize individualized patient management in diagnostic scenarios involving small, highly heterogeneous lesions.

Acknowledgments

None.

Footnote

Reporting Checklist: The authors have completed the TRIPOD + AI reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2812/rc

Data Sharing Statement: Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2812/dss

Funding: None.

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2812/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. This study was approved by the Ethics Committees of The Third Xiangya Hospital, Central South University (No. Kuai 23642) and Changsha Central Hospital (No. 2025 Yi Shen Di 408 Hao). The requirement for written informed consent was waived by both committees due to the retrospective design and the use of de-identified data.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Harada-Shoji N, Suzuki A, Ishida T, Zheng YF, Narikawa-Shiono Y, Sato-Tadano A, Ohta R, Ohuchi N. Evaluation of Adjunctive Ultrasonography for Breast Cancer Detection Among Women Aged 40-49 Years With Varying Breast Density Undergoing Screening Mammography: A Secondary Analysis of a Randomized Clinical Trial. JAMA Netw Open 2021;4:e2121505. [Crossref] [PubMed]
National Comprehensive Cancer Network. NCCN Guidelines for Patients®: Breast Cancer Screening and Diagnosis. Version 2.2025. 2025 Mar 28 [cited 2025 Dec 4]. Available online: https://www.nccn.org/patients/guidelines/content/PDF/breastcancerscreening-patient.pdf
Elezaby M, Li G, Bhargavan-Chatfield M, Burnside ES, DeMartini WB. ACR BI-RADS Assessment Category 4 Subdivisions in Diagnostic Mammography: Utilization and Outcomes in the National Mammography Database. Radiology 2018;287:416-22. [Crossref] [PubMed]
Guo Q, Dong Z, Jiang L, Zhang L, Li Z, Wang D. Tumor size impacts the performance of ultrasound BI-RADS classification in breast cancer patients. Int J Radiat Res 2022;20:341-346.
Choi HY, Seo M, Sohn YM, Hwang JH, Song EJ, Min SY, Kang HJ, Han DY. Shear wave elastography for the diagnosis of small (≤2 cm) breast lesions: added value and factors associated with false results. Br J Radiol 2019;92:20180341. [Crossref] [PubMed]
Kim RG, Kim EK, Kim HA, Koh JS, Kim MS, Kim KI, Lee JI, Moon NM, Ko E, Noh WC. Prognostic significance of molecular subtype in T1N0M0 breast cancer: Korean experience. Eur J Surg Oncol 2011;37:629-34. [Crossref] [PubMed]
Wang C, Chen Z, Zhou Y, Huang W, Zhu H, Mao F, Lin Y, Zhang Y, Guan J, Cao X, Sun Q. T1a triple negative breast cancer has the worst prognosis among all the small tumor (<1 cm) of TNBC and HER2-rich subtypes. Gland Surg 2021;10:943-952. [Crossref] [PubMed]
Luo M, Lin X, Hao D, Shen KW, Wu W, Wang L, Ruan S, Zhou J. Incidence and risk factors of lymph node metastasis in breast cancer patients without preoperative chemoradiotherapy and neoadjuvant therapy: analysis of SEER data. Gland Surg 2023;12:1508-24. [Crossref] [PubMed]
Chen M, Palleschi S, Khoynezhad A, Gecelter G, Marini CP, Simms HH. Role of pri-mary breast cancer characteristics in predicting positive sentinel lymph node biopsy results: a multivariate analysis. Arch Surg 2002;137:606-609; discussion 609-610. [Crossref] [PubMed]
Antolini L, Biganzoli E, Querzoli P, Piantelli M, Alberti S. Lymph Node Micrometastases Do Influence Breast Cancer Outcome. J Clin Oncol 2015;33:3977-8. [Crossref] [PubMed]
Nicosia L, Ferrari F, Bozzini AC, Latronico A, Trentin C, Meneghetti L, Pesapane F, Pizzamiglio M, Balesetreri N, Cassano E. Automatic breast ultrasound: state of the art and future perspectives. Ecancermedicalscience 2020;14:1062. [Crossref] [PubMed]
Tang G, An X, Xiang H, Liu L, Li A, Lin X. Automated breast ultrasound: interobserver agreement, diagnostic value, and associated clinical factors of coronalplane image features. Korean J Radiol 2020;21:550-560. [Crossref] [PubMed]
Klein Wolterink F, Ab Mumin N, Appelman L, Derks-Rekers M, Imhof-Tas M, Lardenoije S, van der Leest M, Mann RM. Diagnostic performance of 3D automated breast ultrasound (3D-ABUS) in a clinical screening setting-a retrospective study. Eur Radiol 2024;34:5451-60. [Crossref] [PubMed]
Huisman M, Akinci D'Antonoli T. What a Radiologist Needs to Know About Radiomics, Standardization, and Reproducibility. Radiology 2024;310:e232459. [Crossref] [PubMed]
Shang Y, Zeng Y, Luo S, Wang Y, Yao J, Li M, Li X, Kui X, Wu H, Fan K, Li ZC, Zheng H, Li G, Liu J, Zhao W. Habitat Imaging With Tumoral and Peritumoral Radiomics for Prediction of Lung Adenocarcinoma Invasiveness on Preoperative Chest CT: A Multicenter Study. AJR Am J Roentgenol 2024;223:e2431675. [Crossref] [PubMed]
Li S, Dai Y, Chen J, Yan F, Yang Y. MRI-based habitat imaging in cancer treatment: current technology, applications, and challenges. Cancer Imaging 2024;24:107. [Crossref] [PubMed]
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016 Jun 27-30; Las Vegas, NV, USA. Piscataway (NJ): IEEE; 2016:770-8.
Huang G, Liu Z, van der Maaten L, Weinberger KQ. Densely connected convolution-al networks. In: Proceedings of the IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR); 2017 Jul 21-26; Honolulu, HI, USA. Piscataway (NJ): IEEE; 2017:4700-8.
Fu Y, Lei Y, Wang T, Curran WJ, Liu T, Yang X. A review of deep learning based methods for medical image multi-organ segmentation. Phys Med 2021;85:107-22. [Crossref] [PubMed]
van Kempen EJ, Post M, Mannil M, Witkam RL, Ter Laan M, Patel A, Meijer FJA, Henssen D. Performance of machine learning algorithms for glioma segmentation of brain MRI: a systematic literature review and meta-analysis. Eur Radiol 2021;31:9638-53. [Crossref] [PubMed]
Adler DD, Carson PL, Rubin JM, Quinn-Reid D. Doppler ultrasound color flow imaging in the study of breast cancer: preliminary findings. Ultrasound Med Biol 1990;16:553-9. [Crossref] [PubMed]
Ma Y, Li G, Li J, Ren WD. The Diagnostic Value of Superb Microvascular Imaging (SMI) in Detecting Blood Flow Signals of Breast Lesions: A Preliminary Study Comparing SMI to Color Doppler Flow Imaging. Medicine (Baltimore) 2015;94:e1502. [Crossref] [PubMed]
Itoh A, Ueno E, Tohno E, Kamma H, Takahashi H, Shiina T, Yamakawa M, Matsumura T. Breast disease: clinical application of US elastography for diagnosis. Radiology 2006;239:341-50. [Crossref] [PubMed]
Bernatowicz K, Grussu F, Ligero M, Garcia A, Delgado E, Perez-Lopez R. Robust imaging habitat computation using voxel-wise radiomics features. Sci Rep 2021;11:20133. [Crossref] [PubMed]
Allignet B, Leporq B, Bouhamama A, Pilleul F, Meurgey A, Gualter V, Sunyach MP, Waissi W, Beuf O. Habitat imaging based on voxel-wise GLCM joint energy clustering: prediction of disease-free survival in localized soft-tissue sarcoma. In: 20th International Conference on the Use of Computers in Radiation Therapy (ICCR); 2024 Jul 8-11; Lyon, France. Conference abstract [cited 2025 Dec 4]. Available online: https://www.iccr2024.org/papers/522722.pdf
Qi YJ, Su GH, You C, Zhang X, Xiao Y, Jiang YZ, Shao ZM. Radiomics in breast cancer: Current advances and future directions. Cell Rep Med 2024;5:101719. [Crossref] [PubMed]
Sagreiya H. Finding the Pieces to Treat the Whole: Using Radiomics to Identify Tumor Habitats. Radiol Artif Intell 2024;6:e230547. [Crossref] [PubMed]
Chen C, Gao K, Li Z, Ding Y, Zhao W. DBT-based habitat imaging for differentiating benign and malignant breast architectural distortion : a two-center study. BMC Med Imaging 2025;25:471. [Crossref] [PubMed]
Youk JH, Gweon HM, Son EJ. Shear-wave elastography in breast ultrasonography: the state of the art. Ultrasonography 2017;36:300-9. [Crossref] [PubMed]
Meng M, Li H, Zhang M, He G, Wang L, Shen D. Reducing the number of unnecessary biopsies for mammographic BI-RADS 4 lesions through a deep transfer learning method. BMC Med Imaging 2023;23:82. [Crossref] [PubMed]
Ezeana CF, He T, Patel TA, Kaklamani V, Elmi M, Brigmon E, Otto PM, Kist KA, Speck H, Wang L, Ensor J, Shih YT, Kim B, Pan IW, Cohen AL, Kelley K, Spak D, Yang WT, Chang JC, Wong STC. A Deep Learning Decision Support Tool to Improve Risk Stratification and Reduce Unnecessary Biopsies in BI-RADS 4 Mammograms. Radiol Artif Intell 2023;5:e220259. [Crossref] [PubMed]
Li Y, Li C, Yang T, Chen L, Huang M, Yang L, Zhou S, Liu H, Xia J, Wang S. Multiview deep learning networks based on automated breast volume scanner images for identifying breast cancer in BI-RADS 4. Front Oncol 2024;14:1399296. [Crossref] [PubMed]
Ye J, Xiong Y, Chen Y, Pan J, Qiu Y, Chen Y, Luo Z, Li Y, Huang W. Development and validation of an integrated model combining deep learning, radiomics, and clinical and breast ultrasound features for Breast Imaging Reporting and Data System 4A lesion malignancy classification. Quant Imaging Med Surg 2025;15:11907-21. [Crossref] [PubMed]
Ma Q, Wang J, Xu D, Zhu C, Qin J, Wu Y, Gao Y, Zhang C. Automatic Breast Volume Scanner and B-Ultrasound-Based Radiomics Nomogram for Clinician Management of BI-RADS 4A Lesions. Acad Radiol 2023;30:1628-37. [Crossref] [PubMed]
Soo MS, Baker JA, Rosen EL. Sonographic detection and sonographically guided biopsy of breast microcalcifications. AJR Am J Roentgenol 2003;180:941-8. [Crossref] [PubMed]

Cite this article as: Guo S, Zhou X, Zhou J, Gong L, Zhou S, Jiang L, Zhang Y, Jin L, Zhou P. Multimodal combined model integrating 2.5D deep learning and habitat radiomics for malignancy discrimination in ≤2 cm BI-RADS 4 breast lesions. Quant Imaging Med Surg 2026;16(5):377. doi: 10.21037/qims-2025-1-2812

Introduction

Methods

Patient enrollment

Image acquisition and segmentation

US models

Radiomics-habitat (Rad-Habitat) models

Habitat subregion generation

Feature selection and model construction

Rad-Habitat (ABVS)

Rad-Habitat (SE)

DL models

Data augmentation

Data normalization

Training parameters

2.5D DL model

Two-dimensional (2D) DL model

Combined models

Combined (ABVS) model

Combined (ABVS + SE) model

Statistical analysis

Results

Cohort baseline characteristics

Table 1

US models

Rad-Habitat models

Rad-Habitat (ABVS)

Rad-Habitat (SE)

DL models

DL-2.5D model

DL-2D model

Combined models

Table 2

BI-RADS 4a subgroup

Table 3

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share