Benchmarking YOLOs in breast ultrasound lesion segmentation

Shaode Yu; Ming Huang; Enqi Chen; Bing Zhu; Xiaokun Liang; Yaoqin Xie

doi:10.21037/qims-2026-1-0333

Original Article

Benchmarking YOLOs in breast ultrasound lesion segmentation

Shaode Yu¹ , Ming Huang¹ , Enqi Chen¹ , Bing Zhu¹ , Xiaokun Liang² , Yaoqin Xie²

¹School of Information and Communication Engineering, Communication University of China, Beijing, China; ²Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China

Contributions: (I) Conception and design: Y Xie, S Yu; (II) Administrative support: Y Xie; (III) Provision of study materials or patients: S Yu, X Liang; (IV) Collection and assembly of data: S Yu, M Huang, E Chen, B Zhu; (V) Data analysis and interpretation: S Yu, M Huang; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Prof. Yaoqin Xie, PhD. Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, 1068 Xueyuan Avenue, Shenzhen University Town, Xili, Nanshan District, Shenzhen 518000, China. Email: yq.xie@siat.ac.cn.

Background: Breast ultrasound (BUS) is widely used for breast cancer (BC) screening and diagnosis, yet accurate breast lesion segmentation remains challenging. Although You Only Look Once (YOLO) and its variants have shown strong performance in object segmentation, their effectiveness on BUS lesion segmentation has not been systematically explored. This study aims to benchmark twelve YOLO variants from four families (YOLOv5, YOLOv8, YOLOv9, and YOLO11) for BUS lesion segmentation under same-database and cross-database settings.

Methods: Twelve YOLO variants spanning nano to extra-large scales were fine-tuned and evaluated on two public BUS datasets, the Breast Ultrasound Images (BUSI) dataset (n=647) and the breast ultrasound lesion segmentation dataset from the University of Castilla-La Mancha (BUS-UCLM) dataset (n=264). Each dataset was split into training (80%), validation (10%), and testing (10%) subsets with stratified random partitioning, and experiments were repeated across eight random seeds. Six evaluation metrics, including Dice coefficient, intersection over union (IoU), precision, recall, F1 score (F1S), and mean average precision at IoU threshold 0.5 (mAP@0.5), were used. In addition, U-Net and DeepLabV3+ were compared under the same protocol.

Results: Under same-database evaluation, all variants achieved strong performance, with mean Dice scores ≥0.81 on BUSI and ≥0.87 on UCLM. On BUSI, yolov5s achieved the highest mean Dice (0.93±0.012) and IoU (0.88±0.014). On UCLM, yolov8m attained the highest mean Dice (0.98±0.006) and IoU (0.96±0.007). In cross-database evaluation, however, performance degraded substantially, with Dice scores decreasing by approximately 0.20 or more. For BUSI→UCLM, yolov5s achieved the highest mean Dice (0.71±0.032); and for UCLM→BUSI, yolov9c achieved the highest mean Dice (0.60±0.038). Among all variants, yolo11n demonstrated competitive performance across both same- and cross-database evaluations (BUSI Dice 0.91±0.014, and BUSI→UCLM Dice 0.69±0.034; UCLM Dice 0.96±0.009, and UCLM→BUSI 0.59±0.039) while maintaining low computational cost (training time <290 s; inference latency ~64 ms). Notably, yolo11n substantially outperformed U-Net and DeepLabV3+ under both same- and cross-database settings.

Conclusions: Fine-tuned YOLO variants achieve strong same-database performance for BUS lesion segmentation, with yolo11n offering the most favorable balance among segmentation accuracy, cross-database competitiveness, and computational efficiency. However, the substantial performance degradation in cross-database settings highlights the critical need for improving domain generalization performance in future work.

Keywords: Breast ultrasound (BUS); lesion segmentation; breast cancer (BC); deep learning; You Only Look Once (YOLO)

Submitted Feb 11, 2026. Accepted for publication May 21, 2026. Published online Jun 08, 2026.

doi: 10.21037/qims-2026-1-0333

Introduction

Breast cancer (BC) remains as one of the most serious threats to women’s health. It is the leading cause of cancer-related morbidity and mortality among females worldwide. A study covering 185 countries shows that ≈2.3 million new BC cases and 0.67 million deaths were recorded in 2022, highlighting the persistent and widespread burden of this disease (1). Despite advancement in medical imaging, screening programs and treatment strategies, significant disparities exist across regions and populations. A meta-analysis on around 2.4 million cases from 81 countries found that women of older age and those with lower socioeconomic status were more likely to be diagnosed at an advanced, metastatic stage (2). This finding suggests that unequal access to early screening and diagnostic services is challenging, particularly in low- and middle-income countries. Therefore, enhancing the accuracy of breast lesion detection at an early stage via cost-effective imaging modalities and artificial intelligence (AI)-based tools is crucial to improving patient outcomes, reducing BC-related mortality, and alleviating the overall healthcare burden (3).

Breast ultrasound (BUS) plays an important role in BC screening, diagnosis, staging, and treatment monitoring. Based on high-frequency sound waves, BUS enables the real-time visualization of internal breast structures to assess tissue composition and detect suspicious lesions. Compared with mammography, it is non-ionizing radiation and can be performed repeatedly without health risks. Moreover, BUS is especially valuable for women with dense breast tissue where the sensitivity of mammography is often limited (4,5). Compared with advanced modalities, such as magnetic resonance imaging, BUS is relatively affordable, portable, and more suitable for routine screening and for deployment in community hospitals or remote healthcare facilities (6). By combining these advantages, BUS has become an indispensable component of disease management and continues to play an expanding role in global cancer care, especially meaningful for the patients in low-income or resource-constrained regions (7).

Automated BUS image segmentation supports localization of suspicious lesions, quantitative tumor assessment, and clinical decision-making. Despite traditional algorithms (5,8,9), numerous deep learning models have been developed over the past decade (10). Firstly, convolutional neural network (CNN)-based models are designed (11). Wang et al. present a coarse-to-fine fusion model which consists of encoding, decoding, and feature fusion for respectively capturing the context information, localization prediction, and generating beneficial aggregate feature representations (12). Chen et al. introduce hybrid adaptive attention to capture sufficient features under different receptive fields and to learn rich robust representation in channel and space dimensions (13). Yang et al. present pyramid squeeze attention to enhance the receptive field, and a boundary branch network utilizes the contextual local attention and integrates local spatial and high-level semantic contextual information (14). Generative adversarial network (GAN)-based approaches (15) are then introduced. Han et al. leverage GAN for BUS lesion segmentation that exploits unannotated data by semi-supervised learning and enhances pixel-wise discrimination by dual attentive fusion (16). Jie et al. propose a semi-pixel-wise cycle GAN by using prior knowledge to relieve the burden for annotation (17). Singh et al. conduct contextual information-aware conditional generative adversarial learning, and several enhancements are exploited to capture both texture features and contextual dependencies (18). After that, Transformer (19) is explored. He et al. develop Transformer encoder blocks to learn global contextual information and a spatial-wise cross attention decoder module to reduce semantic discrepancy (20). Yang et al. integrate CNN and the residual Swin Transformer to enhance contextual information awareness and to capture both spatial and boundary context of lesion morphological information (21). Zhang et al. combine both the long-range dependency of Transformers and the local detail representation of CNNs, and a cross attention block module is added to allow different layers to interact (22).

It is known that You Only Look Once (YOLO) has undergone continuous improvements in object localization and segmentation, achieving top-tier performance on large-scale benchmarks such as ImageNet (23). A comprehensive overview of the evolution of YOLO variants from YOLOv1 to YOLO11 emphasizes key architectural advances, performance benchmarks, representative applications, and their practical strengths and limitations (24). However, their application to BUS lesion segmentation is limited. Cao et al. conducted experimental comparison, and both YOLO and YOLOv3 were evaluated (25). Samanta et al. compared the performance of YOLOv6, YOLOv7, and YOLOv8 on the ultrasound Breast Ultrasound Images (BUSI) dataset (26). Li et al. integrated global-local multi-scale selection into YOLOv5 for mass isolation (27). Wang et al. enhanced YOLOv8s by incorporating oriented bounding boxes, deformable convolution, and multi-scale feature fusion to improve the detection of breast tumors with irregular shapes and orientations across multiple scales (28). Du et al. improved YOLOv7 by mitigating background interference, reducing redundant computations, and enhancing positional awareness (29). Mostafa et al. adapted YOLOv8 and compared its performance with several deep learning-based segmentation models (30). Recently, Ariyametkul and Paing evaluated multiple YOLO variants for BC detection in mammograms and explored model explainability using heat map visualizations (31). Ma et al. developed a YOLO-based AI model using an automated breast volume scanner for the detection and classification of benign and malignant breast lesions (32).

In this study, we present a comprehensive experimental evaluation of four YOLO families comprising 12 variants to investigate their intra- and cross-database performance for BUS lesion segmentation. Experiments are conducted on 2 public datasets [BUSI (33) and breast ultrasound lesion segmentation dataset from the University of Castilla-La Mancha (UCLM) (34)], and performance is assessed using 6 evaluation metrics. Results show that all variants achieve strong performance under intra-database evaluation, whereas notable degradation occurs in cross-database settings. Among the evaluated models, no single variant consistently outperforms others across evaluation settings, while yolo11n achieves competitive performance in both intra- and cross-database evaluations and maintains favorable computational efficiency. We present this article in accordance with the CLAIM reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2026-1-0333/rc).

Methods

Data collection

The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. Two datasets, BUSI (33) and UCLM (34), are used. Both datasets contain BUS images with corresponding pixel-wise annotations. Table 1 summarizes the number of BUS images in each category. For BUS lesion segmentation, only images labeled as benign and malignant are retained, and thereby, a total of 647 images from the BUSI dataset and 264 images from the UCLM dataset are analyzed.

Table 1

The number of images in BUSI and UCLM databases

Dataset	Normal	Benign	Malignant	Used in this study
BUSI (33)	133	437	210	647
UCLM (34)	419	174	90	264

BUSI, Breast Ultrasound Images; UCLM, breast ultrasound lesion segmentation dataset from the University of Castilla-La Mancha.

Figure 1 presents representative examples of BUS images and corresponding masks. The first two columns show images from the BUSI dataset along with their binary masks, whereas the remaining columns display images from the UCLM dataset with colored masks. In all masked images, the dark regions represent the background areas.

Figure 1 Representative examples. The columns present the BUSI raw images, BUSI masks, UCLM raw images, and UCLM masks, respectively. All images are resized to 512×512 using aspect-ratio preservation with center padding. BUSI, Breast Ultrasound Images; UCLM, breast ultrasound lesion segmentation dataset from the University of Castilla-La Mancha.

Involved YOLO variants

Twelve YOLO variants are evaluated, covering a broad spectrum of model capacities ranging from nano and small to medium, large, and extra-large architectures. These variants span four representative families of YOLOv5 (35), YOLOv8 (36), YOLOv9 (37), and YOLO11 (38), including yolov5n, yolov5s, yolov8n, yolov8s, yolov8m, yolov9c, yolov9e, yolo11n, yolo11s, yolo11m, yolo11l, and yolo11x. The official training stacks of model pipelines are used for these variants with the purpose of fair comparison, while full implementation details can be accessed through Ultralytics.

Table 2 reports the feasible batch size (BS) numbers used in training and in the default testing pipeline, and an adaptive BS strategy is employed during model training and the default evaluation pipeline via a graphics processing unit (GPU) memory-aware batch schedule determined by the model scale. Since a smaller model could process more image batches per iteration, the BS number could be larger in the testing stage than that in the training stage. Notably, for fair inference-time comparison across model variants, an additional controlled latency measurement was performed using BS =1 for all models on the same hardware platform.

Table 2

Feasible batch sizes during the training and testing stages

Variant	Scale	Parameters	Capacity level	Training	Testing
yolov5n	Nano	≈2M	Ultra-light	32	32
yolov5s	Small	≈7M	Lightweight	24	24
yolov8n	Nano	≈3M	Very light	32	32
yolov8s	Small	≈11M	Light	24	24
yolov8m	Medium	≈26M	Medium	12	16
yolov9c	Core	≈26M	Medium	16	20
yolov9e	Extra-large	≈59M	Large	4	8
yolo11n	Nano	≈5M	Lightweight	32	32
yolo11s	Small	≈13M	Light	24	24
yolo11m	Medium	≈37M	Medium-large	12	16
yolo11l	Large	≈58M	Large	8	12
yolo11x	Extra-large	≈97M	Very large	4	8

M, million.

To provide a broader reference beyond the YOLO family, two representative image segmentation models, U-Net (39) and DeepLabV3+ (40), were additionally implemented under the same input preprocessing, repeated image-level split protocol, and mask-based evaluation pipeline. The baseline comparison therefore follows the same data partition strategy and repeated-run design as the YOLO benchmark, enabling a direct comparison of same-database and cross-database behavior.

Implementation details

The implementation of evaluated YOLO variants is based on the Ultralytics framework, which emphasizes practical, efficient deployment. These models built upon extensive prior research are continuously refined to enhance both performance and adaptability. They have demonstrated strong effectiveness across a range of computer vision tasks, including object detection and instance segmentation. Notably, the models are pre-trained on the Microsoft Common Objects in Context (COCO) dataset (41). Experiments are conducted on a single NVIDIA L40 GPU (NVIDIA, Santa Clara, California, United States), and the official training stacks were used without manual modification of the major optimization.

Parameter settings

Table 3 summarizes the major training, evaluation, and inference settings used in this study. For YOLOv8, YOLOv9, and YOLO11 variants, the official Ultralytics training stack was used with its default hyperparameter configuration unless otherwise stated. For YOLOv5 variants, the official YOLOv5-seg training stack and its default hyperparameter configuration were used unless otherwise stated. Only a small number of experiment-specific settings were explicitly specified in this study, including the number of epochs, image size, BS, workers, mixed precision, and inference confidence threshold.

Table 3

Major training, evaluation, and inference settings

Items	Settings
Input resolution	512×512 (training and evaluation)
GPU	A single NVIDIA L40 (48 GB)
Epochs	50 (no early stopping)
Batch size (training/default evaluation)	Per-model schedule in Table 2
Latency measurement	Batch size =1 for all models
Workers	8 (data loader)
Cache	Disabled
Mixed precision	Enabled
Pretrained weights	Official COCO-pretrained checkpoints
Optimizer (YOLOv8/v9/11)	Auto
Deterministic mode (YOLOv8/v9/11)	True
Initial learning rate	0.01
Final LRF	0.01
Momentum	0.937
Weight decay	0.0005
Warmup epochs	3.0
Warmup momentum	0.8
Warmup bias learning rate	0.1
Cosine learning-rate scheduler (YOLOv8/v9/11)	False
Close mosaic (YOLOv8/v9/11)	10
Checkpoint used for evaluation	Best saved checkpoint when available
Foreground definition	Annotated benign and malignant regions
Inference confidence threshold	0.25
Inference IoU threshold	0.7

COCO, Common Objects In Context; GPU, graphics processing unit; IoU, intersection over union; LRF, learning rate factor; YOLO, You Only Look Once.

All other key hyperparameters followed the official defaults of the corresponding training stack used at the time of the experiments. For YOLOv8, YOLOv9, and YOLO11, these included the official Ultralytics default configuration (e.g., optimizer = auto, lr0 =0.01, lrf =0.01, momentum =0.937, weight_decay =0.0005, cos_lr = False, close_mosaic =10, and amp = True). For YOLOv5, the official YOLOv5 default hyperparameter configuration was used, including lr0 =0.01, lrf =0.01, momentum =0.937, weight_decay =0.0005, and the default augmentation settings.

Experiment design

Intra- and inter-database experiments are conducted. For each intra-database experiment, the dataset was split at the image level into training, validation, and testing subsets with a ratio of 80%, 10%, and 10%, respectively. Stratified random partitioning based on lesion category (benign/malignant) was repeated using eight different random seeds, and the reported same-database results are presented as mean ± standard deviation across the eight runs. No patient-level partition was applied in the present benchmark. Because the BUSI dataset does not provide patient identifiers required for reliable subject-level grouping, a unified image-level protocol was adopted for both BUSI and UCLM to maintain benchmarking consistency.

For the inter-database experiment, a model trained on the seed-specific training subset of one database was evaluated on all eligible cases from the other database without domain adaptation. The reported cross-database results are likewise summarized as mean ± standard deviation across eight runs. Nevertheless, because the benchmark remains image-level rather than patient-level, images from the same patient may still appear in different subsets when multiple images per patient exist in the source dataset, and this potential source of optimistic bias should be considered when interpreting same-database performance.

Performance metrics

Six metrics are used, including Dice coefficient, intersection over union (IoU), Precision, Recall, F1 score (F1S), and mean average precision at an IoU threshold of 0.5 (mAP@0.5) on binary masks. Polygon labels are extracted from the masks (contour tracing) and stored in the YOLO-seg format [normalized (x, y) vertices]. During evaluation, polygons are rasterized into binary masks, and only predictions with confidence ≥0.25 are considered. The confidence threshold of 0.25 was applied uniformly to all evaluated variants, matching the default prediction setting of the Ultralytics framework, and no additional threshold tuning was performed in order to maintain a consistent comparison protocol across models. Unless otherwise stated, the main quantitative results are summarized as mean ± standard deviation over eight repeated runs. For U-Net and DeepLabV3+, Dice, IoU, Precision, Recall, and F1S were computed from the predicted binary masks using the same mask-level evaluation routine; mAP@0.5 was not reported for these baselines because it depends on the YOLO-specific validation pipeline.

Because the present preprocessing pipeline operated on image files rather than patient identifiers, the benchmark was implemented using image-level partitioning for both datasets. For same-database evaluation, the image-level stratified split was repeated over 8 random seeds, and the reported results are summarized as mean ± standard deviation across runs. For each training run, the best saved checkpoint was used for evaluation when available; otherwise, the last checkpoint was used.

Results

BUSI-based intra- and inter-database performance

Figure 2 illustrates the Dice values of variants on the BUSI database in which blue bars show intra-database results and brown bars represent inter-database performance. Overall, all variants consistently obtain higher Dice scores in the intra-database setting than in the inter-database scenario. Specifically, most variants experience an obvious drop in Dice ≥0.20, indicating severe performance degradation under cross-database evaluation.

Figure 2 Per-model Dice scores on the BUSI database, where same-domain performance consistently exceeds cross-domain performance. BUSI, Breast Ultrasound Images.

Table 4 shows intra- and inter-database segmentation performance on the BUSI dataset. In general, all models achieve substantially better results in the intra-database setting than those in the inter-database scenario, indicating a clear performance degradation under domain shift. In the intra-database evaluation, yolov5s attains the highest Dice, IoU, F1S, and mAP0.5 values, together with a favorable balance between Precision and Recall. Meanwhile, yolo11n and yolov8s also show competitive results, with Dice scores above 0.90, suggesting that lightweight architectures are effective for BUS segmentation in the present benchmark. In contrast, increasing model capacity (e.g., 11m/l/x) does not lead to further improvements in the present evaluation and sometimes coincides with lower Dice and IoU values, which may reflect possible over-fitting on this relatively limited dataset.

Table 4

BUSI-based intra- and inter-database segmentation

Variant	Dice	IoU	Precision	Recall	F1S	mAP0.5
yolov5n (intra)	0.82±0.021	0.76±0.023	0.81±0.020	0.85±0.019	0.82±0.021	0.91±0.015
yolov5n (inter)	0.67±0.035	0.65±0.033	0.68±0.034	0.67±0.036	0.67±0.035	0.49±0.028
yolov5s (intra)	0.93±0.012	0.88±0.014	0.93±0.011	0.94±0.010	0.93±0.012	0.98±0.008
yolov5s (inter)	0.71±0.032	0.69±0.030	0.71±0.031	0.73±0.029	0.71±0.032	0.51±0.026
yolov8n (intra)	0.81±0.022	0.75±0.024	0.79±0.023	0.86±0.020	0.81±0.022	0.93±0.014
yolov8n (inter)	0.62±0.038	0.59±0.036	0.61±0.037	0.66±0.035	0.62±0.038	0.48±0.029
yolov8s (intra)	0.90±0.015	0.85±0.017	0.89±0.016	0.94±0.013	0.90±0.015	0.95±0.011
yolov8s (inter)	0.65±0.036	0.63±0.034	0.65±0.035	0.66±0.033	0.65±0.036	0.53±0.025
yolov8m (intra)	0.83±0.020	0.77±0.022	0.81±0.021	0.85±0.019	0.83±0.020	0.89±0.016
yolov8m (inter)	0.70±0.033	0.68±0.031	0.71±0.032	0.70±0.034	0.70±0.033	0.47±0.027
yolov9c (intra)	0.89±0.016	0.84±0.018	0.88±0.017	0.92±0.014	0.89±0.016	0.97±0.009
yolov9c (inter)	0.62±0.039	0.59±0.037	0.61±0.038	0.64±0.036	0.62±0.039	0.43±0.030
yolov9e (intra)	0.87±0.018	0.81±0.020	0.85±0.019	0.92±0.015	0.87±0.018	0.93±0.012
yolov9e (inter)	0.60±0.040	0.58±0.038	0.60±0.039	0.62±0.037	0.60±0.040	0.45±0.029
yolo11n (intra)	0.91±0.014	0.86±0.016	0.90±0.015	0.92±0.013	0.91±0.014	0.96±0.010
yolo11n (inter)	0.69±0.034	0.67±0.032	0.70±0.033	0.70±0.035	0.69±0.034	0.51±0.026
yolo11s (intra)	0.84±0.019	0.78±0.021	0.82±0.020	0.91±0.016	0.84±0.019	0.95±0.011
yolo11s (inter)	0.61±0.039	0.58±0.037	0.60±0.038	0.64±0.036	0.61±0.039	0.51±0.026
yolo11m (intra)	0.85±0.018	0.79±0.020	0.83±0.019	0.88±0.017	0.85±0.018	0.90±0.015
yolo11m (inter)	0.57±0.041	0.55±0.039	0.57±0.040	0.60±0.038	0.57±0.041	0.43±0.030
yolo11l (intra)	0.83±0.020	0.77±0.022	0.80±0.021	0.90±0.016	0.83±0.020	0.87±0.016
yolo11l (inter)	0.60±0.040	0.58±0.038	0.60±0.039	0.63±0.037	0.60±0.040	0.47±0.027
yolo11x (intra)	0.84±0.019	0.78±0.021	0.82±0.020	0.88±0.017	0.84±0.019	0.84±0.018
yolo11x (inter)	0.56±0.042	0.54±0.040	0.56±0.041	0.59±0.039	0.56±0.042	0.46±0.028

Data are presented as mean ± standard deviation. BUSI, Breast Ultrasound Images; F1S, F1 score; IoU, intersection over union; mAP0.5, mean average precision at an IoU threshold of 0.5.

In the inter-database setting, performance drops significantly across all metrics, highlighting the challenge of cross-dataset generalization in this BUS lesion segmentation task. Among the variants, yolov5s and yolo11n achieve relatively favorable overlap-based results, and yolov8s achieves the highest mAP0.5. These moderate-capacity YOLO models (e.g., yolov5s, yolov8s, and yolo11n) demonstrate relatively favorable performance compared with other evaluated variants, although the absolute cross-database Dice values remain clearly lower than the corresponding same-database performance.

UCLM-based intra- and inter-database performance

Figure 3 shows the Dice values of YOLO variants on the UCLM database in which blue bars denote intra-database results, and brown bars represent inter-database performance. All variants achieve at least 0.35 higher Dice scores in the intra-database setting than in the inter-database setting. Notably, yolov8m exhibits a substantial performance degradation, with its Dice score decreasing from 0.98 in intra-database testing to 0.55 in inter-database testing, yolo11s drops from 0.96 to 0.53, and both variants show a pronounced decline of 0.43 in Dice values when moving from same- to cross-domain evaluation.

Figure 3 Per-model Dice scores on the UCLM database. Same-domain evaluation causes consistently higher Dice scores than cross-domain evaluation. UCLM, breast ultrasound lesion segmentation dataset from the University of Castilla-La Mancha.

Table 5 presents the intra- and inter-database segmentation performance of YOLO variants on the UCLM dataset. In the intra-database setting, all models achieve consistently high segmentation performance, with mean Dice scores exceeding 0.87 across all configurations. Notably, yolov8m attains the highest mean intra-database Dice (0.98±0.006), IoU (0.96±0.007), and F1S (0.98±0.006) values. Several other variants, including yolov9c and yolo11x, also demonstrate competitive intra-database results, confirming the effectiveness of YOLO-based segmentation under same-domain conditions.

Table 5

UCLM-based intra- and inter-database segmentation

Variant	Dice	IoU	Precision	Recall	F1S	mAP0.5
yolov5n (intra)	0.91±0.013	0.89±0.014	0.91±0.013	0.91±0.013	0.91±0.013	0.98±0.007
yolov5n (inter)	0.53±0.043	0.48±0.041	0.53±0.042	0.56±0.040	0.53±0.043	0.54±0.027
yolov5s (intra)	0.87±0.017	0.84±0.019	0.86±0.018	0.87±0.017	0.87±0.017	0.97±0.009
yolov5s (inter)	0.51±0.044	0.47±0.042	0.51±0.043	0.53±0.041	0.51±0.044	0.55±0.026
yolov8n (intra)	0.93±0.012	0.91±0.013	0.92±0.012	0.93±0.012	0.93±0.012	0.99±0.006
yolov8n (inter)	0.52±0.043	0.48±0.041	0.52±0.042	0.55±0.040	0.52±0.043	0.52±0.028
yolov8s (intra)	0.92±0.013	0.90±0.014	0.91±0.013	0.92±0.013	0.92±0.013	0.98±0.007
yolov8s (inter)	0.52±0.043	0.48±0.041	0.51±0.042	0.54±0.040	0.52±0.043	0.54±0.027
yolov8m (intra)	0.98±0.006	0.96±0.007	0.97±0.006	0.98±0.006	0.98±0.006	0.99±0.005
yolov8m (inter)	0.55±0.041	0.51±0.039	0.55±0.040	0.57±0.038	0.55±0.041	0.56±0.025
yolov9c (intra)	0.96±0.009	0.94±0.010	0.96±0.009	0.97±0.008	0.96±0.009	0.98±0.007
yolov9c (inter)	0.60±0.038	0.54±0.036	0.58±0.037	0.66±0.034	0.60±0.038	0.52±0.028
yolov9e (intra)	0.92±0.013	0.90±0.014	0.91±0.013	0.92±0.013	0.92±0.013	0.89±0.015
yolov9e (inter)	0.52±0.043	0.48±0.041	0.52±0.042	0.54±0.040	0.52±0.043	0.42±0.032
yolo11n (intra)	0.96±0.009	0.94±0.010	0.96±0.009	0.96±0.009	0.96±0.009	0.99±0.005
yolo11n (inter)	0.59±0.039	0.54±0.037	0.59±0.038	0.61±0.036	0.59±0.039	0.51±0.029
yolo11s (intra)	0.96±0.009	0.94±0.010	0.96±0.009	0.97±0.008	0.96±0.009	0.99±0.005
yolo11s (inter)	0.53±0.043	0.49±0.041	0.52±0.042	0.57±0.040	0.53±0.043	0.50±0.030
yolo11m (intra)	0.90±0.015	0.88±0.016	0.89±0.015	0.91±0.014	0.90±0.015	0.98±0.007
yolo11m (inter)	0.55±0.041	0.50±0.039	0.53±0.040	0.63±0.037	0.55±0.041	0.50±0.030
yolo11l (intra)	0.95±0.010	0.93±0.011	0.95±0.010	0.95±0.010	0.95±0.010	0.96±0.009
yolo11l (inter)	0.57±0.040	0.52±0.038	0.56±0.039	0.60±0.037	0.57±0.040	0.52±0.028
yolo11x (intra)	0.97±0.008	0.94±0.009	0.97±0.008	0.97±0.008	0.97±0.008	0.98±0.007
yolo11x (inter)	0.58±0.039	0.53±0.037	0.57±0.038	0.62±0.036	0.58±0.039	0.45±0.031

Data are presented as mean ± standard deviation. F1S, F1 score; IoU, intersection over union; mAP0.5, mean average precision at an IoU threshold of 0.5; UCLM, breast ultrasound lesion segmentation dataset from the University of Castilla-La Mancha.

In contrast, inter-database performance drops markedly for all models, revealing a substantial generalization gap between UCLM and unseen datasets. Dice scores decrease to the range of 0.51 to 0.60, with corresponding reductions in IoU and F1S values. Notably, yolov9c achieves the highest mean Dice (0.60±0.038) and Recall (0.66±0.034) values, indicating comparatively favorable cross-database performance. Both yolo11n and yolo11x also show competitive cross-database results, with mean Dice scores close to 0.59 in the present evaluation.

Overall, larger or higher-capacity models yield excellent same-domain performance on UCLM but do not consistently achieve the strongest cross-database results. Moderate-capacity architectures, such as yolov9c and yolo11n, appear to provide a favorable balance between representation power and cross-database competitiveness in the present benchmark.

Analysis of top-tier YOLO variants on BUS lesion segmentation

On intra-database evaluation performance

Table 6 summarizes the intra-database segmentation performance of top-tier YOLO variants on the BUSI and UCLM datasets. The variant names (yolov9c and yolo11n) highlighted in bold stand for their consistently strong performance across both datasets. This emphasis is illustrative and does not correspond to the maximum value in each metric.

Table 6

Top-tier intra-database segmentation performance

Dataset	Variants	Dice	IoU	Precision	Recall	F1S	mAP0.5
	yolov5s	0.93±0.012	0.88±0.014	0.93±0.011	0.94±0.010	0.93±0.012	0.98±0.008
	yolo11n	0.91±0.014	0.86±0.016	0.90±0.015	0.92±0.013	0.91±0.014	0.96±0.010
BUSI	yolov8s	0.90±0.015	0.85±0.017	0.89±0.016	0.94±0.013	0.90±0.015	0.95±0.011
	yolov9c	0.89±0.016	0.84±0.018	0.88±0.017	0.92±0.014	0.89±0.016	0.97±0.009
	yolov9e	0.87±0.018	0.81±0.020	0.85±0.019	0.92±0.015	0.87±0.018	0.93±0.012
	yolov8m	0.98±0.006	0.96±0.007	0.97±0.006	0.98±0.006	0.98±0.006	0.99±0.005
	yolo11x	0.97±0.008	0.94±0.009	0.97±0.008	0.97±0.008	0.97±0.008	0.98±0.007
UCLM	yolov9c	0.96±0.009	0.94±0.010	0.96±0.009	0.97±0.008	0.96±0.009	0.98±0.007
	yolo11n	0.96±0.009	0.94±0.010	0.96±0.009	0.96±0.009	0.96±0.009	0.99±0.005
	yolo11s	0.96±0.009	0.94±0.010	0.96±0.009	0.97±0.008	0.96±0.009	0.99±0.005

Data are presented as mean ± standard deviation. BUSI, Breast Ultrasound Images; F1S, F1 score; IoU, intersection over union; mAP0.5, mean average precision at an IoU threshold of 0.5; UCLM, breast ultrasound lesion segmentation dataset from the University of Castilla-La Mancha.

On the BUSI dataset, yolov5s achieves the highest mean Dice (0.93±0.012) and IoU (0.88±0.014) values, indicating numerically stronger overlap performance among the variants. The yolo11n and yolov8s deliver competitive performance, with mean Dice values above 0.90 and balanced Precision-Recall values, while yolov9c attains a relatively high mAP0.5 (0.97±0.009), suggesting its robust localization capability.

On the UCLM dataset, yolov8m attains the highest mean Dice (0.98±0.006), IoU (0.96±0.007), and F1S (0.98±0.006) values, indicating increased model capacity can be beneficial in this same-domain setting. The yolo11x, yolo11n, and yolov9c also demonstrate consistently strong performance, with mean Dice values of 0.96 to 0.97, indicating that these architectures can effectively model lesion appearance.

On inter-database evaluation performance

Table 7 summarizes the cross-database segmentation results of top-tier YOLO variants for evaluating model generalization. The model names (yolov8m and yolo11n) in bold only indicate these variants achieve comparatively competitive performance on both cross-database evaluation experiments.

Table 7

Inter-database segmentation performance

Dataset	Variants	Dice	IoU	Precision	Recall	F1S	mAP0.5
	yolov5s	0.71±0.032	0.69±0.030	0.71±0.031	0.73±0.029	0.71±0.032	0.51±0.026
	yolov8m	0.70±0.033	0.68±0.031	0.71±0.032	0.70±0.034	0.70±0.033	0.47±0.027
BUSI→UCLM	yolo11n	0.69±0.034	0.67±0.032	0.70±0.033	0.70±0.035	0.69±0.034	0.51±0.026
	yolov5n	0.67±0.035	0.65±0.033	0.68±0.034	0.67±0.036	0.67±0.035	0.49±0.028
	yolov8s	0.65±0.036	0.63±0.034	0.65±0.035	0.66±0.033	0.65±0.036	0.53±0.025
	yolov9c	0.60±0.038	0.54±0.036	0.58±0.037	0.66±0.034	0.60±0.038	0.52±0.028
	yolo11n	0.59±0.039	0.54±0.037	0.59±0.038	0.61±0.036	0.59±0.039	0.51±0.029
UCLM→BUSI	yolo11x	0.58±0.039	0.53±0.037	0.57±0.038	0.62±0.036	0.58±0.039	0.45±0.031
	yolo11l	0.57±0.040	0.52±0.038	0.56±0.039	0.60±0.037	0.57±0.040	0.52±0.028
	yolov8m	0.55±0.041	0.51±0.039	0.55±0.040	0.57±0.038	0.55±0.041	0.56±0.025

Data are presented as mean ± standard deviation. BUSI, Breast Ultrasound Images; F1S, F1 score; IoU, intersection over union; mAP0.5, mean average precision at an IoU threshold of 0.5; UCLM, breast ultrasound lesion segmentation dataset from the University of Castilla-La Mancha.

For the BUSI→UCLM setting, the variants exhibit mean Dice scores ranging from 0.65 to 0.71. Among the variants, yolov5s, yolov8m, and yolo11n achieve relatively higher Dice values, indicating comparatively favorable cross-database performance from BUSI to UCLM. These five variants also maintain balanced Precision and Recall, suggesting that their predictions are not biased toward over- or under-segmentation in cross-domain evaluation.

In the UCLM→BUSI scenario, overall performance further declines, with mean Dice scores dropping to the range of 0.55 to 0.60, while yolov9c and yolo11n yield the highest Dice values (0.60±0.038 and 0.59±0.039). This asymmetric generalization gap implies that models trained on UCLM data struggle more when applied to BUSI images.

Overall, these results indicate a substantial generalization gap in BUS lesion segmentation. Some compact or moderately sized models (e.g., yolov8m and yolo11n) show relatively competitive cross-database performance compared with other variants, while the absolute Dice values remain limited under cross-database evaluation.

Perceived intra- and inter-database segmentation degradation

Figure 4 presents a BUSI lesion segmentation case obtained by the yolo11n model. It shows the original image, the ground-truth mask, the intra-database result (BUSI→BUSI), and the inter-database result (UCLM→BUSI). In this example, the intra-database prediction achieved a Dice score of 0.94, whereas the inter-database prediction decreased to 0.73. Compared with the intra-database result, the inter-database mask misses part of the lesion extent and yields a less complete delineation of the lesion boundary and thus, lower overlap accuracy.

Figure 4 Comparison of intra- and inter-database segmentation for a BUSI case obtained by the yolo11n model. From left to right shows the original BUS image, the ground-truth mask, the same-database prediction (Dice =0.94), and the cross-database prediction (Dice =0.73). Green denotes the ground-truth mask, and blue denotes the predicted segmentation mask. BUS, breast ultrasound; BUSI, Breast Ultrasound Images.

False-positive analysis on normal cases

We evaluated model behavior on normal BUS cases and estimate the risk of false-positive detection. When testing on normal cases in intra-database settings, yolo11n achieves false-positive rate 1.37% and 1.65% on the BUSI and the UCLM database, respectively.

Figure 5 shows two representative examples. In both examples, the highlighted regions correspond to ductal dilatation and radially arranged ductal structures with grape-like terminal ducts rather than true nodules. These regions are not mass lesions, while they may partially resemble focal abnormalities on grayscale ultrasound and therefore, trigger false-positive responses from the models.

Figure 5 False-positive examples on normal cases. From left to right are the original image from BUSI and its false-positive detection, the original image from UCLM and its false-positive detection. Blue denotes the predicted segmentation mask. BUSI, Breast Ultrasound Images; UCLM, breast ultrasound lesion segmentation dataset from the University of Castilla-La Mancha.

Training efficiency

Table 8 lists the training time in second (s) of YOLO variants on the datasets. Lightweight models, such as yolov5n, yolov8n, and yolo11n, consistently require the shortest training time on both datasets, remaining under 300 s on BUSI and ≈250 s on UCLM. As model capacity increases, training time grows substantially. Medium-scale models, such as yolov8m and yolo11m, show a near two- to three-fold increase in training time compared to their nano or small counterparts, while large and extra-large variants exhibit a dramatic time increase. For instance, yolov9e incurs the highest computational overhead, exceeding 2,200 s on BUSI and 1,900 s on UCLM, making it the most computationally expensive configuration. In addition, training on UCLM is consistently faster than on BUSI across the variants, suggesting dataset-dependent training efficiency, potentially attributable to the difference in dataset size.

Table 8

The training time in seconds of YOLO variants

Variant	BUSI	UCLM
yolov5n	260.68	211.19
yolov5s	381.70	307.32
yolov8n	279.87	242.44
yolov8s	384.61	338.30
yolov8m	694.92	601.91
yolov9c	1,022.99	900.83
yolov9e	2,258.84	1,969.95
yolo11n	288.76	252.25
yolo11s	424.31	373.55
yolo11m	821.81	716.04
yolo11l	984.51	859.18
yolo11x	1,776.02	1,553.23

BUSI, Breast Ultrasound Images; UCLM, breast ultrasound lesion segmentation dataset from the University of Castilla-La Mancha; YOLO, You Only Look Once.

Latency-Dice analysis under controlled inference conditions

Figures 6,7 present the inference latency versus Dice scores of YOLO variants on different databases. For fair comparison, inference latency was re-measured under a controlled condition (BS =1). In each figure, the left panel illustrates the performance distribution under intra-database (blue dots) and inter-database (orange dots) testing, while the right panel provides a zoomed-in view of the highlighted region in the left panel for observing variants with similar performance.

Figure 6 Inference latency (batch-size-1) versus Dice scores of YOLO variants under intra-database testing on the BUSI dataset and inter-database testing on the UCLM dataset. BUSI, Breast Ultrasound Images; UCLM, breast ultrasound lesion segmentation dataset from the University of Castilla-La Mancha; YOLO, You Only Look Once.

Figure 7 Inference latency (batch-size-1) versus Dice scores of YOLO variants under intra-database testing on the UCLM dataset and inter-database testing on the BUSI dataset. BUSI, Breast Ultrasound Images; UCLM, breast ultrasound lesion segmentation dataset from the University of Castilla-La Mancha; YOLO, You Only Look Once.

Time cost and Dice values on the BUSI database

Figure 6 presents the inference latency for intra-database testing on BUSI and inter-database testing on UCLM. A notable gap is observed between intra- and inter-database segmentation performance. In terms of latency, several compact variants remain clustered within a relatively close range.

Under intra-database evaluation, yolo11s (61.69 ms), yolov8s (63.40 ms), yolo11m (63.89 ms), yolo11n (63.99 ms), yolov8m (64.33 ms), and yolov9c (64.36 ms) form a compact low-latency group in the controlled setting. In the inter-database setting (BUSI→UCLM), yolov8n (149.19 ms), yolov8s (151.75 ms), yolo11n (155.84 ms), and yolo11s (157.12 ms) achieve the lowest controlled latency values.

Time cost and Dice values on the UCLM database

Figure 7 presents the inference latency for intra-database testing on UCLM and inter-database testing on BUSI. Generally, a pronounced gap is found between intra- and inter-database segmentation performance. In terms of latency, several YOLOv8 and YOLO11 variants remain within a narrow low-latency range, whereas YOLOv5 and some large-capacity variants show substantially higher inference costs.

For same-database evaluation on UCLM, yolov8s (65.64 ms), yolov8n (66.56 ms), yolo11n (66.77 ms), yolov8m (67.97 ms), yolo11m (69.09 ms), and yolo11s (69.98 ms) achieve the lowest inference latency. When evaluated in the cross-database scenario (UCLM→BUSI), yolov8s (144.63 ms), yolo11s (153.46 ms), yolov8n (154.41 ms), and yolo11n (155.28 ms) remain the most efficient variants under the controlled setting.

Comparison with representative segmentation models

To broaden the benchmark beyond the YOLO family, U-Net (39) and DeepLabV3+ (40) were evaluated under the same image processing, image-level split protocol, and mask-based evaluation pipeline. Table 9 shows the results, while the comparison focuses on Dice, IoU, Precision, Recall, and F1S metrics.

Table 9

Comparison with representative segmentation models

Setting	Model	Dice	IoU	Precision	Recall	F1S
BUSI (same)	U-Net	0.60±0.050	0.52±0.053	0.69±0.071	0.63±0.051	0.60±0.050
	DeepLabV3+	0.69±0.034	0.62±0.032	0.75±0.035	0.69±0.051	0.69±0.034
	yolo11n	0.91±0.014	0.86±0.016	0.90±0.015	0.92±0.013	0.91±0.014
BUSI→UCLM	U-Net	0.24±0.080	0.20±0.086	0.28±0.113	0.25±0.062	0.24±0.080
	DeepLabV3+	0.39±0.091	0.35±0.092	0.45±0.104	0.38±0.087	0.39±0.091
	yolo11n	0.69±0.034	0.67±0.032	0.70±0.033	0.70±0.035	0.69±0.034
UCLM (same)	U-Net	0.74±0.066	0.70±0.066	0.81±0.075	0.74±0.057	0.74±0.066
	DeepLabV3+	0.74±0.061	0.70±0.062	0.79±0.062	0.73±0.060	0.74±0.061
	yolo11n	0.96±0.009	0.94±0.010	0.96±0.009	0.96±0.009	0.96±0.009
UCLM→BUSI	U-Net	0.34±0.019	0.29±0.028	0.46±0.039	0.35±0.022	0.34±0.019
	DeepLabV3+	0.37±0.029	0.32±0.027	0.53±0.037	0.35±0.027	0.37±0.029
	yolo11n	0.59±0.039	0.54±0.037	0.59±0.038	0.61±0.036	0.59±0.039

Data are presented as mean ± standard deviation. BUSI, Breast Ultrasound Images; F1S, F1 score; IoU, intersection over union; UCLM, breast ultrasound lesion segmentation dataset from the University of Castilla-La Mancha.

The yolo11n achieves superior performance on both the intra- and inter-database evaluation. Specifically, under intra-database evaluation, DeepLabV3+ outperforms U-Net on the BUSI dataset, both models achieve close Dice values on UCLM, while their Dice scores are more than 0.20 lower than those of yolo11n. Under inter-database evaluation, DeepLabV3+ yields higher Dice scores than U-Net in both transfer directions (BUSI→UCLM, 0.39±0.09 vs. 0.24±0.08; UCLM→BUSI, 0.37±0.03; 0.34±0.02), while both models are much inferior to the yolo11n.

Discussion

This study systematically evaluated 12 YOLO variants for BUS lesion segmentation under same- and cross-database settings. It is observed that fine-tuned YOLO variants achieve strong same-database performance on both the BUSI and UCLM datasets, whereas segmentation accuracy decreases substantially in cross-database experiments.

Under same-database evaluation, all variants perform well overall. On BUSI, all models achieve Dice values ≥0.81, whereas on UCLM, they all exceed 0.87 (Tables 4,5). The strongest model is not identical across datasets, and the same-database summary in Table 6 indicates that compact-to-medium variants such as yolov9c and yolo11n remain among the most competitive models. This pattern suggests that BUS lesion segmentation may favor architectures whose capacity is well matched to moderate dataset size and to the texture-dominated, low-contrast nature of ultrasound imaging. Compact or medium-scale models may therefore offer a useful balance between representational power and training stability, whereas simply increasing model scale does not necessarily yield higher Dice. However, the reasons why one variant may outperform another could be further explored from the perspectives of architectural differences, representation capabilities, and implementation details.

Cross-database evaluation reveals the main challenge of limited model generalization capacity. The top-tier variants remain at Dice ≤0.71 for BUSI→UCLM and Dice ≤0.60 for UCLM→BUSI (Tables 4,5,7), indicating that absolute cross-database performance is unsatisfactory. The observed generalization gap may be related to multiple factors, including differences in scanners and acquisition protocols, annotation style differences between BUSI and UCLM (33,34), population and lesion heterogeneity, and the low-signal, weak-boundary nature of BUS images. These factors can alter intensity distributions, texture statistics, and lesion delineation criteria, which in turn may reduce overlap-based metrics on unseen data. However, this phenomenon is common in the field of medical image segmentation. Potential solutions include domain adaptation, domain generalization, image harmonization, and test-time adaptation strategies (10,42,43).

No single model dominates every evaluation setting. When same-database accuracy, cross-database competitiveness, training time (Table 8), and controlled batch-size-1 latency (Figures 6,7) are considered together, yolo11n could be regarded as one of the most balanced variants in the present benchmark. It remains competitive across both same- and cross-database evaluation while also belonging to the faster compact models. This favorable trade-off may be related to its compact capacity and updated feature aggregation design, which may help it avoid some of the instability seen in larger models, such as yolo11l, while retaining sufficient representation for lesion morphology. Encouragingly, yolo11n has been applied across a range of tasks due to its favorable balance between segmentation accuracy and computational efficiency, including thyroid nodule detection and malignancy classification (44), as well as head and neck lesion segmentation in ultrasound images (45).

The qualitative and specificity-oriented analyses help interpret the numerical findings from a clinical-assistance perspective. The yolo11n example (Figure 4) shows Dice decreasing from 0.94 under same-database evaluation to 0.73 under cross-database evaluation, with partial loss of lesion extent and a less complete boundary. This visual example is consistent with the quantitative domain-shift gap. The false-positive analysis on normal cases (Figure 5) revealed low but nonzero false-positive rates (1.37% on BUSI and 1.65% on UCLM), and the highlighted regions still corresponded to structures that could warrant secondary inspection. In a screening-assistance scenario, such outputs may be clinically useful as candidate-region prompts for further review, but they should not be interpreted as standalone diagnostic decisions and the final judgment should remain with the clinician.

From a broader positioning perspective, the YOLO family showed practical appeal relative to representative segmentation baselines. Under the same image-level protocol, U-Net (39) and DeepLabV3+ (40) remained below the stronger YOLO variants, particularly in cross-database testing. Nevertheless, several limitations remain. Firstly, the current benchmark used two relatively small datasets and image-level rather than patient-level partitioning, because no subject identifiers are provided. Meanwhile, the comparison focused on selected YOLO generations rather than all available releases, such as the latest version YOLO26 (46). In addition, larger multi-institutional datasets with standardized annotation, broader normal-case evaluation, patient-level and external validation, and methods designed to improve cross-database robustness are still needed before BUS segmentation models can be considered more reliable for clinical deployment. Future work should also continue comparing YOLO variants with broader BUS-specific and foundation-model-style segmentation approaches, such as MedSAM (47).

Conclusions

This study benchmarked 12 YOLO variants for BUS lesion segmentation under same- and cross-database settings and showed that strong same-database accuracy does not translate into strong cross-database performance. No single model was best in every evaluation setting; however, yolo11n emerged as one of the more balanced variants when segmentation accuracy, cross-database competitiveness, and computational efficiency were considered together. Future BUS segmentation studies should prioritize larger multi-institutional datasets, patient-level and external validation, and methods designed to improve robustness, generalization capacity, and clinical reliability.

Acknowledgments

We sincerely acknowledge the support provided by the Public Computing Cloud of the Communication University of China (CUC).

Footnote

Reporting Checklist: The authors have completed the CLAIM reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2026-1-0333/rc

Funding: This study was supported by the Shenzhen Science and Technology Program (Nos. KQTD20180411185028798 and JCYJ20241202124902004).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2026-1-0333/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Kim J, Harper A, McCormack V, Sung H, Houssami N, Morgan E, Mutebi M, Garvey G, Soerjomataram I, Fidler-Benaoudia MM. Global patterns and trends in breast cancer incidence and mortality across 185 countries. Nat Med 2025;31:1154-62. [Crossref] [PubMed]
Benitez Fuentes JD, Morgan E, de Luna Aguilar A, Mafra A, Shah R, Giusti F, Vignat J, Znaor A, Musetti C, Yip CH, Van Eycken L, Jedy-Agba E, Piñeros M, Soerjomataram I. Global Stage Distribution of Breast Cancer at Diagnosis: A Systematic Review and Meta-Analysis. JAMA Oncol 2024;10:71-8. [Crossref] [PubMed]
Díaz O, Rodríguez-Ruíz A, Sechopoulos I. Artificial Intelligence for breast cancer detection: Technology, challenges, and prospects. Eur J Radiol 2024;175:111457. [Crossref] [PubMed]
Jiang Y, Mao J. Comparative study of mammography and breast ultrasound in the diagnosis of breast cancer in young women: a meta-analysis. J Radiat Res Appl Sci 2025;18:101767.
Mashekova A, Zhao MY, Zarikas V, Mukhmetov O, Aidossov N, Ng EYK, Wei D, Shapatova M. Review of Artificial Intelligence Techniques for Breast Cancer Detection with Different Modalities: Mammography, Ultrasound, and Thermography Images. Bioengineering (Basel) 2025;12:1110. [Crossref] [PubMed]
Adam R, Dell'Aquila K, Hodges L, Maldjian T, Duong TQ. Deep learning applications to breast cancer detection by magnetic resonance imaging: a literature review. Breast Cancer Res 2023;25:87. [Crossref] [PubMed]
Iacob R, Iacob ER, Stoicescu ER, Ghenciu DM, Cocolea DM, Constantinescu A, Ghenciu LA, Manolescu DL. Evaluating the Role of Breast Ultrasound in Early Detection of Breast Cancer in Low- and Middle-Income Countries: A Comprehensive Narrative Review. Bioengineering (Basel) 2024;11:262. [Crossref] [PubMed]
Yu S, Wu S, Zhuang L, Wei X, Sak M, Neb D, Hu J, Xie Y. Efficient Segmentation of a Breast in B-Mode Ultrasound Tomography Using Three-Dimensional GrabCut (GC3D). Sensors (Basel) 2017;17:1827. [Crossref] [PubMed]
Wu S, Yu S, Zhuang L, Wei X, Sak M, Duric N, Hu J, Xie Y. Automatic Segmentation of Ultrasound Tomography Image. Biomed Res Int 2017;2017:2059036. [Crossref] [PubMed]
Xiao X, Zhang J, Shao Y, Liu J, Shi K, He C, Kong D. Deep Learning-Based Medical Ultrasound Image and Video Segmentation Methods: Overview, Frontiers, and Challenges. Sensors (Basel) 2025;25:2361. [Crossref] [PubMed]
Sultana F, Sufian A, Dutta P. Evolution of image segmentation using deep convolutional neural network: a survey. Knowl Based Syst 2020;201:106062.
Wang K, Liang S, Zhong S, Feng Q, Ning Z, Zhang Y. Breast ultrasound image segmentation: A coarse-to-fine fusion convolutional neural network. Med Phys 2021;48:4262-78. [Crossref] [PubMed]
Chen G, Li L, Dai Y, Zhang J, Yap MH. AAU-Net: An Adaptive Attention U-Net for Breast Lesions Segmentation in Ultrasound Images. IEEE Trans Med Imaging 2023;42:1289-300. [Crossref] [PubMed]
Yang J, Fan L, Dong B, Chen H, Liu X. Pyramid boundary attention network for breast lesion segmentation in ultrasound images. Biomed Signal Process Control 2025;101:107241.
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial networks. Commun ACM 2020;63:139-44.
Han L, Huang Y, Dou H, Wang S, Ahamad S, Luo H, Liu Q, Fan J, Zhang J. Semi-supervised segmentation of lesion from breast ultrasound images with attentional generative adversarial network. Comput Methods Programs Biomed 2020;189:105275. [Crossref] [PubMed]
Xing J, Li Z, Wang B, Qi Y, Yu B, Zanjani FG, Zheng A, Duits R, Tan T. Lesion Segmentation in Ultrasound Using Semi-Pixel-Wise Cycle Generative Adversarial Nets. IEEE/ACM Trans Comput Biol Bioinform 2021;18:2555-65. [Crossref] [PubMed]
Singh VK, Abdel-Nasser M, Akram F, Rashwan HA, Sarker MMK, Pandey N, Romani S, Puig D. Breast tumor segmentation in ultrasound images using contextual-information-aware deep adversarial learning framework. Expert Syst Appl 2020;162:113870.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention is all you need. Adv Neural Inf Process Syst 2017;30:5998-6008.
He Q, Yang Q, Xie M. HCTNet: a hybrid CNN-transformer network for breast ultrasound image segmentation. Comput Biol Med 2023;155:106629. [Crossref] [PubMed]
Yang H, Yang D. CSwin-PNet: a CNN-Swin Transformer combined pyramid network for breast lesion segmentation in ultrasound images. Expert Syst Appl 2023;213:119024.
Zhang H, Lian J, Yi Z, Wu R, Lu X, Ma P, Ma Y. HAU-Net: hybrid CNN-transformer for breast ultrasound image segmentation. Biomed Signal Process Control 2024;87:105427.
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L. ImageNet large scale visual recognition challenge. Int J Comput Vis 2015;115:211-52.
Ali ML, Zhang Z. The YOLO framework: a comprehensive review of evolution, applications, and benchmarks in object detection. Computers 2024;13:336.
Cao Z, Duan L, Yang G, Yue T, Chen Q. An experimental study on breast lesion detection and classification from ultrasound images using deep learning architectures. BMC Med Imaging 2019;19:51. [Crossref] [PubMed]
Samanta PK, Basuli A, Rout NK, Panda G. Improved breast cancer detection from ultrasound images using YOLOv8 model. 2023 IEEE 3rd International Conference on Applied Electromagnetics, Signal Processing, & Communication (AESPC); 24-26 November 2023; Bhubaneswar, India. IEEE; 2023;1-6.
Li W, Ye X, Chen X, Jiang X, Yang Y. A deep learning-based method for the detection and segmentation of breast masses in ultrasound images. Phys Med Biol 2024;69:155027. [Crossref] [PubMed]
Wang H, Li C, Li Z, Du Y, Zhou Z, Wu J. Breast ultrasound tumor detection based on improved YOLOv8s-OBB algorithm. 2024 5th International Conference on Intelligent Computing and Human-Computer Interaction (ICHCI); 24-26 May 2024; Nanchang, China. IEEE; 2024;120-5.
Du Y, Liu W, Wang Y, Li R, Xie L. YOLO-CPC: a breast tumor detection and identification algorithm based on improved YOLOv7. Signal Image Video Process 2025;19:260.
Mostafa AM, Alaerjan AS, Aldughayfiq B, Allahem H, Mahmoud AA, Said W, Shabana H, Ezz M. Optimized YOLOv8 for enhanced breast tumor segmentation in ultrasound imaging. Discov Oncol 2025;16:1152. [Crossref] [PubMed]
Ariyametkul A, Paing MP. Analyzing explainability of YOLO-based breast cancer detection using heat map visualizations. Quant Imaging Med Surg 2025;15:6252-71. [Crossref] [PubMed]
Ma Q, Wang J, Dong B, Yang J, Zhou W, Zhang D, Cheng D, Qin X, Zhang H, Jiang F, Zhang C. YOLO AI model based on an automated breast volume scanner for the detection of benign and malignant breast lesions. Quant Imaging Med Surg 2025;15:10156-67. [Crossref] [PubMed]
Al-Dhabyani W, Gomaa M, Khaled H, Fahmy A. Dataset of breast ultrasound images. Data Brief 2020;28:104863. [Crossref] [PubMed]
Vallez N, Bueno G, Deniz O, Rienda MA, Pastor C. BUS-UCLM: Breast ultrasound lesion segmentation dataset. Sci Data 2025;12:242. [Crossref] [PubMed]
Jocher G. Ultralytics YOLOv5. Accessed on 30 March 2026. Available online: https://github.com/ultralytics/yolov5
Jocher G, Chaurasia A, Qiu J. Ultralytics YOLOv8. Accessed on 30 March 2026. Available online: https://github.com/ultralytics/ultralytics
Wang CY, Yeh IH, Liao HYM. YOLOv9: learning what you want to learn using programmable gradient information. Computer Vision – ECCV 2024; 29 September–4 October, 2024; Milan, Italy. Springer; 2024;1-21.
Ultralytics. YOLO11. Accessed on 30 March 2026. Available online: https://github.com/ultralytics/ultralytics
Ronneberger O, Fischer P, Brox T. U-Net: convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015; 5-9 October, 2015; Munich, Germany. Springer; 2015;234-41.
Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H. Encoder-decoder with atrous separable convolution for semantic image segmentation. Computer Vision – ECCV 2018; 8-14 September, 2018; Munich, Germany. Springer; 2018;833-51.
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL. Microsoft COCO: common objects in context. Computer Vision – ECCV 2014; 6-12 September, 2014; Zurich, Switzerland. Springer; 2014;740-55.
Guan H, Liu M. Domain Adaptation for Medical Image Analysis: A Survey. IEEE Trans Biomed Eng 2022;69:1173-85. [Crossref] [PubMed]
Yoon JS, Oh K, Shin Y, Mazurowski MA, Suk HI. Domain generalization for medical image analysis: a review. Proc IEEE 2024;112:1583-609.
Yang J, Luo Z, Wen Y, Zhang J. Artificial intelligence-enhanced ultrasound imaging for thyroid nodule detection and malignancy classification: a study on YOLOv11. Quant Imaging Med Surg 2025;15:7964-76. [Crossref] [PubMed]
Tsumura R, Tomioka T, Koseki Y, Yoshinaka K. Personalized scan path planning for robotic ultrasound in head and neck lesions. Int J Comput Assist Radiol Surg 2025; [Crossref]
Jocher G, Qiu J. Ultralytics YOLO26. Accessed on 15 April 2026. Available online: https://docs.ultralytics.com/models/yolo26/
Ma J, He Y, Li F, Han L, You C, Wang B. Segment anything in medical images. Nat Commun 2024;15:654. [Crossref] [PubMed]

Cite this article as: Yu S, Huang M, Chen E, Zhu B, Liang X, Xie Y. Benchmarking YOLOs in breast ultrasound lesion segmentation. Quant Imaging Med Surg 2026;16(7):574. doi: 10.21037/qims-2026-1-0333

Benchmarking YOLOs in breast ultrasound lesion segmentation

Introduction

Methods

Data collection

Table 1

Involved YOLO variants

Table 2

Implementation details

Parameter settings

Table 3

Experiment design

Performance metrics

Results

BUSI-based intra- and inter-database performance

Table 4

UCLM-based intra- and inter-database performance

Table 5

Analysis of top-tier YOLO variants on BUS lesion segmentation

On intra-database evaluation performance

Table 6

On inter-database evaluation performance

Table 7

Perceived intra- and inter-database segmentation degradation

False-positive analysis on normal cases

Training efficiency

Table 8

Latency-Dice analysis under controlled inference conditions

Time cost and Dice values on the BUSI database

Time cost and Dice values on the UCLM database

Comparison with representative segmentation models

Table 9

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share