MFA-Net: multi-scale feature aggregation network with background-aware module for ultrasound segmentation of thyroid nodules
Introduction
Thyroid nodules are common diseases of the endocrine system and have become an important health problem of global concern. These nodules appear as abnormal growths within the thyroid and are usually caused by a variety of causes, including benign hyperplasia, cyst formation, benign tumors, or malignant transformations (1). With the continuous advancement of various imaging technologies, doctors can utilize them for early screening and diagnosis of thyroid nodules, which is critical to determining their nature and developing treatment plans. Among the various techniques, ultrasound imaging has become the frontline choice for thyroid nodules due to its low cost, non-radiation and ability to scan continuously. However, due to the blurred boundary of nodules, low contrast, and complex thyroid tissue structure, doctors’ identification and segmentation of nodules are highly dependent on personal experience, which can easily lead to misdiagnosis or missed diagnosis (2). Therefore, the use of artificial intelligence technology to develop efficient thyroid nodule segmentation algorithms can help doctors provide patients with more reliable treatment plans and improve diagnostic efficiency.
Thyroid nodules in ultrasound images usually vary in size, shape, and appearance and are similar to the surrounding thyroid tissue. Additionally, ultrasound imaging is inherently susceptible to various types of noise, which stem from both environmental interference and instrumental limitations during tissue propagation. Given these complexities, traditional segmentation techniques that rely on handcrafted features or threshold-based methods frequently struggle to provide accurate and reliable results. However, with the development of deep learning technologies, new opportunities have been presented, such as U-Net (3), DESENet (4), BMANet (5) and NLIE-UNet (6). These encoder-decoder models utilize hierarchical feature extraction and multi-scale information processing to improve segmentation accuracy. However, despite their effectiveness, they are not without limitations. One key issue is improper feature fusion, which can lead to feature fusion inconsistencies across different network layers. Additionally, during the encoding process, lesion boundary details may be gradually lost due to repeated down-sampling operations. Consequently, further optimization of feature aggregation and boundary refinement strategies is essential to improve the robustness and generalization ability of thyroid nodule segmentation.
The segmentation algorithms of thyroid nodules generally fall into traditional models and deep learning-based models. The traditional method primarily relies on gray-level intensity, texture features, or geometric information for feature extraction and region-based segmentation, which has the advantages of high computational efficiency and mature theory. However, due to the inherent characteristics of ultrasound images, which often lead to inconsistent and inaccurate segmentation results. In contrast, the deep learning-based algorithms leverage powerful architectures and transformer to automatically extract advanced features, which can effectively solve the challenges posed by ultrasonic noise and different nodule shapes. Among them, Wu et al. (7) incorporated deep convolutional layers within the encoder-decoder framework of Swin Transformer, effectively strengthening the representation of both global and local features. Furthermore, they introduced a multi-scale feature fusion module that enables more effective integration and exchange of features across different hierarchical levels. Li et al. (8) proposed a global structure-enhanced decoder that can effectively enhance feature representation and ensure more accurate boundary depiction. Hu et al. (9) designed a dual-decoder branch architecture by integrating Mamba and ResNet-34 to enhance feature extraction and segmentation performance. Ozcan et al. (10) proposed a hybrid segmentation model that enhances remote dependencies and the ability to capture global context information. Ali et al. (11) presented an encoder architecture with dense connections to enhance feature propagation and reuse across different network layers. Zheng et al. (12) adopted a cascade convolution strategy, which can effectively capture multi-scale context information while maintaining a larger acceptance field.
The attention mechanism (13,14) plays a crucial role in enhancing feature representation by selectively focusing on important regions while minimizing the impact of background noise and irrelevant information. By dynamically adjusting the weights of different spatial locations, channels, or positional elements, the attention mechanism can improve model performance and generalization ability. Due to its effectiveness, attention mechanism has been widely integrated into semantic segmentation, object detection, and image classification. Among them, Gu et al. (15) presented a multi-scale coordinate attention algorithm to enhance feature representation by effectively capturing spatial and contextual dependencies across different scales. Ni et al. (16) combined channel attention and positional attention to enhance feature representation by jointly capturing global dependencies and spatial relationships within an image. Shang et al. (17) introduced a cascaded attention fusion module, which enhanced feature representation by progressively integrating multi-level attention mechanisms. Qi et al. (18) combined graph-based convolution operations with attention mechanisms, which allowed for more focused and efficient encoding of relevant information. Apart from the utilization of attention mechanisms, several emerging technologies have demonstrated promising performance. Among them, Wu et al. (19) combined dynamic condition encoding with feature frequency parser and developed the first general medical image segmentation framework that utilizes the diffusion probabilistic model. Liu et al. (20) introduced a scale-aware pyramidal feature learning strategy, which explicitly exploits multi-scale contextual information to strengthen feature representation.
In this paper, we present a novel multi-scale feature aggregation network (MFA-Net) with background-aware module (BAM) for thyroid nodules segmentation. Unlike existing segmentation approaches, our MFA-Net leverages an encoder-decoder architecture combined with multi-scale feature extraction to enhance segmentation accuracy and robustness. By simultaneously modeling global dependencies across both spatial and channel dimensions, the network effectively captures long-range contextual information while preserving fine-grained details. The key contributions of our research are outlined as:
- The multi-scale feature aggregation module (MFAM) is designed to effectively capture contextual information at multiple levels, ensuring a comprehensive understanding of both local and global image structures. This design significantly enhances the model’s adaptability to different shapes, sizes and appearances of thyroid nodules, and improves the overall performance under complex imaging conditions.
- The background-aware mechanism is integrated to suppress irrelevant or non-informative regions, guiding the focus of the network to more prominent foreground areas. By refining feature selection, this mechanism enhances the distinction between nodules and surrounding tissue and reduces interference from background noise.
- The residual decoder module (RDM) is augmented with spatial and channel mechanisms, which work synergistically to highlight critical features while diminishing the influence of less relevant information. This dual-attention strategy improves segmentation precision by preserving structural integrity and enhancing boundary clarity, ultimately leading to more accurate and robust segmentation results.
We present this article in accordance with the CLEAR reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1364/rc).
Methods
This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.
Overview
Figure 1 illustrates the overall architecture of MFA-Net, which is a deep learning framework following the encoder-decoder structure. The encoder adopts Encoder-block while the decoder uses Decoder-block, both of which are designed for the encoding and decoding of thyroid nodule images. In the encoding stage, each Encoder-block employs a double convolutional inspired by U-Net structure, which enhances feature extraction while preserving fine-grained local details. To enhance the segmentation performance, MFA-Net incorporates MFAM and BAM. These modules are responsible for capturing multi-scale contextual information and suppressing interference information. In each layer, the MFAM processes the output features from the corresponding Encoder block and performs multi-scale contextual information acquisition. Then, the up-sampling (UP) module (transpose convolution using ConvTranspose2d function) enhances feature resolution by UP the output of the next-layer Decoder-block, making these refined features available for both the current Decoder-block and the prediction output. Meanwhile, the BAM strengthens model focus by applying attention mechanisms to distinguish foreground from background, leveraging both the Encoder block’s feature representations and the predictive output from the UP module. Throughout the decoding process, the Decoder-block accepts the output features from the BAM, MFAM, and UP. Finally, the highest-level Decoder-block performs final refinements using a 1×1 convolution followed by a Sigmoid activation, ultimately generating the model’s segmentation prediction.
MFAM
As illustrated in Figure 2, the MFAM begins by receiving the feature maps from the Encoder-block. To enhance feature representation across different spatial scales, the input features are simultaneously passed through multiple convolutional branches, each utilizing distinct kernel sizes to extract diverse patterns and structural details. Firstly, the 1×5 and 5×1 convolutions are responsible for capturing fine-grained horizontal and vertical edge information. To further extend spatial awareness, the 1×7 and 7×1 convolutions contribute to capturing long-range dependencies. Finally, the 1×11 and 11×1 convolutions provide the largest receptive field, capturing both fine details and broader spatial relationships. Once features are extracted at multiple scales, their outputs are concatenated along the channel dimension to construct a comprehensive multi-scale feature representation. This aggregated feature map is subsequently processed through a 1×1 convolution, which reduces channel dimensions while refining feature interactions. Moreover, the processed features are passed through a Sigmoid function, which normalizes values between 0 and 1, thereby learning attention-based weights for different feature regions. Following this, the Sigmoid-activated feature undergoes element-wise multiplication with the original encoder feature. This operation ensures that salient features are emphasized, while less relevant information is suppressed. The MFAM module plays a vital role in MFA-Net, as it effectively captures multi-scale contextual dependencies, leading to enhanced thyroid nodule segmentation with improved accuracy and robustness.
BAM
As depicted in Figure 3, the BAM plays a crucial role in enhancing the model’s ability to discriminate between foreground (i.e., thyroid nodules) and background (surrounding non-nodule regions). Specifically, the feature map output from the Encoder block, as well as the prediction map generated by the UP module. To build an attention mechanism, BAM first computes the background-aware map by applying 1-pred to the prediction maps, where pred represents the probability map obtained by applying the Sigmoid activation to the predicted result. This transformation effectively inverts the predictions, ensuring that higher weights are assigned to background areas while foreground features are suppressed. The computed attention map is then applied to the encoder feature map channel-wise through element-wise multiplication. The attention-refined features are subsequently processed through a 3×3 convolution layer and a squeeze-and-excitation block (21,22), which enhances channel-wise attention by dynamically adjusting feature importance, as shown in Figure 4. To preserve rich spatial information and prevent excessive feature loss, skip connection is introduced to add the original encoder features back to the refined attention-enhanced feature. By integrating BAM into the network, MFA-Net effectively suppresses background noise and improves prediction accuracy.
RDM
As illustrated in Figure 5, the Decoder-block incorporates an RDM, which processes inputs from the BAM, MFAM, and UP. To ensure comprehensive feature representation, these inputs are first concatenated together. Once concatenated, the features are refined through two successive 3×3 convolutional layers. To further optimize learning, a 1×1 convolution-based residual connection is introduced in parallel. Following feature transformation, the output is refined through an attention mechanism comprising channel attention and spatial attention. Ultimately, this structured decoding approach results in a well-defined feature representation at this stage, contributing to a more precise and context-aware final segmentation output.
As shown in Figure 6, the channel attention module (23,24) enriches feature representation by selectively emphasizing important channels, and its input feature map has dimensions of C × H × W. To extract inter-channel dependencies, we employ two parallel processing branches: one branch reshapes the input tensor into a matrix of size C × (H × W) to facilitate channel-wise interaction, and the second branch applies both reshaping and a transpose operation, transforming the feature map into a (H × W) × C matrix. The two transformed feature representations are then multiplied using a dot product to compute a channel affinity matrix of size C × C. This matrix effectively captures the relationships between different channels. To adjust the attention distribution, the calculated channel association matrix is normalized by Softmax to ensure that the sum of all weights equals one.
Subsequently, the resulting attention matrix is applied to the reshaped feature matrix through matrix multiplication. To preserve essential original information while integrating learned attention-based improvements, the refined output is added back to the original input through a residual connection. This approach enhances feature learning stability and ensures that the output retains the same spatial dimensions C×H×W as the initial input, while benefiting from refined channel-wise feature weighting.
As illustrated in Figure 7, the spatial attention module (25,26) functions similarly to the channel attention module but differs in its primary focus. However, unlike capturing global dependencies between channels, spatial attention ensures higher attention to key areas such as object boundaries and key structural details by analyzing spatial dependencies. By dynamically adjusting attention weights across spatial positions, the module ensures that essential features are more prominent, which is especially advantageous in segmentation tasks where precise object delineation is required.
Loss function
Unlike conventional loss functions that only operate on a pixel-by-pixel basis, we employ a combined loss function consisting of Dice loss (27) and binary cross-entropy (BCE) loss (28) to optimize the segmentation network. Among them, Dice loss is designed to evaluate the similarity between predicted segmentation maps and ground truth labels by directly computing their overlap. Additionally, BCE loss provides stable pixel-level supervision by penalizing misclassified pixels, which is particularly effective in handling class imbalance. By integrating these two components, the joint loss leverages the strengths of both global region-level optimization and local pixel-wise accuracy. For numerical stability, the joint loss is computed on the probability maps obtained after applying the Sigmoid activation to the network output. The mathematical formulation of the combined loss is as follows:
where N is the number of pixels, and yi denote the labeled value and predicted value. α and β are weighting coefficients that balance the contributions of the two loss terms.
Dataset description
To comprehensively assess the generalizability of MFA-Net, we conducted extensive experiments on four benchmark datasets: thyroid nodule 3493 (TN3K), thyroid gland 3583 (TG3K), digital database thyroid image (DDTI) (29) and BrainTumor (30). The detailed characteristics of these datasets are summarized in Table 1.
Table 1
| Dataset | Number | Train | Validate | Test |
|---|---|---|---|---|
| TN3K | 3,493 | 2,160 | 719 | 614 |
| TG3K | 3,583 | 2,152 | 717 | 716 |
| DDTI | 637 | 383 | 127 | 127 |
| BrainTumor | 3,064 | 1,839 | 613 | 612 |
DDTI, digital database thyroid image; TG3K, thyroid gland 3583 dataset; TN3K, thyroid nodule 3493 dataset.
The TN3K dataset comprises 3,493 ultrasound images, each containing at least one thyroid nodule region. The images in TN3K are carefully curated to include diverse nodule shapes, sizes, and locations, providing a robust foundation for assessing segmentation performance across varying anatomical structures. The TG3K dataset consists of 3,583 ultrasound images, where the thyroid gland region has been precisely segmented from ultrasound video sequences. To ensure data quality and relevance, only images where the thyroid gland occupies at least 6% of the total image area are considered. The DDTI dataset contains 637 ultrasound images of the thyroid nodule, each annotated with pixel-level segmentation masks obtained from a single ultrasound imaging device. The BrainTumor dataset provides a well-structured resource tailored for advancing brain tumor segmentation research. It includes a total of 3,064 magnetic resonance imaging (MRI) brain scans, each precisely aligned with a manually annotated binary mask that delineates the tumor regions.
Experimental scheme
All experiments were performed utilizing the PyTorch framework and executed on a high-performance NVIDIA RTX A6000 GPU. To optimize the model, we employed the Adam optimizer, initializing its learning rate at 1×10−3 to facilitate stable and efficient convergence. During the training phase, we configured the batch size to 32 and the number of iterations to 200. To ensure consistency across the dataset, all input ultrasound images were resized to 256×256 pixels before being fed into the network.
To thoroughly evaluate the effectiveness of MFA-Net and ensure a fair comparison with some well-established segmentation methods, we employ Dice (31,32), intersection over union (IoU) (33,34), accuracy (35,36) and Matthews correlation coefficient (Mcc) (37,38) as the performance metrics. The mathematical formulations of these metrics are as follows:
Results
Experimental results on TN3K dataset
To evaluate the performance of MFA-Net, a series of comparative experiments was conducted on the TN3K dataset, with the results summarized in Table 2. These methods include U-Net (3), BSNet (39), MDA-Net (40), HFENet (41), MSFCN (42), ERDUnet (43), LANet (44), AMSUnet (45), DESENet (4), BMANet (5) and NLIE-UNet (6). Among the compared methods, MSFCN and U-Net exhibit the lowest performance, with Dice of 0.7113 and 0.7151, IoU of 0.5560 and 0.5649, accuracy of 0.9358 and 0.0399, Mcc of 0.6762 and 0.6823. Despite their foundational role in segmentation tasks, both models struggle with complex nodule structures and often produce coarse predictions. ERDUnet and HFENet show moderately improved results, achieving Dice of 0.8014 and 0.8041, IoU of 0.6719 and 0.6758, accuracy of 0.9558 and 0.9550, Mcc of 0.7765 and 0.7802. However, they still face challenges in accurately delineating nodular boundaries, especially in low-contrast or noisy regions. LANet, AMSUnet, BMANet and NLIE-UNet achieve Dice scores between 0.805 and 0.82, reflecting their better balance between detail preservation and structural consistency, though some fine-grained errors persist. Further enhancements are observed in BSNet, MDA-Net, and DESENet, which obtain Dice scores of 0.8377, 0.8409, and 0.8365, and Mcc values above 0.81, demonstrating stronger feature extraction ability and improved robustness in boundary prediction. However, there are still minor errors in complex regions, which slightly affect the overall consistency. Among all evaluated models, the proposed MFA-Net achieves the highest performance, with Dice of 0.8616, IoU of 0.7586, accuracy of 0.9698 and Mcc of 0.8457. By capturing both global and local dependencies, MFA-Net excels in delineating fine structures and maintaining accurate contours, particularly in challenging cases involving small or irregular nodules. Figure 8 provides qualitative segmentation results produced by various approaches on the TN3K dataset. These qualitative observations strongly support the quantitative findings, confirming that MFA-Net offers superior robustness and precision in clinical ultrasound-based thyroid segmentation.
Table 2
| Model | Dice | IoU | Accuracy | Mcc |
|---|---|---|---|---|
| U-Net | 0.7151 | 0.5649 | 0.9399 | 0.6823 |
| BSNet | 0.8377 | 0.7226 | 0.9648 | 0.8196 |
| MDA-Net | 0.8409 | 0.7275 | 0.9649 | 0.8216 |
| HFENet | 0.8041 | 0.6758 | 0.9550 | 0.7802 |
| MSFCN | 0.7113 | 0.5560 | 0.9358 | 0.6762 |
| ERDUnet | 0.8014 | 0.6719 | 0.9558 | 0.7765 |
| LANet | 0.8163 | 0.6919 | 0.9601 | 0.7953 |
| AMSUnet | 0.8053 | 0.6787 | 0.9589 | 0.7828 |
| DESENet | 0.8365 | 0.7219 | 0.9642 | 0.8175 |
| BMANet | 0.8091 | 0.6833 | 0.9587 | 0.7879 |
| NLIE-UNet | 0.8054 | 0.6774 | 0.9563 | 0.7817 |
| MFA-Net | 0.8618 | 0.7586 | 0.9698 | 0.8457 |
IoU, intersection over union; Mcc, Matthews correlation coefficient; MFA-Net, multi-scale feature aggregation network; TN3K, thyroid nodule 3493 dataset.
Experimental results on TG3K dataset
Table 3 displays the quantitative evaluation of several models on the TG3K dataset, which features thyroid nodules with well-defined lesions and relatively consistent shapes and anatomical positions. These characteristics contribute to the high segmentation accuracy achieved by most models. Specifically, with the exception of U-Net, all methods reported Dice coefficients exceeding 97%, IoU scores above 94%, and Mcc are higher than 94%, reflecting the dataset’s lower complexity compared to more heterogeneous collections. Despite this overall strong performance, U-Net significantly underperforms, yielding a Dice of 78.63% and IoU of 65.23%, suggesting limited capability in handling even moderately variable nodule presentations. The visual results in Figure 9 further illustrate this point, where U-Net displays a large number of erroneous segmentations; it is unable to capture the complete structure of the nodules or introduce false positives in the surrounding area. In contrast, the differences between other models are very small. However, among all the approaches, the proposed MFA-Net achieves the highest performance, with Dice of 98.57%, IoU of 97.18%, accuracy of 99.77% and Mcc of 98.44%. This superior outcome highlights MFA-Net’s exceptional ability to capture fine-grained boundaries and preserve spatial consistency, even in datasets where lesions may appear visually subtle or homogeneous.
Table 3
| Model | Dice | IoU | Accuracy | Mcc |
|---|---|---|---|---|
| U-Net | 0.7863 | 0.6523 | 0.9602 | 0.7711 |
| BSNet | 0.9820 | 0.9647 | 0.9972 | 0.9805 |
| MDA-Net | 0.9821 | 0.9650 | 0.9972 | 0.9806 |
| HFENet | 0.9792 | 0.9593 | 0.9967 | 0.9774 |
| MSFCN | 0.9785 | 0.9581 | 0.9966 | 0.9767 |
| ERDUnet | 0.9714 | 0.9445 | 0.9955 | 0.9691 |
| LANet | 0.9823 | 0.9653 | 0.9973 | 0.9808 |
| AMSUnet | 0.9778 | 0.9567 | 0.9966 | 0.9759 |
| DESENet | 0.9845 | 0.9695 | 0.9976 | 0.9832 |
| BMANet | 0.9836 | 0.9677 | 0.9975 | 0.9822 |
| NLIE-UNet | 0.9736 | 0.9487 | 0.9959 | 0.9713 |
| MFA-Net | 0.9857 | 0.9718 | 0.9977 | 0.9844 |
IoU, intersection over union; Mcc, Matthews correlation coefficient; MFA-Net, multi-scale feature aggregation network; TG3K, thyroid gland 3583 dataset.
Experimental results on DDTI dataset
Drawing upon the quantitative metrics detailed in Table 4 and the visual outcomes illustrated in Figure 10, we can comprehensively evaluate and contrast the performance of various models on the DDTI dataset. Due to the higher variability in image quality and nodule appearance of the DDTI dataset, all models have lower Dice and IoU scores compared to TG3K or TN3K. Specifically, MSFCN demonstrates the lowest performance among all approaches, with Dice of 61.65%, IoU of 44.68%, accuracy of 88.36% and IoU of 55.08%. These results highlight MSFCN’s limitations in capturing fine boundary details in more complex scenarios. U-Net also underperforms, with Dice of 62.49%, IoU of 45.47%, accuracy of 87.02% and Mcc of 55.70%, suggesting insufficient capacity for modeling the intricate and irregular structures commonly found in DDTI. HFENet, AMSUnet and NLIE-UNet show moderate results, but still struggle to achieve satisfactory segmentation precision, particularly in edge delineation. MDA-Net, BSNet, LANet, DESENet and BMANet perform comparably, with Dice scores in the 70-73% range, and IoU values between 55% and 58%, indicating relatively stable but not optimal performance. ERDUnet stands out slightly from the above group, achieving Dice of 74.10%, IoU of 58.94%, accuracy of 92.56% and Mcc of 69.81%, suggesting better generalization to the heterogeneous features present in the dataset. However, the proposed method achieves the best quantitative performance, with Dice of 74.83%, IoU of 59.81%, accuracy of 92.83% and Mcc of 70.78%. The combined quantitative and qualitative analyses affirm that MFA-Net outperforms other state-of-the-art models in handling the challenging characteristics of the DDTI dataset. This indicates its superior ability to handle low-contrast and irregularly shaped nodules, thanks to the multi-scale and attention-guided mechanisms embedded in the architecture.
Table 4
| Model | Dice | IoU | Accuracy | Mcc |
|---|---|---|---|---|
| U-Net | 0.6249 | 0.4547 | 0.8702 | 0.5570 |
| BSNet | 0.7346 | 0.5810 | 0.9169 | 0.6874 |
| MDA-Net | 0.7255 | 0.5696 | 0.9135 | 0.6765 |
| HFENet | 0.6586 | 0.4917 | 0.8929 | 0.5972 |
| MSFCN | 0.6165 | 0.4468 | 0.8836 | 0.5508 |
| ERDUnet | 0.7410 | 0.5894 | 0.9256 | 0.6981 |
| LANet | 0.7097 | 0.5507 | 0.9085 | 0.6576 |
| AMSUnet | 0.6844 | 0.5210 | 0.8971 | 0.6275 |
| DESENet | 0.7113 | 0.5527 | 0.9160 | 0.6624 |
| BMANet | 0.7282 | 0.5736 | 0.9175 | 0.6801 |
| NLIE-UNet | 0.6756 | 0.5115 | 0.9025 | 0.6193 |
| MFA-Net | 0.7483 | 0.5981 | 0.9283 | 0.7078 |
DDTI, digital database thyroid image; IoU, intersection over union; Mcc, Matthews correlation coefficient; MFA-Net, multi-scale feature aggregation network.
Experimental results on BrainTumor dataset
Drawing upon the quantitative metrics reported in Table 5 and the qualitative visualizations presented in Figure 11, the performance of different segmentation models on the BrainTumor dataset can be thoroughly evaluated. Among them, U-Net provides the baseline with relatively modest Dice (77.55%) and IoU (64.85%), highlighting its limited ability to capture complex tumor structures. AMSUnet, MSFCN, and ERDUnet also demonstrate lower accuracy in boundary delineation, with Dice scores below 80%, indicating difficulties in segmenting irregular or small tumor regions. AMSUnet, MSFCN, and ERDUnet also demonstrate lower accuracy in boundary delineation, with Dice scores below 80%, indicating difficulties in segmenting irregular or small tumor regions. HFENet, LANet, and NLIE-UNet achieve slightly better results, with Dice around 80% and Mcc values close to 0.80, reflecting moderate improvements in segmentation consistency but still failing to handle subtle tumor boundaries effectively. In contrast, BSNet, MDA-Net, DESENet, and BMANet deliver more competitive outcomes, with Dice ranging from 81% to 82% and IoU values above 69%. The visual outcomes also suggest that these approaches yield more coherent tumor contours compared to the baseline U-Net. Notably, MFA-Net surpasses all competing methods, achieving the best performance with Dice of 84.85%, IoU of 74.21%, accuracy of 99.49%, and Mcc of 84.69%. Both the quantitative and qualitative evaluations indicate that MFA-Net consistently produces more accurate and complete tumor boundaries, outperforming state-of-the-art methods.
Table 5
| Model | Dice | IoU | Accuracy | Mcc |
|---|---|---|---|---|
| U-Net | 0.7755 | 0.6485 | 0.9934 | 0.7873 |
| BSNet | 0.8243 | 0.7071 | 0.9943 | 0.8233 |
| MDA-Net | 0.8245 | 0.7078 | 0.9938 | 0.8226 |
| HFENet | 0.8023 | 0.6756 | 0.9933 | 0.7997 |
| MSFCN | 0.7922 | 0.6633 | 0.9932 | 0.7912 |
| ERDUnet | 0.7934 | 0.6662 | 0.9933 | 0.7934 |
| LANet | 0.8009 | 0.6759 | 0.9936 | 0.8001 |
| AMSUnet | 0.7843 | 0.6543 | 0.9928 | 0.7836 |
| DESENet | 0.8154 | 0.6944 | 0.9939 | 0.8141 |
| BMANet | 0.8093 | 0.6847 | 0.9935 | 0.8064 |
| NLIE-UNet | 0.8062 | 0.6804 | 0.9934 | 0.8038 |
| MFA-Net | 0.8485 | 0.7421 | 0.9949 | 0.8469 |
IoU, intersection over union; Mcc, Matthews correlation coefficient; MFA-Net, multi-scale feature aggregation network.
Discussion
Ablative studies
To thoroughly evaluate the performance benefits and individual contributions of each core module in the proposed MFA-Net architecture, we carried out detailed ablation studies using the TN3K dataset. In this evaluation, the MFAM, background-aware mechanism and RDM were independently incorporated into the baseline architecture. Each component was evaluated in isolation to quantify its specific effects on segmentation accuracy and robustness. Finally, all modules were combined to form the complete MFA-Net. The corresponding quantitative results and qualitative visualizations are presented in Table 6 and Figure 12, respectively.
Table 6
| Model | Dice | IoU | Accuracy | Mcc | Param (M) | FPS | GFLOPs |
|---|---|---|---|---|---|---|---|
| Baseline | 0.7151 | 0.5649 | 0.9399 | 0.6823 | 1.94 | 227.29 | 3.48 |
| Baseline + MFAM | 0.8431 | 0.7312 | 0.9651 | 0.8237 | 2.80 | 138.70 | 4.84 |
| Baseline + BAM | 0.8410 | 0.7273 | 0.9655 | 0.8225 | 3.71 | 139.47 | 7.64 |
| Baseline + RDM | 0.8476 | 0.7380 | 0.9672 | 0.8301 | 2.55 | 101.23 | 4.07 |
| Baseline + MFAM + BAM | 0.8321 | 0.7158 | 0.9646 | 0.8137 | 4.00 | 103.33 | 8.56 |
| Baseline + MFAM + RDM | 0.8584 | 0.7536 | 0.9683 | 0.8413 | 2.85 | 86.50 | 4.99 |
| Baseline + BAM + RDM | 0.8536 | 0.7459 | 0.9671 | 0.8356 | 3.77 | 84.43 | 7.86 |
| MFA-Net | 0.8618 | 0.7586 | 0.9698 | 0.8457 | 4.07 | 67.46 | 8.78 |
BAM, background-aware module; IoU, intersection over union; FPS, frame-per-second; GFLOPs, giga floating-point operations per second; Mcc, Matthews correlation coefficient; MFA-Net, multi-scale feature aggregation network; MFAM, multi-scale feature aggregation module; RDM, residual decoder module; TN3K, thyroid nodule 3493 dataset.
To validate the impact of the MFAM within the proposed MFA-Net architecture, we conducted focused ablation experiments by integrating MFAM independently into the baseline. As presented in Table 6 and illustrated in Figure 12, the addition of MFAM resulted in a marked increase in segmentation accuracy when compared to the baseline configuration. Specifically, the Dice increased from 71.51% to 84.31%, the IoU rose from 56.49% to 73.12%, the accuracy improved from 93.99% to 96.51%, and the Mcc increased from 68.23% to 82.37%. These enhancements clearly indicate MFAM’s effectiveness in enriching the feature representation by leveraging multi-scale contextual cues. By combining detailed local textures with broader structural information, MFAM strengthens the model’s ability to delineate thyroid nodule boundaries with higher accuracy, especially in cases where lesion contours are subtle or irregular. From a computational perspective, the inclusion of MFAM leads to a moderate increase in model complexity: the number of parameters rises from 1.94 to 2.80 M, and the computational cost grows from 3.48 to 4.84 giga floating-point operations per second (GFLOPs). Meanwhile, the frame-per-second (FPS) rate decreases from 227.29 to 138.70, which remains acceptable for real-time or near-real-time clinical applications. Qualitatively, visualizations in Figure 12 demonstrate that the MFAM produces more complete and accurate nodule masks, especially in complex scenarios with low contrast or fragmented edges. The segmentation predictions produced after incorporating the MFAM exhibit a markedly higher consistency with the ground truth masks than those generated by the baseline, confirming that MFAM enhances the model’s capacity to identify nodules with varying scales and textures.
In terms of quantitative performance, the integration of BAM into the baseline network leads to a significant improvement in segmentation accuracy. Specifically, the Dice increases from 71.51% to 84.10%, while the IoU rises from 56.49% to 72.73%. In addition, the accuracy improves from 93.99% to 96.55%, and the Mcc increases from 68.23% to 82.25%. The visualization results show that BAM’s capacity to guide the model’s attention more effectively toward meaningful foreground features while diminishing the influence of irrelevant background content. From a computational perspective, the addition of BAM increases the model’s parameter count from 1.94 to 3.71 M, with the frame rate dropping from 227.29 to 139.47 FPS, while the computational complexity rises from 3.48 to 7.64 GFLOPs. Although its speed is slower than the base version, its performance still falls within the feasible range suitable for practical clinical applications.
Integrating RDM into the baseline network architecture can significantly enhance the segmentation performance. Quantitatively, the Dice improves from 71.51% to 84.76%, while the IoU increases from 56.49% to 73.80%. Moreover, the accuracy rises from 93.99% to 96.72%, and the Mcc improves from 68.23% to 83.01%. From a qualitative perspective, the generated predictions demonstrate improved alignment with ground truth boundaries, especially in regions characterized by subtle textures or complex structural variations. In terms of computational implications, the introduction of RDM leads to a moderate increase in model complexity, expanding the parameter count from 1.94 to 2.55 M and GFLOPs increasing from 3.48 to 4.07. The additional computational cost is reflected in a decrease in processing speed, with the inference frame rate dropping from 227.29 FPS to 101.23 FPS.
When multiple modules are integrated, the performance of the network improves further through their complementary strengths. As shown in Table 6, the combination of MFAM and BAM leads to clear improvements over the baseline, with the Dice increasing from 71.51% to 83.21%, IoU from 56.49% to 71.58%, accuracy from 93.99% to 96.46%, and Mcc from 68.23% to 81.37%. Although this dual-module configuration raises the parameter count to 4.00 M and increases GFLOPs to 8.56, the achieved segmentation improvements demonstrate that MFAM and BAM cooperate effectively by capturing multi-scale contextual cues and refining spatial attention. Similarly, the integration of MFAM and RDM produces the most notable gains among the dual-module settings. In this case, the Dice reaches 85.84%, IoU rises to 75.36%, accuracy improves to 96.83%, and Mcc increases to 84.13%, all of which represent significant advances compared to the baseline. The parameter count moderately grows to 2.85 M, GFLOPs to 4.99, and FPS drops to 86.50. The combination of BAM and RDM also yields strong results, with Dice improving to 85.36%, IoU to 74.59%, accuracy to 96.71%, and Mcc to 83.56%. Although this configuration has a higher parameter load of 3.77M and computational cost of 7.86 GFLOPs, the improvements in feature refinement and boundary delineation make it particularly effective for complex cases with heterogeneous nodule appearances. Finally, when all three modules are jointly integrated into the baseline, the proposed MFA-Net exhibits the most outstanding overall performance in both quantitative metrics and qualitative visual outcomes. This comprehensive configuration leverages the individual strengths of each component, leading to a synergistic enhancement in segmentation capability. In terms of model complexity, the parameter count increases to 4.07 M, GFLOPs to 8.78, and FPS decreases from 227.29 to 67.46, reflecting the added architectural complexity. Despite these increases, the performance gains achieved justify the trade-off. The network remains lightweight enough for practical use, especially in scenarios where segmentation accuracy is critical and computational resources are moderately available.
Selection experiment of loss function
To identify the most effective loss function for enhancing segmentation performance in the context of thyroid nodule detection, we conducted a comparative evaluation of six widely used loss functions and the results are shown in Table 7. Each loss function was integrated into MFA-Net under identical training settings, ensuring a fair comparison based solely on their individual optimization capabilities. Specifically, BCE loss achieved Dice of 83.39%, IoU of 71.73%, accuracy of 96.31%, and Mcc of 81.37%. While it provides a solid baseline, its performance is limited in capturing fine-grained regions due to the lack of direct overlap optimization. Dice loss improved all metrics, achieving Dice of 85.88%, IoU of 75.36%, accuracy of 96.84%, and Mcc of 84.14%. This demonstrates its strong ability to directly maximize overlap between predicted masks and ground truth, making it particularly effective for handling class imbalance. IoU loss focused on optimizing the intersection-over-union metric, but its overall performance was lower (Dice 77.59%, IoU 63.76%, accuracy 94.07%, Mcc 75.74%). This indicates that although optimizing IoU is useful, during the training process, it may not be as stable as the Dice loss. The Tversky loss function (we set 0.3 and 0.7 for the hyperparameter) aims to balance the situations of false positives and false negatives, and its results are quite satisfactory (with a Dice value of 79.82%, an IoU value of 66.73%, an accuracy rate of 94.79%, and a Mcc value of 78.02%). This indicates that although it can solve the problem of class imbalance, its overall segmentation performance is slightly lower than that of the Dice loss function. Focal (weighted 0.1) + Tversky (weighted 0.9) loss, which emphasizes difficult-to-segment regions, resulted in Dice of 78.22%, IoU of 64.51%, accuracy of 94.12%, and Mcc of 76.48%. Although this combination is theoretically advantageous for hard examples, it underperformed compared to simpler Dice-based losses in our experiments. BCE (weighted 0.5) + Dice (weighted 0.5) loss achieved the highest overall performance (Dice 86.18%, IoU 75.86%, accuracy 96.98%, Mcc 84.57%), slightly outperforming Dice loss alone. This combination benefits from both pixel-wise probability optimization (BCE) and overlap maximization (Dice), resulting in more balanced segmentation performance across all metrics.
Table 7
| Loss function | Dice | IoU | Accuracy | Mcc |
|---|---|---|---|---|
| BCE | 0.8339 | 0.7173 | 0.9631 | 0.8137 |
| Dice | 0.8588 | 0.7536 | 0.9684 | 0.8414 |
| IoU | 0.7759 | 0.6376 | 0.9407 | 0.7574 |
| Tversky | 0.7982 | 0.6673 | 0.9479 | 0.7802 |
| Focal (0.1) + Tversky (0.9) | 0.7822 | 0.6451 | 0.9412 | 0.7648 |
| BCE (0.5) + Dice (0.5) | 0.8618 | 0.7586 | 0.9698 | 0.8457 |
BCE, binary cross-entropy; IoU, intersection over union; Mcc, Matthews correlation coefficient; TN3K, thyroid nodule 3493 dataset.
The effects of kernel sizes on the MFAM
Table 8 presents the results of our experiments on the MFAM in the MFA-Net architecture, where we evaluate the performance of various kernel size configurations. Specifically, we compared the use of horizontal and vertical convolutions to standard convolutions across different kernel combinations: 3+5+7, 3+5+9, 3+5+11, 5+7+9, 5+7+11, and 7+9+11. Our analysis shows that the horizontal and vertical convolution configurations outperformed the standard convolution configurations in most cases (except for 7+9+11), and they perform better across all evaluated metrics. Notably, the combination of 5+7+11 kernels yielded the best performance across all configurations, achieving Dice of 0.8618, IoU of 0.7586, accuracy of 96.98%, and Mcc of 0.8457. This configuration showed a marked improvement over the standard convolution configurations, where the best result (from the 7+9+11 kernel size) yielded Dice of 0.8581, IoU of 0.7530, accuracy of 96.78%, and Mcc of 0.8405. In conclusion, the horizontal and vertical convolution kernel configurations, particularly with kernel sizes of 5+7+11, are more effective for improving the segmentation performance of the MFA-Net model, as compared to traditional standard convolution configurations.
Table 8
| Convolution type | Kernel size | Dice | IoU | Accuracy | Mcc |
|---|---|---|---|---|---|
| 3+5+7 | 0.8511 | 0.7427 | 0.9664 | 0.8327 | |
| 3+5+9 | 0.8525 | 0.7445 | 0.9672 | 0.8352 | |
| Standard convolution | 3+5+11 | 0.8541 | 0.7472 | 0.9676 | 0.8367 |
| 5+7+9 | 0.8502 | 0.7415 | 0.9662 | 0.8316 | |
| 5+7+11 | 0.8497 | 0.7409 | 0.9667 | 0.8320 | |
| 7+9+11 | 0.8581 | 0.7530 | 0.9678 | 0.8405 | |
| 3+5+7 | 0.8574 | 0.7521 | 0.9676 | 0.8397 | |
| 3+5+9 | 0.8553 | 0.7488 | 0.9681 | 0.8382 | |
| Horizontal and vertical | 3+5+11 | 0.8579 | 0.7528 | 0.9685 | 0.8408 |
| 5+7+9 | 0.8552 | 0.7484 | 0.9682 | 0.8387 | |
| 5+7+11 | 0.8618 | 0.7586 | 0.9698 | 0.8457 | |
| 7+9+11 | 0.8543 | 0.7476 | 0.9677 | 0.8367 |
IoU, intersection over union; Mcc, Matthews correlation coefficient; TN3K, thyroid nodule 3493 dataset.
Computational efficiency
Table 9 presents a comprehensive comparison of model complexity (in terms of parameter count), computational efficiency (measured by FPS), and computational load (measured by GFLOPs) across various thyroid nodule segmentation methods on the TN3K dataset. Among the models evaluated, HFENet stands out with the lowest number of parameters (only 0.15 million), the highest FPS of 233.49, and a minimal computational load of 1.47 GFLOPs, highlighting its exceptional efficiency for real-time applications. U-Net is one of the lightest architectures in this comparison, but compared with the newer models, it often lags behind in terms of segmentation accuracy. It has 1.94 million parameters, 227.29 FPS, and a 3.48 GFLOPs computational load. AMSUnet is another efficient model with a modest parameter size of 2.61 million and 84.18 FPS, offering a slightly slower inference speed than U-Net. It has 6.12 GFLOPs, reflecting a moderate computational burden. MSFCN has 14.17 million parameters, an FPS of 129.21, and a high GFLOPs of 55.80, indicating a high computational load for high-accuracy segmentation but a well-balanced design in terms of model complexity and inference speed. LANet features 23.79 million parameters, positioning it on the higher end in terms of model size. Despite this, it maintains a relatively high FPS of 121.72 and a moderate GFLOPs of 8.30, demonstrating efficient architectural optimization that allows for competitive processing speeds even with increased complexity. ERDUnet carries 10.21 million parameters, but its FPS drops to 39.59, and its GFLOPs of 10.29 indicate slower processing efficiency, making it more computationally intensive despite its moderate parameter count. MDA-Net has a significantly larger footprint with 29.84 million parameters, and a moderate FPS of 57.95, accompanied by GFLOPs of 45.80, indicating its high computational load. BSNet possesses the largest parameter count at 43.98 million, making it the heaviest model among those compared. Its FPS is the lowest, at only 25.39, with a GFLOPs of 45.80, highlighting its computational intensity. MFA-Net has 4.07 million parameters and a frame processing rate of 67.46 per second. This indicates that its running speed is moderate, with 8.78 floating-point operations per second, which keeps it in a balanced state in terms of computational load.
Table 9
| Model | Param (M) | FPS | GFLOPs |
|---|---|---|---|
| HFENet | 0.15 | 233.49 | 1.47 |
| DESENet | 1.15 | 45.35 | 3.17 |
| U-Net | 1.94 | 227.29 | 3.48 |
| AMSUnet | 2.61 | 84.18 | 6.12 |
| NLIE-UNet | 2.71 | 15.83 | 6.02 |
| MFA-Net | 4.07 | 67.46 | 8.78 |
| ERDUnet | 10.21 | 39.59 | 10.29 |
| MSFCN | 14.17 | 129.21 | 55.80 |
| LANet | 23.79 | 121.72 | 8.30 |
| BSNet | 43.98 | 25.39 | 45.80 |
| BMANet | 28.40 | 33.43 | 9.19 |
FPS, frame-per-second; GFLOPs, giga floating-point operations per second; MFA-Net, multi-scale feature aggregation network; TN3K, thyroid nodule 3493 dataset.
Limitations
Although the proposed MFA-Net achieves promising segmentation performance, several limitations remain. Firstly, our experiment was conducted on publicly available datasets, which may not fully encompass the diversity of real-world clinical data. Future work will involve collaborating with hospitals to validate our framework on larger and more diverse sample groups and using LLM-based annotation. Secondly, our current design is based on task-specific deep learning architectures, whereas recent advances in foundation models have demonstrated powerful generalization capabilities across tasks and modalities. Integrating MFA-Net with such pre-trained large-scale models may further enhance segmentation accuracy and reduce dependence on extensive labeled medical data. Thirdly, this study focuses primarily on single-modality imaging. In practice, integrating multimodal information, such as histology slides, genomic profiles, or clinical notes, could provide complementary insights that improve both segmentation precision and downstream clinical decision-making (46).
Conclusions
In this research, we introduce MFA-Net, an advanced deep learning framework meticulously designed for accurate and efficient segmentation of thyroid nodules. MFA-Net integrates three key components: MFAM, background-aware mechanism, and RDM. The MFAM enhances the network’s capacity to obtain fine details and contextual information across varying receptive fields, while BAM effectively suppresses background noise, guiding the network’s attention toward critical nodule regions. The RDM further strengthens decoding by preserving semantic continuity and refining boundary precision. Extensive experiments conducted on three datasets demonstrate that MFA-Net consistently surpasses a wide range of leading methods in terms of Dice coefficient and IoU value. Ablation studies further validate the individual contributions of each component, confirming their synergistic impact on performance enhancement. Additionally, MFA-Net achieves a favorable trade-off between accuracy and computational efficiency, making it a robust and practical solution for clinical applications. These findings affirm that MFA-Net delivers precise, stable, and scalable segmentation results, with strong generalization capability across diverse thyroid nodule patterns and imaging conditions.
Acknowledgments
We would like to sincerely thank our research collaborators Simon Fong and Yaoyang Wu from University of Macau for their assistance in data collection and the language editing of this manuscript.
Footnote
Reporting Checklist: The authors have completed the CLEAR reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1364/rc
Funding: This work was supported by
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1364/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Lu W, Zhang D, Zhou W, Wei W, Wu X, Ding W, Zhang C. Diagnosis of thyroid nodules using ultrasound images based on deep learning features: online dynamic nomogram and gradient-weighted class activation mapping. Quant Imaging Med Surg 2025;15:5689-702. [Crossref] [PubMed]
- Wang QG, Li M, Deng GX, Huang HQ, Qiu Q, Lin JJ. Development and validation of a nomogram based on conventional and contrast-enhanced ultrasound for differentiating malignant from benign thyroid nodules. Quant Imaging Med Surg 2025;15:4641-54. [Crossref] [PubMed]
- Ronneberger O, Fischer P, Brox T, editors. U-Net: Convolutional networks for biomedical image segmentation. In: Navab N, Hornegger J, Wells W, Frangi A, editors. Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015. Lecture Notes in Computer Science. Cham: Springer; 2015.
- Tang Q, Min S, Shi X, Zhang Q, Liu Y. DESENet: A bilateral network with detail-enhanced semantic encoder for real-time semantic segmentation. Meas Sci Technol 2025;36:015425.
- Wu Z, Chen H, Xiong X, Wu S, Li H, Zhou X. BMANet: Boundary-guided multi-level attention network for polyp segmentation in colonoscopy images. Biomed Signal Process Control 2025;105:107524.
- Wan L, Song L, Zhou Y, Kang C, Zheng S, Chen G. Dynamic neighbourhood-enhanced UNet with interwoven fusion for medical image segmentation. Vis Comput 2025;41:7703-21.
- Wu Y, Huang L, Yang T. Thyroid nodule ultrasound image segmentation based on improved Swin Transformer. IEEE Access 2025;13:19788-95.
- Li X, Fu C, Wang Q, Zhang W, Ye C, Ma T. GSE-Nets: Global structure enhancement decoder for thyroid nodule segmentation. Biomed Signal Process Control 2025;102:107340.
- Hu M, Zhang Y, Xue H, Lv H, Han S. Mamba- and ResNet-Based Dual-Branch Network for Ultrasound Thyroid Nodule Segmentation. Bioengineering (Basel) 2024;11:1047. [Crossref] [PubMed]
- Ozcan A, Tosun Ö, Donmez E, Sanwal M. Enhanced-TransUNet for ultrasound segmentation of thyroid nodules. Biomed Signal Process Control 2024;95:106472.
- Ali H, Wang M, Xie J. Cil-net: Densely connected context information learning network for boosting thyroid nodule segmentation using ultrasound images. Cogn Comput 2024;16:1176-97.
- Zheng Z, Liang E, Zhang Y, Weng Z, Chai J, Bu W, Xu J, Su T. A segmentation-based algorithm for classification of benign and malignancy Thyroid nodules with multi-feature information. Biomed Eng Lett 2024;14:785-800. [Crossref] [PubMed]
- Li W, Tang YM, Wang Z, Yu KM, To S. Atrous residual interconnected encoder to attention decoder framework for vertebrae segmentation via 3D volumetric CT images. Eng Appl Artif Intell 2022;114:105102.
- Ji Z, Ge Y, Chukwudi C. U K, Zhang SM, Peng Y, Zhu J, Zaki H, Zhang X, Yang S, Wang X, Chen Y, Zhao J. Counterfactual Bidirectional Co-Attention Transformer for Integrative Histology-Genomic Cancer Risk Stratification. IEEE J Biomed Health Inform 2025;29:5862-74. [Crossref] [PubMed]
- Gu R, Liu L. An agricultural leaf disease segmentation model applying multi-scale coordinate attention mechanism. Appl Soft Comput 2025;172:112904.
- Ni JC, Lee SH, Shen YC, Yang CS. Improved U-Net based on ResNet and SE-Net with dual attention mechanism for glottis semantic segmentation. Med Eng Phys 2025;136:104298. [Crossref] [PubMed]
- Shang X, Wu S, Liu Y, Zhao Z, Wang S. PVT-MA: Pyramid vision transformers with multi-attention fusion mechanism for polyp segmentation. Appl Intell 2025;55:17.
- Qi K, Yan C, Niu D, Zhang B, Liang D, Long X. MG-Net: A fetal brain tissue segmentation method based on multiscale feature fusion and graph convolution attention mechanisms. Comput Methods Programs Biomed 2024;257:108451. [Crossref] [PubMed]
- Wu J, Fu R, Fang H, Zhang Y, Yang Y, Xiong H, Liu H, Xu Y. MedSegDiff: Medical image segmentation with diffusion probabilistic model. arXiv:2211.00611 [Preprint]. Available online: https://arxiv.org/abs/2211.00611
- Liu X, Liang J, Zhang J, Qian Z, Xing P, Chen T, Yang S, Chukwudi C, Qiu L, Liu D, Zhao J. Advancing hierarchical neural networks with scale-aware pyramidal feature learning for medical image dense prediction. Comput Methods Programs Biomed 2025;265:108705. [Crossref] [PubMed]
- Zhang L, Xu C, Li Y, Liu T, Sun J. MCSE-U-Net: multi-convolution blocks and squeeze and excitation blocks for vessel segmentation. Quant Imaging Med Surg 2024;14:2426-40. [Crossref] [PubMed]
- Jiang S, Chen X, Yi C. SSA-UNet: Whole brain segmentation by U-Net with squeeze-and-excitation block and self-attention block from the 2.5D slice image. IET Image Process 2024;18:1598-612.
- Klomp SR, Wijnhoven RG, de With PH. Performance-efficiency comparisons of channel attention modules for resnets. Neural Process Lett 2023;55:6797-813.
- Shan X, Shen Y, Cai H, Wen Y. Convolutional neural network optimization via channel reassessment attention module. Digital Signal Processing 2022;123:103408.
- Lin C, Hu X, Zhan Y, Hao X. MobileNetV2 with Spatial Attention module for traffic congestion recognition in surveillance images. Expert Syst Appl 2024;255:124701.
- Hu D, Fang Y, Cao J, Jiang T, Gao F. An end-to-end vision-based seizure detection with a guided spatial attention module for patient detection. IEEE Internet Things J 2024;11:18869-79.
- Jiang L, Hui Y, Fei Y, Ji Y, Zeng T. Improving polyp segmentation with boundary-assisted guidance and cross-scale interaction fusion transformer network. Processes 2024;12:1030.
- Luo H, Zhou D, Cheng Y, Wang S. MPEDA-Net: A lightweight brain tumor segmentation network using multi-perspective extraction and dense attention. Biomed Signal Process Control 2024;91:106054.
- Gong H, Chen J, Chen G, Li H, Li G, Chen F. Thyroid region prior guided attention for ultrasound segmentation of thyroid nodules. Comput Biol Med 2023;155:106389. [Crossref] [PubMed]
BrainTumor dataset. Available online: https://figshare.com/articles/dataset/brain_tumor_dataset/1512427- Selvaraj A, Nithiyaraj E. CEDRNN: A convolutional encoder-decoder residual neural network for liver tumour segmentation. Neural Process Lett 2023;55:1605-24.
- Li Y, Zhang Y, Liu JY, Wang K, Zhang K, Zhang GS, Liao XF, Yang G. Global Transformer and Dual Local Attention Network via Deep-Shallow Hierarchical Feature Fusion for Retinal Vessel Segmentation. IEEE Trans Cybern 2023;53:5826-39. [Crossref] [PubMed]
- Sun S, Fu C, Xu S, Wen Y, Ma T. GLFNet: Global-local fusion network for the segmentation in ultrasound images. Comput Biol Med 2024;171:108103. [Crossref] [PubMed]
- Yuan Y, Yang L, Chang K, Huang Y, Yang H, Wang J. DSCA-PSPNet: Dynamic spatial-channel attention pyramid scene parsing network for sugarcane field segmentation in satellite imagery. Front Plant Sci 2023;14:1324491. [Crossref] [PubMed]
- Yang L, Dong Q, Lin D, Tian C, Lü X. MUNet: a novel framework for accurate brain tumor segmentation combining UNet and mamba networks. Front Comput Neurosci 2025;19:1513059. [Crossref] [PubMed]
- Hu M, Dong Y, Li J, Jiang L, Zhang P, Ping Y. LAMFFNet: Lightweight adaptive multi-layer feature fusion network for medical image segmentation. Biomed Signal Process Control 2025;103:107456.
- Rainio O, Teuho J, Klén R. Evaluation metrics and statistical tests for machine learning. Sci Rep 2024;14:6086. [Crossref] [PubMed]
- Zhu Q. On the performance of matthews correlation coefficient (Mcc) for imbalanced dataset. Pattern Recognit Lett 2020;136:71-80.
- Cong R, Zhang Y, Yang N, Li H, Zhang X, Li R, Chen Z, Zhao Y, Kwong S. Boundary guided semantic learning for real-time COVID-19 lung infection segmentation system. IEEE Transactions on Consumer Electronics 2022;68:376-86.
- Iqbal A, Sharif M. MDA-Net: Multiscale dual attention-based network for breast lesion segmentation using ultrasound images. J King Saud Univ Comput Inf Sci 2022;34:7283-99.
- Lu F, Zhang Z, Guo L, Chen J, Zhu Y, Yan K, Zhou X. HFENet: A lightweight hand‐crafted feature enhanced CNN for ceramic tile surface defect detection. Int J Intell Syst 2022;37:10670-93.
- Li R, Zheng S, Duan C, Wang L, Zhang C. Land cover classification from remote sensing images based on multi-scale fully convolutional network. Geo Spat Inf Sci 2022;25:278-94.
- Li H, Zhai DH, Xia Y. ERDUnet: An efficient residual double-coding unet for medical image segmentation. IEEE Trans Circuits Syst Video Technol 2024;34:2083-96.
- Ding L, Tang H, Bruzzone L. LANet: Local attention embedding to improve the semantic segmentation of remote sensing images. IEEE Trans Geosci Remote Sens 2021;59:426-35.
- Yin Y, Han Z, Jian M, Wang GG, Chen L, Wang R. AMSUnet: A neural network using atrous multi-scale convolution for medical image segmentation. Comput Biol Med 2023;162:107120. [Crossref] [PubMed]
- Ong HT, Karatas E, Poquillon T, Grenci G, Furlan A, Dilasser F, Mohamad Raffi SB, Blanc D, Drimaracci E, Mikec D, Galisot G, Johnson BA, Liu AZ, Thiel C, Ullrich O. OrgaRES Consortium; Racine V, Beghin A. Digitalized organoids: integrated pipeline for high-speed 3D analysis of organoid structures using multilevel segmentation and cellular topology. Nat Methods 2025;22:1343-54. [Crossref] [PubMed]





