An improved multi-scale feature extraction network for medical image segmentation
Introduction
Medical image segmentation plays a critical role in medical image processing that requires both speed and accuracy. In recent years, rapid advances in deep learning-based image segmentation algorithms have led to notable improvements in precision and efficiency compared to conventional segmentation methods that depend on image thresholding (1). Convolutional neural networks (CNNs) have been widely adopted in image processing due to their exceptional feature extraction capabilities. However, earlier CNNs, such as GoogLeNet (2) and ResNet (3), were primarily used for image classification and recognition, and there are few specialized networks for image segmentation. In 2017, Shelhamer et al. (4) introduced fully convolutional networks to leverage the full capacity of CNNs for this task, which marked a significant milestone in the field of image segmentation. Inspired by the encoder-decoder architecture, Ronneberger et al. (5) proposed the U-Net network, which has achieved significant success and has since served as a backbone for numerous improved networks. Zhou et al. (6) re-designed the skip connections of U-Net to enhance the ability of the network to learn features at various depths. Zunair and Ben Hamza (7) incorporated a sharpening spatial filter before merging the encoder feature maps with the decoder feature maps. This emphasized the fine details of the early-level features generated by the encoder, helping to smooth out artifacts caused by untrained parameters throughout the network. In their subsequent work, they proposed masked supervised learning (8), a novel single-stage learning paradigm for semantic segmentation that can better establish short- and long-term contextual relationships. Additionally, in the MoNuSAC2020 challenge (9), numerous researchers proposed various segmentation networks for cell segmentation and achieved impressive results.
However, CNNs are limited to local regions and cannot capture global context. To address this limitation, Dosovitskiy et al. (10) proposed the Vision Transformer (ViT), which applied the self-attention mechanism to establish global features in computer vision. Since then, many researchers have incorporated the ViT into the CNN to enhance the segmentation accuracy. However, one of the main limitations of the Transformer architecture is its dependence on a significant amount of training data to produce optimal results. To address this issue, researchers have developed innovative structures such as the fully convolutional transformer (11), convolutional ViT (12), and CNN-style transformer (ConvFormer) (13), to enhance the performance and effectiveness of the Transformer model. These architectures include CNN layers in the Transformer model to compensate for the need for a large amount of training data in the original structure.
Medical image segmentation aims to assign each pixel in an image to a category; however, the extraction of multi-scale information for medical image segmentation is a significant challenge (14). To address this issue, a deeper understanding of the image content is required. Nevertheless, current models face significant challenges in capturing richer multi-scale features while excluding irrelevant structures and background information due to the significant size and shape variations among different tissues and organs in medical images. These models have primarily focused on extracting multi-scale information from the medical images of specific organs. Currently, there is no single universal feature extraction technique that is capable of effectively addressing the scale differences among medical images containing different tissues and organs. Gao et al. (15) proposed Res2Net, a new backbone network that builds on ResNet and aims to improve the capacity of the network to extract multi-scale image features. A hierarchical residual connection was constructed within a single residual block, resembling a residual connection with varying levels. Despite this advancement, the residual block still employs convolutional kernels of fixed size, which does not address the limitation of a fixed receptive field.
To address the above-mentioned issues, we propose Res2Net-ConvFormer-Dilation-UNet (Res2-CD-UNet), a multi-scale feature extraction network for medical image segmentation. The Res2Net (15) network is used as the backbone of the present network, with one layer employing dilated convolutions to reduce the number of training parameters and increase the receptive field, thereby enhancing the multiscale feature extraction capability of the entire backbone network. Additionally, the ConvFormer (13) is introduced to the bottom of the encoder to perform global feature extraction on the deep image features. Since the direct concatenation of low-level features with high-level features in the skip connections introduces excessive irrelevant background information, we added a novel channel feature fusion block (CFFB) into the skip connection that effectively uses the spatial information of low-level features, reducing the effect of background noise and enhancing the model’s feature learning ability. The proposed model has been evaluated on multi-organ segmentation and aorta segmentation datasets. The results of our experiment indicate that our proposed model outperforms existing networks in terms of accuracy.
Related research
CNN-Transformer hybrid segmentation networks
Dosovitskiy et al. (10) first introduced transformers to computer vision, followed by TransUNet (16), which applies transformers to medical image segmentation. TransUNet (16) uses the original ViT to serialize the downsampled image sequence during training, fully leveraging the ability of transformers to capture long-range dependencies in low-resolution images. TransUNet (16) aims to improve segmentation performance by combining a symmetric encoder-decoder structure. Cao et al. (17) developed the Swin Transformer that employs a sliding window-based self-attention mechanism to combine local receptive fields, improving computational efficiency and accuracy. Inspired by the Axial Transformer (18), Valanarasu et al. (19) introduced the MedT network, which incorporates a gated axial-attention mechanism. This network combines self-attention and axial attention to capture key features more effectively in medical images. Xie et al. (20) proposed a new model named DeTrans, which combines convolution with transformer architecture. This model employs deformable self-attention mechanisms that allow the model to focus on critical regions. Jin et al. (21) also added transformers to the encoder to enhance the global feature extraction capabilities of the network. Yan et al. (22) improved the multi-scale feature extraction capability of the network by using features of different resolutions.
Multi-scale information extraction
Szegedy et al. (2) introduced the Inception model, which uses convolutional layers of different sizes concatenated together to capture features at different scales. Zhao et al. (23) proposed the pyramid pooling model, which effectively captures context information at different scales in the image. Rahman and Marculescu (24) introduced a multi-scale hierarchical ViT to enhance the generalizability of the model. Ji et al. (25) designed the multi-scale normalized channel attention model based on dilated convolution, channel attention mechanism standardization, and depth-wise separable convolution. Sinha et al. (26) created attention feature maps containing different semantic information at various resolutions using a multi-scale strategy for encoding to capture relevant context feature information. Xu et al. (27) proposed ViTAE, which added additional convolutional branches inside and outside the self-attention block to incorporate multi-scale information. Srivastava et al. (28) proposed the Multi-Scale Residual Fusion Network to extract multi-scale features from specific organ medical images. Similarly, Sun et al. (29) used spatial pyramid pooling models for the segmentation of gastric cancer regions.
Methods
Network architecture
Due to variations in edges, grayscale values, and shapes, medical image segmentation is a challenging task. This study proposes a network for multi-scale medical image segmentation, named the Res2-CD-UNet. The overall structure of the proposed Res2-CD-UNet is shown in Figure 1. The network structure retains the U-shaped architecture of U-Net. However, the dual-layer convolution blocks are replaced by three Res2Net blocks and one Res2Net block with dilated convolutions. This improvement enhances the extraction of multi-scale features by expanding the receptive field without a significant increase in the network parameters. The feature maps extracted by the convolutional network are then fed into the ConvFormer for global relationship modeling.
In the decoder stage, the deconvolution is used to upsample the feature maps to the original image size. In addition, a novel CFFB is included to reduce interference from irrelevant background information. In the encoder stage, the input image first undergoes feature extraction via Res2Net blocks using 3×3 convolutional kernels. Following each convolution, a maximum-pooling layer is applied to extract prominent boundary and texture features from the feature maps. One of the Res2Net blocks uses dilated convolutions with a dilation rate of 2 to extract finer multi-scale features. The feature maps are reduced to 1/16 of their original size with 1,024 channels before being input to the ConvFormer for global relationship modeling. The size of the feature maps remains unchanged after passing through the ConvFormer.
To ensure the preservation of the original image details while reducing the impact of extraneous background information, a channel attention mechanism is employed in each convolution layer to merge the features with the upsampled feature maps. Further details of the model are provided in the following sections.
Backbone network
The ability of a backbone network to accurately segment images relies heavily on its capability to extract image features. Traditional backbone networks such as ResNet (3) use dual-layer convolutional blocks with residual connections to extract features of different scales. However, due to their local receptive field, which can reduce segmentation accuracy, there are certain limitations in the abilities of these networks to extract multi-scale information. Conversely, the Res2Net block uses a hierarchical residual-style approach to connect different groups of convolutional kernels. This approach expands the receptive field and enables the network to learn feature representations at various scales.
As Figure 2 shows, the Res2Net block redesigns the convolution module (n=s×w) by replacing the 3×3×n convolution kernel in ResNet with s sets of 3×3×w convolution kernels. Here, represents each group of the feature maps, where , and each subset has the same spatial size but the number of channels is of the input feature map. Except for , each has a corresponding 3×3 convolution, denoted by . denotes the output of . The feature subset is added to the output of , and then fed into . Thus, can be given as (15):
The Res2Net block has been shown to enhance model performance by expanding the receptive field. However, the use of fixed-size convolutional kernels limits the ability of Res2Net to overcome the fixed receptive field defect. To address this limitation, dilated convolutions have been introduced. Unlike conventional convolutions, dilated convolutions incorporate a dilation rate hyperparameter, which determines the spacing between the values within the convolutional kernel. By expanding the receptive field, dilated convolutions can effectively reduce information loss when extracting multiscale features. Further, dilated convolutions do not increase the number of parameters, as they do not add new convolution operations. To enhance the multiscale feature extraction ability of the backbone network and overcome the limitation of using fixed-size convolutional kernels, one layer is replaced with convolutional kernels that use a dilated strategy. This approach captures more multi-scale features without a commensurate increase in computational complexity.
ConvFormers
Transformers lack some of the inherent inductive biases of CNNs (10). As a result, they usually require a significant amount of training data to produce satisfactory outcomes. However, some current challenges in medical imaging include a shortage of datasets and insufficient labels. To address this issue, Lin et al. (13) proposed the ConvFormer, which combines convolutional operations with the transformer structure to overcome the need for large amounts of training data. Thus, in this study, we incorporate ConvFormer into the encoder stage by feeding the features after convolutional operations into ConvFormer to establish global feature relationships.
Figure 3 presents a comparison between ViT and ConvFormer. The ConvFormer model has an advantage over the ViT in that it eliminates the need to reshape the input feature map into a one-dimensional sequence. Instead, it directly reduces the feature map resolution using convolution and maximum-pooling. The ConvFormer model also uses the CNN-style self-attention module to establish long-range dependencies by constructing self-attention matrices. Additionally, the convolutional feed-forward network module is used to optimize the features for each pixel.
CFFB
The encoder generates a low-level feature map that contains spatial information but lacks a global understanding of the target objects. Conversely, the decoder generates a high-level feature map that provides rich semantic information but lacks spatial resolution. To restore detailed information more accurately from the original image during upsampling, one approach is to use a direct connection of the low-level feature map and the high-level feature map, which is commonly used by conventional U-Net and some of its variations. This equation is expressed as:
where, t represents the level, and represent the low-level feature map and high-level feature map, respectively, denotes the channel concatenation, and represents the output after feature fusion.
The conventional skip connection often introduces an excessive amount of irrelevant information into the feature maps. To address this issue, we proposed a new CFFB and added it to the skip connection. The structure of the block is shown in Figure 4. Our approach focuses on the critical regions of the skip connections, reducing the interference of irrelevant information. Our process involves global average pooling and global maximum pooling of the feature maps, followed by the concatenation of the feature maps at the channel level. This process facilitates the acquisition of a comprehensive understanding of the feature maps, thus enabling the most pertinent information to be extracted from them. Subsequently, a squeeze-and-excitation (SE) module is employed to extract channel weights from the feature maps. Upon acquiring the weights, they are multiplied element-wise with the low- and high-level feature maps. The resulting products are then summed to yield the final output feature. The equations are expressed as follows:
where and represent the low- and high-level feature maps, respectively, is the input feature of SE module, and represents the weight.
Experimental environment
All the experiments were implemented on the Pytorch framework and accelerated using the NVIDIA GeForce RTX 4090 GPU. To increase the data diversity for all the training cases, data augmentations such as flipping and rotation were employed. For the pure Transformer encoder, we used a ViT with 12 Transformer layers, all of which were pre-trained on ImageNet (30). Unless otherwise specified, the input resolution and patch size P were set to 224 px × 224 px and 16, respectively. In all the experiments, the scale of the Res2Net blocks was set to 4. The Adam optimizer was used for the network training. After conducting a comparative study of the learning rates (0.1, 0.01, and 0.001), we ultimately selected 0.01. Similarly, we set the batch size to 24 and the training epochs to 150.
Datasets and quantitative evaluations
To evaluate the effectiveness of the proposed medical image segmentation network, experiments were conducted using two public datasets. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013).
Synapse multi-organ segmentation dataset1
Synapse comprises a collection of 30 abdominal computed tomography (CT) scans, containing a total of 3,779 axial view contrast-enhanced clinical abdominal CT images. Annotations are provided for 13 organs, and the images have a resolution of 512 px × 512 px. After preprocessing, the dataset was divided into a training set comprising 18 scans, and a test set comprising 12 scans. We report the dice similarity coefficient (DSC) and Hausdorff distance (HD) values for eight abdominal organs (i.e., the aorta, gallbladder, spleen, left kidney, right kidney, liver, pancreas, and stomach).
Seg.A 2023
The Seg.A.2023 (31) dataset contains 56 CT scans that have annotations for the aortic vascular tree. The original dataset was obtained from scans conducted at three different medical centers. However, in this study, the data from Dongyang Hospital was used, including a total of 18 CT scans with a resolution of 512 px × 666 px. After preprocessing, the dataset was divided into a training set (comprising 12 scans) and a test set (comprising 6 scans). The segmentation accuracy of the aortic tree is reported using the DSC and HD values.
These two metrics are widely used and accepted for image segmentation evaluation. The DSC measures the similarity between the prediction and the ground truth; a higher score suggests a closer match to the ground truth. Conversely, the HD measures the maximum distance between the prediction and the ground truth, considering bidirectional maximum errors; a smaller distance indicates a closer match between the two. In addition, we used evaluation metrics such as pixel accuracy (PA), mean pixel accuracy (MPA), and mean intersection over union (MIoU) to evaluate the performance of the model. The equations for these metrics are expressed as follows:
In Eq. [5], A and B represent the prediction and the ground truth, respectively; represents the intersection of the prediction and the ground truth; and and denote the respective number of elements. Eq. [6] is the bidirectional HD, and Eqs. [7,8] are the unidirectional HDs from A to B and from B to A, respectively. In Eqs. [9-11], n represents the total number of classes, represents the true positive for the ith class, represents the false positive for the ith class, and represents the false negative for the ith class.
Loss function
CrossEntropy (CE) loss and Dice loss were employed in the present study. The Loss function is defined as:
where, N denotes the total number of samples, C represents the total number of categories, and signify the true label and prediction, respectively, is the indicator that the ith sample belongs to the cth class, and is the predicted probability that the model associates the ith input with class c (7). The meanings of the symbols in Eq. [13] are consistent with those in Eq. [5]. refers to the weights of loss function, ranging from 0 to 1.
CE loss is a commonly used loss function in image segmentation. However, it can be easily affected by an imbalance in the distribution between the target regions and background regions, affecting the accuracy of model training. To address this issue, it is often used in conjunction with the Dice loss function. Combining the two offers an additional benefit, as in multi-class segmentation, there is usually a significant difference in the size of the target regions, which is a typical characteristic of multi-scale features, and in such cases, the Dice loss focuses on learning the larger regions, while the CE loss function still learns from the smaller samples. Regarding the choice of the parameter , we also conducted a series of experiments. As Figure 5 shows, we found that the best results were obtained when was set to 0.5.
Results
To evaluate the performance of the proposed Res2-CD-UNet, we compared it against 12 state-of-the-art networks using the Synapse dataset. The average DSC and HD results are presented in Table 1. The optimal results are shown with “*”. The average DSC score for Res2-CD-UNet was 83.92%, indicating that the proposed network outperformed all other networks, achieving a DSC score higher than that of all the other methods. The range of improvement in performance varied from 1.96% to 14.15%. Further, the average HD for Res2-CD-UNet was 14.51 mm, which was lower than that of any other model. We conducted a comparison between our network and several recent networks (35,37-39) using more objective metrics. As Table 2 shows, the results with “*” performance of our network, albeit with a higher number of network parameters.
Table 1
Method | Average, DSC% | Average, HD (mm) | DSC% | |||||||
---|---|---|---|---|---|---|---|---|---|---|
Aorta | Gall | Kidney (L) | Kidney (R) | Liver | Pancreas | Spleen | Stomach | |||
DARR (32) | 69.77 | – | 74.74 | 53.77 | 72.31 | 73.24 | 94.08 | 54.18 | 89.90 | 45.96 |
UNet (5) | 76.85 | 39.70 | 89.07 | 69.72* | 77.77 | 68.60 | 93.43 | 53.98 | 86.67 | 75.58 |
Att-UNet (33) | 77.77 | 36.02 | 89.55 | 68.88 | 77.98 | 71.11 | 93.57 | 58.04 | 87.30 | 75.75 |
Swin-Unet (17) | 79.13 | 21.55 | 85.47 | 66.53 | 83.28 | 79.61 | 94.29 | 56.58 | 90.66 | 76.60 |
ViT (10) | 71.29 | 32.87 | 73.73 | 55.13 | 75.80 | 72.20 | 91.51 | 45.99 | 81.99 | 73.95 |
TransUNet (16) | 77.48 | 31.69 | 87.23 | 63.13 | 81.87 | 77.02 | 94.08 | 55.86 | 85.08 | 75.62 |
LeViT-UNet (34) | 78.53 | 16.84 | 78.53 | 62.23 | 84.61 | 80.25 | 93.11 | 59.07 | 88.86 | 72.76 |
MISSFormer (35) | 81.96 | 18.20 | 86.99 | 68.65 | 85.21 | 82.00* | 94.41 | 65.67 | 91.92* | 80.81 |
MT-UNet (36) | 78.59 | 26.59 | 87.92 | 64.99 | 81.47 | 77.29 | 93.06 | 59.46 | 87.75 | 76.81 |
HiFormer (37) | 80.39 | 14.70 | 86.21 | 65.69 | 85.23 | 79.77 | 94.61 | 59.52 | 90.99 | 81.08 |
GCtx-UNet (38) | 81.95 | 16.80 | 86.96 | 66.26 | 87.75 | 83.68 | 94.53 | 61.06 | 91.42 | 83.74* |
MEW-UNet (39) | 78.92 | 16.44 | 86.68 | 65.32 | 82.87 | 80.02 | 93.63 | 58.36 | 90.19 | 74.26 |
Res2-CD-UNet | 83.92* | 14.51* | 92.24* | 68.59 | 89.50* | 80.53 | 94.68* | 72.04* | 88.77 | 79.05 |
*, the optimal results. CT, computed tomography; DSC, dice similarity coefficient; HD, Hausdorff distance; L, left; R, right; DARR, domain adaptive relational reasoning; ViT, Vision Transformer; Res2-CD-UNet, Res2Net-ConvFormer-Dilation-UNet.
Table 2
Method | DSC% | HD (mm) | PA% | MPA% | MIoU% | Parameters |
---|---|---|---|---|---|---|
MISSFormer (35) | 81.96 | 18.20 | 99.13 | 75.28 | 73.84 | 42,462,537 |
HiFormer (37) | 80.39 | 14.70 | 98.76 | 73.52 | 72.19 | 33,830,067 |
GCtx-UNet (38) | 81.95 | 16.80 | 98.89 | 75.93 | 74.21 | 12,342,168* |
MEW-UNet (39) | 78.92 | 16.44 | 98.52 | 70.76 | 68.38 | 140,268,525 |
Res2-CD-UNet | 83.92* | 14.51* | 99.26* | 76.44* | 74.50* | 111,554,993 |
*, the optimal results. CT, computed tomography; DSC, dice similarity coefficient; HD, Hausdorff distance; PA, pixel accuracy; MPA, mean pixel accuracy; MIoU, mean intersection over union; Res2-CD-UNet, Res2Net-ConvFormer-Dilation-UNet.
A qualitative comparison study was conducted using the Synapse dataset, the results of which are presented in Figure 6. The study showed that networks based solely on CNNs tend to over-segment organs. Conversely, the transformer-based models, such as Res2-CD-UNet and MissFormer, exhibited stronger image feature extraction and semantic differentiation capabilities.
Table 3 presents a comprehensive comparison of the segmentation performance of various models on the SEG.A 2023 dataset. The results revealed that the network proposed in this study achieved the highest DSC score (93.27%) and the lowest HD score (1.53 mm).
Table 3
Method | DSC% | HD (mm) | PA% | MPA% | MIoU% | Parameters |
---|---|---|---|---|---|---|
UNet (5) | 90.12 | 4.22 | 99.43 | 92.27 | 92.19 | 31,036,546 |
TransUNet (16) | 91.80 | 3.26 | 99.75 | 92.35 | 92.23 | 105,276,066 |
HiFormer (37) | 92.17 | 2.34 | 99.87 | 93.12 | 93.04 | 33,830,067 |
GCtx-UNet (38) | 90.23 | 3.42 | 99.59 | 92.84 | 92.78 | 12,342,168* |
MEW-UNet (39) | 90.43 | 1.86 | 99.89 | 93.54 | 92.13 | 140,268,525 |
Res2-CD-UNet | 93.27* | 1.53* | 99.92* | 93.80* | 93.57* | 111,554,993 |
*, the optimal results. DSC, dice similarity coefficient; HD, Hausdorff distance; PA, pixel accuracy; MPA, mean pixel accuracy; MIoU, mean intersection over union; Res2-CD-UNet, Res2Net-ConvFormer-Dilation-UNet.
Discussion
Comparison with state-of-the-art networks
In this study, we proposed a multi-scale feature extraction medical image segmentation model. To validate its effectiveness, we compared it against 12 state-of-the-art networks [i.e., DARR (32), U-Net (5), Att-UNet (33), ViT (10), TransUNet (16), Swin-Unet (17), LeViT-UNet (34), MISSFormer (35), MT-UNet (36) and HiFormer (37), GCtx-UNet (38), and MEW-UNet (39)] using the Synapse dataset. Among these, DARR (32), U-Net (5), and Att-UNet (33) are CNN-based methods; Swin-Unet (17), ViT (10), and MISSFormer (35) are pure Transformer-based methods; TransUNet (16), LeViT-UNet (34), MT-UNet (36), HiFormer (37), GCtx-UNet (38), and MEW-UNet (39) are CNN-Transformer hybrid methods; and Att-UNet (33) and MISSFormer (35) use an improved skip connection to enhance their feature extraction capability.
In the Synapse dataset, the average DSC score for Res2-CD-UNet was 83.92%. Thus, with a higher DSC score than all the other methods, the proposed network outperformed all the other networks. The range of improvement in performance varied from 1.96% to 14.15%. Further, the average HD for Res2-CD-UNet was 14.51 mm, which was lower than that of any other model. The proposed network also showed exceptional performance in relation to the DSC values for individual organs. Among the eight organs, four achieved optimal results, with the aorta scoring 92.24%, the left kidney scoring 89.50%, the liver scoring 94.68%, and the pancreas scoring 72.04%. However, one organ showed suboptimal results (i.e., the kidney scored only 80.53%).
We also observed an increase of 4.79% in the average DSC and an increase of 7.04 mm in the average HD over Swin-Unet (17), which is a pure Transformer-based model. In comparison to HiFormer (37), which is a CNN-Transformer hybrid method, Res2-CD-UNet showed an improvement of 3.53% and 0.19 mm in the average DSC and HD, respectively. Further, the average DSC and HD improved by 1.96% and 3.69 mm, respectively, compared to MissFormer (35). These results showed that Res2-CD-UNet had superior multi-scale feature extraction capability and segmentation accuracy than the other networks.
A qualitative comparison study was conducted using the Synapse dataset, the results of which are presented in Figure 6. The comparison of the segmentation results in the first and second rows showed that Res2-CD-UNet predicted fewer false positives than the other methods. This suggests that our model was more effective in suppressing background noise and reducing irrelevant information interference than the other networks. The results demonstrated that Res2-CD-UNet performed finer segmentation while retaining detailed shape information. This is attributed to Res2-CD-UNet’s advantages in multi-scale feature extraction, global context information extraction, and low-level detail extraction. Therefore, our study confirmed the accuracy of Res2-CD-UNet and highlighted its potential in medical imaging applications.
We also selected several models for comparison using the Seg.A.2023 dataset. Notably, in the Seg.A.2023 dataset, the proposed network showed a marked improvement (i.e., an increase of 1.1% in the DSC score and an increase of 0.81 mm in the HD score compared to the suboptimal model) (37). These findings showed the remarkable generalizability and robustness of the proposed network.
Ablation study
We also undertook numerous experiments using the Synapse dataset to assess the effectiveness of our network. In this section, we discuss the effects of various design choices on the results.
Network components
To evaluate the efficacy of the Res2Net block in segmentation, we replaced the original convolution blocks in U-Net (5) and Att-UNet (33) with Res2Net blocks, creating the Res2-Unet and the Res2-AttUnet models. The results showed improvements of 0.48% and 0.71% in the average DSC, respectively (Table 4). This suggests that Res2Net blocks were more effective in extracting image features. We also compared Res2-CD-UNet with Res2-C-UNet (without dilated convolution) to evaluate the efficacy of dilated convolutions. We observed significant improvements in various organs. Finally, we compared Res2-C-UNet with Res2-Att-UNet. The former uses ConvFormer and a new skip connection. The results showed an increase of 0.96% in the average DSC and an increase of 6.17 mm in the HD. These results suggest that the Res2Net block and dilated convolutions are effective techniques for enhancing the performance of segmentation. In addition, the use of ConvFormer and a novel skip connection can lead to further improvements in model performance.
Table 4
Method | Average, DSC% | Average, HD (mm) |
DSC% | |||||||
---|---|---|---|---|---|---|---|---|---|---|
Aorta | Gall | Kidney (L) | Kidney (R) | Liver | Pancreas | Spleen | Stomach | |||
UNet (5) | 76.85 | 39.70 | 89.07 | 69.72 | 77.77 | 68.6 | 93.43 | 53.98 | 86.67 | 75.58 |
Res2-UNet | 77.33 | 36.10 | 86.42 | 71.04 | 76.03 | 78.09 | 92.49 | 68.01 | 86.48 | 78.02 |
Att-UNet (33) | 77.77 | 36.02 | 89.55 | 68.88 | 77.98 | 71.11 | 93.57 | 58.04 | 87.30 | 75.75 |
Res2-AttUNet | 78.48 | 31.06 | 87.41 | 76.23 | 73.93 | 72.05 | 94.34 | 53.34 | 87.96 | 72.62 |
Res2-C-UNet | 79.44 | 24.89 | 89.96 | 65.38 | 77.97 | 79.46 | 94.51 | 64.78 | 86.50 | 76.93 |
Res2-CD-UNet | 83.92 | 14.51 | 92.24 | 68.59 | 89.50 | 80.53 | 94.68 | 72.04 | 88.77 | 79.05 |
CT, computed tomography; DSC, dice similarity coefficient; HD, Hausdorff distance; L, left; R, right; Res2-CD-UNet, Res2Net-ConvFormer-Dilation-UNet.
Skip connection configurations
To evaluate the effects of different configurations of skip connections on segmentation performance, we compared three different configurations of skip connections; that is, the CFFB skip connection proposed in this study; the Conventional skip connection, which uses direct concatenation; and the non-skip connection, which does not use any skip connection. As Table 5 shows, the findings indicated that the inclusion of skip connections between the encoder and decoder enhanced segmentation performance. Additionally, it was discovered that implementing the channel feature fusion skip connection configuration resulted in a notable improvement in performance. We employed a gradient-weighted class activation map to derive saliency maps from the network’s output layer. This analysis was employed to assess the extent to which the CFFB module enhanced the network’s ability to concentrate on the target region while mitigating the influence of the background. As Figure 7 shows, the network without the CFFB module was adversely affected by background elements, resulting in diminished focus on the target region and a subsequent decline in the network’s precision.
Table 5
Skip connections | Average, DSC% | DSC% | |||||||
---|---|---|---|---|---|---|---|---|---|
Aorta | Gall | Kidney (L) | Kidney (R) | Liver | Pancreas | Spleen | Stomach | ||
Non | 67.03 | 82.85 | 70.08 | 74.61 | 45.93 | 90.93 | 61.06 | 71.60 | 38.44 |
Conventional | 79.79 | 89.54 | 80.16 | 84.11 | 62.64 | 94.11 | 75.45 | 88.12 | 64.15 |
CFFB | 83.92 | 92.24 | 68.59 | 89.50 | 80.53 | 94.68 | 72.04 | 88.77 | 79.05 |
DSC, dice similarity coefficient; L, left; R, right; CFFB, channel feature fusion block.
Attention module selection
This study employed an enhanced SE module in the feature fusion module. To evaluate its effectiveness, it was compared to other commonly used attention modules (40-42). As Table 6 shows, the enhanced SE module yielded the most favorable results, exhibiting a 2.96% improvement over the suboptimal alternative (40). Our findings suggest that the enhanced SE module in the feature fusion module presents a promising avenue for improving the overall system’s performance.
Table 6
Attention module | Average, DSC% | DSC% | |||||||
---|---|---|---|---|---|---|---|---|---|
Aorta | Gall | Kidney (L) | Kidney (R) | Liver | Pancreas | Spleen | Stomach | ||
SE (40) | 80.96 | 89.48 | 81.93 | 85.41 | 62.48 | 94.86 | 77.46 | 89.40 | 66.66 |
CBAM (41) | 76.71 | 85.51 | 75.51 | 80.55 | 58.32 | 93.21 | 71.85 | 87.99 | 60.69 |
CA (42) | 80.33 | 92.46 | 72.44 | 79.45 | 69.35 | 94.60 | 82.18 | 88.53 | 63.62 |
Enhanced SE (ours) | 83.92 | 92.24 | 68.59 | 89.50 | 80.53 | 94.68 | 72.04 | 88.77 | 79.05 |
DSC, dice similarity coefficient; L, left; R, right; SE, squeeze-and-excitation; CBAM, convolutional block attention module; CA, coordinate attention.
Res2Net scale size
As Table 7 shows, we observed an average improvement of 5.28% in DSC when comparing the scale of 4 to the scale of 2. Nonetheless, it should be noted that an excessive increase in scale may lead to a decline in performance. We attribute this phenomenon to the small image resolution used in the dataset, which may not be capable of generating corresponding equivalent scales. Our findings suggest that careful consideration of scale is paramount to achieving optimal segmentation performance.
Table 7
Scale | ||||
---|---|---|---|---|
2 | 4 | 6 | 8 | |
Average, DSC% | 78.64 | 83.92 | 80.43 | 79.32 |
DSC, dice similarity coefficient.
Choice of the upsampling method
Finally, the study compared the effectiveness of the deconvolution upsampling strategy in the decoder stage with bilinear interpolation. As Table 8 shows, deconvolution was more effective in restoring the low-resolution feature map to its original image resolution. Based on this finding, we recommend that deconvolution be used as the optimal approach for image resolution restoration.
Table 8
Upsampling | Average, DSC% | DSC% | |||||||
---|---|---|---|---|---|---|---|---|---|
Aorta | Gall | Kidney (L) | Kidney (R) | Liver | Pancreas | Spleen | Stomach | ||
Bilinear | 77.93 | 86.11 | 75.83 | 77.66 | 60.55 | 93.75 | 74.95 | 87.91 | 66.66 |
Deconvolution | 83.92 | 92.24 | 68.59 | 89.50 | 80.53 | 94.68 | 72.04 | 88.77 | 79.05 |
DSC, dice similarity coefficient; L, left; R, right.
Conclusions
The present study proposed a novel network named Res2-CD-UNet, which was designed for multi-scale medical image segmentation. The model leverages Res2Net, supplemented by dilated convolutions, as the network’s backbone. The feature maps generated by this backbone are passed through ConvFormer, a convolutional-style transformer that establishes global feature relationships. Moreover, we introduced a novel CFFB in the skip-connection stage to minimize the effect of irrelevant background information. The segmentation results of the proposed model were evaluated based on both subjective visual assessment and objective evaluation metrics. The comparison of the proposed model with existing medical image segmentation networks showed that the proposed network effectively improves segmentation accuracy. The network proposed in this study has higher accuracy than previous networks; however, it suffers from the disadvantage of having excessively large network parameters, which have high computational costs. Moreover, the network can only process two-dimensional data. Therefore, in the future, we intend to focus our research on network lightweighting and the voxel segmentation of medical images.
Acknowledgments
Funding: This study was supported by
Footnote
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-24-1022/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work, including ensuring that any questions related to the accuracy or integrity of any part of the work have been appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013).
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Khan RF, Lee BD, Lee MS. Transformers in medical image segmentation: a narrative review. Quant Imaging Med Surg 2023;13:8747-67. [Crossref] [PubMed]
- Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015:1-9.
- He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016:770-8.
- Shelhamer E, Long J, Darrell T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans Pattern Anal Mach Intell 2017;39:640-51. [Crossref] [PubMed]
- Ronneberger O, Fischer P, Brox T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In: Navab N, Hornegger J, Wells W, Frangi A. editors. Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. Lecture Notes in Computer Science, Springer, 2015;9351:234-41.
- Zhou Z, Siddiquee MMR, Tajbakhsh N, Liang J. UNet++: Redesigning Skip Connections to Exploit Multiscale Features in Image Segmentation. IEEE Trans Med Imaging 2020;39:1856-67. [Crossref] [PubMed]
- Zunair H, Ben Hamza A. Sharp U-Net: Depthwise convolutional network for biomedical image segmentation. Comput Biol Med 2021;136:104699. [Crossref] [PubMed]
.Zunair H Hamza AB Masked Supervised Learning for Semantic Segmentation. arXiv: 2210.00923,2022 .- Verma R, Kumar N, Patil A, Kurian NC, Rane S, Graham S, et al. MoNuSAC2020: A Multi-Organ Nuclei Segmentation and Classification Challenge. IEEE Trans Med Imaging 2021;40:3413-23. [Crossref] [PubMed]
Dosovitskiy A Beyer L Kolesnikov A Weissenborn D Zhai X Unterthiner T Dehghani M Minderer M Heigold G Gelly S Uszkoreit J Houlsby N. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv: 2010.11929,2020 .- Tragakis A, Kaul C, Murray-Smith R, Husmeier D. The Fully Convolutional Transformer for Medical Image Segmentation. 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2023:3649-58.
- Wu H, Xiao B, Codella N, Liu M, Dai X, Yuan L, Zhang L. CvT: Introducing Convolutions to Vision Transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021:22-31.
- Lin X, Yan Z, Deng X, Zheng C, Yu L. ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical Image Segmentation. In: Greenspan H, Madabhushi A, Mousavi P, Salcudean S, Duncan J, Syeda-Mahmood T, Taylor R. editors. Medical Image Computing and Computer Assisted Intervention – MICCAI 2023. Lecture Notes in Computer Science, Springer, 2023;14223:642-51.
- Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y. editors. Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science, Springer, 2018;11211:833-51.
- Gao SH, Cheng MM, Zhao K, Zhang XY, Yang MH, Torr P. Res2Net: A New Multi-Scale Backbone Architecture. IEEE Trans Pattern Anal Mach Intell 2021;43:652-62. [Crossref] [PubMed]
.Chen J Lu Y Yu Q Luo X Adeli E Wang Y Lu L Yuille AL Zhou Y TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv: 2102.04306,2021 .- Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, Wang M. Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation. In: Karlinsky L, Michaeli T, Nishino K. editors. Computer Vision – ECCV 2022 Workshops. Lecture Notes in Computer Science, Springer, 2023; 13803:205-18.
Ho J Kalchbrenner N Weissenborn D Salimans T. Axial Attention in Multidimensional Transformers. arXiv: 1912.12180,2019 .- Valanarasu JMJ, Oza P, Hacihaliloglu I, Patel VM. Medical Transformer: Gated Axial-Attention for Medical Image Segmentation. In: De Bruijne M, Cattin PC, Cotin S, Padoy N, Speidel S, Zheng Y, Essert C. editors. Medical Image Computing and Computer Assisted Intervention – MICCAI 2021. Lecture Notes in Computer Science, Springer, 2021;12901:36-46.
- Xie Y, Zhang J, Shen C, Xia Y. CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image Segmentation. In: De Bruijne M, Cattin PC, Cotin S, Padoy N, Speidel S, Zheng Y, Essert C. editors. Medical Image Computing and Computer Assisted Intervention – MICCAI 2021. Lecture Notes in Computer Science, Springer, 2021;12903:171-80.
- Jin Y, Han D, Ko H. TrSeg: Transformer for semantic segmentation. Pattern Recognition Letters 2021;148:29-35. [Crossref]
- Yan X, Jiang W, Shi Y, Zhuo C. MS-NAS: Multi-scale Neural Architecture Search for Medical Image Segmentation. In: Martel AL, Abolmaesumi P, Stoyanov D, Mateus D, Zuluaga MA, Zhou SK, Racoceanu D, Joskowicz L. editors. Medical Image Computing and Computer Assisted Intervention – MICCAI 2020. Lecture Notes in Computer Science, Springer, 2020;12261:388-97.
- Zhao H, Shi J, Qi X, Wang X, Jia J. Pyramid Scene Parsing Network. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017:6230-9.
- Rahman M, Marculescu R. Multi-scale Hierarchical Vision Transformer with Cascaded Attention Decoding for Medical Image Segmentation. Medical Imaging with Deep Learning. PMLR, 2024:1526-44.
- Ji Q, Wang J, Ding C, Wang Y, Zhou W, Liu Z, Yang C. DMAGNet: Dual‐path multi‐scale attention guided network for medical image segmentation. IET Image Processing 2023;17:3631-44. [Crossref]
- Sinha A, Dolz J. Multi-Scale Self-Guided Attention for Medical Image Segmentation. IEEE J Biomed Health Inform 2021;25:121-30. [Crossref] [PubMed]
- Xu Y, Zhang Q, Zhang J, Tao D. ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias. Advances in neural information processing systems 2021;34:28522-35.
- Srivastava A, Jha D, Chanda S, Pal U, Johansen H, Johansen D, Riegler M, Ali S, Halvorsen P. MSRF-Net: A Multi-Scale Residual Fusion Network for Biomedical Image Segmentation. IEEE J Biomed Health Inform 2022;26:2252-63. [Crossref] [PubMed]
- Sun M, Zhang G, Dang H, Qi X, Zhou X, Chang Q. Accurate Gastric Cancer Segmentation in Digital Pathology Images Using Deformable Convolution and Multi-Scale Embedding Networks. IEEE Access 2019;7:75530-41.
- Deng J, Dong W, Socher R, Li LJ, Kai Li, Li FF. ImageNet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 2009:248-55.
- Radl L, Jin Y, Pepe A, Li J, Gsaxner C, Zhao FH, Egger J. AVT: Multicenter aortic vessel tree CTA dataset collection with ground truth segmentation masks. Data Brief 2022;40:107801. [Crossref] [PubMed]
- Fu S, Lu Y, Wang Y, Zhou Y, Shen W, Fishman E, Yuille A. Domain Adaptive Relational Reasoning for 3D Multi-organ Segmentation. In: Martel AL, Abolmaesumi P, Stoyanov D, Mateus D, Zuluaga MA, Zhou SK, Racoceanu D, Joskowicz L. editors. Medical Image Computing and Computer Assisted Intervention – MICCAI 2020. Lecture Notes in Computer Science. Springer, 2020;12261:656-66.
.Oktay O Schlemper J Folgoc LL Lee M Heinrich M Misawa K Mori K McDonagh S Hammerla NY Kainz B Glocker B Rueckert D Attention U-Net: Learning Where to Look for the Pancreas. arXiv:1804.03999,2018 .- Xu G, Zhang X, He X, Wu X. LeViT-UNet: Make Faster Encoders with Transformer for Medical Image Segmentation. In: Liu Q, Wang H, Ma Z, Zheng W, Zha H, Chen X, Wang L, Ji R. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, Springer, Singapore, 2024;14432:42-53.
Huang X Deng Z Li D Yuan X. MISSFormer: An Effective Medical Image Segmentation Transformer. arXiv: 2109.07162,2021 . .Wang H Xie S Lin L Iwamoto Y Han XH Chen YW Tong R Mixed Transformer U-Net For Medical Image Segmentation. arXiv: 2111.04734,2021 .- Heidari M, Kazerouni A, Soltany M, Azad R, Aghdam EK, Cohen-Adad J, et al. HiFormer: Hierarchical Multi-scale Representations Using Transformers for Medical Image Segmentation. 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2023:6191-201.
Alrfou K Zhao T. GCtx-UNet: Efficient Network for Medical Image Segmentation. arXiv: 2406.05891,2024 .Ruan J Xie M Xiang S Liu T Fu Y. MEW-UNet: Multi-axis representation learning in frequency domain for medical image segmentation. arXiv: 2210.14007,2022 .- Hu J, Shen L, Sun G. Squeeze-and-Excitation Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018:7132-41.
- Woo S, Park J, Lee JY, Kweon IS. CBAM: Convolutional Block Attention Module. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y. editors. Computer Vision – ECCV 2018. Lecture Notes in Computer Science, Springer, 2018;11211:3-19.
- Hou Q, Zhou D, Feng J. Coordinate Attention for Efficient Mobile Network Design. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021:13708-17.