TAC-UNet: transformer-assisted convolutional neural network for medical image segmentation
Introduction
Medical image segmentation is an important field in medical image analysis, as it provides crucial guidance and assistance in precise disease identification, rational diagnosis, prediction, and prevention. In recent years, there have been rapid advancements in deep learning, especially in convolutional neural networks (CNNs), which have been widely applied in the field of medical imaging (1). Due to the local perception ability of the convolutional layers, CNNs can effectively capture details and local features in images. In particular, since 2015, the U-Net (2) architecture based on CNNs has dominated this field. U-Net is a CNN-based encoder-decoder network architecture, consisting of symmetric expanding and contracting paths, forming a U-shaped segment. Simple skip connections are used to pass the spatial information lost during the encoder’s downsampling, enhancing the preservation of details. Inspired by the U-Net architecture, a series of U-Net variants have been proposed by researchers, such as the UNet++ (3), UNet3+ (4), AttUNet (5), Res-UNet (6), and MultiResUNet (7). These algorithms often employ complex encoder-decoder frameworks or further improve the simple skip connection scheme to enhance the performance of medical image segmentation. Despite achieving good performance in various medical image segmentation tasks, the ability of the U-Net and its variants to further improve the performance of medical image segmentation is limited due to the inherent inductive bias of CNNs, which hinders their global understanding of the images themselves.
Inspired by the success of Transformer (8) models in language tasks, recent research (9-16) has applied Transformers to computer vision tasks. Vision transformer (ViT) (17) is the first image recognition model to fully adopt the Transformer architecture. It divides the input image into a sequence of patches with positional embeddings, modeling the relationships between these patches to capture the global context of the image. ViT has achieved state-of-the-art performance in various image recognition benchmarks. Due to its excellent performance, many researchers have sought to improve vision tasks based on the ViT model, leading to the emergence of a series of models based on the ViT architecture. SEgmentation TRansformer (SETR) (18) was the first model to introduce ViT into the field of image segmentation and has shown promising results. Swin-Unet (19) replaces the convolutional blocks in U-Net with Swin-Transformer blocks. Swin-Transformer employs a window attention mechanism that performs attention calculations only in local windows, reducing the computational complexity of the Transformer to linear complexity with respect to the input size. TransUNet (20) combines a CNN and Transformer in a single architecture, effectively capturing both local details and global representations. Medical Transformer (MedT) (21) extends existing architectures by introducing additional control mechanisms into the self-attention mechanism, enabling effective training on small-scale datasets. These Transformer-based and hybrid methods have achieved certain success; however, they still fail to produce satisfactory results across various medical segmentation tasks. The reasons for this include the inconsistency in feature dimensions between CNNs and Transformers; simply combining the two does not eliminate the semantic gap between them. Additionally, Transformers lack the inherent inductive bias of CNNs, resulting in an insufficient ability to capture low-level features. Compared to other computer vision tasks, the number of data samples in medical imaging is usually smaller, making it difficult for Transformers to be effectively trained on small-scale datasets, thereby affecting segmentation performance.
To leverage the strengths of both the CNN and ViT architectures while avoiding their respective limitations, we established a network structure called the Transformer-assisted convolutional neural network (TAC-UNet) for effective medical image segmentation. The TAC-UNet is primarily composed of a composite structure of a CNN and ViT (referred to as a TAC) stacked together. This composite module features a dual-path parallel structure, consisting of a CNN backbone and a Transformer branch. During training, the Transformer branch aids the CNN backbone in image segmentation by passing the learned global contextual information to the CNN backbone, which in turn enhances the global representation based on the local features it captures. The final segmentation output is produced by the CNN backbone. Additionally, considering the inconsistency in feature dimensions between the CNN and Transformer, a feature combination unit (FCU) (22) is used as a bridge between the two paths. This composite structure was adopted for the following reasons: (I) it maximizes the enhancement of global contextual information on the basis of the local features already obtained by the CNN; (II) by using a CNN backbone with a Transformer auxiliary branch, the model can effectively address the issue of requiring large-scale data for effective training, which is common in ViT-based architectures; (III) the FCU serves as a bridge, effectively eliminating the semantic gap between the CNN and Transformer features. Additionally, a channel cross-attention (CCA) module is used to better integrate the semantically inconsistent features between the encoder and decoder of the TAC-UNet, thereby further improving the segmentation performance of the network.
In summary, this study:
- Established a TAC-UNet network architecture based on TAC segmentation to address the challenge of effectively training Transformers on small-scale datasets;
- Developed a composite structure, the TAC module, which inherits the advantages of both the Transformer and CNN in medical image segmentation. By leveraging the Transformer to assist the CNN, the CNN gains the ability to capture both global contextual information and local features; and
- Showed that our model achieved improved segmentation performance on three different datasets compared to the CNN-based, Transformer-based, and hybrid methods.
Related works
CNNs have become one of the primary methods for medical image segmentation due to their outstanding performance in image processing. Early models, such as the U-Net, proposed by Ronneberger et al. (2) in 2015 during the International Symposium on Biomedical Imaging Challenge, introduced an encoder-decoder network architecture with skip connections that preserve high-resolution features, effectively addressing the issue of blurry segmentation boundaries. Inspired by this, Zhou et al. (3) proposed the U-Net++ network architecture, which innovatively introduced nested and dense skip connections along with a deep supervision mechanism. These designs enabled the network to better integrate features at different levels, improving segmentation accuracy and enhancing training efficiency and model performance. Huang et al. (4) proposed U-Net3+ to address the issue in U-Net++ of insufficient multi-scale information extraction. U-Net3+ uses full-scale skip connections and deep supervision to address this issue. Ibtehaz et al. (7) introduced a new network structure named the MultiResUNet, inspired by the concept of the residual network (ResNet). The MultiResUNet further refined and optimized the convolutional blocks and skip connections in the U-Net by incorporating residual structures.
Transformers initially gained prominence in the field of natural language processing due to their powerful global feature modeling capabilities, which were later introduced to image recognition. In 2020, Dosovitskiy et al. (17) proposed the ViT in an early attempt to segment images into patches, and model the global relationships among these patches using self-attention layers to capture the global context of the image. Inspired by this, Liu et al. (23) proposed Swin Transformer, which achieved linear computational complexity by introducing a sliding window mechanism. Cao et al. (19) introduced Swin-Unet, a pure Transformer-based U-shaped network, which achieved significant improvements in medical image segmentation. However, pure Transformer-based methods typically require large datasets for effective training.
In the field of medical image segmentation, hybrid structures combining CNNs and Transformers have garnered extensive attention. These hybrid structures leverage the advantages of CNNs in capturing local features, and the strengths of Transformers in handling global contextual information. For instance, Chen et al. (20) proposed the TransUNet, a model that combines a CNN and Transformer. The TransUNet employs a U-shaped structure consisting of an encoder and a decoder. In the TransUNet, the Transformer encodes tokenized image patches from CNN feature maps to extract global context, while the decoder upsamples the encoded features and combines them with high-resolution CNN feature maps for precise localization. The TransUNet has achieved excellent performance on multiple medical image segmentation datasets. Wang et al. (24) proposed the UCTransNet, which introduced multi-scale channel cross-fusion Transformer and CCA mechanisms to improve the U-Net’s skip connections and multi-scale modeling. Wang et al. (25) proposed the SMESwin Unet, which fuses multi-scale features by designing a composite structure of a CNN and ViT [referred to as Multi-scale Channel-wise Cross fusion Transformer (MCCT)] and introduces super-pixel technology to segment features at the regional level, thereby reducing the interference of irrelevant parts in the image. Heidari et al. (26) introduced HiFormer, which constructs multi-scale features through a dual-branch structure of a CNN and Transformer, and finely integrates local features and global representations through a dual-level fusion module. The success of these models demonstrates the significant potential of hybrid CNN-Transformer structures in medical image segmentation tasks.
Currently, the application of deep-learning models in medical image analysis remains an active research area with potential for more innovations and improvements in the future. Issues such as the effective combination of CNNs and Transformers in hybrid structures and the challenge of training Transformers effectively on small-scale datasets require further research and innovation. Therefore, addressing the challenges faced by current medical image segmentation algorithms and further enhancing their performance remain key focus areas of ongoing research.
Methods
The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013). The overall architecture of the proposed TAC-UNet is illustrated in Figure 1. First, the TAC-UNet adopts a U-shaped network structure. Second, the TAC-UNet is primarily composed of a composite structure of a CNN and ViT (TAC) stacked together. Traditional CNN architectures are mainly composed of stacked convolutional blocks, while ViT architectures are mainly composed of stacked Transformers. Using a CNN or Transformer alone can result in the neglect of either local or global features. Therefore, the proposed TAC structure maximizes the preservation of both local features and global representations with the TAC-Stem serving as the initial module of the network for extracting initial local features. Additionally, a CCA module is employed as a bridge between the TAC encoder and decoder to better integrate the semantically inconsistent features between the encoder and decoder of the TAC-UNet. In the next sections, we will discuss the TAC-Stem, TAC, and CCA modules in detail.
TAC-Stem
As Figure 1A shows, the TAC-Stem module serves as the initial module of the entire network. It first uses the CNN backbone to extract initial local feature maps, which are then fed into the Transformer to extract global representations, thereby deriving the Transformer branch. Specifically, the TAC-Stem module first applies a 3 × 3 convolution with a stride of 1 on the given input image to extract initial local features . The CNN backbone then directly feeds X into subsequent layers, while the derived Transformer branch segments X into non-overlapping patches l of resolution . These non-overlapping patches are flattened and fed into a linear embedding layer to obtain the original embedding sequence . Next, is concatenated with [global average pooling (GAP) of ] and fed into the Transformer Block (Trans-block) module. The concatenated embedding sequence, processed through the self-attention layer, learns the importance of the sequence itself, aligning the spatial dimensions for the subsequent Transformer branch input. In the TAC-Stem module, the Transformer branch receives input from the CNN backbone’s output, preventing the loss of many fine features in the image. The TAC-Stem module is represented by the following formulas:
where Trans represents the Trans-block module. In our implementation, , , and the Trans-block contains only one Transformer block, which consists of a multi-head attention module and a multilayer perceptron (MLP) block. Layer normalization (LayerNorm) is applied before the residual connections in both the self-attention layer and the MLP block.
TAC
The main challenge in integrating the strengths of both CNN and Transformer lies in enhancing global representations based on existing local features. Therefore, we established a novel TAC module. As Figure 2 shows, this module adopts a dual-path structure comprising a CNN backbone and a Transformer branch. First, the feature maps are passed from the CNN backbone to the Transformer branch to maximize the enhancement of global representations based on the local features obtained. The enhanced features are then returned to the CNN backbone, thus forming a method in which the Transformer assists the CNN in segmentation. Throughout the training process, this Transformer branch continuously feeds global context into the feature maps, thereby enhancing the CNN backbone’s global perception capability. Transformer-based network architectures require large-scale datasets for effective training; however, this approach addresses this issue. Moreover, to address the inconsistency in the feature dimensions between the feature maps in the CNN backbone and the token sequences in the Transformer branch, inspired by the Conformer (22) work, we introduced the FCU as a bridging module. It not only aligns the spatial and channel dimensions of the feature maps and token sequences, but also eliminates the semantic discrepancies between them, allowing the CNN backbone to better capture the global context information from the Transformer branch.
The TAC module consists of a CNN backbone, a Transformer branch, and a bridging module called the FCU. First, the CNN backbone employs a typical U-Net design, which consists of stacked convolutional blocks. Notably, the TAC module functions as both an encoder and a decoder, depending on the changes in the channel dimensions of the feature maps in the CNN backbone. Second, in the FCU module, 1 × 1 convolutions are used to align channel dimensions, downsampling/upsampling is employed to align feature resolutions, and LayerNorm and batch normalization (BatchNorm) are used to align feature values. The FCU module employs the following two processes: (I) it feeds feature maps from the CNN backbone to the Transformer branch; and (II) it feeds the enhanced features from the Transformer branch back to the CNN backbone. Finally, in the Transformer branch, GAP is first applied to the tokens from the previous branch to aggregate the channel features. When the feature maps are input into the Transformer branch through the FCU module, they are concatenated with the aggregated token and fed into the Trans-block module for self-attention operations. This not only enhances the global perception capability of the CNN backbone’s features but also gradually bridges the semantic gap between the upper and lower layers of the encoder. The Trans-block module contains a single Transformer, composed of a multi-head attention module and an MLP block, with LayerNorm applied before the residual connections in the self-attention layer and the MLP block. The computations in the TAC module are summarized by the following formulas:
where R, B, and denote the rectified linear unit activation function, BatchNorm, and the number of 2D convolutions (where ), respectively. The dimension of the feature map at the i-th layer (I = 2, 3, 4, 5) is denoted by , and is the patch embedding sequence at the i-th layer, where d is the length of the patch embedding sequence, and is the channel dimension. Trans denotes the Trans-block module, which includes a Transformer module. In our implementation, for the experiments using the TAC-UNet architecture with a total of five layers, we employed , , , , , and .
CCA
In the decoder part, inspired by the work of the UCTransNet (24), a CCA module is employed to address the inconsistency between the semantic features of the encoder and the decoder. As Figure 3 shows, the CCA module assigns a weight to each channel based on the channel information of the decoder features. This weight is then multiplied by the corresponding channel information from the same level of encoder output, resulting in an enhanced feature vector. This enhanced feature vector can better integrate with the decoder features, thereby improving segmentation performance. Specifically, the i-th layer encoder feature and the decoder feature are first passed through a GAP module and a single fully connected layer, respectively. These are then added together to obtain the attention mask . The sigmoid function is then applied to construct the channel attention map, which is finally multiplied by to obtain the enhanced feature map. Mathematically, the CCA module can be summarized by the following formulars:
where and represent two linear layers, σ represents the sigmoid function, and is the output feature map.
Experimental framework
Datasets
We evaluated the TAC-UNet network architecture using the Multi-organ Nucleus Segmentation (MoNuSeg) (27), Gland Segmentation (GlaS) (28), and Computer Vision Center Colorectal Cancer Clinic Database (CVC-ClinicDB) (29) datasets. MoNuSeg is a multi-organ nuclear dataset from the 2018 Medical Image Segmentation Challenge that consists of hematoxylin and eosin (H&E)-stained tissue images captured at 40× magnification downloaded from The Cancer Genome Atlas archives. GlaS is a dataset from the Medical Image Computing and Computer Assisted Intervention Society (MICCAI) 2015 Gland Segmentation Challenge that contains images derived from 16 H&E-stained histological slides of colorectal cancer in stages T3 or T4. The CVC-ClinicDB dataset is the official training dataset from the MICCAI 2015 Endoscopic Vision Sub-challenge for Automatic Polyp Detection that consists of 612 static images extracted from colonoscopy videos. Table 1 provides further details about the three public datasets.
Table 1
Dataset | Classes | Size | Anatomy | Subset | Images |
---|---|---|---|---|---|
MoNuSeg | Backgrounds, nuclei | 1,000×1,000 | Cell | Train | 30 |
Test | 14 | ||||
GlaS | Backgrounds, glands | 522×775 | Colorectal | Train | 85 |
Test | 80 | ||||
CVC-ClinicDB | Backgrounds, polyp | 288×384 | Colorectal | Train | 551 |
Test | 61 |
The CVC-ClinicDB dataset is a medical imaging dataset created by the CVC at the Autonomous University of Barcelona. TAC, Transformer-assisted convolutional neural network; CVC-ClinicDB, Computer Vision Center Colorectal Cancer-Clinic Database.
Evaluation metrics
To accurately assess the performance of different models, the selection of appropriate metrics is crucial to assess segmentation quality. Common evaluation metrics for medical image segmentation include the Dice coefficient, Intersection over Union (IoU), precision, recall, and accuracy. The Dice coefficient is the most commonly used metric in medical image segmentation tasks, as it effectively addresses the class imbalance issue often found in medical images. IoU, also known as the Jaccard index, measures the ratio of the intersection to the union of the predicted and ground truth segmentation results. Precision represents the proportion of correctly segmented pixels among all pixels predicted as positive, focusing on the accuracy of the positive predictions. Recall, also known as sensitivity, indicates the proportion of correctly segmented pixels among all actual positive pixels. Accuracy represents the proportion of correctly segmented pixels among the total number of pixels, which is suitable for balanced datasets but can be affected by the number of negative examples in imbalanced datasets. The formulas for these metrics are expressed as follows:
where TP denotes true positives, FP denotes false positives, FN denotes false negatives, and TN denotes true negatives. These evaluation metrics are widely used in image segmentation tasks and help comprehensively assess the performance of models. In practical applications, appropriate evaluation metrics can be selected based on the specific segmentation task, or a combination of their results can be considered.
Implementation details
Our experiments were conducted using the PyTorch framework (developed by Facebook’s AI Research lab), with training performed on an NVIDIA GeForce RTX (Ray Tracing Texel eXtreme) 3090 Graphics Processing Unit (GPU) with 24 GB of memory. Data augmentation techniques, such as vertical flipping, random rotation, and horizontal flipping, were applied to the datasets. The proposed TAC-UNet was trained from scratch without using any pre-trained weights. For the MoNuSeg and GlaS datasets, we set the batch size to 4, and for the CVC-ClinicDB dataset, we set the batch size to 16. For each dataset, the image size input to the network was set to 224 × 224, and the Adam optimizer was used to optimize the model with an initial learning rate of 0.001. The loss function for training the network was a combination of binary cross-entropy loss and Dice loss. For the predicted value and the target value y, the loss function L is expressed as:
All baselines were trained with the same settings and loss function. To evaluate our method, we used the Dice coefficient, IoU, precision, recall, and accuracy as the evaluation metrics for the medical image segmentation tasks.
Results
Comparisons with state-of-the-art methods
To demonstrate the effectiveness of the TAC-UNet, we evaluated and compared its segmentation performance with CNN-based methods, Transformer-based methods, and hybrid methods. The CNN-based methods included the U-Net (2), UNet++ (3), Attention U-Net (5), R2UNet (30), HoVerNet (31), PraNet (32), and DoubleUnet-DCA (33). The Transformer-based method was primarily the Swin-Unet (19). The hybrid models included the TransUNet (20), UCTransNet (24), and HiFormer (26). The comparison models in the experiments were configured with the same settings as the TAC-UNet.
For the CVC-ClinicDB dataset, the experimental results are shown in Table 2 with the better results highlighted in bold. As Table 2 shows, the TAC-UNet outperformed most CNN models, Transformer models, and hybrid models, achieving an excellent Dice score of 91.81%. From this analysis, we can conclude that Transformer models have lower segmentation accuracy due to their limited ability to extract low-level features. Conversely, while CNN models can capture fine image details, they are unable to fully exploit the global contextual information, limiting their segmentation capability. Hybrid models may be constrained by their combined approaches, making it difficult to effectively train them on three small-scale datasets. Our method extends the CNN model architecture with a Transformer branch for auxiliary segmentation. The superiority of this approach lies in the CNN backbone leveraging the Transformer branch to continuously interact and enhance its global perception during training. Additionally, when we adjusted the TAC modules in the 4th and 5th layers of the encoder and decoder in the TAC-UNet (consisting of five layers) by increasing the number of Transformers in the Trans-block module from one to two, the Dice score of the TAC-UNet on the CVC-ClinicDB dataset further improved by 0.5%, reaching a Dice score of 92.31%. In Table 2, this model is referred to as the TAC-UNet-L.
Table 2
Methods | Params (M) | Dice (%) | IoU (%) | Precision (%) | Recall (%) | Accuracy (%) |
---|---|---|---|---|---|---|
U-Net | 14.8 | 86.53 | 78.78 | 85.80 | 89.66 | 97.64 |
UNet++ | 47.2 | 88.07 | 80.51 | 87.43 | 91.32 | 97.77 |
Attention U-Net | 34.9 | 90.31 | 83.98 | 89.27 | 93.00 | 98.18 |
R2UNet | 39.1 | 72.44 | 62.86 | 84.53 | 70.76 | 95.10 |
PraNet | 32.5 | 90.11 | 83.87 | 89.19 | 93.61 | 97.96 |
TransUNet | 105.3 | 90.26 | 84.02 | 89.51 | 93.17 | 98.26 |
Swin-Unet | 41.4 | 72.85 | 61.78 | 76.12 | 75.83 | 95.94 |
HiFormer | 25.5 | 91.26 | 85.13 | 88.84 | 94.79 | 98.52 |
UCTransNet | 65.6 | 91.37 | 85.52 | 89.93 | 94.59 | 98.61 |
DoubleUnet-DCA | 30.6 | 91.46 | 85.07 | 90.35 | 93.25 | 98.52 |
TAC-UNet (ours) | 101.8 | 91.81 | 85.91 | 90.35 | 94.03 | 98.68 |
TAC-UNet-L (ours) | 123.1 | 92.31 | 86.92 | 90.89 | 94.61 | 98.73 |
CVC-ClinicDB, Computer Vision Center Colorectal Cancer-Clinic Database; DCA, dual cross-attention; TAC, transformer-assisted convolutional neural network; IoU, Intersection over Union.
For the MoNuSeg and GlaS datasets, which contain fewer medical images, the experimental comparison results are shown in Table 3. Again, the better results are highlighted in bold. As Table 3 shows, the Transformer-based model, the Swin-Unet, achieved a Dice score of 77.74% on the MoNuSeg dataset and 85.84% on the GlaS dataset. In contrast, the UNet++ and Attention U-Net models outperformed the Swin-Unet by 0.33% and 0.49% on the MoNuSeg dataset, and by 2.23% and 4.13% on the GlaS dataset, respectively. However, the TAC-UNet achieved excellent results with Dice scores of 80.36% on the MoNuSeg dataset and 90.7% on the GlaS dataset, outperforming all the other baseline models. Additionally, our TAC-UNet-L model achieved a Dice score of 81.13% on the MoNuSeg dataset, representing a 0.77% improvement on the Dice score.
Table 3
Methods | MoNuSeg (%) | GlaS (%) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Dice | IoU | Precision | Recall | Accuracy | Dice | IoU | Precision | Recall | Accuracy | ||
U-Net | 76.89 | 63.49 | 70.84 | 86.89 | 88.59 | 86.28 | 76.84 | 84.96 | 89.90 | 86.07 | |
UNet++ | 78.07 | 65.46 | 73.78 | 86.41 | 88.89 | 88.07 | 79.66 | 86.63 | 91.27 | 88.35 | |
Attention U-Net | 78.23 | 65.36 | 72.63 | 87.48 | 89.38 | 89.97 | 82.83 | 90.43 | 90.88 | 90.30 | |
R2UNet | 78.61 | 64.84 | 75.07 | 83.38 | 89.93 | 73.04 | 60.71 | 75.24 | 76.26 | 74.90 | |
HoVerNet | 79.91 | 66.66 | 72.93 | 89.00 | 90.08 | 86.66 | 77.63 | 85.69 | 89.71 | 87.04 | |
TransUNet | 78.44 | 64.64 | 72.67 | 86.09 | 89.74 | 87.45 | 79.22 | 87.11 | 90.18 | 88.18 | |
Swin-Unet | 77.74 | 63.70 | 70.30 | 87.73 | 89.03 | 85.84 | 76.19 | 84.10 | 89.25 | 85.94 | |
HiFormer | 73.27 | 58.44 | 65.81 | 84.87 | 86.73 | 89.37 | 81.60 | 88.18 | 91.66 | 89.30 | |
UCTransNet | 78.91 | 65.73 | 72.36 | 88.28 | 89.96 | 89.05 | 81.36 | 89.50 | 90.25 | 89.84 | |
DoubleUnet-DCA | 79.30 | 65.86 | 72.84 | 88.09 | 90.04 | 88.06 | 79.92 | 87.01 | 91.34 | 88.77 | |
TAC-UNet (ours) | 80.36 | 67.35 | 81.02 | 81.38 | 91.35 | 90.70 | 83.72 | 89.06 | 93.55 | 90.89 | |
TAC-UNet-L (ours) | 81.13 | 68.38 | 79.90 | 83.22 | 91.48 | 90.18 | 82.97 | 89.29 | 92.35 | 90.54 |
MoNuSeg, Multi-organ Nucleus Segmentation; GlasS, Gland Segmentation; DCA, dual cross-attention; TAC, Transformer-assisted convolutional neural network; IoU, Intersection over Union.
Additionally, we plotted precision-recall (P-R) curves and receiver operating characteristic (ROC) curves for the three datasets to evaluate the segmentation performance of the models (see Figure 4). The P-R curve focuses on the trade-off between precision and recall, and the TAC-UNet demonstrated competitive precision values across all three datasets, maintaining a high recall at high precision. The areas under the P-R curve for TAC-UNet on the three datasets were 0.8892, 0.9698, and 0.9834, respectively, outperforming the other compared models. This shows the superior performance of our model in medical image segmentation. The ROC curve showed the trade-off between the true positive rate (TPR) and false positive rate (FPR) at different classification thresholds. In our experiments, the TAC-UNet consistently maintained a high TPR. The areas under the curve (AUCs) of the ROC curves for the TAC-UNet on the three datasets were 0.9635, 0.9679, and 0.9973, respectively, all surpassing the AUCs of the other compared models. This demonstrates that our model effectively captures TPs while maintaining a low FPR, proving its superiority in medical image segmentation tasks.
We also visualized the segmentation results of different models, with the visualizations displayed in Figure 5. From top to bottom, the datasets are the MoNuSeg, GlaS, and CVC-ClinicDB datasets, with our TAC-UNet highlighted in bold. As Figure 5 shows, for the MoNuSeg dataset, which is small-scale dataset, hybrid models such as the TransUnet may be limited by the dataset size, resulting in less effective training and coarser segmentation edges compared to the ground truth. The U-Net, which lacs the ability to model global contextual relationships, tends to over-segment. In contrast, the TAC-UNet, which effectively combines local features and global contextual information, produces clearer edge details and overall contours. For the GlaS and CVC-ClinicDB datasets, our model also showed superior performance in terms of the edge details and contours compared to the other CNNs-based, Transformer-based, and hybrid methods.
In Table 4, we present the training and inference times of the TAC-UNet compared to current benchmark models. All models were tested on an Advanced Micro Devices (AMD) Ryzen 7 Central Processing Unit (CPU) and an RTX 3090 GPU with 24 GB of memory. As Table 4 shows, the TAC-UNet model struck an impressive balance between performance and efficiency during both the training and inference stages. The total training time for the TAC-UNet was only 18 minutes, which, while slightly longer than that of the U-Net (10 minutes), was significantly faster than models like the Swin-Unet (45 minutes) that required more time. Notably, the TAC-UNet achieved a Dice score of 80.36 while maintaining a reasonable training time, demonstrating its excellent segmentation performance. During inference, the TAC-UNet operated at approximately 55 frames per second. While it may not be the fastest in terms of inference speed, the TAC-UNet’s performance remains competitive, especially considering that it achieved the highest Dice score. Therefore, the TAC-UNet is well suited for applications that demand both computational efficiency and high segmentation quality.
Table 4
Methods | Epochs | Training (min) | Inference (fps) | Dice (%) |
---|---|---|---|---|
U-Net | 112 | ∼10 | ∼219 | 76.89 |
UNet++ | 182 | ∼19 | ∼69 | 78.07 |
Attention U-Net | 137 | ∼21 | ∼130 | 78.23 |
R2UNet | 116 | ∼15 | ∼58 | 78.61 |
HoVerNet | 178 | ∼30 | ∼39 | 79.91 |
Swin-Unet | 284 | ∼45 | ∼65 | 77.74 |
TransUNet | 48 | ∼17 | ∼49 | 78.44 |
UCTransNet | 221 | ∼27 | ∼44 | 78.91 |
DoubleUnet-DCA | 76 | ∼17 | ∼39 | 79.30 |
HiFormer | 288 | ∼35 | ∼62 | 73.27 |
TAC-UNet (ours) | 132 | ∼18 | ∼55 | 80.36 |
The epochs when the model converges were recorded. MoNuSeg, Multi-organ Nucleus Segmentation; DCA, dual cross-attention; TAC, Transformer-assisted convolutional neural network.
Ablation studies
Ablation studies of different modules in terms of segmentation performance
We further conducted ablation studies on the GlaS, MoNuSeg, and CVC-ClinicDB datasets to evaluate the effect of different modules on segmentation performance (see Table 5). First, using the U-Net as the baseline, we observed a significant improvement in segmentation performance when the convolutional blocks were replaced with the TAC module, thus confirming the effectiveness of the TAC module’s integration strategy. Second, to validate the effectiveness of the GAP module in the TAC module, we removed it, and observed a decrease in segmentation performance, which underscores the importance of including the GAP module. Finally, the addition of the CCA module led to further improvements in the segmentation results, demonstrating the optimality of the TAC-UNet model layout. In addition, we visualized the segmentation results of different models (see Figure 6).
Table 5
Model | GlaS (%) | MoNuSeg (%) | CVC-ClinicDB (%) | |||||
---|---|---|---|---|---|---|---|---|
Dice | IoU | Dice | IoU | Dice | IoU | |||
Baseline (U-Net) | 86.28 | 76.84 | 76.89 | 63.49 | 86.53 | 78.78 | ||
Baseline + CCA | 87.84 | 79.23 | 77.93 | 64.31 | 89.60 | 83.19 | ||
Baseline + TAC (w/o GAP) | 88.97 | 81.53 | 78.12 | 64.54 | 90.96 | 84.90 | ||
Baseline + TAC | 90.33 | 83.24 | 80.12 | 66.97 | 91.80 | 85.88 | ||
Baseline + TAC + CCA | 90.70 | 83.72 | 80.36 | 67.35 | 91.81 | 85.91 |
CCA, channel cross-attention; TAC, Transformer-assisted convolutional neural network; GAP, global average pooling; GlaS, Gland Segmentation; MoNuSeg, Multi-organ Nucleus Segmentation; CVC-ClinicDB, Computer Vision Center Colorectal Cancer-Clinic Database; IoU, Intersection over Union.
Ablation studies of different TAC configurations in terms of segmentation performance
We also explored the effect of the number of Transformers used in the TAC module on the model’s segmentation performance (see Figure 7). We adjusted the TAC-UNet (with five layers) by configuring the number of Transformers in the Trans-block module in the TAC modules of the encoder and decoder from the 2nd to the 5th layers according to the array [n1, n2, n3, n4]. For example, [1, 1, 2, 2] indicates that the number of Transformers used in the TAC modules from the 2nd to the 5th layers are 1, 1, 2, and 2, respectively. We conducted four configurations and experimented on the MoNuSeg, GlaS, and CVC-ClinicDB datasets. The comparison of the three-line charts revealed that the first configuration performed best on the GlaS dataset, while the third configuration, despite having a higher parameter count than the first, showed the best segmentation performance on the MoNuSeg and CVC-ClinicDB datasets.
Discussion
The accuracy of medical image segmentation directly affects the quality of medical image analysis and the reliability of clinical decision making (1). Capturing both local detail features and global semantic information is crucial for improving segmentation accuracy (34). This study designed the TAC module to organically combine CNN and Transformer, creating a dual-path structure. The advantage of this design is that the Transformer branch continuously transmits global semantic information to the CNN backbone, thereby enhancing the global representation of the CNN backbone, which is already adept at capturing local features. This design not only compensates for the CNN’s limitations in processing global information but also mitigates the challenge of training Transformers on small-scale datasets. Our experiments demonstrated that TAC-UNet achieved significantly better segmentation performance than the CNN-based, Transformer-based, and hybrid methods. Additionally, the TAC-UNet stuck an impressive balance between effectiveness and efficiency during both training and inference phases, outperforming the other models in terms of the training and inference times.
However, the TAC-UNet also has certain limitations. First, the introduction of the Transformer branch increases the model’s computational complexity and memory consumption, which may pose challenges for lightweight and fast training (8). Second, while this study validated the effectiveness of the TAC-UNet on several small-scale datasets, the wide variety of medical image types and the high-quality annotation requirements mean that the model’s applicability and generalization capability for different types of medical image segmentation need to be explored further. Future work will focus on the lightweight design of the model and seek to explore its potential applications in more medical imaging tasks.
Conclusions
In this study, we addressed the challenges of training ViT models effectively on small-scale datasets and the limitations of CNN models to fully capture global contextual information by designing a TAC composite module. This module features a dual-path parallel structure, composed of a CNN backbone and a Transformer branch. The Transformer branch assists the CNN backbone in image segmentation. Specifically, during training, the Transformer branch transmits the learned global contextual information to the CNN backbone, enhancing its global perception capabilities based on existing local features. The final segmentation head is also output by the CNN backbone. This approach effectively mitigates the difficulty of training Transformers on small datasets and enhances the CNN’s global perception based on local features. We conducted extensive comparative experiments on three public datasets. Using the MoNuSeg dataset, which comprises only 30 training images, the GlaS dataset, which comprises 85 training images, and the CVC-ClinicDB dataset, which comprises 490 training images, we showed that our method outperformed other segmentation results based on the three types of baseline models. This shows the effectiveness of our proposed method.
Acknowledgments
Funding: This study was supported by
Footnote
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-24-1229/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013).
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Wang R, Lei T, Cui R, Zhang B, Meng H, Nandi AK. Medical image segmentation using deep learning: A survey. IET Image Processing 2022;16:1243-67.
- Ronneberger O, Fischer P, Brox T, editors. U-net: Convolutional networks for biomedical image segmentation. Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18; 2015: Springer.
- Zhou Z, Siddiquee MMR, Tajbakhsh N, Liang J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. Deep Learn Med Image Anal Multimodal Learn Clin Decis Support (2018) 2018;11045:3-11. [Crossref] [PubMed]
- Huang H, Lin L, Tong R, Hu H, Zhang Q, Iwamoto Y, Han X, Chen YW, Wu J. Unet 3+: A full-scale connected unet for medical image segmentation. ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP); Barcelona, 4-8 May 2020, 1055-1059.
- Oktay O, Schlemper J, Folgoc LL, Lee M, Heinrich M, Misawa K, Mori K, McDonagh S, Hammerla NY, Kainz B. Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:180403999 2018.
- Xiao X, Lian S, Luo Z, Li S. Weighted res-unet for high-quality retina vessel segmentation. 2018 9th international conference on information technology in medicine and education (ITME); 2018: Hangzhou, China.
- Ibtehaz N, Rahman MS. MultiResUNet: Rethinking the U-Net architecture for multimodal biomedical image segmentation. Neural Netw 2020;121:74-87. [Crossref] [PubMed]
- Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, 4-9 December 2017, 5998-6008.
- Shi Z, Ma Y, Ding S, Yan Z, Zhu Q, Xiong H, Li C, Xu Y, Tan Z, Yin F, Chen S, Li Y. Radiomics derived from T2-FLAIR: the value of 2- and 3-classification tasks for different lesions in multiple sclerosis. Quant Imaging Med Surg 2024;14:2049-59. [Crossref] [PubMed]
- Lin A, Chen B, Xu J, Zhang Z, Lu G, Zhang D. Ds-transunet: Dual swin transformer u-net for medical image segmentation. IEEE Transactions on Instrumentation and Measurement 2022;71:1-15.
- Li Z, Li Y, Li Q, Wang P, Guo D, Lu L, Jin D, Zhang Y, Hong Q. LViT: Language Meets Vision Transformer in Medical Image Segmentation. IEEE Trans Med Imaging 2024;43:96-107. [Crossref] [PubMed]
- Huang X, Deng Z, Li D, Yuan X. Missformer: An effective medical image segmentation transformer. arXiv preprint arXiv:210907162 2021.
- Hatamizadeh A, Tang Y, Nath V, Yang D, Myronenko A, Landman B, Roth HR, Xu D. Unetr: Transformers for 3d medical image segmentation. Proceedings of the IEEE/CVF winter conference on applications of computer vision; 2022.
- Gao Y, Zhou M, Metaxas DN, editors. UTNet: a hybrid transformer architecture for medical image segmentation. Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part III 24; 2021: Springer.
- Chen H, Li C, Wang G, Li X, Rahaman MM, Sun H, Hu W, Li Y, Liu W, Sun C. GasHis-Transformer: A multi-scale visual transformer approach for gastric histopathological image detection. Pattern Recognition 2022;130:108827.
- Liang J, Yang C, Zeng M, Wang X. TransConver: transformer and convolution parallel network for developing automatic brain tumor segmentation in MRI images. Quant Imaging Med Surg 2022;12:2397-415. [Crossref] [PubMed]
- Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:201011929 2020.
- Zheng S, Lu J, Zhao H, Zhu X, Luo Z, Wang Y, Fu Y, Feng J, Xiang T, Torr PH. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2021.
- Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, Wang M. Swin-unet: Unet-like pure transformer for medical image segmentation. In: Karlinsky L, Michaeli T, Nishino K, editors. European Conference on Computer Vision. Switzerland: Springer; 2022.
- Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille AL, Zhou Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv: 210204306 2021.
- Valanarasu JMJ, Oza P, Hacihaliloglu I, Patel VM. Medical transformer: Gated axial-attention for medical image segmentation. Medical Image Computing and Computer Assisted Intervention – MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24; 2021: Springer.
- Peng Z, Huang W, Gu S, Xie L, Wang Y, Jiao J, Ye Q. Conformer: Local features coupling global representations for visual recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision; 2021.
- Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF international conference on computer vision; 2021.
- Wang H, Cao P, Wang J, Zaiane OR. UCTransNet: rethinking the skip connections in u-net from a channel-wise perspective with transformer. Proceedings of the AAAI Conference on Artificial Intelligence 2022;36:2441-9.
- Wang Z, Min X, Shi F, Jin R, Nawrin SS, Yu I, Nagatomi R. SMESwin Unet: Merging CNN and transformer for medical image segmentation. Medical Image Computing and Computer Assisted Intervention – MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part V; 2022: Springer.
- Heidari M, Kazerouni A, Soltany M, Azad R, Aghdam EK, Cohen-Adad J, Merhof D. Hiformer: Hierarchical multi-scale representations using transformers for medical image segmentation. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023; 6202-6212.
- Kumar N, Verma R, Sharma S, Bhargava S, Vahadane A, Sethi A. A Dataset and a Technique for Generalized Nuclear Segmentation for Computational Pathology. IEEE Trans Med Imaging 2017;36:1550-60. [Crossref] [PubMed]
- Sirinukunwattana K, Pluim JPW, Chen H, Qi X, Heng PA, Guo YB, Wang LY, Matuszewski BJ, Bruni E, Sanchez U, Böhm A, Ronneberger O, Cheikh BB, Racoceanu D, Kainz P, Pfeiffer M, Urschler M, Snead DRJ, Rajpoot NM. Gland segmentation in colon histology images: The glas challenge contest. Med Image Anal 2017;35:489-502. [Crossref] [PubMed]
- Silva J, Histace A, Romain O, Dray X, Granado B. Toward embedded detection of polyps in WCE images for early diagnosis of colorectal cancer. Int J Comput Assist Radiol Surg 2014;9:283-93. [Crossref] [PubMed]
- Alom MZ, Hasan M, Yakopcic C, Taha TM, Asari VK. Recurrent residual convolutional neural network based on u-net (r2u-net) for medical image segmentation. arXiv preprint arXiv: 180206955 2018.
- Graham S, Vu QD, Raza SEA, Azam A, Tsang YW, Kwak JT, Rajpoot N. Hover-Net: Simultaneous segmentation and classification of nuclei in multi-tissue histology images. Med Image Anal 2019;58:101563. [Crossref] [PubMed]
- Fan DP, Ji GP, Zhou T, Chen G, Fu H, Shen J, Shao L. Pranet: Parallel reverse attention network for polyp segmentation. International Conference on Medical Image Computing and Computer-Assisted Intervention; 2020: Springer.
- Ates GC, Mohan P, Celik E. Dual cross-attention for medical image segmentation. Engineering Applications of Artificial Intelligence 2023;126:107139.
- Khan RF, Lee BD, Lee MS. Transformers in medical image segmentation: a narrative review. Quant Imaging Med Surg 2023;13:8747-67. [Crossref] [PubMed]