Not Another Dual Attention UNet Transformer (NNDA-UNETR): a plug-and-play parallel dual attention block in U-Net with enhanced residual blocks for medical image segmentation

Lei Cao; Qikai Zhang; Chunjiang Fan; Yongnian Cao

doi:10.21037/qims-24-833

Original Article

Not Another Dual Attention UNet Transformer (NNDA-UNETR): a plug-and-play parallel dual attention block in U-Net with enhanced residual blocks for medical image segmentation

Lei Cao¹, Qikai Zhang¹, Chunjiang Fan², Yongnian Cao³

¹Department of Artificial Intelligence, Shanghai Maritime University, Shanghai, China; ²Department of Rehabilitation Medicine, Wuxi Rehabilitation Hospital, Wuxi, China; ³Tiktok Inc., San Jose, CA, USA

Contributions: (I) Conception and design: L Cao, Q Zhang; (II) Administrative support: C Fan, Y Cao; (III) Provision of study materials or patients: Q Zhang; (IV) Collection and assembly of data: L Cao, Q Zhang, Y Cao; (V) Data analysis and interpretation: L Cao, Q Zhang; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Chunjiang Fan, PhD. Department of Rehabilitation Medicine, Wuxi Rehabilitation Hospital, No. 88 Qiangao Road, Wuxi 214002, China. Email: fangchunjiang1980@163.com.

Background: Medical image segmentation is crucial for clinical diagnostics and treatment planning. Recently, hybrid models often neglect the local modeling capabilities of Transformers for medical image segmentation, despite the complementary nature of local information from both convolutional neural networks (CNNs) and transformers. This limitation is particularly problematic in multi-organ segmentation, where organs are closely adhered, and accurate delineation is essential. This study aims to develop a novel method that leverages the strengths of both CNNs and transformers to address the challenges of segmenting multiple closely adhered organs, improving accuracy and robustness in multi-organ segmentation.

Methods: In this study, we present the Not Another Dual Attention block (NNDA-block), a versatile, plug-and-play module that seamlessly integrates channel and spatial attention mechanisms. This block can be incorporated at any stage within the Not Another Dual Attention UNet Transformer (NNDA-UNETR) framework. Our novel approach to spatial attention uniquely combines local modeling, significantly reducing detail and texture loss during the up- and down-sampling processes typical in U-shaped architectures.

Results: Evaluations on the Beyond the Cranial Vault (BTCV) and Automatic Cardiac Diagnosis Challenge benchmark datasets demonstrate our method’s effectiveness in balancing model complexity with accuracy. On the BTCV dataset, our model sets new state-of-the-art benchmarks, achieving a Dice similarity coefficient of 84.13%, normalized surface dice of 86.96%, mean average surface distance of 3.46 mm, and an Hausdorff distance at 95% of 17.76 mm. Moreover, it reduces the parameter count by 34.2% compared to leading contemporary methods, all while maintaining low computational costs [measured in floating point operations per second (FLOPs)].

Conclusions: NNDA-UNETR offers a robust solution for accurate segmentation in multi-organ tasks, particularly where organ adhesion poses challenges. Its lightweight design also makes it well-suited for deployment in real-world medical environments with limited computational resources.

Keywords: Dual attention mechanism; abdominal segmentation; cardiac segmentation; magnetic resonance imaging (MRI); computed tomography imaging (CT imaging)

Submitted Apr 26, 2024. Accepted for publication Oct 22, 2024. Published online Nov 29, 2024.

doi: 10.21037/qims-24-833

Introduction

Biomedical image segmentation is essential in a wide range of applications in medical research and clinical practice (1-6). These tasks encompass quantitative volume measurement, precise identification of object boundaries and lesion localization, disease diagnosis, and the analysis of anatomical structures. Moreover, they play a critical role in computer-aided diagnosis, image-guided surgeries, and tumor radiotherapy. A typical example would be the multi-organ segmentation in abdominal computed tomography (CT) and the segmentation of left ventricle, right ventricle, and myocardium in cardiac magnetic resonance imaging (MRI). However, the majority of automated segmentation methods have not yet tackled the challenges associated with segmenting multiple organs that adhere to each other at present. Wong et al. (7) highlighted that deep learning shows great potential in cardiovascular image diagnosis by automating feature extraction and improving accuracy. However, challenges such as model investigation and data quality limitations require further research and optimization. Additionally, the accurate identification and segmentation of subtle or severely deformed organs remain a significant hurdle. Zhou et al. (8) sought to improve model segmentation performance by indiscriminately incorporating a large number of parameters and high-computational Transformer layers in their not another transformer (nnFormer) model. This practice impedes the practical application of the model in real-world scenarios. A multitude of advanced networks [e.g., attention U-Net 3D-UXNET (9,10), Sharp U-Net (11) or MaskSup (12)], have emerged, consistently pushing the boundaries of performance in various segmentation tasks by leveraging the exceptional representation learning abilities of deep learning. Verma et al. (13) advanced deep learning in cancer research with the MoNuSAC2020 challenge, offering a diverse dataset of 46,000 annotated nuclei and evaluating algorithms to enhance automated nucleus detection and segmentation. Several studies (14-16) have reported significant advancements in the segmentation of retinal blood vessels, pancreas and pancreatic tumors, as well as liver vessels, using deep learning techniques, particularly convolutional neural networks (CNNs) and U-Net architectures.

While CNN-based methods have shown strong performance, they struggle with long-range dependencies due to inherent biases like locality and translational variance. To tackle these challenges, researchers have employed strategies such as enlarging convolutional kernels, utilizing atrous convolutions, and incorporating attention mechanisms (17-19). For example, TransUNet combines with CNNs as a backbone with transformers to model long-term dependencies (20-23). Transformers have demonstrated superior performance over CNNs in capturing both local and global interactions, owing to their capacity to model long-range dependencies (24-27). As a result, various methods have emerged, incorporating transformers into biomedical image segmentation. For instance, Karimi et al. proposed a deconvolutional segmentation model that converts three-dimensional (3D) image blocks into sequential data (28). Drawing on this concept, Cao et al. employed transformer-based hierarchical blocks to design encoders and decoders, integrating a Swin transformer with linear complexity and a sliding window mechanism customized for medical applications (29). There remains a critical need to explore more effective approaches for harmoniously integrating the strong localized feature extraction capabilities of convolutional layers with the superior context modeling power of transformers to design an optimized network for medical image segmentation. Unlike natural images, medical images are not widely available and often demand meticulous and labor-intensive annotations. Moreover, the substantial computational overhead of self-attention mechanisms in transformers may result in suboptimal performance when relying solely on transformer-based models for biomedical image segmentation tasks.

In the realm of biomedical image segmentation, an array of hybrid architectures integrating CNNs and transformers has surfaced. This amalgamation capitalizes on the inherent strengths of each component: CNN excels in local feature and texture detail extraction, while self-attention within Transformers adeptly captures long-range context (the model’s ability to capture dependencies and relationships between distant elements in the input data). The synergy between these components enhances the model’s full performance potential. Notably, several well-regarded hybrid architectures have emerged from the integration of transformers and CNNs, where transformers are either incorporated into models with CNNs as the foundational structure or replace specific modules within the network. For instance, MISSFormer and CoTr improve UNet’s skip connections by integrating transformer blocks, capturing multi-scale global dependencies, and bridging the encoder-decoder semantic gap (30,31). Xu et al. contributed by analyzing the trade-offs between transformers and CNNs, proposing the LeViT-UNet encoder, which optimizes efficiency and performance (32). Li et al. focused on the decoder, introducing an up-sampling method that embeds contextual information into encoder layers (33). Additionally, nnFormer integrates both transformer and convolutional blocks in the encoder and decoder stages. Liu et al. introduced PHtrans, a parallel architecture that combines CNN and transformer elements, enhancing model performance by independently processing local and global features (34).

While integrating CNN and Transformer components improves segmentation accuracy, it also substantially increases the number of trainable parameters and computational demands. Moreover, these approaches often struggle with segmenting closely adherent organs and tend to overlook channel-specific features, placing greater emphasis on spatial relationships. The implicit multi-head self-attention used for inter-channel dependencies lacks interpretability and adds computational overhead. To address these issues, we propose a versatile block that aligns spatial and channel dependencies in parallel, enhancing training efficiency. This block can be easily integrated at various scales within U-shaped architectures, improving both performance and computational efficiency. We have summarized the advantages and disadvantages of several representative methods, as shown in Table 1. Our main contributions of this work are as follows: (I) local-global interaction: in data analysis, current methods using vision and Swin transformers face quadratic complexity and data scarcity challenges. To address this, we introduce Max SA, a linear complexity model with sparse attention mechanisms, ex-tended to 3D and integrated into the U-shaped architecture. (II) Parallel combination of spatial attention and channel attention: this research introduces a fusion method that combines local, global, and channel attention mechanisms for dynamic integration at any stage within the U-shaped architecture. (III) Combination with U-shaped architecture: we have developed the Not Another Dual Attention UNet Transformer (NNDA-UNETR) model by integrating the Not Another Dual Attention block (NNDA-block) with the U-shaped architecture. The NNDA-block incorporates spatial and channel attention mechanisms to ensure that feature information is preserved during up-sampling and concatenation, thereby improving decoder performance.

Table 1

Pros and cons of representative biomedical image segmentation approaches

Method	Specific models	Pros	Cons
CNN-based	Retinal vessel CNNs (3); Karimi et al.’s model (28)	Strong performance in feature extraction; efficient for localized patterns	Struggles with long-range dependencies; locality and translational variance issues
UNet variants	Attention U-Net (9), Sharp U-Net (11), MaskSup (12)	High accuracy in segmentation tasks; effective encoder-decoder structure	Computationally intensive; requires large datasets
Transformer-based	TransFuse (21), SwinUNETR (35), PHtrans (34)	Excellent in capturing long-range dependencies; flexible and adaptable	High memory requirements; needs extensive data for training
Hybrid models	MISSFormer (30), CoTr (31), nnFormer (8), 3D-UXNET (10)	Combines local feature extraction and global context modeling; improved encoder-decoder synergy	Higher trainable parameters and computational cost; complexity in model design

CNN, convolutional neural network; MaskSup, mask supervision; TransFuse, transformer-based fusion network; SwinUNETR, swin transformer-based UNet; PHtrans, parallel hybrid transformer; MISSFormer, medical image segmentation with swin transformer; CoTr, coordinate transformer; nnFormer, not another transformer.

Methods

Inspired by the sparse attention-based Max SA of Tu et al. (36) and the improved transformer for high-resolution generative adversarial networks (GANs) by Zhao et al. (37), we introduce an improved version of 3D-Max SA for medical images. This approach enables both local modeling and the capture of long-term dependencies by decomposing traditional dense attention into two components: local attention within non-overlapping windows and sparse global attention across a uniform grid. It preserves the non-local nature of attention while reducing the quadratic complexity of standard attention to a linear scale. Thus, our proposed spatial attention module is constructed by sequentially designing convolutional blocks with 3D-Max SA. This module is then parallelized with the channel attention module to form our innovative “plug-and-play” NNDA-block. Moreover, the NNDA-block derives advantages from the integration of low parameter counts and computational complexity, as well as integrating local and global context information. This results in a model that demonstrates competitive performance in terms of both model capacity and generalization ability.

Overall architecture

Similar to the majority of U-shaped architectures, our model is also a hierarchical architecture, as depicted in Figure 1. The encoder receives 3D tensors (patches) as input, and its outputs are linked to the decoder part via skip connections. Following this, 3D Conv-blocks are applied to generate the ultimate output. Our model consists of two main components: the backbone and the meticulously crafted plug-and-play NNDA-block. The core focus of our design is the introduction of the NNDA-block.

Figure 1 Our NNDA-UNETR approach demonstrates the hierarchical architecture. Initially, the encoder receives 3D tensors (patches), and their outputs establish connections with the decoder using standard skip connections. Subsequently, Conv-Blocks generate the final segmentation output. Our design innovation lies in the incorporation of the NNDA-block, which efficiently learns enhanced spatial-channel feature representations. This is achieved by employing parallel attention modules to perform two tasks simultaneously. The right-side NNDA-block diagram depicts this mechanism. The spatial attention module integrates local and global attention mechanisms along with a lightweight block called Light-Conv3d. Additionally, the second attention module highlights channel dependencies to produce channel attention maps. By merging the outputs from both attention modules and enhancing feature representation through Conv-Blocks, it leads to improved segmentation masks. NNDA-UNETR, Not Another Dual Attention UNet Transformer; 3D, three-dimension; Conv, convolution.

Backbone

In our hierarchical backbone, down-sampling factors of 2 are employed for all levels, with the exception of the first level that utilizes a down-sampling factor three times greater than the original input. Our NNDA-UNETR framework is built upon the UNETR (38). We selected the UNETR architecture as the backbone for our model because it was the first to integrate the transformer into the UNet framework. At the time of its publication, UNETR achieved state-of-the-art performance across multiple datasets, further validating its effectiveness. Additionally, several subsequent models, including swin transformer-based UNet (SwinUNETR) and U-Net with transformer plus plus (UNETR++), have built upon the UNETR architecture, showcasing its robustness and suitability as a strong foundation for further advancements. We replaced the original transformer modules and the depth-wise convolution modules in the skip connections of the baseline (UNETR) with our lightweight NNDA-block, specifically designed to reduce model complexity. This modification led to a significant reduction in the overall parameter count.

NNDA-block

Each NNDA-block efficiently learns enhanced spatial-channel feature representations by performing two tasks using parallel attention modules, as shown on the right side of Figure 1. The first attention module is composed of a serial combination of local and global attention mechanisms, as well as a lightweight block called Light-Conv3d. Besides, the second attention module focuses on emphasizing channel dependencies and computes channel attention maps. Finally, we employ concat fusion to combine the outputs from the two attention modules. These combined outputs are then processed through Conv-Blocks to enhance feature representations. The NNDA-block’s output is denoted as $\hat{X}$ , which encapsulates the enriched feature representation.

$\hat{X} = {Conv}_{1} ({Conv}_{3} ({Conv}_{1} (Concat ({\hat{X}}_{s}, {\hat{X}}_{c}))))$ [1]

where ${\hat{X}}_{s}$ and ${\hat{X}}_{c}$ respectively represent channel attention maps and spatial attention maps, while $Concat (\cdot)$ , ${Conv}_{1} (\cdot)$ and ${Conv}_{3} (\cdot)$ denote concatenation, 3D 1×1×1 and 3D 3×3×3 convolutional operations, respectively.

3D-multi-aixs attention

Relative self-attention has been proposed as an enhancement to the conventional attention mechanism (39-42). It introduces a learned bias that is added to the attention weights, resulting in improved performance compared to the original attention mechanism across various computer vision tasks. In this work, aligned with the methodology employed in the Max-vit framework, our approach predominantly relies that demonstrates competitive performance in terms of both model capacity and generalization ability. On the utilization of pre-normalized relative self-attention, as outlined in the existing task by Dai et al., serving as the fundamental operator in CoAtNet (39). To streamline the presentation, we illustrate our model using a single head of the multi-head self-attention. In practical implementation, we consistently employ multi-head attention with identical head dimensions. The relative attention can be defined as:

$RelAttention (Q_{s}, K_{s}, V_{s}) = σ (Q_{s}, K_{s}^{T} / \sqrt{d} + B) V_{s}$ [2]

where $Q_{s}, K_{s}, V_{s} \in R^{(H \times W \times D) \times C}$ represent the query, key, and value matrices, while $d$ and $σ$ respectively denotes the hidden dimension and softmax. Similar to the work of two-dimensional (2D) multi-axis attention (43), the 3D attention weights are also determined through a combination of a static location-aware matrix $B$ and the scaled input-adaptive attention $Q_{s} K_{s}^{T} / \sqrt{d}$ . Moreover, the relative position bias $B$ , which accounts for the disparities in 3D coordinates, is parameterized by a matrix $\hat{B} \in R^{(2 H - 1) \times (2 W - 1) \times (2 D - 1)}$ . Within our spatial Attention module, all attention operators utilize this default relative attention as defined in Eq. [2].

Spatial attention

To simplify the high complexity of dense attention on 3D medical images into a linear form, the 3D-Max-SA method decomposes this dense attention into two sparse forms, namely local and global, as shown in Figure 2.

Figure 2 This image showcases our spatial attention module. The method based on local and global attention is primarily composed of three components: the attention block that considers local texture information, the attention grid that models long-range dependencies, and the lightweight Light-Conv3d part that aims to progressively process input features and ultimately provide more informative and representative output features. These components are serially combined to form our approach. Layer norm, layer normalization; GELU, Gaussian Error Linear Unit; Conv, convolution.

Block attention

Given a standard input feature map $X \in R^{(H \times W \times D) \times C}$ , instead of employing attention mechanisms on the flattened spatial dimension $H W D$ , we strategically partition the features into a tensor with the shape of $(\frac{H}{P} \times \frac{W}{P} \times \frac{D}{P}, P \times P \times P, C)$ symbolizing non-overlapping windows of size $P \times P \times P$ . By leveraging this methodology, we effectively mitigate the computational complexity, thereby enhancing computational efficiency. Applying self-attention on the local spatial dimension, specifically $P \times P \times P$ , is equivalent to focusing on a small window within the given medical image.

To begin with, we define the $B P (\cdot)$ operator as partitioning the input feature maps $x \in R^{(H \times W \times D) \times C}$ into blocks of size $\frac{H}{P} \times \frac{W}{P} \times \frac{D}{P}$ , where each block has dimensions of $P \times P \times P$ .

$B P : (H, W, D, C) \to (\frac{H}{P} \times P, \frac{W}{P} \times P, \frac{D}{P} \times P, C) \to (\frac{H W D}{P^{3}}, P^{3}, C)$ [3]

where $B P (\cdot)$ stands for block partition, we also denote $B R (\cdot)$ operation as the inverse of the aforementioned $B P (\cdot)$ procedure.

Grid attention

While local attention models offer the advantage of bypassing computationally intensive full self-attention methods, on large-scale datasets, they have been observed to exhibit underfitting issues (41,44). Consistent with the approach adopted by Max-vit, we employ a straightforward and effective method to obtain sparse attention, facilitating global characterization. This method is called grid attention. Unlike the local attention mechanism, which partitions the feature map with a fixed window size, our approach employs a fixed $G \times G \times G$ uniform grid to divide the tensor into a $(G \times G \times G, \frac{H}{G} \times \frac{W}{G} \times \frac{D}{G}, C)$ shape, which leads to windows with an adaptive size of $\frac{H}{G} \times \frac{W}{G} \times \frac{D}{G}$ . Similarly, the $G P (\cdot)$ operation with parameter $G$ is defined as dividing the input feature into a uniform $G \times G \times G$ grid, where each grid has an adaptive size of $\frac{H}{G} \times \frac{W}{G} \times \frac{D}{G}$ :

$G P : (H, W, D, C) \to (G \times \frac{H}{G}, G \times \frac{W}{G}, G \times \frac{D}{G}, C) \to (G^{3}, \frac{H W D}{G^{3}}, C) \to (\frac{H W D}{G^{3}}, G^{3}, C)$ [4]

where $G P (\cdot)$ stands for grid partition, we also denote $G R (\cdot)$ operation as the inverse of the aforementioned $G P (\cdot)$ procedure to reverse the gridded input back to the normal 3D feature space.

To this end, we are prepared to offer an elucidation for this spatial module. Given a standard input tensor $x \in R^{(H \times W \times D) \times C}$ , the above local feature-symbolizing block attention can be formulated as:

$x \leftarrow x + RelAttention (LN (B P (x)))$ [5]

$x \leftarrow B R (x + Light-Conv3d (LN (x)))$ [6]

while the above dilated, grid-attention module can be represented as:

$x \leftarrow x + RelAttention (LN (G P (x)))$ [7]

$x \leftarrow G R (x + Light-Conv3d (LN (x)))$ [8]

To simplify the explanation, we have omitted the $Q_{s}, K_{s}, V_{s}$ input format in the RelAttention operation. LN denotes the layer normalization, we substitute the conventional multi-layer perceptron (MLP) layer with a Light-Conv3d comprising two 1×1×1 convolutions separated by a Gaussian Error Linear Unit (GELU) activation function. This strategic modification achieves dual objectives: reduction of model parameter count and elimination of the necessity to search for the optimal dropout rate in MLP layers. One significant reason for substituting the MLP layer with Light-Conv3d is to avoid the inherent complexity associated with MLPs, which typically include fully connected layers and dropout operations. While dropout is not the only regularization technique used in fully connected layers, it is the most common. Searching for the optimal dropout rate can be an additional burden, requiring considerable effort to fine-tune this hyper-parameter. By replacing the fully connected layers with standard convolutions, we eliminate the need for dropout rate optimization. Additionally, incorporating the GELU activation function enhances the model’s non-linear modeling capability and provides some regularization.

Channel attention

In order to enhance our model’s computational efficiency and effectively address channel dependencies, we have incorporated the channel attention module sourced from UNETR++ into our framework. We adopted the channel attention calculation method from their work. In their methodology, the same set of Q and K matrices is shared for both spatial and channel attention mechanisms. However, we believe this shared approach lacks interpretability. Therefore, in our NNDA-block, we use separate Q and K matrices for spatial and channel attention. This distinction allows us to more effectively capture inter-channel dependencies while maintaining clarity in the attention mechanisms. This decision was made by carefully considering the trade-off between interpretability and efficiency associated with shared operations. Simultaneously, we have managed to maintain a low parameter count while ensuring high computational efficiency.

Given a standard input tensor $x$ of shape $H W D \times C$ , $Q_{c}, K_{c}, V_{c}$ can be computed using three linear layers, yielding $Q_{c} = W^{Q} x$ , $K_{c} = W^{K} x$ and $V_{c} = W^{V} x$ , where the weights $W^{Q}$ , $W^{K}$ and $W^{V}$ are used for projection corresponding to $Q_{c}$ , $K_{c}$ and $V_{c}$ respectively. The channel attention is defined as follows:

${\hat{x}}_{c} = V_{c} \cdot σ (\frac{Q_{c}^{T} K_{c}}{\sqrt{d}})$ [9]

where $σ$ and d respectively represent the softmax operation and the size of each vector, and $Q_{c}$ , $K_{c}$ and $V_{c}$ denote keys, queries and channel value layer, respectively.

Loss function

In order to harness the advantages of both the dice loss (45) and the cross-entropy loss, our loss function is formulated as the summation of these two complementary loss functions. This combined loss function is designed to simultaneously leverage the benefits provided by each individual loss. The specific definition of the loss function is as follows:

$L (G, P) = 1 - \sum_{i = 1}^{I} (\frac{2 * \sum_{v = 1}^{V} G_{v, i} \cdot P_{v, i}}{\sum_{v = 1}^{V} G_{v, i}^{2} + \sum_{v = 1}^{V} P_{v, i}^{2}} + \sum_{v = 1}^{V} G_{v, i} \log P_{v, i})$ [10]

where $V$ represents the voxel count; $I$ indicates the class count; $G_{v, i}$ and $P_{v, i}$ represent the ground truth (GT) and the predicted probabilities for class $i$ at each voxel $v$ , respectively.

Results

Experimental setup

We conducted experiments on two datasets, namely Beyond the Cranial Vault (BTCV) (46) and Automatic Cardiac Diagnosis Challenge (ACDC) (47). Datasets: BTCV dataset includes abdominal CT scans from 30 subjects, with annotations of 13 organs performed by interpreters under the supervision of clinical radiologists at Vanderbilt University Medical Center, highlighting abdominal CT scans (see Table 2 for details). The CT scans are performed in the portal venous contrast-enhanced phase, typically consisting of 80 to 225 slices with 512×512 pixels and slice thickness ranging from 1 to 6 mm. Our evaluation focuses on reporting Dice similarity coefficient (DSC). ACDC dataset includes MRI images from 100 patients, annotated for segmentation of the right ventricle (RV) cavity, left ventricle (LV) cavity, and myocardium (Myo). The data partitioning strategy we adopt is based on UNETR++ and nnFormer. Our assessment involves reporting the DSC and Hausdorff distance at 95% (HD95). In Table 2, we present the experimental network configurations for the BTCV dataset and the ACDC dataset. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013).

Table 2

Pre-input data preprocessing configuration according to the characteristics of the datasets

Data processing mode	BTCV	ACDC
Target spacing (mm)	2.0×1.5×1.5	1.52×1.52×6.35
Median image size (voxel)	127×512×512	246×21×13
Crop size (voxel)	96×96×96	16×160×160
Batch size	1	4
Down-sample stride	[3,3,3], [2,2,2], [2,2,2], [2,2,2]	[1,4,4], [2,2,2], [2,2,2], [2,2,2]

Target spacing: the desired voxel spacing in millimeters for the resampled images, ensuring uniform spatial resolution across the dataset. Median image size: the median dimensions of the images in the dataset, typically given in pixels, representing the central tendency of image sizes. Crop size: the fixed dimensions used to crop or resize images, ensuring they fit a standard size for consistent processing. Batch size: the number of images processed together in a single forward and backward pass during training, affecting memory usage and training stability. Down-sampling stride: the factor by which the image resolution is reduced, controlling the level of detail retained in the down-sampled images. BTCV, Beyond the Cranial Vault; ACDC, Automatic Cardiac Diagnosis Challenge.

Evaluation metrics

We assess the effectiveness by DSC and HD95. DSC measures the level of agreement between actual voxel data and volumetric segmentation predictions. It is calculated in the following way:

$DSC (G, P) = 2 \times \frac{| G \cap P |}{| G | \cup | P |} = 2 \times \frac{G \cdot P}{G^{2} + P^{2}}$ [11]

where G represents the GT and P represents the predictions. HD95 is frequently utilized as a metric based on boundaries. It is determined as follows:

$HD 95 (G, P) = Max {d_{G P}, d_{P G}}$ [12]

where $d_{P G}$ denotes the maximum 95th percentile distance between the actual voxel data (GT) and volumetric segmentation predictions, and $d_{G P}$ represents the greatest distance at the 95th percentile between volumetric segmentation predictions and GT. To capture a comprehensive performance of all segmentation methods, we further evaluate the performance of specific models based on normalized surface dice (NSD) and mean average surface distance (MASD). NSD measures the accuracy of the segmentation surface by normalizing the distance deviations relative to the true surface. It is defined as follows:

$NSD (G, P) = \frac{| S_{G} \cap B_{P} | + | S_{P} \cap B_{G} |}{| S_{G} | + | S_{P} |}$ [13]

where $S_{G}$ and $S_{P}$ denote the boundaries of the GTs and output probabilities for all voxels, and $B_{G}$ and $B_{P}$ denote the border regions for the GTs and output probabilities for all voxels respectively. MASD calculates the average absolute distance between the segmentation results and the true surface. It is defined as follows:

$MASD (G, P) = \frac{1}{2} (\frac{\sum_{g \in G} d (g, P)}{| G |} + \frac{\sum_{p \in P} d (G, p)}{| P |})$ [14]

where d (g, P) denotes the distance between a point g of the GT and the predicted voxels P, and d (G, p) denotes the distance between the GT voxels G and a predicted point p.

Implementation details

We initialized the weights of the convolutional and linear layers using a truncated normal distribution with a standard deviation of 0.02. Bias terms, where applicable, were initialized to zero. For the Layer Norm layers, the bias and weight terms were initialized to zero and one, respectively. Additionally, we did not use any pretrained encoders, such as models pretrained on ImageNet.

BTCV

Our approach was implemented using PyTorch version 1.13.0. To ensure a fair comparison with other models, we adopted the same preprocessing strategy and input size, without utilizing any additional data for training. Data augmentation of random flip, rotation and intensities shifting are used for training, with probabilities of 0.1, 0.1, and 0.5, respectively. The models were trained on a single GeForce RTX 2080-Ti 11GB GPU, with an input size of 96×96×96. The training process consisted of 5,000 epochs, with an initial learning rate of 1e−4, utilizing the Lion optimizer as described in Chen et al. (1). To enable a fair comparison with the benchmark methods, we selected 5,000 epochs as the training duration for all approaches. Furthermore, we did not employ an early stopping mechanism, but rather compared the performance of all methods at the checkpoint that exhibited the best results during the training process. Similarly, the ACDC dataset follows the same rules.

ACDC

All experiments were conducted on Ubuntu 22.04, utilizing Python 3.6 and PyTorch 1.8.1. The fair comparisons were performed on a GeForce RTX 2080-Ti 11GB GPU, with an input size of 16×160×160. The initial learning rate was initialized to 1e−2, following which a “poly” decay strategy presented in Eq. [15] was employed. The default optimizer utilized in this study was stochastic gradient descent (SGD), incorporating a momentum of 0.99. Furthermore, a weight decay of 1e−4 was applied to regularize the model during training. Augmentations such as rotation, scaling, gaussian noise, gaussian blur, brightness and contrast adjust, simulation of low resolution, gamma augmentation and mirroring are applied in the given order during the training process. The training regimen was designed to span 1,000 epochs, with each epoch comprising 250 iterations.

$l r = i n i t i a l_l r \times 1 - {(\frac{e p o c h_i d}{\max_e p o c h})}^{0.9}$ [15]

Visualization tool

ITK-SNAP is an open-source software application designed for medical image navigation and segmentation. It offers robust tools for 3D visualization, allowing precise rendering and analysis of medical images. Its user-friendly interface and advanced features make it an essential tool for researchers and clinicians working with complex anatomical structures. ITK-SNAP is commonly used for quantitative analysis in the field of medical image segmentation, aiding in the observation and assessment of segmentation outcomes. We used ITK-SNAP as a qualitative visualization tool.

Comparison of model performance and efficiency with baseline

As shown in Table 3, incorporating the proposed enhancements into our baseline model demonstrates their impact on the BTCV dataset. In addition to reporting the DSC, we also provide details on the model’s parameters and floating point operations per second (FLOPs). The DSC performance is reported for a single model across all cases. Similar to most U-shaped architectures, our model follows a hierarchical structure. In our hierarchical structure, down-sampling factors of 2 are employed for all levels, with the exception of the first level that utilizes a down-sampling factor three times greater than the original input. In addition, Introducing the NNDA-block within our NNDA-UNETR encoders and decoders yields a substantial enhancement in performance, exemplified by an absolute gain of 4.04% and 4.53% in DSC compared to the baseline. The integration of the NNDA-block in encoder part and decoder part further enhances the performance. Our final NNDA-UNETR design, which incorporates a hierarchical structure and introduces the innovative NNDA-block in both encoder parts and decoder parts, achieves a remarkable improvement of 8.13% in DSC. Additionally, a substantial reduction in model complexity was achieved with a decrease of 34.27% in FLOPs and 69.54% in parameters in comparison to UNETR. As illustrated in Figure 3, a qualitative comparison was presented between UNETR and our model on BTCV. In all rows, the baseline model encounters difficulties in accurately delineating the inferior vena cava (IVC) during the segmentation process. In row 1, it becomes unclear when distinguishing adjacent instances of the same two organs, whereas it over-segments the esophagus between the spleen, stomach, and aorta. In rows 2 and 4, it under-segments stomach. In row 3, the baseline mistakenly subdivided a portion of the spleen as the liver. Besides, NNDA-UNETR achieves competitive segmentation performance while maintaining minimal model complexity, as shown in Figure 4.

Table 3

Baseline comparison on BTCV dataset

Methods	Params (M)	FLOPs (G)	DSC (%)
Baseline	92.79	73.68	76.00
+ NNDA in encoder	20.32	22.3	80.04
+ NNDA in decoder	8.67	33.77	80.53
+ NNDA in encoder & decoder (NNDA-UNETR)	28.26	48.43	84.13

The results are displayed based on segmentation performance, measured by DSC, parameters and FLOPs. By incorporating the NNDA-block into both the encoder parts and decoders parts of the hierarchical architecture, we observed an improvement in segmentation performance, achieving a DSC of 80.04% and 80.53%. To further enhance the results, we also introduced the NNDA-block in encoders and decoders, leading to a DSC of 84.13%. Our enhanced NNDA-UNETR architecture, featuring the NNDA-block in both the encoder parts and decoder parts, yields an impressive absolute gain of 8.13% in DSC. Moreover, this improvement is achieved while effectively reducing the model complexity. BTCV, Beyond the Cranial Vault; FLOPs, floating point operations per second; DSC, Dice similarity coefficient; NNDA-UNETR, Not Another Dual Attention UNet Transformer.

Figure 3 Our main goal is to visually compare NNDA-UNETR and UNETR in the context of the BTCV. In addition to displaying segmentation results, ground truth masks will also be included to enrich the comparison and there are some representative cases demonstrating that the proposed method outperforms other methods in segmentation performance. The white boxes represent under-segmented regions, while the yellow box denotes over-segmented region. NNDA-UNETR, Not Another Dual Attention UNet Transformer; Spl, spleen; RKid, right kidney; LKid, left kidney; Gal, gallbladder; Eso, esophagus; Liv, liver; Sto, stomach; Aor, aorta; IVC, inferior vena cava; PSV, portal and splenic veins; Pan, pancreas; RAG, right adrenal gland; LAG, left adrenal gland; BTCV, Beyond the Cranial Vault.

Figure 4 Qualitatively and quantitatively evaluate a multi-organ segmentation task. Left: we will now provide a comparative analysis between the SOTA UNETR++ (20) and the NNDA-UNETR on BTCV, using two cases that involve 13 organs. The white boxes represent under-segmented regions, while the yellow box denotes over-segmented region. In the first row, we can observe that UNETR++ have difficulty in accurately segmenting the Pan and Sto and it performed excessive segmentation on the ROI between the IVC and Liv in the RAG. In the second row, we can observe that UNETR++ have difficulty in accurately segmenting the small Eso that tightly adheres to the Aor, and also exhibit under-segmentation of the Gal that is fused with the Liv. In comparison, our NNDA-UNETR effectively captures and encodes enhanced local-global based spatial and channel features through the proposed NNDA-block, resulting in accurate segmentation of all organs in these examples. It is recommended to view the examples in a zoomed-in mode for better visualization. Right: comparison of model complexity to DSC. NNDA-UNETR improves segmentation accuracy with fewer parameters and lower computational requirements. UNETR++, UNet with transformer plus plus; NNDA-UNETR, Not Another Dual Attention UNet Transformer; SwinUNETR, swin transformer-based UNet; nnFormer, not another transformer; RKid, right kidney; LKid, left kidney; Gal, gallbladder; Eso, esophagus; Liv, liver; Sto, stomach; Aor, aorta; IVC, inferior vena cava; PSV, portal and splenic veins; Pan, pancreas; RAG, right adrenal gland; LAG, left adrenal gland; FLOPs, floating point operations per second; SOTA, state of the art; BTCV, Beyond the Cranial Vault; ROI, region of interest; IVC, inferior vena cava; DSC, Dice similarity coefficient.

Comparison of model performance and efficiency with state-of-the-art methods

BTCV dataset

Table 4 presents the results obtained for BTCV. Segmentation performance of all 13 abdominal organs is evaluated using the DSC. Without utilizing additional training data, model-ensemble, or pretraining, we emphasize that the results on the BTCV dataset are reported based solely on the accuracy of a single model. Among the existing hybrid models, the baseline model (38) and its upgraded counterpart, SwinUNETR (35), achieve DSC accuracies of 76.00% and 80.44%, respectively. Additionally, other recent models such as segmenting images efficiently with transformers using naive upsampling (SETR NUP) (48) with a DSC of 79.6%, segmenting images efficiently with transformers using progressive upsampling (SETR PUP) (48) with a DSC of 79.7%, segmenting images efficiently with transformers using multi-level feature aggregation (SETR MLA) (48) with a DSC of 79.6%, atrous spatial pyramid pooling (ASPP) (49) with a DSC of 81.1%, and transformer-based brain tumor segmentation (TransBTS) (50) with a DSC of 81.31% have also demonstrated competitive performance. Notably, no-new-UNet (nnUNet) and UNETR++ achieves superior performance compared to other existing works on this dataset. However, our proposed NNDA-UNETR surpasses them, achieving a DSC of 84.13%.

Table 4

Comprehensive comparison of SOTA methods for 13 organs segmentation on BTCV

Methods	Spl (%)	RKid (%)	LKid (%)	Gal (%)	Eso (%)	Liv (%)	Sto (%)	Aor (%)	IVC (%)	PSV (%)	Pan (%)	RAG (%)	LAG (%)	Mean (%)↑
SETR NUP (48)	93.1	89.0	89.7	65.2	76.0	95.2	80.9	86.7	74.5	71.7	71.9	–	–	79.6
SETR PUP (48)	92.9	89.3	89.2	64.9	76.4	95.4	82.2	86.9	74.2	71.5	71.4	–	–	79.7
SETR MLA (48)	93.0	88.9	89.4	65.0	76.2	95.3	81.9	87.2	73.9	72.0	71.6	–	–	79.6
ASPP (49)	93.5	89.2	91.4	68.9	76.0	95.3	81.2	91.8	80.7	69.5	72.0	–	–	81.1
TransBTS (50)	94.55	89.20	90.97	68.38	75.61	96.44	83.52	88.55	82.48	74.21	76.02	67.23	67.03	81.31
UNETR (38)	90.48	82.51	86.05	58.23	71.21	94.64	72.06	86.57	76.51	70.37	66.06	66.25	63.04	76.0
SwinUNETR (35)	94.59	88.97	92.39	65.37	75.43	95.61	75.57	88.28	81.61	76.30	74.52	68.23	66.02	80.44
nnFormer (8)	94.58	88.62	93.68	65.29	76.22	96.17	83.59	89.09	80.80	75.97	77.87	70.20	66.05	81.62
nnUNet (18)	95.95	88.35	93.02	70.13	76.72	96.51	86.79	88.93	82.89	78.51	79.60	73.26	68.35	83.16
UNETR++ (20)	94.94	91.90	93.62	70.75	77.18	95.95	85.15	89.28	83.14	76.91	77.42	72.56	68.17	83.28
NNDA-UNETR	95.91	94.44	94.34	66.64	78.46	96.85	90.37	88.52	85.57	76.56	84.29	71.93	69.80	84.13

We present a comprehensive comparison of SOTA methods for 13 organs segmentation on BTCV, which includes large organs such as Liv, Spl, LKid, RKid, and Sto; Eso, Aor, IVC, PSV; Gal, LAG, RAG, and Pan. Without utilizing additional training data, model-ensemble, or pretraining, we emphasize that these results are reported based solely on the accuracy of a single model. Our NNDA-UNETR demonstrates exceptional segmentation accuracy, outperforming existing 3D image segmentation techniques. The table includes various organs. The arrow “↑” indicates that a higher value is better for the corresponding metric. SOTA, state of the art; BTCV, Beyond the Cranial Vault; Spl, spleen; RKid, right kidney; LKid, left kidney; Gal, gallbladder; Eso, esophagus; Liv, liver; Sto, stomach; Aor, aorta; IVC, inferior vena cava; PSV, portal and splenic veins; Pan, pancreas; RAG, right adrenal gland; LAG, left adrenal gland; SETR NUP, segmenting images efficiently with transformers using naive upsampling; SETR PUP, segmenting images efficiently with transformers using progressive upsampling; SETR MLA, segmenting images efficiently with transformers using multi-level feature aggregation; ASPP, atrous spatial pyramid pooling; TransBTS, transformer-based brain tumor segmentation; SwinUNETR, swin transformer-based UNet; nnFormer, not another transformer; nnUNet, no-new-UNet; UNETR++, UNet with transformer plus plus; NNDA-UNETR, Not Another Dual Attention UNet Transformer.

Critically, Table 5 demonstrates that NNDA-UNETR achieves improved segmentation performance while significantly reducing the number of parameters and maintaining relatively low FLOPs. Furthermore, our average DSC is not attained through a substantial lead in one or two organs, but rather by securing the first position among all eight abdominal organs, including left kidney, right kidney, liver, esophagus, stomach, IVC, pancreas, and left adrenal gland, and achieving the second position in one organ, spleen. Besides, we conducted paired t-tests to compare the performance of our method with several existing approaches on the BTCV dataset. The results indicate that our method significantly outperforms the state of the art (SOTA) methods. Moreover, our method achieves the lowest HD95 value of 17.76 mm with a standard deviation of 11.92, significantly outperforming the other compared methods. A lower HD95 value indicates that our model’s predictions are closer to the actual organ boundaries, demonstrating superior precision in capturing complex organ shapes. The low standard deviation further reinforces this, as our method not only achieves the smallest average HD95 but also maintains relatively low variance across different test cases, highlighting its consistency. This consistency underscores the robustness of our approach in handling edge cases where accurate boundary detection is critical. To quantitatively substantiate the superiority of our method, we also conducted paired t-tests to compare the HD95 between our method and others, with significance levels indicated by *: P<0.05, **: P<0.01, ***: P<0.001. These results, combined with the HD95 analysis, clearly demonstrate that NNDA-UNETR outperforms other methods in terms of sensitivity to organ edges, providing a strong validation of our approach. Building on the previously discussed DSC and HD95 metrics, which have already highlighted the superior segmentation accuracy and edge sensitivity of NNDA-UNETR, the additional analysis of MASD and NSD further underscores the model’s strengths. Specifically, our model achieves the lowest MASD value of 3.46 mm, demonstrating its exceptional precision in delineating organ boundaries compared to other methods.

Table 5

Model complexity comparison on BTCV dataset

Methods	Params (M)↓	FLOPs (G)↓	DSC↑ (%), mean ± SD	NSD↑ (%), mean ± SD	MASD↓ (mm), mean ± SD	HD95 (mm), mean ± SD
UNETR	92.79	73.68	76.00±0.02***	84.71±0.04***	3.60±0.89***	27.08±10.83***
SwinUNETR	62.83	394.84	80.44±0.03**	85.08±0.04**	3.48±0.92	19.90±10.50**
UNETR++	42.95	44.34	83.28±0.02***	85.87±0.04**	3.53±0.93**	22.46±14.40**
nnFormer	149.32	178.49	81.62±0.03**	84.55±0.04***	3.51±1.13***	21.35±10.52**
NNDA-UNETR	28.26	48.43	84.13±0.02	86.96±0.03	3.46±0.92	17.76±11.92

We conducted a paired t-test to investigate the disparities in DSC between our method and other methods (**, P<0.01; ***: P<0.001). Our proposed NNDA-UNETR demonstrates positive segmentation performance compared to current models, while significantly decreasing the quantity of parameters and maintaining a relatively low FLOPs. “↑”, a higher value is better; “↓”, a lower value is better. BTCV, Beyond the Cranial Vault; FLOPs, floating point operations per second; DSC, Dice similarity coefficient; SD, standard deviation; NSD, normalized surface dice; MASD, mean average surface distance; HD95, Hausdorff distance at 95%; SwinUNETR, swin transformer-based UNet; UNETR++, UNet with transformer plus plus; nnFormer, not another transformer; NNDA-UNETR, Not Another Dual Attention UNet Transformer.

Moreover, with an NSD value of 86.96%, NNDA-UNETR exhibits the highest capability in accurately capturing the detailed surfaces of segmented regions. These findings align with the earlier metrics, indicating that NNDA-UNETR not only excels in overall segmentation quality but also in maintaining precise boundary details, thanks to the innovative NNDA-block. This consistent performance across multiple metrics establishes NNDA-UNETR as a more robust and reliable model for medical image segmentation (see Table 5 for details).

Figure 5 presents a qualitative comparison between the existing models and our method on the BTCV. Here, the white boxes represent under-segmented regions, while the yellow boxes denote over-segmented regions. At the top of the image, it is evident that current methods encounter challenges in accurately segmenting the small esophagus tightly adhering to the aorta. They also demonstrate under-segmentation of the gallbladder fused with the liver. In comparison, our method accurately segments the esophagus and gallbladder. Furthermore, even advanced methods like UNETR++ still show shortcomings in segmenting the stomach region in the second row. We have achieved substantial performance improvements in stomach segmentation. Particularly concerning the stomach, our NNDA-UNETR has shown a superiority of 3.58% in terms of the DSC compared to the second-ranked nnUNet. This may be attributed to the strong adaptability of our model architecture towards tube-like organs with significant texture and contrast variations. Moreover, in the third row, current approaches exhibit a lack of sensitivity to tubular structures, resulting in erroneous segmentation of the esophagus. Additionally, they excessively segment the right adrenal gland (RAG), mistakenly identifying certain background regions between the IVC and the liver as part of the RAG. In the fourth row, our NNDA-UNETR method achieves relatively accurate segmentation of the portal vein and splenic veins (PSV), addressing the issue of existing methods failing to accurately segment these structures situated between the IVC and the aorta. This improvement may be attributed to our NNDA-block, which integrates local and global attention into spatial attention and further incorporates channel attention.

Figure 5 A qualitative assessment of the multi-organ segmentation task is conducted. Our proposed NNDA-UNETR is compared with current approaches, including UNETR, SwinUNETR, nnFormer, and UNETR++. It is observed that the existing methods encounter difficulties in accurately segmenting various organs, as indicated by the organs highlighted within the white and yellow dashed boxes. The white boxes represent under-segmented regions, while the yellow boxes denote over-segmented regions. In contrast, our NNDA-UNETR demonstrates promising segmentation performance by precisely delineating the organs. We recommend viewing the image in zoom for a more detailed analysis. Derived from CT imaging, showing a 3D visualization of the segmentation results. Representative case demonstrating that the proposed method outperforms other methods in segmentation performance. SwinUNETR, swin transformer-based UNet; nnFormer, not another transformer; UNETR++, UNet with transformer plus plus; NNDA-UNETR, Not Another Dual Attention UNet Transformer; RKid, right kidney; LKid, left kidney; Gal, gallbladder; Eso, esophagus; Liv, liver; Sto, stomach; Aor, aorta; IVC, inferior vena cava; PSV, portal and splenic veins; Pan, pancreas; RAG, right adrenal gland; LAG, left adrenal gland; CT, computed tomography; 3D, three-dimension.

ACDC dataset

Table 6 illustrates the comparison on ACDC, with UNETR, nnFormer, and UNETR++ attaining mean DSC values of 86.61%, 92.06%, and 92.83%. Our method has achieved competitive results, achieving a DSC of 92.73%. Similarly, in Table 7, we assess our approach with the two top-performing approaches on ACDC in terms of model complexity and the HD95 metric. We can observe that the reduction in FLOPs and parameter count compared to UNETR++ is significant, while the higher parameter count compared to nnFormer is largely due to our method’s utilization of an input patch size of 16×160×160, whereas nnFormer uses 14×160×160. However, our model outperforms nnFormer significantly, especially in terms of DSC and HD95 metrics. Besides, when evaluating boundary segmentation accuracy using the HD95 metric, our method demonstrates a clear absolute advantage in average HD95 on the test set. Besides, we performed paired t-tests comparing the performance of our method with two SOTA approaches on the ACDC dataset. Regarding the DSC, our method exhibited significant superiority over nnformer, while there was no significant difference compared to the best-performing UNETR++. Additionally, in terms of the HD95, our method significantly outperformed both top-performing methods. These findings suggest that our approach not only focuses on segmenting the overall organ regions but also prioritizes delineating organ boundaries (see Table 7 for details).

Table 6

The results for Myo, RV cavity, and LV cavity on ACDC dataset, including mean DSC

Methods	RV (%)	Myo (%)	LV (%)	Mean DSC (%)↑
Swin-UNet	88.55	85.62	95.83	90.00
UNETR	85.29	86.52	94.02	86.61
MISSFomer	86.36	85.75	91.59	87.90
nnFormer	90.94	89.58	95.65	92.06
UNETR++	91.89	90.61	96.00	92.83
NNDA-UNETR	91.78	90.41	96.02	92.74

“↑”, a higher value is better. A larger value for this metric represents superior performance. Myo, myocardium; RV, right ventricle; LV, left ventricle; ACDC, Automatic Cardiac Diagnosis Challenge; DSC, Dice similarity coefficient; MISSFormer, medical image segmentation with swin transformer; nnFormer, not another transformer; UNETR++, UNet with transformer plus plus; NNDA-UNETR, Not Another Dual Attention UNet Transformer.

Table 7

The segmentation results and model complexity comparison on ACDC dataset

Methods	Params (M)	FLOPs (G)	DSC (%), mean ± SD	HD95↓ (mm), mean ± SD
nnFormer	37.16	47.73	92.06±0.03**	1.12±2.83*
UNETR++	66.8	43.74	92.83±0.03	1.18±0.39***
NNDA-UNETR	44.03	26.77	92.73±0.02	1.07±0.17

“↓”, a lower value is better. A smaller value for this metric represents superior performance. *, P<0.05; **, P<0.01; ***, P<0.001. ACDC, Automatic Cardiac Diagnosis Challenge; FLOPs, floating point operations per second; DSC, Dice similarity coefficient; SD, standard deviation; HD95, Hausdorff distance at 95%; nnFormer, not another transformer; UNETR++, UNet with transformer plus plus; NNDA-UNETR, Not Another Dual Attention UNet Transformer.

Figure 6 provides a qualitative contrast between the current methodologies and NNDA-UNETR on ACDC. In the first row, the current methods have over-segmented the RV. Moreover, in the second and third rows, both UNETR++ and nnFormer encounter scenarios where accurate segmentation of the right ventricular cavity is unattainable. Conversely, our approach yields segmentation outcomes that closely align with GT when evaluated visually. The fourth row presents a particularly challenging segmentation example. Although there are still some discrepancies between our method and the GT in the segmentation of the right ventricular cavity, our method achieves better segmentation performance compared to the widespread under-segmentation issues observed in nnFormer and UNETR++.

Figure 6 We aim to visually compare the capability of NNDA-UNETR, UNETR++, and nnFormer in the context of ACDC for automatic cardiac diagnosis. Along with showcasing segmentation results, we will provide ground truth masks for an improved comparative analysis. This image, derived from MRI, shows the axial plane and there are some representative cases demonstrating that the proposed method outperforms other methods in segmentation performance. The white boxes represent under-segmented regions. nnFormer, not another transformer; UNETR++, UNet with transformer plus plus; NNDA-UNETR, Not Another Dual Attention UNet Transformer; ACDC, Automatic Cardiac Diagnosis Challenge; MRI, magnetic resonance imaging.

Discussion

Ablation sudy

Table 8 illustrates the outcomes of our meticulous ablation study on various modules within the NNDA-UNETR framework that align with the domain of medical image segmentation. In the interest of clarity, experiments were carried out on the BTCV dataset, with the DSC serving as the default assessment metric. The methodologies outlined in Table 8 correspond to the individual dissections of specific modules integrated into our NNDA-UNETR architecture. Initially, we replaced the Light-Conv3d module, which was originally integrated into the spatial attention mechanism of NNDA-UNETR, with a conventional MLP layer, resulting in Method 1. The findings exhibited a marked decline in DSC scores, coupled with a noticeable escalation in FLOP. This observation can be attributed to dual underlying factors. Primarily, the conventional MLP layer often mandates an optimal dropout rate for suitability. Secondly, our efficient convolutional module is well-suited for our spatial attention mechanism, which processes input features in a layered manner. This approach ensures that each layer’s features are optimized to provide the most informative and representative outputs, ultimately enhancing the model’s segmentation performance. Remarkably, employing either global attention or local attention independently (Method 2 or Method 3) led to superior performance compared to UNETR++ (83.28%). Furthermore, we scrutinized the significance of the NNDA-block within both the encoder and decoder (Method 4 and Method 5). Despite a notable decrease in model parameters and complexity, there was a discernible reduction in the DSC metric. Nevertheless, a substantial enhancement in performance was achieved when juxtaposed with the base UNETR (76.00%). Furthermore, excluding the spatial attention module throughout the entire network (Method 6) resulted in a significant deterioration in performance. Additionally, mirroring the architecture of nnFormer, we deployed local attention in shallower layers and global attention in deeper stages where semantic information predominates (Method 7).

Table 8

Exploration of the influence of various modules employed in NNDA-UNETR

No.	Methods	Params (M)↓	FLOPs (G)↓	DSC (%)↑
1	w/o Light-Conv3d	28.26	52.80	81.27
2	w/o global attention	28.26	45.14	83.86
3	w/o local attention	28.26	45.14	83.78
4	w/o NNDA-block in decoder	20.32	22.3	80.04
5	w/o NNDA-block in encoder	8.67	33.77	80.53
6	w/o spatial attention in encoder & decoder	27.94	41.49	82.52
7	w/local attention in stage 1&2 w/global attention in stage 3&4	28.26	45.14	82.90
8	w/NNDA-block in stage 4 decoder part	37.23	68.48	84.14
9	NNDA-UNETR	28.26	48.43	84.13

“↑”, a higher value is better; “↓”, a lower value is better. NNDA-UNETR, Not Another Dual Attention UNet Transformer; FLOPs, floating point operations per second; DSC, Dice similarity coefficient; w/o, without.

From a macro perspective, the NNDA-block comprises several key components: global attention, local attention, Light-Conv3d, and channel attention. We have performed ablation studies on each of these components, analyzing their individual contributions to the overall performance of the model. These components work together to enhance the model’s segmentation performance. This strategy yielded a performance accuracy of 82.9%, surpassing that of nnFormer at 81.62%. The results of Method 8 show that adding the NNDA-block at this stage improves performance by only 0.1%, while significantly increasing the model’s parameter count by 31.74% and the computational complexity [giga floating-point operations per second (GFLOPs)] by 41.40%. The enhancement in performance of our approach from a lateral standpoint has been substantiated to result not solely from the modification of architecture, but also from the benefits derived from spatial attention.

Lastly, to delve into the dimension head predicament at each stage of multi-head spatial attention, supplementary experiments were conducted, as delineated in Table 9. The results from Table 9 reveal that among the five experiments conducted, the configuration with dimension values [4,8,16,32] achieved the highest DSC of 84.13%, indicating that this setup optimizes the model’s performance. This suggests that a strategic and balanced distribution of dimension heads across different stages of the multi-head spatial attention mechanism is crucial for capturing spatial dependencies at multiple scales. Configurations with more uniform or imbalanced dimensions, such as [16,16,16,16] or [32,32,32,32], underperform, likely due to their inability to specialize in feature extraction at various stages. Therefore, the findings highlight the importance of carefully selecting dimension head configurations to enhance the model’s capability to process features at different levels of detail, which is essential for achieving accurate segmentation in complex medical images.

Table 9

Investigating the impact of the number of attention heads and the dimension per head on the performance of the NNDA-UNETR, based on the BTCV dataset

No.	The number of heads	Dimension head	DSC (%)↑
1	[8,8,16,32]	[4,8,8,8]	83.22
2	[1,2,4,8]	[32,32,32,32]	83.18
3	[4,4,4,4]	[8,16,32,64]	83.37
4	[2,4,8,16]	[16,16,16,16]	83.02
5	[4,8,16,32]	[8,8,8,8]	84.13

“↑”, a higher value is better. A larger value for this metric represents superior performance. NNDA-UNETR, Not Another Dual Attention UNet Transformer; BTCV, Beyond the Cranial Vault; DSC, Dice similarity coefficient.

In this study, we present NNDA-UNETR, a novel approach for multi-organ segmentation in abdominal CT and cardiac MRI images. Our segmentation experiments demonstrate the model’s high accuracy, surpassing existing methods and achieving SOTA results on the BTCV dataset. Moreover, on the ACDC dataset, NNDA-UNETR exhibits improved sensitivity to organ edges, enhancing segmentation outcomes. Additionally, our aim is to address the lack of lightweight design in current segmentation tasks, crucial for real-world medical applications with limited computational resources.

Analysis of hyper-parameter

According to Figure 7, it is evident that the Lion optimizer consistently yielded superior results across most metrics, particularly at lower learning rates on BTCV dataset. Subsequently, the optimizer was set to Lion, with variations in the learning rate and weight decay. Figure 8 illustrates these results, highlighting that a learning rate of 1e−4 with a weight decay of 1e−4.

Figure 7 Selection of optimizer. The x-axis represents the types of optimizers, while the y-axis displays the values of the four metrics. The bars are color-coded to indicate different learning rates. LR, learning rate; DSC, Dice similarity coefficient; SGD, stochastic gradient descent; HD95, Hausdorff distance at 95%; NSD, normalized surface dice; MASD, mean average surface distance.

Figure 8 Impact of learning rate and weight decay. The x-axis represents the learning rate sizes, and the y-axis displays the values of the four metrics. The bars are color-coded to indicate different weight decay values. WD, weight decay; DSC, Dice similarity coefficient; HD95, Hausdorff distance at 95%; MASD, mean average surface distance; NSD, normalized surface dice.

Figure 8 provided the optimal balance across all metrics on BTCV dataset. Therefore, we set the learning rate to 1e−4 and the weight decay to 1e−4 for all our experiments.

Sensitivity analysis of learning rates and optimizers

According to Figure 9, our sensitivity analysis of learning rates and optimizers reveals key insights:

Learning rate sensitivity: performance metrics, including DSC, HD95, MASD, and NSD, show varying sensitivities to learning rates. Lower learning rates (e.g., 1e−4) generally improve edge detection and surface distance accuracy, with Lion optimizer consistently performing well at this rate. This suggests a stable performance around this optimal value for Lion.
Optimizer sensitivity: the Lion optimizer demonstrates superior robustness and less sensitivity to learning rate changes compared to others. It consistently achieves high performance metrics across all tested learning rates. AdamW also performs well but exhibits more sensitivity to learning rate variations, whereas SGD shows significant performance variability. Adam offers stable performance but is less effective than Lion.

Figure 9 Sensitivity to learning rate and optimizer. DSC, Dice similarity coefficient; SGD, stochastic gradient descent; HD95, Hausdorff distance at 95%; MASD, mean average surface distance; NSD, normalized surface dice.

Overall, Lion provides the best balance and stability in performance, showing reduced sensitivity to learning rate changes, while other optimizers vary more in their sensitivity.

Limitations

Our research focuses on integrating the NNDA-block into our design, providing scalability and compatibility with diverse backbones. However, our investigation does not explore the potential performance enhancement resulting from pretraining medical images using our model, a topic for future research. Furthermore, while our validation was limited to CT and MRI modalities, consideration of other modalities for future studies could broaden the scope and applicability of our approach.

Conclusions

This paper introduces a novel hybrid architecture, NNDA-UNETR, designed specifically for medical image segmentation tasks. Notably, a versatile NNDA-block has been devised, facilitating seamless integration into U-shaped architectures of varying scales with minimal adjustments. By adeptly extracting local features and contextual dependencies from medical images at different stages, while intricately considering the interplay between channels and spatial data, the proposed NNDA-block significantly enhances performance. The effectiveness of NNDA-UNETR has been demonstrated in 3D segmentation assignments involving MRI and CT modalities.

Our methodology demonstrates competitive performance on the BTCV dataset, all the while upholding a parsimonious model structure. In essence, NNDA-UNETR showcases its proficiency in capturing crucial anatomical relationships depicted within biomedical images, specifically those from MRI and CT scans. Furthermore, the adoption of a modular design proves instrumental in streamlining the intricacies of model architecture. This strategy effectively mitigates design complexities by obviating the need for disparate operations at various scales, thereby simplifying the holistic design process within the domain of biomedical image segmentation.

Future work

Pretraining with medical images: one significant area for future work is investigating the potential performance improvements that could be achieved by pretraining our model on large-scale medical image datasets. Pretraining on domain-specific data could enable our model to capture more relevant features and improve its generalizability and accuracy across different medical imaging tasks.
Expansion to other imaging modalities: while our current validation is limited to CT and MRI modalities, extending our model to other imaging modalities such as ultrasound, positron emission tomography (PET), and X-ray could significantly broaden its applicability. Future work could involve collecting diverse datasets encompassing these modalities and evaluating the model’s performance across them. This would not only demonstrate the versatility of our approach but also identify any modality-specific challenges and potential adaptations needed.
Real-world clinical application and robustness: to address the practical constraints of real-world clinical settings, it is crucial to test our model’s robustness and efficiency in environments with limited computing resources. Future experiments could focus on deploying our lightweight model on various hardware configurations, such as edge devices and mobile platforms, to evaluate its performance in terms of speed, memory usage, and accuracy.
Optimization techniques for model efficiency: further optimization techniques could be explored to enhance the efficiency of our model. Techniques such as quantization, pruning, and knowledge distillation could be investigated to reduce the model’s size and computational requirements without sacrificing performance. Comparative studies could be conducted to assess the trade-offs between model complexity and accuracy, providing a comprehensive understanding of the most effective methods for further pursuing lightweight yet powerful medical image segmentation models.
Expanding the NNDA-block’s applications and versatility: future work will extend the NNDA-block beyond U-Nets, exploring its integration into models like V-Net, Dense-Net, and Res-Net. We will also develop a two dimensional (2D) version for natural images, potentially inspiring its application in non-medical 3D domains, such as satellite imagery and autonomous driving, to validate its versatility across various fields.

By addressing these future directions, we can further refine our approach and extend its impact on the field of medical image segmentation. The ultimate goal is to develop a robust, efficient, and versatile model that can be seamlessly integrated into various clinical workflows, enhancing diagnostic accuracy and patient outcomes.

Acknowledgments

Funding: The work was supported by the National Natural Science Young Foundation of China (Nos. 62102242 and 62103258), the Shanghai Education Research Program (No. C2022152), Soft Science Project of Wuxi Science and Technology Association (No. KX-24-077), and Project of Wuxi Health Commission (Nos. Z202212 and M202342).

Footnote

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-24-833/coif). All authors report fundings from the National Natural Science Young Foundation of China (Nos. 62102242 and 62103258), the Shanghai Education Research Program (No. C2022152), Soft Science Project of Wuxi Science and Technology Association (No. KX-24-077), and Project of Wuxi Health Commission (Nos. Z202212 and M202342). Y.C. is from Tiktok Inc. The authors have no other conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013).

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Chen X, Liang C, Huang D, Real E, Wang K, Liu Y, Pham H, Dong X, Luong T, Hsieh CJ, Lu Y, Le QV. Symbolic discovery of optimization algorithms. In Proceedings of the 37th International Conference on Neural Information Processing Systems. December, 2023:49205-33. Available online: https://arxiv.org/pdf/2302.06675
Lim SH, Kim YJ, Park YH, Kim D, Kim KG, Lee DH. Automated pancreas segmentation and volumetry using deep neural network on computed tomography. Sci Rep 2022;12:4075. [Crossref] [PubMed]
Shi Z, Wang T, Huang Z, Xie F, Liu Z, Wang B, Xu J. MD-Net: A multi-scale dense network for retinal vessel segmentation. Biomedical Signal Processing and Control 2021;70:102977.
Li W, Zeng G, Li F, Zhao Y, Zhang H. FRBNet: Feedback reﬁnement boundary network for semantic segmentation in breast ultrasound images. Biomedical Signal Processing and Control 2023;86:105194.
Conze PH, Kavur AE, Cornec-Le Gall E, Gezer NS, Le Meur Y, Selver MA, Rousseau F. Abdominal multi-organ segmentation with cascaded convolutional and adversarial deep networks. Artif Intell Med 2021;117:102109. [Crossref] [PubMed]
Fujita S, Mori S, Onda K, Hanaoka S, Nomura Y, Nakao T, Yoshikawa T, Takao H, Hayashi N, Abe O. Characterization of Brain Volume Changes in Aging Individuals With Normal Cognition Using Serial Magnetic Resonance Imaging. JAMA Netw Open 2023;6:e2318153. [Crossref] [PubMed]
Wong KKL, Fortino G, Abbott D. Deep learning-based cardiovascular image diagnosis: A promising challenge. Future Generation Computer Systems 2020;110:802-11.
Zhou HY, Guo J, Zhang Y, Han X, Yu L, Wang L, Yu Y. nnFormer: Volumetric Medical Image Segmentation via a 3D Transformer. IEEE Trans Image Process 2023;32:4036-45. [Crossref] [PubMed]
Oktay O, Schlemper J, Le Folgoc L, Lee M, Heinrich M, Misawa K, Mori K, McDonagh S, Hammerla NY, Kainz B, Glocker B, Rueckert D. Attention U-Net: Learning Where to Look for the Pancreas. In Proceedings of the 1st Conference on Medical Imaging with Deep Learning (MIDL 2018). Amsterdam: 2018. Available online: https://arxiv.org/pdf/1804.03999
LeeHHBaoSHuoYLandmanBA. 3D UX-Net: A Large Kernel Volumetric ConvNet Modernizing Hierarchical Transformer for Medical Image Segmentation. arXiv:2209.15076.
Zunair H, Ben Hamza A. Sharp U-Net: Depthwise convolutional network for biomedical image segmentation. Comput Biol Med 2021;136:104699. [Crossref] [PubMed]
ZunairHHamzaAB. Masked Supervised Learning for Semantic Segmentation. arXiv:2210.00923.
Verma R, Kumar N, Patil A, Kurian NC, Rane S, Graham S, et al. MoNuSAC2020: A Multi-Organ Nuclei Segmentation and Classification Challenge. IEEE Trans Med Imaging 2021;40:3413-23. [Crossref] [PubMed]
Ghorpade H, Jagtap J, Patil S, Kotecha K, Abraham A, Horvat N, Chakraborty J. Automatic Segmentation of Pancreas and Pancreatic Tumor: A Review of a Decade of Research. IEEE Access 2023;11:108727-45.
Ciecholewski M, Kassjański M. Computational Methods for Liver Vessel Segmentation in Medical Imaging: A Review. Sensors (Basel) 2021;21:2027. [Crossref] [PubMed]
Mookiah MRK, Hogg S, MacGillivray TJ, Prathiba V, Pradeepa R, Mohan V, Anjana RM, Doney AS, Palmer CNA, Trucco E. A review of machine learning methods for retinal blood vessel segmentation and artery/vein classification. Med Image Anal 2021;68:101905. [Crossref] [PubMed]
Peng C, Zhang X, Yu G, Luo G, Sun J. Large Kernel Matters — Improve Semantic Segmentation by Global Convolutional Network. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA: IEEE; 2017.
Isensee F, Jaeger PF, Kohl SAA, Petersen J, Maier-Hein KH. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods 2021;18:203-11. [Crossref] [PubMed]
YuFKoltunV.Multi-Scale Context Aggregation by Dilated Convolutions. arXiv:1511.07122.
Shaker A, Maaz M, Rasheed H, Khan S, Yang MH, Shahbaz Khan F. UNETR++: Delving Into Efficient and Accurate 3D Medical Image Segmentation. IEEE Transaction on Medical Imaging 2024;43:3377-90. [Crossref] [PubMed]
Zhang Y, Liu H, Hu Q. TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation. Medical Image Computing and Computer Assisted Intervention – MICCAI 2021. September 27 to October 1, 2021. Strasbourg, France: Springer; 2021:14-24.
Yao C, Hu M, Li Q, Zhai G, Zhang XP. TransClaw U-Net: Claw U-Net with Transformers for Medical Image Segmentation. 2022 5th International Conference on Information Communication and Signal Processing (ICICSP). Shenzhen, China: IEEE; 2022:280-4.
Chen B, Liu Y, Zhang Z, Lu G, Kong AWK. TransAttUnet: Multi-level Attention-guided U- Net with Transformer for Medical Image Segmentation. IEEE Transactions on Emerging Topics in Computational Intelligence 2024;8:55-68.
Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE 1998;86:2278-324.
Zhou HY, Lu C, Yang S, Yu Y. ConvNets vs. Transformers: Whose Visual Representations are More Transferable? 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). Montreal, BC, Canada: IEEE; 2021:2230-8.
Qu T, Wang X, Fang C, Mao L, Li J, Li P, Qu J, Li X, Xue H, Yu Y, Jin Z. M3Net: A multi-scale multi-view framework for multi-phase pancreas segmentation based on cross-phase non-local attention. Medical Image Analysis 2022;75:102232. [Crossref] [PubMed]
Zhang D, Huang G, Zhang Q, Han J, Han J, Yu Y. Cross-modality deep feature learning for brain tumor segmentation. Pattern Recognition 2021;110:107562.
Karimi D, Vasylechko SD, Gholipour A. Convolution- Free Medical Image Segmentation using Transformers. Medical Image Computing and Computer Assisted Intervention – MICCAI 2021. September 27 to October 1, 2021. Strasbourg, France: Springer; 2021:78-88.
Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, Wang M. Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation. Computer Vision – ECCV 2022 Workshops. Springer; 2022:205-18.
Huang X, Deng Z, Li D, Yuan X, Fu Y. MISSFormer: An Effective Transformer for 2D Medical Image Segmentation. IEEE Trans Med Imaging 2023;42:1484-94. [Crossref] [PubMed]
Xie Y, Zhang J, Shen C, Xia Y. CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image Segmentation. Medical Image Computing and Computer Assisted Intervention – MICCAI 2021. September 27 to October 1, 2021. Strasbourg, France: Springer; 2021:171-80.
Xu G, Zhang X, He X, Wu X. LeViT-UNet: Make Faster Encoders with Transformer for Medical Image Segmentation. Pattern Recognition and Computer Vision. PRCV 2023. Springer; 2024:42-53.
Li Y, Cai W, Gao Y, Li C, Hu X. More than Encoder: Introducing Transformer Decoder to Upsample. 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). Las Vegas, NV, USA: IEEE; 2022:1597-602.
Liu W, Tian T, Xu W, Yang H, Pan X, Yan S, Wang L. PHTrans: Parallelly Aggregating Global and Local Representations for Medical Image Segmentation. Medical Image Computing and Computer Assisted Intervention – MICCAI 2022. Springer; 2022:235-44.
Hatamizadeh A, Nath V, Tang Y, Yang D, Roth HR, Xu D. Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images. Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries. BrainLes 2021. Springer: 2022:272-84.
Tu Z, Talebi H, Zhang H, Yang F, Milanfar P, Bovik A, Li Y. MAXIM: Multi-Axis MLP for Image Processing. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA, USA: IEEE; 2022:5759-70.
Zhao L, Zhang Z, Chen T, Metaxas DN, Zhang H. Improved Transformer for High-Resolution GANs. Advances in Neural Information Processing Systems 34. NeurIPS 2021:18367-80.
Hatamizadeh A, Tang Y, Nath V, Yang D, Myronenko A, Landman B, Roth HR, Xu D. UNETR: Transformers for 3D Medical Image Segmentation. 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Waikoloa, HI, USA: IEEE; 2022.
Dai Z, Liu H, Le QV, Tan M. CoAtNet: Marrying Convolution and Attention for All Data Sizes. Advances in Neural Information Processing Systems 34. NeurIPS 2021:3965-77.
Jiang Y, Chang S, Wang Z. TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up. Advances in Neural Information Processing Systems 34. NeurIPS 2021:14745-58.
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada: IEEE; 2021:9992-10002.
Shaw P, Uszkoreit J, Vaswani A. Self-Attention with Relative Position Representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. New Orleans, LA, USA: Association for Computational Linguistics; 2018:464-8.
Tu Z, Talebi H, Zhang H, Yang F, Milanfar P, Bovik A, Li Y. MaxViT: Multi-axis Vision Transformer. Computer Vision – ECCV 2022. Springer: 2022:459-79.
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. The International Conference on Learning Representations (ICLR). 2021.
Milletari F, Navab N, Ahmadi SA. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. 2016 Fourth International Conference on 3D Vision (3DV). Stanford, CA, USA: IEEE; 2016:565-71.
Landman B, Xu Z, Igelsias J, Styner M, Langerak T, Klein A. MICCAI Multi-Atlas Labeling Beyond Cranial Vault-Workshop and Challenge. 2015:12.
Bernard O, Lalande A, Zotti C, Cervenansky F, Yang X, Heng PA, et al. Deep Learning Techniques for Automatic MRI Cardiac Multi-Structures Segmentation and Diagnosis: Is the Problem Solved? IEEE Trans Med Imaging 2018;37:2514-25. [Crossref] [PubMed]
Zheng S, Lu J, Zhao H, Zhu X, Luo Z, Wang Y, Fu Y, Feng J, Xiang T, Torr PHS, Zhang L. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA: IEEE; 2021.
Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. Computer Vision – ECCV 2018. Springer; 2018.
Wang W, Chen C, Ding M, Yu H, Zha S, Li J. TransBTS: Multimodal Brain Tumor Segmentation Using Transformer. Medical Image Computing and Computer Assisted Intervention – MICCAI 2021. Springer: 2021:109-19.

Cite this article as: Cao L, Zhang Q, Fan C, Cao Y. Not Another Dual Attention UNet Transformer (NNDA-UNETR): a plug-and-play parallel dual attention block in U-Net with enhanced residual blocks for medical image segmentation. Quant Imaging Med Surg 2024;14(12):9169-9192. doi: 10.21037/qims-24-833