EMCAH-Net: an effective multi-scale context aggregation hybrid network for medical image segmentation

Yu Jin; Rui Tian; Qian Yu; Yu Bai; Guoqing Chao; Danqing Liu; Yanhui Guo

doi:10.21037/qims-24-1983

Original Article

EMCAH-Net: an effective multi-scale context aggregation hybrid network for medical image segmentation

Yu Jin¹, Rui Tian¹, Qian Yu², Yu Bai³, Guoqing Chao⁴, Danqing Liu⁵, Yanhui Guo²

¹School of Computer Science, Qinghai Normal University, Xining, China; ²School of Data and Computer Science, Shandong Women’s University, Jinan, China; ³School of Engineering and Computer Science, California State University, Fullerton, CA, USA; ⁴School of Computer Sciences and Technology, Harbin Institute of Technology, Weihai, China; ⁵College of Computer Science and Cyber Security, Chengdu University of Technology, Chengdu, China

Contributions: (I) Conception and design: Y Jin, Y Guo; (II) Administrative support: D Liu, Y Guo; (III) Provision of study materials or patients: R Tian, Y Jin, Y Bai; (IV) Collection and assembly of data: Y Jin, Q Yu; (V) Data analysis and interpretation: Y Guo, G Chao, D Liu, Q Yu; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Yanhui Guo, PhD. School of Data and Computer Science, Shandong Women’s University, Daxue Road No. 2399, Jinan 250002, China. Email: guoyanhui03@163.com.

Background: Pixel-level medical image segmentation tasks are challenging due to factors such as variable target scales, complex geometric shapes, and low contrast. Although U-shaped hybrid networks have demonstrated strong performance, existing models often fail to effectively integrate the local features captured by convolutional neural networks (CNNs) with the global features provided by Transformers. Moreover, their self-attention mechanisms often lack adequate emphasis on critical spatial and channel information. To address these challenges, our goal was to develop a hybrid deep learning model that can effectively and robustly segment medical images, including but not limited to computed tomography (CT) and magnetic resonance (MR) images.

Methods: We propose an effective hybrid U-shaped network, named the effective multi-scale context aggregation hybrid network (EMCAH-Net). It integrates an effective multi-scale context aggregation (EMCA) block in the backbone, along with a dual-attention augmented self-attention (DASA) block embedded in the skip connections and bottleneck layers. Aimed at the characteristics of medical images, the former block focuses on fine-grained local multi-scale feature encoding, whereas the latter enhances global representation learning by adaptively combining spatial and channel attention with self-attention. This approach not only effectively integrates local multi-scale and global features but also reinforces skip connections, thereby highlighting segmentation targets and precisely delineating boundaries. The code is publicly available at https://github.com/AloneIsland/EMCAH-Net.

Results: Compared to previous state-of-the-art (SOTA) methods, the EMCAH-Net achieves outstanding performance in medical image segmentation, with Dice similarity coefficient (DSC) scores of 84.73% (+2.85), 92.33% (+0.27), and 82.47% (+0.76) on the Synapse, automated cardiac diagnosis challenge (ACDC), and digital retinal images for vessel extraction (DRIVE) datasets, respectively. Additionally, it maintains computational efficiency in terms of model parameters and floating point operations (FLOPs). For instance, EMCAH-Net surpasses TransUNet on the Synapse dataset by 7.25% in DSC while requiring only 25% of the parameters and 71% of the FLOPs.

Conclusions: EMCAH-Net has demonstrated significant advantages in segmenting multi-scale, small, and boundary-blurred features in medical images. Extensive experiments on abdominal multi-organ, cardiac, and retinal vessel medical segmentation tasks confirm that EMCAH-Net surpasses previous methods, including pure CNN, pure Transformer, and hybrid architectures.

Keywords: Medical image segmentation; hybrid network; multi-scale feature; self-attention

Submitted Sep 19, 2024. Accepted for publication Feb 10, 2025. Published online Mar 28, 2025.

doi: 10.21037/qims-24-1983

Introduction

Precise segmentation of organs and lesions is often a crucial step in medical image analysis and is of great significance for disease diagnosis and the formulation of treatment plans. However, manual segmentation is time-consuming, tedious, and requires extensive clinical experience. Therefore, automated and accurate segmentation methods for medical images are clinically desirable.

Over the past decade, convolutional neural network (CNN)-based methods for medical image segmentation (1-6), particularly U-Net (7) and its variants (8-11), have demonstrated remarkable clinical performance. The U-shaped architecture with skip connections effectively bridges the semantic gap between the encoder and decoder, whereas convolution operations enable the extraction of robust local features. However, the use of small, fixed kernels limits these methods to capturing single-scale local features and hinders their ability to model rich multi-scale contexts. Although dilated convolutions (12-16) mitigate this limitation by introducing gaps between kernel elements, suboptimal design can result in grid artifacts or the omission of smaller targets (16). Additionally, inadequate global feature representation can lead to localization errors and over-segmentation (17).

Meanwhile, the advent of Vision Transformer (ViT) (18) has overcome CNNs’ inability to capture global contextual information. The multi-head self-attention (MHSA) mechanism inherent in Transformers excels at establishing long-range dependencies among token sequences, which proves beneficial for pixel-level image segmentation tasks. However, when applied to medical image segmentation, Transformer-based methods (17,19,20) often encounter the following constraints: (I) ViT’s performance diminishes without a substantial number of samples for pretraining, a challenge given the scarcity of labeled data in medical images (21). In fact, the large amount of labeled medical images is difficult to obtain (22). (II) Spatial information is disregarded as images are serialized into one-dimensional (1D) tokens, leading to weak local feature learning (23). Additionally, this process averages out channel-specific information, reducing ViT’s effectiveness in modeling channel interactions.

Hybrid CNN-Transformer networks (24-28) effectively combine the strengths of both architectures and have achieved significant advancements in medical image segmentation. However, existing approaches often fall short of fully realizing their potential. For instance, TransUNet (24) and TransFuse (25) incorporate Transformers within the encoder to extract global features, but this design can lead to the collapse of the attention mechanism (29,30). Similarly, UNETR (26) and Swin UNETR (28) employ Transformer blocks in the encoder while relying on convolutional blocks in the decoder. This separation prevents the encoder from fully leveraging the convolutional capability to capture local representations. Furthermore, a pervasive limitation of these methods is their insufficient consideration of the unique characteristics of medical images, such as varying target scales and complex geometries, when employing convolutions. Although Transformers are often used solely for capturing global information, their limitations in perceiving spatial and channel-specific details are frequently overlooked. Consequently, traditional rigid hybrid strategies constrain the synergistic potential of CNNs and Transformers.

Based on the insights gleaned from the preceding analysis, we propose the effective multi-scale context aggregation hybrid network (EMCAH-Net) for medical image segmentation, designed to fully leverage the strengths of both CNNs and Transformers. This network adopts a U-shaped encoder-decoder structure and incorporates two core components: the effective multi-scale context aggregation (EMCA) block and the dual-attention augmented self-attention (DASA) block. The EMCA block serves as the network’s backbone, adopting the idea of gradually expanding the receptive field. It employs standard convolutions (with a kernel size of 3×3) and a spatial pyramid of dilated convolutions to extract fine-grained local multi-scale contextual features. To address the limitations of the self-attention mechanism in adaptively capturing spatial and channel relationships, the DASA block integrates both channel attention and spatial attention (31). By embedding the DASA block into skip connections and bottleneck layers, EMCAH-Net achieves global modeling of convolutional feature maps while effectively capturing spatial and channel relationships. This design emphasizes target area segmentation and enhances boundary localization accuracy. Additionally, since the DASA block avoids the need for stacking multiple layers, it prevents attention collapse, improves skip connection performance, and facilitates consistent feature representations. Compared to previous models, EMCAH-Net demonstrates superior performance, achieving Dice similarity coefficient (DSC) scores of 84.73% (+2.85), 92.33% (+0.27), and 82.47% (+0.70) on the Synapse, automated cardiac diagnosis challenge (ACDC), and digital retinal images for vessel extraction (DRIVE) datasets, respectively. The specific contributions are outlined as follows:

This paper introduces a novel hybrid model, EMCAH-Net, designed specifically for medical image segmentation. The model is engineered to improve the accuracy of segmenting multi-scale, blurred boundary and elongated shapes, particularly small targets, by leveraging the EMCA and DASA blocks.
Our proposed EMCA block effectively captures local multi-scale contextual features, enhancing the model’s ability to discern intricate details. Meanwhile, the integration of the DASA block into the skip connections at each stage enables the adaptive perception of spatial and channel relationships while modeling global contextual dependencies. This design helps the model to emphasize segmentation targets and accurately delineate boundaries, thereby boosting overall performance.
Our EMCAH-Net method demonstrates its outstanding performance across various medical image datasets, surpassing prior approaches including full-convolution, pure Transformer models, and hybrid CNN-Transformer models, which confirms the effectiveness of our method and its contribution to advancing medical image segmentation.

Related work

CNN-based methods

In recent years, deep learning has achieved remarkable progress across various domains (32-35). Notably, CNNs represented by U-Net (7) and its variants (8-11) have demonstrated exceptional performance in medical image segmentation tasks, thanks to their powerful ability to learn local features, achieving significant results. However, the fixed receptive field inherent in standard convolution poses a limitation on CNNs’ capability to effectively capture features of multi-scale segmentation targets. Dilated convolutions (36) offer a solution with their variable receptive fields, partially mitigate the limitations of standard convolution, and have be validated as highly successful in computer vision (CV) (12-16). In medical image segmentation, approaches such as DiSegNet (37) have introduced multi-stage atrous spatial pyramid pooling (MS-ASPP), leveraging dilated convolutions to extract multi-scale information and enhance segmentation performance. Other studies, such as that by Kaur et al. (38) have utilized progressively increasing dilated rates in dilated building blocks to segment skin cancer. Furthermore, techniques such as TransResU-Net (39) integrate a hybrid of dilated convolutions and Transformers in the bottleneck layer to extract low-resolution feature information from the output of ResNet-50 (40). These advancements illustrate the effectiveness of leveraging dilated convolutions in conjunction with other techniques to address the challenges posed by multi-scale segmentation targets in medical image analysis.

Transformer-based methods

Transformer-based methods (21,41-43) have showcased remarkable performance in CV and are considered viable alternatives to CNN-based approaches. ViT (18) pioneered the application of pure Transformers to image classification, followed by subsequent enhancements such as DeiT (44), Swin-Transformer (42), and MaxViT (21), which have further solidified the influence of ViT in the CV domain. In the realm of medical image segmentation, Swin-UNet (17) and DS-TransUNet (19) draw inspiration from Swin-Transformer and propose pure transformer architectures, achieving performance levels comparable to CNNs. However, a notable limitation of these methods lies in their encoders being pre-trained with large-scale datasets.

Hybrid model method

In comparison to employing a single method (either pure Transformer or fully convolutional), the hybrid CNN-Transformer architecture is typically crafted to more effectively capture both local and global features within images. TransUNet (24) marked the initial endeavor to introduce a hybrid architecture into the field of medical image segmentation. This approach utilizes CNN for extracting local features and Transformer for encoding global context. Additionally, TransFuse (25) introduced a fusion aimed at amalgamating feature information from both CNN and Transformer branches, thereby enhancing the model’s feature expression capabilities. UNETR (26) on the other hand, is a three-dimensional (3D) medical image segmentation method that employs a Transformer as the encoder and a CNN as the decoder. Despite the potential benefits of hybrid approaches, research on extracting rich multi-scale and global features while mitigating their respective shortcomings remains relatively limited. We present this article in accordance with the TRIPOD + AI reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-24-1983/rc).

Methods

Architecture overview

As shown in Figure 1A, our proposed EMCAH-Net follows an encoder-decoder architecture. The backbone of the network is composed of stacked EMCA blocks, whereas the DASA block is embedded in skip connections and bottleneck layers. The EMCA block includes four standard convolutions and a spatial pyramid of dilated convolutions, combined with residual and addition operations, to effectively aggregate local multi-scale contextual features in medical images. The DASA block enhances the self-attention mechanism by incorporating channel and spatial attention, enabling it to model the global representation of convolutional feature maps while perceiving channel and spatial relationships. This design not only helps the model to emphasize segmentation targets and accurately delineate boundaries but also addresses challenges such as significant scale variations, complex geometric shapes, and low contrast in medical image segmentation. Moreover, it improves skip connection performance and mitigates attention collapse, thereby boosting the overall segmentation performance.

Figure 1 The overall framework diagram of the proposed model. (A) Architecture of the proposed EMCAH-Net. (B) EMCA block. (C) DASA block. (D) SA block. (E) CA block. CA, channel attention; DASA, dual-attention augmented self-attention; EMCA, effective multi-scale context aggregation; EMCAH-Net, effective multi-scale context aggregation hybrid network; SA, spatial attention.

Specifically, given an input image $I \in R^{H \times W \times C} \in$ , H, W, and C denote the height, width, and the channel number, respectively. We denote the output feature of Stage_i as $S_{i} \in R^{\frac{H}{2^{i}} \times \frac{W}{2^{i}} \times 2^{i} \cdot C}, i = 0, 1, \dots, 4$ . In the encoder, the input I is fed into the first EMCA block for representation learning, where the feature dimension becomes C while maintaining the same resolution. The max pooling layer following the EMCA block performs 2× down-sampling, reducing the resolution while increasing the receptive field. These operations are repeated five times, from Stage₀ to Stage₄. Apart from using bilinear interpolation for up-sampling, the role of the EMCA block is opposite in the decoder compared to the encoder. Next, the feature maps output from each stage of the encoder are flattened element-wise into 1D tokens, and the global representation is constructed through the DASA block in the skip connections and bottleneck layer, while adaptively perceiving spatial and channel information. Then, these tokens are reshaped into spatial feature maps and input to the corresponding stages in the decoder for feature fusion. Finally, EMCAH-Net recovers the input resolution through a simple convolution segmentation head, outputting $O \in R^{H \times W \times K}$ , where K represents the number of segmentation classes. Next, we elaborate on the proposed architectural design.

EMCA block

The EMCA block (see Figure 1B) is designed to address the challenges posed by complex geometric shapes and significant scale variations among segmentation targets in medical images. To achieve this, the block progressively expands the receptive field to preserve fine-grained local details while employing a spatial pyramid of dilated convolutions to efficiently extract multi-scale contextual features. In the following sections, we provide a detailed comparison of the EMCA block with other methods and elaborate on its specific design elements.

Dilated convolution is to expand the convolutional kernel by inserting some spaces (zeros) between its elements, thereby enlarging the receptive field while maintaining the same resolution. In previous research, Deeplab (12), ESPNet (13), Inception (45,46), and ResNext (47) commonly utilize the method of split-(reduce)-transform-(expand)-merge. For example, ESPNet (13) uses point-wise convolutions to help reduce computation (reduce and split), while the spatial pyramid of dilated convolutions resamples the feature maps (transform) to learn multi-scale image representations with a large effective field. Finally, hierarchical feature fusion is performed (merge).

Unlike natural images, medical images often display complex geometric shapes and variable target scales within organ tissues or lesions. Previous approaches have typically relied on dilated convolutions without employing a progressive strategy, which can lead to the neglect of local features, particularly for smaller segmentation targets. Such limitations make these methods less suitable for medical image segmentation tasks. To address this, the EMCA block adopts an expand-split-transform-merge-extend strategy, gradually expanding the receptive field while preserving local features and aggregating multi-scale contextual information. For a detailed configuration of the EMCA block, please refer to Table 1.

f Expand (step 1): the standard convolution projects low-dimensional feature maps into high-dimensional space (e.g., number of channels from C to 2C), extracting local features while initially increasing the receptive field.
f Split (step 2): the standard convolution further expands the receptive field and splits the feature maps into multiple parallel branches.
f Transform (step 3): the spatial pyramid of dilated convolutions applies multiple $k \times k$ dilated convolutions concurrently to resample the feature maps. Each dilated convolution has a dilation rate of $r, r = {1, 2, 3}$ .
f Merge (step 4): by adding the output of the previous step and combining it with standard convolutions to integrate features, the EMCA block is enabled to learn multi-scale feature representations.
f Extend (step 5): finally, standard convolution continues to slowly increase the receptive field, combined with the residual structure’s aggregation feature representation.

Table 1

The hyperparameters of each step in the EMCA block

EMCA block	Hyper-parameter
Step 1 (expand)	Conv2d(C,2C,k=(3,3),s=(1,1)),BatchNorm2d,GELU
Step 2 (split)	Conv2d(2C,2C,k=(3,3),s=(1,1)),BatchNorm2d,GELU
Step 3 (transform)	Conv2d(2C,2C,k=(3,3),s=(1,1),r=1),GELU
	Conv2d(2C,2C,k=(3,3),s=(1,1),r=2),GELU
	Conv2d(2C,2C,k=(3,3),s=(1,1),r=3),GELU
Step 4 (merge)	Addition
Step 4 (merge)	Conv2d(6C,2C,k=(3,3),s=(1,1)),BatchNorm2d,GELU
Step 5 (extend)	Conv2d(2C,2C,k=(3,3),s=(1,1)),BatchNorm2d,GELU

C represents the number of channels, and ‘r’ represents the dilation rate. A residual connection is employed between step 2 and step 4. EMCA, effective multi-scale context aggregation block; GELU, Gaussian error linear unit non-linear activation function.

It is noteworthy that both standard convolution and dilated convolution are accompanied by a Gaussian error linear units (GELU) non-linear activation function. This imparts vital nonlinear representation capabilities to the EMCA block.

DASA block

Global context features are critical for pixel-level segmentation tasks (18,42). However, prior methods have often failed to address the limitations of self-attention in capturing spatial and channel information. Figure 1C illustrates how the DASA block introduces channel attention and spatial attention, and its integration into the skip connection and bottleneck layers of EMCAH-Net. This design not only models the global representation of convolutional feature maps but also enhances their spatial and channel perception. By focusing on target segmentation, sharpening boundary detection, and distinguishing between target and background regions, the DASA block effectively mitigates issues such as over-segmentation and localization errors commonly observed in pure convolutional architectures.

Specifically, the DASA block enhances the self-attention mechanism, as shown in Figure 1D,1E, and consists of a Channel Attention Block (CAB) that emphasizes relevant channels and a Spatial Attention Block (SAB) (31) that captures spatial contextual information:

$X_{i} = F l a t t e n (F_{i}) \to X_{i} = x^{1}; x^{2}; \dots; x^{n}$ [1]

where F_i denotes the output of the i-th encoder stage, X_i denotes the flattened convolutional feature map at the i-th stage. Then, the query ( $Q$ ) and key ( $K$ ) matrices are generated through learnable linear transformation layers and embedded with positional encoding, following the approach of ViT, to capture global contextual information more efficiently. The value matrix ( $V$ ) is generated by processing the input features through the CABs and SABs:

${\begin{matrix} Q, K = L i n e a r T r a n s f o m a t i o n (X_{i}) + E_{p o s}, E_{p o s} \in R^{(H_{i} \times W_{i}) \times C_{i}} \\ V = S A B (C A B (X_{i})) \end{matrix}$ [2]

where H_i, W_i, and C_i is the output of the i-th encoder stage. We compute the matrix of outputs as:

$A t t e n t i o n_{o u t} = S o f t m a x (Q \cdot K^{T}) \cdot V$ [3]

Finally, the output is combined with the original input through an addition operation and reshaped into a 2D spatial feature map to align with the decoder’s reconstruction process and match the output resolution:

$O u t p u t = R e s h a p e (A t t e n t i o n_{o u t} + X_{i})$ [4]

Notably, by utilizing shallower DASA blocks {from stage 1 to stage 5 with layer configurations [1, 1, 1, 1, 4]}, we not only circumvent the issue of attention collapse but also enhance the performance of skip connections.

Loss function

For training, the loss function can be defined as an equation:

$L = λ L_{c e} (g, p) + (1 - λ) L_{d i c e} (g, p)$ [5]

where L_ce is the cross-entropy loss and L_diceis the Dice loss, g and p are the ground truth and prediction results, respectively. The default setting of λ is 1.

Implementation details and evaluation indicators

EMCAH-Net is a hybrid model that effectively combines the advantages of both CNNs and Transformers in an effective and novel manner. It is implemented using Python 3.10 (https://www.python.org/downloads/release/python-3100/) and PyTorch 2.0.0 (https://pytorch.org/get-started/pytorch-2.0/). In order to improve the model’s generalization ability, data augmentation techniques such as flips and rotations are used in all training sets to increase data diversity. We train our model on a Nvidia RTX 3090 GPU (Nvidia, Santa Clara, CA, USA) with 24 GB memory. During the training, the batch size was set to 8 (with the DRIVE dataset set to 2), and the popular SGD optimizer with momentum 0.9 and weight decay 1e−4 is used to optimize our model for backpropagation.

The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013). During the training phase on three datasets, all experiments were conducted using the same experimental settings, and the model did not require any pretraining strategies. To evaluate the model’s performance, we employed five metrics. In the following formula, the TP, FP, TN, and FN represent true positive, false positive, true negative, and false negative respectively.

DSC quantifies the similarity between the predicted segmentation and the ground truth, providing a measure of the overlap between the predicted and true regions.

$D i c e = \frac{2 \cdot T P}{F P + F N + 2 \cdot T P}$ [6]

Sensitivity (Sen) represents the proportion of true positives that the model can correctly identify. A higher Sen value indicates a better ability of the model to recognize positive samples. However, evaluating solely with Sen may lead to an increase in false positives (i.e., misclassifying negative samples as positive). Therefore, it is typically combined with other metrics such as specificity to comprehensively evaluate the model’s performance.

$S e n = \frac{T P}{T P + F N}$ [7]

Specificity (Spe) represents the proportion of true negative samples that are correctly predicted as negative by the model, measuring the accuracy of the model when predicting negative samples. It is often used together with Sen.

$S p e = \frac{T N}{F P + T N}$ [8]

Pixel accuracy (Acc) represents the proportion of pixels that the model predicts correctly out of the total number of pixels, indicating the accuracy of the model at the pixel level.

$A c c = \frac{T N + T P}{T N + T P + F N + F P}$ [9]

Hausdorff distance (HD95). The 95% percent of HD distance distribution (HD95) calculates the 95th percentile of surface distances between the point sets of ground truth and prediction, which is valuable for evaluating the similarity of two images in shape matching.

$H D_{95} = m a x {\underset{y^{'} \in Y^{'}}{m a x} \underset{\hat{y} \in \hat{Y}}{m i n} ‖ \hat{y} - y^{'} ‖, \underset{\hat{y} \in \hat{Y}}{m i n} \underset{y^{'} \in Y^{'}}{m a x} ‖ y^{'} - \hat{y} ‖}$ [10]

where $| |$ denotes the Euclidean distance function, $y^{'}$ and $\hat{y}$ represent individual pixels form the ground truth $Y^{'}$ and the predicted segmentation $\hat{Y}$ , respectively.

For the DSC, Sen, Spe, and Acc, larger values are better. We report the four metrics as percentages (i.e., %). For the HD95, smaller values are better.

Results

We carefully selected three challenging datasets to evaluate our model, all of which are publicly available. These three datasets are the Synapse computed tomography (CT) image dataset (48), the ACDC magnetic resonance imaging (MRI) dataset (49), and DRIVE color fundus image dataset (50).

Firstly, we verified the segmentation capabilities of our EMCA block for multi-scale targets and the advantages of the DASA block in emphasizing and locating areas of segmentation on the Synapse dataset. This dataset requires the segmentation of eight abdominal organs, which exhibit significant scale differences and face challenges such as low contrast and complex geometric shapes. Particularly, the pancreas undergoes substantial morphological variations across different individuals, and the absence of a membrane leads to very fuzzy boundaries. Then, in each image of the ACDC cardiac segmentation dataset, the proportion of cardiac tissue is very small, and most boundaries, especially those of the myocardium (Myo), are quite blurry, posing a significant challenge to segmentation models. Finally, we meticulously selected the DRIVE retinal vessel segmentation dataset. In this dataset, the vasculature exhibits complex interweaving, significant scale variations, and relatively low contrast with the image background, presenting a formidable challenge for the model’s multi-scale target capture, localization, and boundary refinement capabilities.

Dataset

Synapse multi-organ segmentation (Synapse) (48): the dataset includes 30 cases, with a total of 3,779 axial abdominal clinical CT images. Each CT volume involves 85–198 slices of 512×512 pixels, with a voxel spatial resolution of. [(0.54–0.54) × (0.98–0.98) × (2.5–5.0)] mm³ In order to fairly evaluate the performance of EMCAH-Net on the Synapse dataset, following the settings of TransUNet (24), 18 samples were allocated as the training set, and 12 samples were designated as the testing set. The average DSC and average HD95 are used as evaluation metrics to evaluate the performance of our method on the test results of eight abdominal organs (aorta, gallbladder, spleen, left kidney, right kidney, liver, pancreas, spleen, stomach).

ACDC (49): the dataset comprises MRI scans collected from different patients. And it includes a total of 100 samples. In each MR image of each patient, the left ventricle (LV), right ventricle (RV), and Myo are annotated. Following the same settings as TransUNet (24), the dataset is split into 70 samples for training, 10 samples for validation, and 20 samples for testing. Only average DSC is used as the evaluation metric.

DRIVE (50): this dataset is used for retinal vessel segmentation and consists of a total of 40 JPEG-format color fundus images, including 7 cases with abnormal pathology. These images were acquired as part of a diabetes retinopathy screening program conducted in the Netherlands, captured using a Canon CR5 non-mydriatic 3CCD camera with a field of view (FOV) of 45 degrees. Each image has a resolution of 584×565 pixels, with each color channel having a depth of eight bits. The 40 images are evenly split into 20 for the training set and 20 for the test set. For the convenience of comparing with other state-of-the-art (SOTA) methods, we use DSC, Sen, Spe, and Acc as evaluation metrics.

To ensure a fair comparison with SOTA methods, we standardized the two dimensional (2D) image dimensions for the input model: 224×224 on the Synapse and ACDC datasets, and 576×576 the DRIVE dataset.

In the subsequent report, EMCAH-Net’s experimental results for the previously mentioned dataset directly or indirectly refer to the research findings published in the relevant paper.

Experiment results and analysis on synapse dataset

Table 2 presents the comparison results of the proposed EMCAH-Net with previous SOTA methods on the Synapse multi-organ CT dataset. The experiment involves segmenting eight organ tissues with significant differences in scale, including small target organs such as the gallbladder and pancreas, which represent the most challenging part. EMCAH-Net achieved notable performance in segmentation accuracy (84.73% DSC and 16.56 HD), surpassing methods based on fully convolutional networks, pure Transformers, hybrid CNN-Transformers, and SAMed (53) (based on SAM) (55) fine-tuning.

Table 2

Segmentation accuracy of different methods on the Synapse multi-organ CT dataset

Methods	DSC (%)	HD (%)	Aorta (%)	Gallbladder (%)	Kidney (L) (%)	Kidney (R) (%)	Live (%)	Pancreas (%)	Spleen (%)	Stomach (%)
R50 U-Net (24)	74.68	36.87	87.74	63.66	80.60	78.19	93.74	56.90	85.87	74.16
U-Net (7)	76.85	39.70	89.07	69.72	77.77	68.60	93.43	53.98	86.67	75.58
R50 Att-UNet (24)	75.57	36.97	55.92	63.91	79.20	72.71	93.56	49.37	87.19	74.95
Att-UNet (51)	77.77	36.02	89.55^†	68.88	77.98	71.11	93.57	58.04	87.30	75.75
TransUNet (24)	77.48	31.69	87.23	63.13	81.87	77.02	94.08	55.86	85.08	75.62
SwinUNet (17)	79.13	21.55	85.47	66.53	83.28	79.61	94.29	56.58	90.66	76.60
HiFormer-L (27)	80.69	19.14	87.03	68.61	84.23	78.37	94.07	60.77	90.44	82.03
DA-TransUNet (52)	79.80	23.48	86.54	65.27	81.70	80.45	94.57	61.62	88.53	79.73
SAMed (53)	81.88	20.64	87.77	69.11	80.45	79.95	94.80	72.17^†	88.72	82.06^†
VM-UNet (54)	81.08	19.21	86.40	69.41	86.16	82.76	94.17	58.80	89.51	81.40
EMCAH-Net (ours)	84.73^†	16.56^†	88.99	74.74^†	89.50^†	84.70^†	95.50^†	71.42	92.02^†	80.96

^†, the best. CT, computed tomography; DSC, Dice similarity coefficient; EMCAH-Net, effective multi-scale context aggregation hybrid network; HD, Hausdorff distance; L, left; R, right.

Specifically, our results demonstrate superior segmentation accuracy for each organ compared to fully convolutional methods such as Att-UNet (51) and U-Net (7). This indicates the limitation of relying solely on local image features for pixel-level medical image segmentation tasks. Subsequently, the pure Transformer method, SwinUNet (17), surpasses traditional CNNs, emphasizing the importance of global contextual information in segmentation tasks. Furthermore, alongside TransUNet (24), hybrid methods such as DA-TransUNet (52) and HiFormer (27) outperform SwinUNet (17), demonstrating the advantages of hybrid approaches. Our method consistently outperforms previous hybrid methods, particularly excelling in the segmentation of the pancreas, gallbladder, and spleen with significant scale variations and blurred boundaries. This is a result of the collaborative effect of the fine-grained local multi-scale features extracted by the EMCA block at each stage of the encoder and the global dependencies established by the DASA block based on this foundation. It cleverly leverages the strengths of both CNN and Transformer.

Finally, Figure 2 provides a characteristic qualitative example of the results for better illustration. We have observed that the proposed method can accurately segment targets with significant scale differences and output more accurate segmentation results, which are more robust to small targets and complex backgrounds.

Figure 2 Segmentation visualization of different methods on the Synapse multi-organ CT dataset. CT, computed tomography.

Experiment results and analysis on the ACDC dataset

Similar to the Synapse dataset, the proposed EMCAH-Net is trained on the ACDC dataset to perform medical image segmentation tasks. The experimental results are shown in Table 3. EMCAH-Net is still able to achieve excellent performance with a DSC accuracy of 92.33%. Our method outperformed the pure CNNs model UNet (7), the CNN-Transformer hybrid architecture LeViT-UNet (58), and the pure Transformer model SwinUnet (17) in DSC, with accuracy improvements of 2.92%, 2.01%, and 2.33%, respectively. This demonstrates that when using MR mode image data as input, EMCAH-Net is still able to achieve excellent performance with good generalization capability and robustness.

Table 3

Segmentation accuracy of different methods on the ACDC MRI dataset

Methods	DSC (%)	RV (%)	Myo (%)	LV (%)
U-Net (7)	89.41	87.77	85.88	94.67
UNet++ (10)	89.58	87.23	87.13	94.37
Att-UNet (51)	89.01	87.30	85.07	94.66
PSPNet (56)	88.75	85.99	86.39	93.87
DeepLabv3+ (12)	88.25	85.41	85.44	93.90
R50 U-Net (24)	87.55	87.10	80.63	94.92
R50 Att-UNet (24)	86.75	87.58	79.20	93.47
TransUnet (24)	89.71	88.86	84.53	95.73
SwinUnet (17)	90.00	88.55	85.62	95.83
MISSFormer (20)	84.53	81.07	81.21	91.29
nnFormer (57)	92.06	90.94^†	89.58	95.65
LeViT-UNet (58)	90.32	89.55	87.64	93.76
CvT (59)	89.01	87.30	85.07	94.66
EMCAH-Net (ours)	92.33^†	90.21	90.48^†	96.30^†

^†, the best. ACDC, automated cardiac diagnosis challenge; DSC, Dice similarity coefficient; EMCAH-Net, effective multi-scale context aggregation hybrid network; LV, left ventricle; MRI, magnetic resonance image; Myo, myocardium; RV, right ventricle.

In Figure 3, we present visual results comparing our model with other representative models. Compared to them, our segmentation results for the RV closely match the Ground Truth. UNet (7) exhibits the common phenomenon of over-segmentation seen in CNNs, whereas TransUNet (24) shows under-segmentation, possibly due to attention collapse issues. SwinUNet (17) divides the RV into two parts, which is due to SwinTransformer (42) lacking detailed local features. In terms of Myo segmentation, our method excels in boundary delineation, benefiting from the model’s effective use of DASA block for global context modeling. Finally, for LV segmentation, the area and shape obtained by our method are the closest to the Ground Truth.

Figure 3 Segmentation visualization of different methods on the ACDC cardiac MRI dataset. ACDC, automated cardiac diagnosis challenge; MRI, magnetic resonance imaging.

Experiment results and analysis on the DRIVE dataset

The comparison with other methods is shown in Table 4, where we significantly outperform previous work in Sen (+2.87%) and Acc (+1.1%), and slightly surpass prior work in DSC (+0.76%) and Spe (+0.02%). It is worth noting that SOTA models for retinal vessel segmentation tasks are almost all achieved by CNNs, specialized in the segmentation of such elongated and tiny tissues. This may be attributed to CNN’s advantage in capturing local detailed information. Additionally, we conducted qualitative analysis of this dataset, as shown in Figure 4. Through the observation of enlarged regions, firstly, our method avoids causing over-segmentation, whereas other methods tend to introduce an additional vessel. Secondly, in our results, the intersections between vessels can be clearly distinguished, whereas other methods mistakenly merge two vessels into one. Furthermore, our method excels in segmenting the terminals of each vessel. This is attributed to the EMCA block’s strong capabilities in extracting local details and multi-scale contextual features, along with the significant role of the DASA block in locating segmentation targets and delineating clear boundaries.

Table 4

Segmentation accuracy of different methods on the DRIVE dataset

Methods	DSC (%)	Sen (%)	Spe (%)	Acc (%)
UNet (7)	80.55	77.17	97.97	95.29
Bride-Net (60)	–	78.53	98.18	95.65
Nfn+ (61)	–	79.91	98.13	95.82
Csu-net (62)	–	80.71	97.82	95.65
Zou et al. (63)	81.29	77.61	97.92	95.19
Residual UNet (64)	81.49	77.26	98.20	95.53
CE-Net (65)	80.99	76.67	98.15	95.42
SCS-Net (66)	81.53	77.79	98.10	95.51
LIOT UNet (67)	81.39	78.74	97.85	95.42
CS-Net (68)	81.71	78.13	98.09	95.55
AttMSFCU-Net (69)	–	79.84	98.07	95.75
EMCAH-Net (ours)	82.47^†	83.58^†	98.22^†	96.92^†

The “–” indicates that the results for the DSC metric on this dataset were not reported in the referenced study. ^†, the best. Acc, accuracy; DRIVE, retinal images for vessel extraction; DSC, Dice similarity coefficient; EMCAH-Net, effective multi-scale context aggregation hybrid network; Sen, sensitivity; Spe, specificity.

Figure 4 Segmentation visualization of different methods on the DRIVE dataset. DRIVE, digital retinal images for vessel extraction.

Ablation study

A series of ablation studies was performed to validate the efficacy of our proposed method. Unless otherwise specified, all experimental results were obtained using the ACDC dataset.

Effect of EMCA block

As shown in Table 5, by comparing E.5 with E.1, we can observe that when using the EMCA block and only standard convolutions [e.g., U-Net (7)] as the encoder-decoder backbone respectively, E.5 demonstrates significantly better performance than E.1, with an improvement of 1.46% in DSC. This also demonstrates the performance enhancement of the U-Net model with the inclusion of the DASA block. By comparing E.1 with E.2 and E.3, as an encoder, EMCA enhances the DSC baseline by 0.67%; as a decoder, it enhances the DSC baseline by 1.14%. The results clearly demonstrate the efficacy of the EMCA block in capturing local multi-scale context information. Finally, we streamlined the EMCA block by reducing the standard convolutions before and after the spatial pyramid of the dilated convolutions block. As a result, we observed a decrease (E.5 vs. E.4) in DSC accuracy by 0.91%, which confirms the effectiveness of gradually expanding the receptive field compared to the previous dilated convolution method.

Table 5

Ablation study on the impact of EMCA block

Setting	Encoder	Decoder	DSC (%)	RV (%)	Myo (%)	LV (%)
E.1	like U-Net	like U-Net	90.87	88.09	88.94	95.58
E.2	like U-Net	EMCA	91.54	89.48	89.16	95.98
E.3	EMCA	like U-Net	92.01	89.87	90.07	96.11
E.4	reduced EMCA	reduced EMCA	91.42	89.09	89.27	95.89
E.5	EMCA	EMCA	92.33^†	90.21^†	90.48^†	96.30^†

Across all experiments, the DASA block was incorporated into the skip connections. ^†, the best. DASA, dual-attention augmented self-attention; DSC, Dice similarity coefficient; EMCA, effective multi-scale context aggregation; LV, left ventricle; Myo, myocardium; RV, right ventricle.

Effect of the DASA block

The objective of this ablation study is to examine the contribution of the DASA block within the EMCAH-Net. First, we evaluate the rationale behind the design of the DASA block. Next, we investigate the impact of skip connections with different numbers (integrated DASA blocks) and normal skip connections (i.e., those without integrated DASA blocks). Furthermore, we verify the influence of DASA blocks with different configurations on the model and examine the results produced by applying various methods in the network bottleneck layers.

The experimental results reported in Table 6 reveal the impact of different attention mechanism combinations on segmentation performance. When only self-attention is used to capture global contextual representation, the DSC is 91.87%. Next, channel attention is introduced to adaptively perceive the channel relationships of input features while capturing global information, resulting in a DSC of 92.09%. For comparison, to compensate for the insufficient attention of self-attention to the spatial relationships of input features, spatial attention is introduced, resulting in a DSC of 92.10%. These results highlight the limitations of self-attention in perceiving the channel and spatial relationships of input features. Finally, by combining self-attention, channel attention, and spatial attention, the best segmentation performance is achieved, with the DSC reaching 92.33%. This experiment validates that the DASA block can adaptively perceive the channel and spatial relationships of input features while learning global contextual information, thereby enhancing segmentation performance.

Table 6

The impact of different attention mechanisms on model performance

Self-Attention	+Channel-attention	+Spatial-attention	DSC (%)
✓	✗	✗	91.87
✓	✓	✗	92.09
✓	✗	✓	92.10
✓	✓	✓	92.33^†

^†, the best. DSC, Dice similarity coefficient.

By modifying the number of skip connections to 1/2/3/4, the average DSC on the ACDC dataset is shown in Figure 5. It can be observed that the addition of more skip connections results in improved segmentation performance in the model. Compared with the normal skip connections, the proposed DASA block demonstrated an accuracy improvement of 1.11% in DSC. Notably, the enhancement in model performance due to “4-skip” over “3-skip” is limited. For computational efficiency, it is suggested to incorporate DASA blocks solely within the skip connections of stages 2 through 4.

Figure 5 Effectiveness of the DASA block in EMCAH-Net and the impact of the number of skip connections. The “1-skip” configuration denotes the incorporation of the DASA block at stage 4, with stages 1–3 employing normal skip connections (i.e., those without integrated DASA blocks). The “2-skip” configuration involves the introduction of the DASA block at stages 3 and 4, whereas stages 1 and 2 retain normal skip connections. 3-skip and 4-skip follow the same pattern, respectively. DASA, dual-attention augmented self-attention; DSC, Dice similarity coefficient; EMCAH-Net, effective multi-scale context aggregation hybrid network; LV, left ventricle; Myo, myocardium; RV, right ventricle.

We evaluate the impact of DASA blocks with varying configurations on model performance, as shown in Table 7. Experimental findings indicate that despite an increase in the number of heads of multi-head attention and the number of DASA block layers, there is no corresponding improvement in model performance. This phenomenon may be due to the attention collapse. Therefore, we decided to stack fewer DASA blocks in each stage and the bottleneck layer, that is, to set the hyperparameters to be the same as in C.1, which not only circumvents attention collapse but also effectively improves the performance of skip connections.

Table 7

The impact of different configurations of the DASA block on model performance

Subsection	Output size	Dim	C.1	C.2	C.3
Stage 1	112×112	32	w/o DASA block	w/o DASA block	w/o DASA block
Stage 2	56×56	64	Head 4, depth 1	Head 4, depth 2	Head 8, depth 2
Stage 3	28×28	128	Head 8, depth 1	Head 8, depth 2	Head 8, depth 2
Stage 4	14×14	256	Head 8, depth 1	Head 8, depth 6	Head 16, depth 12
Bottleneck	7×7	512	Head 16, depth 4	Head 16, depth 4	Head 32, depth 6
DSC (%)	–	–	92.65	92.49	92.40

C, configuration; DASA, dual-attention augmented self-attention; DSC, Dice similarity coefficient.

Moreover, to validate the effectiveness of the DASA block in the bottleneck layer, we replaced it with standard convolution (i.e., U-Net) or EMCA blocks, as shown in Table 8. By replacing the standard convolutional blocks in the bottleneck layer with

Table 8

The impact of different structures of the bottleneck layer on model performance

Methods	DSC (%)	RV (%)	Myo (%)	LV (%)
Like U-Net	91.90	89.50	90.18	96.01
EMCA	92.00	89.92	90.07	96.01
DASA	92.33^†	90.21^†	90.48^†	96.30^†

^†, the best. DASA, dual-attention augmented self-attention; DSC, Dice similarity coefficient; EMCA, effective multi-scale context aggregation; LV, left ventricle; Myo, myocardium; RV, right ventricle.

EMCA blocks, the performance improved by 0.1%. This demonstrates that the EMCA block is superior to the standard convolutional block. In contrast, the DASA block has shown a significant improvement of 0.43% in DSC compared to the EMCA block. This experiment further verifies the effectiveness of the DASA block.

In conclusion, the above four experiments indicate that embedding the DASA block in the bottleneck layer and skip connections is beneficial for constructing the global representation of convolutional feature maps while adaptively perceiving their channel and spatial relationships. This helps the model to accurately identify and segment the target, refine the edges, and distinguish the background. The weighted activation heatmap (70) (i.e., Figure 6) further supports our conclusion: (I) as shown in column 2, the introduction of DASA blocks enhances the discriminability between the background and the segmentation target while preserving the image texture; (II) columns 3 through 5 demonstrate more highlighted segmentation targets and clear boundaries, resulting in a more accurate delivery of features to the decoder.

Figure 6 The heatmaps of different segmentation regions. The first row excludes DASA blocks, whereas the second row includes them in the skip connections and bottleneck layers. DASA, dual-attention augmented self-attention; LV, left ventricle; Myo, myocardium; RV, right ventricle.

Effect of channel number

The channel number is a key hyperparameter of EMCAH-Net, which influences the model’s parameter count, complexity, and performance, as shown in Table 9. $S t a g e_{i}, i \in {0, 1, 2, 3, 4}$ represents the number of channels in the convolutional feature maps output for each stage. For the “large” model, the results demonstrate that increasing the number of channels further improves performance but also undoubtedly increases computational cost. Conversely, for the “tiny” model, reducing the number of channels decreases computational cost but also decreases by 0.23% in DSC. Therefore, to balance computational cost and model performance, we adopt the “Base” model for all the experiments.

Table 9

The effect of different channel numbers on model performance

Models	DSC (%)	Stage 0	Stage 1	Stage 2	Stage 3	Stage 4
Tiny	92.29	16	32	64	128	256
Base	92.33	32	64	128	256	512
Large	92.52^†	64	128	256	512	1,024

^†, the best. DSC, Dice similarity coefficient.

Model complexity

In this section, we compare the complexity of the proposed model with other methods, including parameters and floating point operations (FLOPs). Lower values of parameters and FLOPs indicate better efficiency, whereas a higher DSC value reflects better segmentation accuracy. As shown in Table 10, U-Net has 31.04 M parameters and 36.92 G FLOPs, which are 1.18 and 2.1 times those of our method, respectively. This is because its standard convolutional structure requires higher-dimensional latent space to learn image features. In contrast, our method has 26.21 M parameters with 17.53 G FLOPs, which is only 25% and 71% of TransUNet respectively, and the segmentation performance is obviously better than it. Although Swin-UNet has fewer FLOPs, its use of Transformer structures in both the encoder and decoder weakens its ability to learn local features, resulting in inferior performance compared to our method. Therefore, our hybrid architecture maintains reasonable complexity while ensuring strong performance, making it a more effective structure. From the results in Table 10, our model achieves the best trade-off between performance and model complexity in terms of parameter count and FLOPs.

Table 10

Analysis of parameters and FLOPs for various models with Synapse CT dataset

Type	Method	#Params (M)	FLOPs (G)	DSC (%)
CNNs	U-Net (7)	31.04	36.92	76.85
CNNs	AttnUNet (71)	34.80	51.04	77.77
Trans	Swin-UNet (17)	27.17	5.92	79.13
Trans	MissFormer (20)	42.46	9.89	81.96
Hybrid	Trans-UNet (24)	105.13	24.66	77.48
	LeViT-UNet (58)	52.17	25.55	78.53
	EMCAH-Net (ours)	26.21^†	17.53	84.73^†

^†, the best. CNN, convolutional neural network; CT, computed tomography; DSC, Dice similarity coefficient; EMCAH-Net effective multi-scale context aggregation hybrid network; FLOPs, floating point operations.

Discussion

Analyzing the limitations of the model is equally critical. Models employing Transformers often face the challenge of quadratic growth in computational complexity with the number of tokens due to the self-attention mechanism. Although the proposed DASA block addresses the limitations of self-attention in effectively capturing the spatial and channel relationships of input features, it does not fully overcome the challenges posed by high computational complexity. A straightforward approach is to down-sample input images to a lower resolution, but this can lead to performance degradation. For instance, in fundus images, fine vessel endpoints are easily lost during down-sample, increasing the difficulty of segmentation. Therefore, we plan to explore more efficient attention mechanisms and optimized down-sample methods in future work to enhance efficiency while maintaining segmentation performance.

Conclusions

In this paper, we propose a novel hybrid architecture, EMCAH-Net, which combines the strengths of CNNs and the self-attention mechanism from Transformers, specifically tailored to the unique characteristics of medical images. The EMCA block in EMCAH-Net excels at encoding fine-grained local multi-scale features, whereas the DASA block enhances global representation learning by adaptively modeling spatial and channel relationships on convolutional feature maps. This design effectively highlights segmentation targets and distinguishes them from the background. EMCAH-Net has demonstrated its superiority in segmenting multi-scale structures, tiny features, and blurred boundaries in medical images. Extensive experiments on abdominal multi-organ, cardiac, and retinal vessel medical segmentation tasks demonstrate that EMCAH-Net outperforms previous methods, including pure CNN, pure Transformer, and hybrid architectures.

Acknowledgments

None.

Footnote

Reporting Checklist: The authors have completed the TRIPOD + AI reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-24-1983/rc

Funding: This work was supported in part by the State Key Laboratory of Tibetan Intelligent Information Processing and Application, Tibetan Information Processing and Machine Translation Key Laboratory of Qinghai Province (grant No. 2023-Z-001), Shandong Provincial Natural Science Foundation (grant Nos. ZR2023MF110, and ZR2023MF037), and Shandong Women’s University High Level Scientific Research Project Cultivation Fund (grant No. 2020GSPGJ08).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-24-1983/coif). All authors report that this work was supported in part by the State Key Laboratory of Tibetan Intelligent Information Processing and Application, Tibetan Information Processing and Machine Translation Key Laboratory of Qinghai Province (grant No. 2023-Z-001), Shandong Provincial Natural Science Foundation (grant Nos. ZR2023MF110, and ZR2023MF037), and Shandong Women’s University High Level Scientific Research Project Cultivation Fund (grant No. 2020GSPGJ08). The authors have no other conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013).

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Guo H, Shi L, Liu J. An improved multi-scale feature extraction network for medical image segmentation. Quant Imaging Med Surg 2024;14:8331-46. [Crossref] [PubMed]
Tran PV. A fully convolutional neural network for cardiac segmentation in short-axis MRI. arXiv 2016;arXiv:1604.00494.
Liu Y, Zhang L, Jiang Z. Multi-stream and multi-scale fusion rib fracture segmentation network based on UXNet. Quant Imaging Med Surg 2025;15:230-48. [Crossref] [PubMed]
Xie A, Lin Q, He Y, Zeng X, Cao Y, Man Z, Liu C, Hao Y, Huang X. Metastasis lesion segmentation from bone scintigrams using encoder-decoder architecture model with multi-attention and multi-scale learning. Quant Imaging Med Surg 2025;15:689-708. [Crossref] [PubMed]
Nan H, Gao Z, Song L, Zheng Q. Seg-SkiNet: adaptive deformable fusion convolutional network for skin lesion segmentation. Quant Imaging Med Surg 2025;15:867-81. [Crossref] [PubMed]
Wang C, Wang S, Shao S, Zhai J. DeepNeXt: a lightweight polyp segmentation algorithm based on multi-scale attention. Quant Imaging Med Surg 2024;14:8551-67. [Crossref] [PubMed]
Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III. Munich: Springer; 2015:234-41.
Çiçek Ö, Abdulkadir A, Lienkamp SS, et al. 3D U-net: learning dense volumetric segmentation from sparse annotation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2016: 19th International Conference, Athens, Greece, October 17-21, 2016, Proceedings, Part II. Athens: Springer; 2016:424-32.
Xiao X, Lian S, Luo Z, Li S. Weighted Res-UNet for High-Quality Retina Vessel Segmentation. In 2018 9th international conference on information technology in medicine and education (ITME). IEEE; 2018:327-31.
Zhou Z, Siddiquee MMR, Tajbakhsh N, Liang J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. Deep Learn Med Image Anal Multimodal Learn Clin Decis Support (2018) 2018;11045:3-11. [Crossref] [PubMed]
Huang H, Lin L, Tong R, Hu H, Zhang Q, Iwamoto Y, et al. Unet 3+: A full-scale connected Unet for medical image segmentation. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2020:1055-9.
Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans Pattern Anal Mach Intell 2018;40:834-48. [Crossref] [PubMed]
Mehta S, Rastegari M, Caspi A, Shapiro L, Hajishirzi H. ESPNet: Efficient spatial pyramid of dilated convolutions for semantic segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018:552-68.
Wang P, Chen P, Yuan Y, Liu D, Huang Z, Hou X, et al. Understanding convolution for semantic segmentation. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE; 2018:1451-60.
Yu F, Koltun V, Funkhouser T. Dilated residual networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017:472-80.
Wang Z, Ji S. Smoothed dilated convolutions for improved dense prediction. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; 2018:2486-95.
Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, et al. Swin-Unet: Unet-like pure transformer for medical image segmentation. In: European Conference on Computer Vision. Tel Aviv: Springer; 2022:205-18.
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proceedings of the International Conference on Learning Representations; 2020.
Lin A, Chen B, Xu J, Zhang Z, Lu G, Zhang D. DS-TransUNet: Dual Swin Transformer U-Net for medical image segmentation. IEEE Trans Instrum Meas 2022;71:1-15.
Huang X, Deng Z, Li D, Yuan X, Fu Y. MISSFormer: An effective transformer for 2D medical image segmentation. IEEE Trans Med Imaging 2023;42:1484-94. [Crossref] [PubMed]
Tu Z, Talebi H, Zhang H, Yang F, Milanfar P, Bovik A, et al. MaxViT: Multi-axis vision transformer. In: European Conference on Computer Vision. Tel Aviv: Springer; 2022:459-79.
Wang R, Lei T, Cui R, Zhang B, Meng H, Nandi AK. Medical image segmentation using deep learning: A survey. IET Image Process 2022;16:1243-67.
He A, Wang K, Li T, Du C, Xia S, Fu H. H2Former: An Efficient Hierarchical Hybrid Transformer for Medical Image Segmentation. IEEE Trans Med Imaging 2023;42:2763-75. [Crossref] [PubMed]
Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, et al. TransUNet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306; 2021.
Zhang Y, Liu H, Hu Q. TransFuse: Fusing transformers and CNNs for medical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I. Strasbourg: Springer; 2021:14-24.
Hatamizadeh A, Tang Y, Nath V, Yang D, Myronenko A, Landman B, et al. UNETR: Transformers for 3D medical image segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; 2022:574-84.
Heidari M, Kazerouni A, Soltany M, Azad R, Aghdam EK, Cohen-Adad J, et al. HiFormer: Hierarchical multi-scale representations using transformers for medical image segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; 2023:6202-12.
Hatamizadeh A, Nath V, Tang Y, Yang D, Roth HR, Xu D. Swin UNETR: Swin transformers for semantic segmentation of brain tumors in MRI images. In: International MICCAI Brainlesion Workshop. Strasbourg: Springer; 2021:272-84.
Zhou D, Kang B, Jin X, Yang L, Lian X, Jiang Z, et al. DeepViT: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886; 2021.
Lin X, Yan Z, Deng X, Zheng C, Yu L. ConvFormer: Plug-and-play CNN-style transformers for improving medical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Vancouver: Springer; 2023:642-51.
Woo S, Park J, Lee JY, Kweon IS. Cbam: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV); 2018:3-19.
Roy AM, Bhaduri J, Kumar T, Raj K. WilDect-YOLO: An efficient and robust computer vision-based accurate object localization model for automated endangered wildlife detection. Ecol Inform 2023;75:101919.
Singh A, Raj K, Kumar T, Verma S, Roy AM. Deep learning-based cost-effective and responsive robot for autism treatment. Drones 2023;7:81.
Roy AM, Bose R, Bhaduri J. A fast accurate fine-grain object detection model based on YOLOv4 deep neural network. Neural Comput Appl 2022;34:3895-921.
Jamil S, Roy AM. An efficient and robust Phonocardiography (PCG)-based Valvular Heart Diseases (VHD) detection framework using Vision Transformer (ViT). Comput Biol Med 2023;158:106734. [Crossref] [PubMed]
Holschneider M, Kronland-Martinet R, Morlet J, Tchamitchian P. A real-time algorithm for signal analysis with the help of the wavelet transform. In: Wavelets: Time-Frequency Methods and Phase Space Proceedings of the International Conference, Marseille, France, December 14–18, 1987. Springer Berlin Heidelberg; 1990:286-97.
Xu G, Cao H, Udupa JK, Tong Y, Torigian DA. DiSegNet: A deep dilated convolutional encoder-decoder architecture for lymph node segmentation on PET/CT images. Comput Med Imaging Graph 2021;88:101851. [Crossref] [PubMed]
Kaur R. GholamHosseini H, Sinha R, Lindén M. Automatic lesion segmentation using atrous convolutional deep neural networks in dermoscopic skin cancer images. BMC Med Imaging 2022;22:103. [Crossref] [PubMed]
Tomar NK, Shergill A, Rieders B, Bagci U, Jha D. TransResU-Net: Transformer based ResU-Net for real-time colonoscopy polyp segmentation. arXiv preprint arXiv:2206.08985; 2022.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016:770-8.
Zheng S, Lu J, Zhao H, Zhu X, Luo Z, Wang Y, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2021:6881-6890.
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, et al. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2021:10012-22.
Zhang Q, Yang YB. REST: An efficient transformer for visual recognition. Adv Neural Inf Process Syst 2021;34:15475-85.
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H. Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning. PMLR; 2021:10347-57.
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, et al. Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2015:1-9.
Szegedy C, Ioffe S, Vanhoucke V, Alemi A. Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the AAAI Conference on Artificial Intelligence; 2017. Available online: https://doi.org/10.1609/aaai.v31i1.11231
Xie S, Girshick R, Dollár P, Tu Z, He K. Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017;1492-500.
Landman B, Xu Z, Igelsias J, Styner M, Langerak T, Klein A. MICCAI multi-atlas labeling beyond the cranial vault–workshop and challenge. In: Proc. MICCAI Multi-Atlas Labeling Beyond Cranial Vault—Workshop Challenge; 2015;5:12.
Bernard O, Lalande A, Zotti C, Cervenansky F, Yang X, Heng PA, et al. Deep Learning Techniques for Automatic MRI Cardiac Multi-Structures Segmentation and Diagnosis: Is the Problem Solved? IEEE Trans Med Imaging 2018;37:2514-25. [Crossref] [PubMed]
Staal J, Abràmoff MD, Niemeijer M, Viergever MA, van Ginneken B. Ridge-based vessel segmentation in color images of the retina. IEEE Trans Med Imaging 2004;23:501-9. [Crossref] [PubMed]
Oktay O, Schlemper J, Folgoc LL, Lee M, Heinrich M, Misawa K, et al. Attention U-Net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999, 2018.
Sun G, Pan Y, Kong W, Xu Z, Ma J, Racharak T, Nguyen LM, Xin J. DA-TransUNet: integrating spatial and channel dual attention with transformer U-net for medical image segmentation. Front Bioeng Biotechnol 2024;12:1398237. [Crossref] [PubMed]
Zhang K, Liu D. Customized segment anything model for medical image segmentation. arXiv preprint. 2023;arXiv:2304.13785.
Ruan J, Xiang S. Vm-unet: Vision mamba unet for medical image segmentation. arXiv preprint. 2024;arXiv:2402.02491.
Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, et al. Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2023:4015-26.
Zhao H, Shi J, Qi X, Wang X, Jia J. Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017. p. 2881-2890.
Zhou HY, Guo J, Zhang Y, Han X, Yu L, Wang L, Yu Y. nnFormer: Volumetric Medical Image Segmentation via a 3D Transformer. IEEE Trans Image Process 2023;32:4036-45. [Crossref] [PubMed]
Xu G, Zhang X, He X, Wu X. Levit-unet: Make faster encoders with transformer for medical image segmentation. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV); 2023 Oct 42-53. Singapore: Springer Nature Singapore.
Wu H, Xiao B, Codella N, Liu M, Dai X, Yuan L, Zhang L. CvT: Introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2021:22-31.
Zhang Y, He M, Chen Z, Hu K, Li X, Gao X. Bridge-Net: Context-involved U-net with patch-based loss weight mapping for retinal blood vessel segmentation. Expert Syst Appl 2022;195:116526.
Wu Y, Xia Y, Song Y, Zhang Y, Cai W. NFN+: A novel network followed network for retinal vessel segmentation. Neural Networks 2020;126:153-62. [Crossref] [PubMed]
Wang B, Wang S, Qiu S, Wei W, Wang H, He H. CSU-Net: A Context Spatial U-Net for Accurate Blood Vessel Segmentation in Fundus Images. IEEE J Biomed Health Inform 2021;25:1128-38. [Crossref] [PubMed]
Zou B, Dai Y, He Q, Zhu C, Liu G, Su Y, Tang R. Multi-Label Classification Scheme Based on Local Regression for Retinal Vessel Segmentation. IEEE/ACM Trans Comput Biol Bioinform 2021;18:2586-97. [Crossref] [PubMed]
Alom MZ, Yakopcic C, Hasan M, Taha TM, Asari VK. Recurrent residual U-Net for medical image segmentation. J Med Imaging (Bellingham) 2019;6:014006. [Crossref] [PubMed]
Gu Z, Cheng J, Fu H, Zhou K, Hao H, Zhao Y, Zhang T, Gao S, Liu J. CE-Net: Context Encoder Network for 2D Medical Image Segmentation. IEEE Trans Med Imaging 2019;38:2281-92. [Crossref] [PubMed]
Wu H, Wang W, Zhong J, Lei B, Wen Z, Qin J. SCS-Net: A Scale and Context Sensitive Network for Retinal Vessel Segmentation. Med Image Anal 2021;70:102025. [Crossref] [PubMed]
Shi T, Boutry N, Xu Y, Geraud T. Local Intensity Order Transformation for Robust Curvilinear Object Segmentation. IEEE Trans Image Process 2022;31:2557-69. [Crossref] [PubMed]
Mou L, Zhao Y, Chen L, Cheng J, Gu Z, Hao H, et al. CS-Net: Channel and spatial attention network for curvilinear structure segmentation. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part I. Shenzhen: Springer International Publishing; 2019:721-30.
Li C, Li Z, Yu F, Liu W. An improved method for retinal vessel segmentation in U-Net. Multimedia Tools and Applications 2024;1-19.
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision; 2017:618-26.
Schlemper J, Oktay O, Schaap M, Heinrich M, Kainz B, Glocker B, Rueckert D. Attention gated networks: Learning to leverage salient regions in medical images. Med Image Anal 2019;53:197-207. [Crossref] [PubMed]

Cite this article as: Jin Y, Tian R, Yu Q, Bai Y, Chao G, Liu D, Guo Y. EMCAH-Net: an effective multi-scale context aggregation hybrid network for medical image segmentation. Quant Imaging Med Surg 2025;15(4):3064-3083. doi: 10.21037/qims-24-1983

EMCAH-Net: an effective multi-scale context aggregation hybrid network for medical image segmentation

Introduction

Related work

CNN-based methods

Transformer-based methods

Hybrid model method

Methods

Architecture overview

EMCA block

Table 1

DASA block

Loss function

Implementation details and evaluation indicators

Results

Dataset

Experiment results and analysis on synapse dataset

Table 2

Experiment results and analysis on the ACDC dataset

Table 3

Experiment results and analysis on the DRIVE dataset

Table 4

Ablation study

Effect of EMCA block

Table 5

Effect of the DASA block

Table 6

Table 7

Table 8

Effect of channel number

Table 9

Model complexity

Table 10

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share