A dual-branch network for lesion segmentation in medical images using state space models

Hao Chen; Byung-Won Min; Haifei Zhang

doi:10.21037/qims-2025-576

Original Article

A dual-branch network for lesion segmentation in medical images using state space models

Hao Chen^1,2 , Byung-Won Min², Haifei Zhang¹

¹School of Information Engineering, Nantong Institute of Technology, Nantong, China; ²Division of Information and Communication Convergence Engineering, Mokwon University, Daejeon, Republic of Korea

Contributions: (I) Conception and design: H Chen; (II) Administrative support: BW Min; (III) Provision of study materials or patients: None; (IV) Collection and assembly of data: H Chen, H Zhang; (V) Data analysis and interpretation: H Chen; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Hao Chen, PhD. School of Information Engineering, Nantong Institute of Technology, 211 Yongxing Road, Chongchuan District, Nantong 226002, China; Division of Information and Communication Convergence Engineering, Mokwon University, Daejeon, Republic of Korea. Email: chenhao@ntit.edu.cn.

Background: Lesion segmentation in medical images is crucial for clinical diagnosis and treatment planning. However, existing methods often struggle to effectively extract both local and global features, limiting segmentation accuracy. To address this challenge, we propose a dual-branch network that integrates state space models (SSMs) with deep convolutional networks to enhance the extraction of both local and global features, thus improving lesion segmentation performance.

Methods: The proposed model employs a dual-branch encoder: one branch incorporates the visual state space encoder to efficiently model long-range contextual dependencies, while the other branch, based on the residual network, extracts hierarchical local features. To refine feature representation, we introduce a lightweight multi-scale depth-wise separable convolution block, ensuring adaptability to varying lesion sizes while maintaining computational efficiency. The fused features are processed by the decoder for high-precision segmentation.

Results: Extensive experiments on the Kaggle_3M and Kvasir-SEG datasets demonstrated that the proposed model outperformed existing state-of-the-art models. Specifically, it achieved a dice similarity coefficient (Dice) of 0.9140 and a false negative rate (FNR) of 0.0800 on Kaggle_3M dataset, and a Dice of 0.9173 and an FNR of 0.0788 on Kvasir-SEG dataset. Compared to other models, our model delivered superior quantitative results and visual segmentation performance. In addition, when trained on Kvasir-SEG and tested on two external datasets, our model demonstrated superior cross-dataset generalization.

Conclusions: The proposed model integrates SSMs and deep convolutional networks to improve lesion segmentation by effectively capturing both local and global features. It offers new insights for medical image segmentation with potential clinical applications.

Keywords: Lesion segmentation; state space models (SSMs); visual state space (VSS); long-range dependencies

Submitted Mar 07, 2025. Accepted for publication Sep 25, 2025. Published online Nov 14, 2025.

doi: 10.21037/qims-2025-576

Introduction

Medical image segmentation plays a crucial role in disease diagnosis and treatment. For example, deep learning has been applied to brain tumor magnetic resonance imaging for feature extraction and segmentation, showing its effectiveness in clinical decision-making (1). In recent years, deep learning, particularly convolutional neural network (CNN), has become the mainstream method in this field (2,3). The classic U-Net, with its efficient encoder-decoder structure and skip connections, has been widely applied to medical image segmentation (4,5). Recent studies have proposed various U-Net-based approaches to further improve medical image segmentation performance (6,7). However, CNNs still face limitations in multi-scale feature fusion and long-range dependency modeling, making it challenging to fully capture global contextual information.

To address this issue, Transformer-based models utilize self-attention mechanisms to effectively capture long-range dependencies, which enhances global context modeling (8). However, the computational complexity of Transformer models increases significantly with the input image size, limiting their application in large-scale medical image segmentation tasks (9).

To address the limitations of CNNs and Transformer, state space model (SSM)-based methods (10) have gained increasing attention. With its exceptional global modeling capabilities and efficient computation, SSM has emerged as a promising alternative for modeling long-range dependencies and capturing spatial context. SSM can effectively handle long-range dependencies while outperforming traditional CNNs and Transformer in computational efficiency. In recent years, SSM-based image segmentation methods have achieved significant success in medical and remote sensing images, demonstrating their potential for precise segmentation and efficient computation (11-13).

Even though existing methods have partially addressed the limitations of CNNs and Transformer, effectively capturing and integrating both global and local information, while balancing computational complexity, remains a major challenge in improving segmentation performance. To tackle this challenge, this paper proposes a novel dual-branch network-based medical image lesion segmentation model. The specific contributions are as follows:

A dual-branch encoder is designed in which residual network (ResNet) extracts hierarchical semantic features and visual state space (VSS) encoder captures long-range dependencies, enabling complementary local-global feature integration for lesion segmentation.
The VSS encoder is introduced into medical image segmentation, achieving efficient global context modeling with lower complexity compared to Transformer-based approaches.
A lightweight multi-scale depth-wise separable convolution block (LM-DSCB) is proposed. It combines depth-wise separable convolution (DSConv) with multiple dilation rates and multi-scale pooling to capture lesions of varying sizes and morphologies, while maintaining computational efficiency.

The rest of the paper is organized as follows: Related work section introduces the related work, Methods section describes the proposed method in detail, Results section presents the experiments and results, Discussion section provides a discussion, and Conclusions section concludes the paper.

Related work

U-Net (4) is a classic encoder-decoder architecture that is widely used in medical image segmentation. However, U-Net has limitations when handling multi-scale feature fusion. To address this, U-Net 3+ (14) effectively combines low-level details and high-level semantic information from different scales through full-scale skip connections, further improving segmentation accuracy. To tackle the issue of spatial information loss, CE-Net (15) introduces a dense atrous convolution block and a residual multi-kernel pooling block, enhancing the ability to extract high-level features. Researchers have found that incorporating attention mechanisms into U-Net helps the model focus on key features, which improves segmentation accuracy (16-18). By using self-attention mechanisms and multi-scale feature fusion, the traditional fusion methods have been improved, successfully capturing rich contextual information. To enhance the adaptability of the attention mechanism, channel prior convolutional attention (19) dynamically allocates attention weights in both the channel and spatial dimensions, further improving the extraction of spatial relationships. To overcome the limitations of the dual-branch network architecture in real-time semantic segmentation, PIDNet (20) proposes a three-branch network architecture and introduces the concept of proportional-integral-derivative control. It effectively guides the fusion of detail and contextual information through a boundary attention mechanism. However, traditional CNNs, which primarily learn local features, struggle to capture long-range dependencies and incorporate global context.

To overcome these limitations, Transformer-based models have shown superior performance in medical image segmentation. Transunet is one of the first models to introduce Transformer into medical image segmentation. By incorporating the self-attention mechanism of Transformer into the U-Net architecture, it improves the extraction of global features (8). Swin-unet (21) further enhances the application of Transformer in medical image segmentation. It combines the learning of both local and global features, overcoming the limitations of CNNs in capturing global semantic information. Scribformer (22) introduces a three-branch structure, combining CNNs, Transformer, and an attention-guided class activation map branch, which further strengthens the synergy between local and global features. LM-Net (23) addresses the issue of blurry segmentation boundaries in medical image segmentation by combining local and global feature transformers. AFC-Unet (24), built on the fusion of CNNs and Transformer, employs full-scale feature block fusion and pyramid sampling modules. However, Transformer-based models face the challenge of increased computational complexity as the input size grows, leading to a heavier computational burden.

Long-range dependency modeling and computational complexity have always been challenges in medical image segmentation. Traditional CNNs have limitations in modeling long-range dependencies, and although Transformer have global modeling capabilities, their computational complexity grows quadratically. To overcome these issues, SSM has gradually shown their advantages. Mamba (10) has advanced the development of SSM by introducing a data-dependent SSM layer. Vision Mamba (25) and Vmambair (26) further applied SSM to the visual domain, achieving significant results. Vm-unet (11) is the first medical image segmentation network based on SSM. It captures long-range dependency information through the VSS block and uses an asymmetric encoder-decoder structure to reduce the number of convolutional layers, thus lowering computational complexity. VM-UNetV2 (27) combines the VSS block with the semantics and detail infusion method to further enhance segmentation performance. RS³Mamba (12) provides global information to the convolution-based main branch using VSS block, improving remote sensing image segmentation accuracy. Segmamba (28) addresses computational issues in three-dimensional medical image segmentation by efficiently capturing long-range dependencies and modeling them at different scales using VSS block. H-vmunet (29) designs the high-order two-dimensional-selective-scan (SS2D) and local-SS2D modules to enhance the modeling of both global and local features. Based on these works, this paper proposes a dual-branch feature enhancement extraction network based on SSM, which effectively captures extensive contextual information for medical image segmentation.

Methods

Dual-branch network architecture

This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The proposed segmentation model adopts a dual-branch encoder architecture, as illustrated in Figure 1, and consists of four stages. Stage 1 employs a ResNet stem and encoder to capture local structural features in a hierarchical manner. Stage 2 introduces a parallel VSS stem and encoder to model global contextual dependencies through long-range information propagation. Stage 3 fuses the complementary local and global features from both encoders and further refines them through a LM-DSCB for feature enhancement. Finally, Stage 4 uses a decoder to progressively restore spatial resolution, and the segmentation head produces the final pixel-wise predictions.

Figure 1 Dual-branch network architecture. BN, batch normalization; Conv, convolution; ReLU, rectified linear unit; ResNet, residual network; VSS, visual state space.

SS2D

SSM has been introduced into visual tasks, where the input $x (t) \in ℝ$ is projected to the output $y (t) \in ℝ$ through an intermediate hidden state $h (t) \in ℝ^{N}$ . SSM can be represented using a system of linear ordinary differential equations:

$h^{'} (t) = A h (t) + B x (t)$ [1]

$y (t) = C h (t)$ [2]

where N is the hidden dimension, $h^{'} (t)$ denotes its derivative with respect to time, $A \in ℝ^{N \times N}$ is the state matrix, and $B \in ℝ^{N \times 1}$ and $C \in ℝ^{1 \times N}$ are projection parameters.

To apply SSM in discrete settings, the ordinary differential equations must be discretized. This process involves introducing a timescale parameter $Δ$ and transforming $A$ and $B$ into discrete counterparts $\bar{A}$ and $\bar{B}$ using a fixed discretization rule. The zero-order hold method is adopted for this transformation, defined as:

$\bar{A} = e^{Δ A}$ [3]

$\bar{B} = {(Δ A)}^{- 1} (e^{Δ A} - 1) \cdot Δ B$ [4]

Figure 2 illustrates the SS2D process, which serves as the core computational unit of the VSS encoder. It comprises three key components: scan expansion, an S6 block, and scan merging. First, the scan expansion operation unfolds the input image into sequences along four different directions. Then, these sequences are processed through the S6 block, which extracts features while ensuring comprehensive scanning in each direction to capture diverse information. Finally, the scan merging operation aggregates and combines the processed sequences, restoring the output image to the same spatial dimensions as the input. The overall process can be mathematically formulated as follows:

$x_{v} = E x p a n s i o n (x, v)$ [5]

$\bar{x_{v}} = S 6 (x_{v})$ [6]

$\bar{x} = Merging ({\bar{x}}_{1}, {\bar{x}}_{2}, {\bar{x}}_{3}, {\bar{x}}_{4})$ [7]

Figure 2 SS2D process. SS2D, two-dimensional-selective-scan.

Here, $x$ denotes the input feature map, and $Expansion (\cdot)$ unfolds $x$ into directional sequences $x_{v}$ , where $v \in {1, 2, 3, 4}$ indicates the four scanning directions. The operator $S6 (\cdot)$ processes each sequence to produce direction-specific representations $\bar{x_{v}}$ . Finally, $Merging (\cdot)$ aggregates the four directional outputs and reconstructs the feature map $\bar{x}$ .

Dual-branch encoder architecture

The detailed workflow of the dual-branch encoder is illustrated in Figure 3. The input image is first processed by two parallel stems: the ResNet stem produces the feature map $X_{0} \in ℝ^{H / 4 \times W / 4 \times 64}$ , and the VSS stem generates the feature map $X_{1} \in ℝ^{H / 2 \times W / 2 \times 48}$ . The ResNet encoder takes $X_{0}$ as input and progressively extracts local structural features through four convolutional stages, producing ${r_{1}, r_{2}, r_{3}, r_{4}}$ . In parallel, the VSS encoder processes $X_{1}$ , yielding five feature maps ${v_{0} {,v}_{1}, v_{2}, v_{3}, v_{4}}$ . At each level, the ResNet and VSS features are fused to obtain ${{rv}_{1}, {rv}_{2}, {rv}_{3}, {rv}_{4}}$ , which integrate the local features captured by the ResNet encoder and the global dependencies modeled by the VSS encoder. This fusion process is mathematically expressed as follows:

${rv}_{1} = [r_{1}, v_{1}]$ [8]

${rv}_{2} = [r_{2}, v_{2}]$ [9]

${rv}_{3} = [r_{3}, v_{3}]$ [10]

${rv}_{4} = [r_{4}, v_{4}]$ [11]

Figure 3 Encoder architecture. ResNet, residual network; VSS, visual state space.

These fused multi-scale representations provide complementary information for subsequent feature enhancement.

LM-DSCB

To effectively capture multi-scale features and reduce computational cost, this paper proposes LM-DSCB. Figure 4 illustrates the structure of LM-DSCB. This block integrates DSConv and multi-scale pooling to efficiently extract features, enhance the model’s ability to perceive information at different scales, and simultaneously reduce computational complexity.

Figure 4 LM-DSCB architecture. Conv, convolution; DSConv, depthwise separable convolution; DWConv, depthwise convolution; LM-DSCB, lightweight multi-scale depth-wise separable convolution block; PWConv, pointwise convolution.

The input feature map $rv 4$ of LM-DSCB originates from the final fused features of the ResNet and VSS encoder. $rv 4$ integrates deep global and local semantic information, providing rich feature representation for subsequent convolution operations. The LM-DSCB consists of three sets of DSConv units and two pooling operations. Each convolution unit includes a depthwise convolution (DWConv) and a pointwise convolution (PWConv). The first set adopts a 3×3 DWConv and a 1×1 PWConv, while the second and third sets use dilated 3×3 DWConv along with 1×1 PWConv. This hierarchical convolution design captures features at different receptive fields, enhancing the network’s multi-scale perception capability.

DWConv reduces computation by performing convolution independently on each input channel, while PWConv fuses the output of each channel using a 1×1 convolution operation. Let $m \in ℝ^{H \times W \times C_{in}}$ denotes the input feature map with height $H$ , width $W$ , and $C_{in}$ channels. DWConv applies a separate $K \times K$ kernel $ω_{d} \in ℝ^{K \times K \times C_{in}}$ to each channel $c \in {1, ..., C_{in}}$ , producing:

$m^{'} = m_{c} * ω_{d, c}, f o r c \in {1, \dots, C_{i n}}$ [12]

where $*$ represents the convolution operation, and $ω_{d,c}$ is the $K \times K$ convolution kernel applied to the input channel $c$ . The output $m^{'} \in ℝ^{H \times W \times C_{i n}}$ is processed using a 1×1 convolution kernel $ω_{f} \in ℝ^{1 \times 1 \times C_{in} \times C_{out}}$ to obtain the final output $p \in ℝ^{H \times W \times C_{o u t}}$ , which is expressed as:

$p = m^{'} * ω_{f}$ [13]

where $ω_{f}$ is the convolution kernel for the PWConv, and $C_{out}$ is the number of output channels.

Figure 4 shows the detailed configuration of the three sets of DSConv units. Each set of convolution operations processes the input features with different receptive fields to extract multi-scale information. The formula for calculating the receptive field of dilated convolution is as follows:

$K_{r f} = K_{s i z e} + (K_{s i z e} - 1) \cdot (d - 1)$ [14]

where $K_{size}$ represents the size of the dilated convolution kernel, d denotes the dilation rate, and $K_{rf}$ represents the kernel size of the standard convolution that corresponds to the same receptive field as the dilated convolution. Additionally, to further enhance the model’s ability to extract multi-scale features, two pooling operations are introduced: the first is 2×2 max pooling, and the second is 3×3 max pooling.

After the features are processed by DSConv and pooling, the multiple feature maps ${g_{1}, g_{2}, ..., g_{n}}$ are concatenated along the channel dimension by the operator $concat (\cdot)$ to obtain a fused representation. The resulting feature map is denoted by $p^{'}$ . The mathematical expression for the concatenation operation is as follows:

$p^{'} = c o n c a t (g_{1}, g_{2}, \dots, g_{n})$ [15]

The concatenation operation is performed along the channel dimension, and the concatenated feature maps undergo a channel-wise feature transformation using a 1×1 convolution to enhance information interaction and adjust the number of channels.

Results

Datasets

Kaggle_3M (30): this dataset contains preoperative brain magnetic resonance images from 110 patients with lower-grade gliomas, sourced from The Cancer Genome Atlas and hosted by the cancer imaging archive. The dataset includes fluid-attenuated inversion recovery sequences, which are preferred for the assessment of lower-grade gliomas due to their sensitivity in delineating tumor infiltration and edema in non-enhancing tumors. Unlike high-grade gliomas, lower-grade gliomas rarely show contrast enhancement, making fluid-attenuated inversion recovery a more suitable modality for preoperative imaging. We manually annotated fluid-attenuated inversion recovery hyperintensity regions to create segmentation masks. The images were divided into 1,095 training, 137 validation, and 137 test images, each with a resolution of 256×256 pixels (px), with over 20 slices per patient.

Kvasir-SEG (31): this dataset contains 1,000 polyp images with resolutions ranging from 332×487 to 1,920×1,072 px. It was divided into 800 training images, 100 validation images, and 100 test images. The dataset is designed for research on polyp detection, segmentation, and classification, and focuses specifically on polyps.

These datasets were chosen to assess the model’s performance across varied clinical scenarios. The Kaggle_3M dataset, focused on brain tumor segmentation from magnetic resonance images, presented challenges due to complex lesion structures. The Kvasir-SEG dataset, featuring colon polyps from colonoscopy, offered a distinct lesion type and imaging modality. Together, they provided a comprehensive evaluation of the model’s generalizability and robustness. Furthermore, we followed common practice in polyp segmentation literature by training on Kvasir-SEG and testing the trained models on two external datasets, CVC-ColonDB (32) and ETIS (33), to evaluate cross-dataset generalization.

Experimental details

This paper compared U-Net (4), CE-Net (15), Ma-net (18), Transunet (8), PIDNet (20), Vm-unet (11), RS³Mamba (12), and the proposed model. The experiments were conducted on an RTX 4090D (24 GB) with Ubuntu 22.04, Python 3.10, Cuda 11.8, and PyTorch 2.1.2.

The models were optimized using the Adam optimizer with an initial learning rate of 0.0001, and the learning rate was adjusted with cosine annealing. The training process lasted 400 epochs, with an input size of 256×256 for training, validation, and testing. In this work, ResNet-34 is chosen as one of the encoders because it offers a favorable trade-off between accuracy and efficiency and has been widely validated in medical image segmentation tasks.

For the experiments on Kaggle_3M and Kvasir-SEG, segmentation performance was assessed using false negative rate (FNR), dice similarity coefficient (Dice), and mean surface distance (MSD). FNR reflects missed lesion px, Dice measures region overlaps, and MSD evaluates boundary accuracy. In addition, frames per second (FPS) (34) was reported to indicate inference speed. For the cross-dataset validation on CVC-ColonDB and ETIS, we followed common practice in polyp segmentation research (35,36) and reported mean FNR and mean Dice to enable consistent benchmarking across datasets.

Comparative experiments

On the Kaggle_3M dataset, this paper provided a comprehensive evaluation of the proposed model and compared it with several mainstream segmentation models. Table 1 presented the quantitative metrics of different models on this dataset. The experimental results showed that the proposed model outperformed all other models in all evaluation metrics. In contrast, Ma-net and Transunet exhibited more stable segmentation performance, but their Dice values were lower than those of ours. Moreover, compared to U-Net, it showed a 1.82% reduction in FNR and a 0.98% improvement in Dice.

Table 1

Comparative experimental results on the Kaggle_3M dataset

Method	FNR	Dice
U-Net	0.0982	0.9042
CE-Net	0.0945	0.9081
Ma-net	0.0924	0.9133
Transunet	0.0885	0.9127
PIDNet	0.1171	0.8903
Vm-unet	0.1011	0.9019
RS³Mamba	0.1004	0.9043
Ours	0.0800	0.9140

Dice, dice similarity coefficient; FNR, false negative rate.

As shown in Figure 5, our proposed method achieved the lowest MSD of 2.12 px, outperforming all compared baselines including U-Net (2.73 px), CE-Net (2.32 px), and Transunet (2.65 px). This indicated a more accurate boundary alignment between the predicted and ground truth lesion contours. In contrast, methods such as PIDNet and Vm-unet exhibited higher MSD, suggesting less precise segmentation boundaries.

Figure 5 Comparison of MSD among different methods. MSD, mean surface distance; px, pixels.

Figure 6 illustrated the visual segmentation results of different models on the Kaggle_3M dataset. Due to the diverse morphological characteristics of lesion regions in this dataset, some models exhibited over-segmentation or under-segmentation. In contrast, our model accurately identified lesion regions and produced complete segmentation results, demonstrating a high degree of alignment with the ground truth. CE-Net showed instances of lesion omission in certain samples, whereas Ma-net and Transunet occasionally suffered from over-segmentation, leading to the misclassification of non-lesion areas. Overall, the proposed model achieved clearer boundaries and more comprehensive lesion segmentation, further confirming its effectiveness in medical image segmentation tasks.

Figure 6 Visual comparisons on the Kaggle_3M dataset. (A) Origin image. (B) Ground truth. (C) Ours. (D) U-Net. (E) CE-Net. (F) Ma-net. (G) Transunet. (H) PIDNet. (I) Vm-unet. (J) RS³Mamba.

The quantitative evaluation results on the Kvasir-SEG dataset were shown in Table 2. The proposed model achieved the best performance across all evaluation metrics, with a Dice of 0.9173 and an FNR of only 0.0788. In comparison, CE-Net, the best-performing baseline, achieved a Dice of 0.8879, which is still 2.94% lower than the proposed model. Additionally, U-Net achieved a Dice of 0.8112, which is significantly lower than that of the proposed model. Overall, the proposed model demonstrated the best balance among all models, not only improving Dice but also significantly reducing FNR, enhancing segmentation accuracy.

Table 2

Comparative experimental results on the Kvasir-SEG dataset

Method	FNR	Dice
U-Net	0.1837	0.8112
CE-Net	0.1120	0.8879
Ma-net	0.1364	0.8763
Transunet	0.2531	0.7873
PIDNet	0.3093	0.7448
Vm-unet	0.2657	0.8009
RS³Mamba	0.3372	0.7201
Ours	0.0788	0.9173

Dice, dice similarity coefficient; FNR, false negative rate.

Figure 7 presented the visual segmentation results of different models on the Kvasir-SEG dataset. The segmentation results of our model exhibited the closest resemblance to the ground truth, accurately capturing lesion boundaries while effectively reducing mis segmentation. In contrast, U-Net, Ma-net, Transunet, PIDNet, Vm-unet, and RS³Mamba demonstrated varying degrees of boundary blurring and incomplete lesion segmentation, particularly in cases involving small or complex-shaped lesions. CE-Net performed relatively well in recovering lesion regions but still exhibited certain mis segmentation artifacts. Overall, the proposed model achieved more precise boundary delineation and more complete lesion segmentation, further validating its effectiveness in medical image segmentation tasks.

Figure 7 Visual comparisons on the Kvasir-SEG dataset. (A) Origin image. (B) Ground truth. (C) Ours. (D) U-Net. (E) CE-Net. (F) Ma-net. (G) Transunet. (H) PIDNet. (I) Vm-unet. (J) RS³Mamba.

Figure 8 showed the inference speed of the comparison models on two datasets. On both the Kaggle_3M and Kvasir-SEG datasets, the proposed model achieved an inference speed of 78 FPS, outperforming models such as CE-Net, Transunet, and RS³Mamba. Although U-Net has the highest inference speed, it maintained a relatively high speed while still demonstrating strong segmentation accuracy.

Figure 8 Comparative experiments: inference speed of models across different datasets. FPS, frames per second.

To further examine the robustness of the proposed method, we extended the evaluation beyond the training domain. Specifically, the models were trained on the Kvasir-SEG dataset and subsequently applied to two additional public datasets, CVC-ColonDB and ETIS. The outcomes are summarized in Table 3. Across both benchmarks, our approach produced the lowest mean FNR together with the highest mean Dice. Compared with widely used segmentation frameworks such as U-Net, Ma-net, Transunet, and Vm-unet, the proposed method maintained more stable performance when transferred to unseen data. The results showed that our model worked well on the training data and still performed reliably on new datasets.

Table 3

Comparative experimental results on the two datasets

Dataset	Method	Mean FNR	Mean Dice
CVC-ColonD	U-Net	0.3000	0.7337
	Ma-net	0.2700	0.7926
	Transunet	0.2519	0.6955
	Vm-unet	0.2621	0.7565
	Ours	0.2403	0.8175
ETIS	U-Net	0.2094	0.7462
	Ma-net	0.1650	0.8451
	Transunet	0.2081	0.7494
	Vm-unet	0.2671	0.7775
	Ours	0.1446	0.8909

Dice, dice similarity coefficient; FNR, false negative rate.

Ablation experiments

The results of the ablation experiments on the Kaggle_3M dataset were presented in Table 4, with the corresponding segmentation effects shown in Figure 9. The experiments indicated that when using ResNet alone as the encoder, the Dice reached 0.9107. In contrast, using VSS encoder alone resulted in lower Dice of 0.8911, suggesting that VSS encoder alone is less effective than ResNet encoder in recognizing target regions. Building on this, the basic dual-branch fusion method enhanced segmentation performance. It increased Dice to 0.9127, confirming the effectiveness of multi-branch feature extraction. Furthermore, incorporating LM-DSCB for feature enhancement further improved the Dice to 0.9140 and reduced the FNR to 0.0800, demonstrating more accurate and complete lesion delineation. The actual segmentation results in Figure 9 further supported this conclusion, as it generated finer segmentation boundaries and more complete target regions.

Table 4

Results of ablation experiments on Kaggle_3M dataset

Method	FNR	Dice
ResNet encoder	0.0794	0.9107
VSS encoder	0.0986	0.8911
Basic fusion	0.0861	0.9127
Ours	0.0800	0.9140

Dice, dice similarity coefficient; FNR, false negative rate; ResNet, residual network; VSS, visual state space.

Figure 9 Visual comparisons on the Kaggle_3M dataset. (A) Origin image. (B) Ground truth. (C) Ours. (D) ResNet encoder. (E) VSS encoder. (F) Basic fusion. ResNet, residual network; VSS, visual state space.

The results of the ablation experiments on the Kvasir-SEG dataset were shown in Table 5, with the corresponding segmentation effects presented in Figure 10. The results indicated a significant performance gap between using ResNet encoder and VSS encoder alone. When VSS was used as the encoder, its performance was poor, with a Dice of 0.7647 and an FNR as high as 0.3127. After applying the basic dual-branch fusion method, Dice improved to 0.8882, demonstrating the effectiveness of multi-branch feature fusion. Furthermore, incorporating LM-DSCB enhanced the proposed model’s performance, increasing Dice to 0.9173 and reducing FNR to 0.0788, achieving the best results. The segmentation comparison in Figure 10 further supports this conclusion, showing that the proposed model achieves more precise lesion segmentation with higher completeness of target regions.

Table 5

Results of ablation experiments on Kvasir-SEG dataset

Method	FNR	Dice
ResNet encoder	0.1440	0.8763
VSS encoder	0.3127	0.7647
Basic fusion	0.0947	0.8882
Ours	0.0788	0.9173

Dice, dice similarity coefficient; FNR, false negative rate; ResNet, residual network; VSS, visual state space.

Figure 10 Visual comparisons on the Kvasir-SEG dataset. (A) Origin image. (B) Ground truth. (C) Ours. (D) ResNet encoder. (E) VSS encoder. (F) Basic fusion. ResNet, residual network; VSS, visual state space.

To further validate the effectiveness of the proposed LM-DSCB module, we replaced it with atrous spatial pyramid pooling (37), a widely used multi-scale feature extraction module. The results were shown in Table 6. LM-DSCB achieved a higher Dice and a lower FNR, demonstrating its superior capability in capturing lesions of varying sizes and morphologies compared with conventional multi-scale approaches.

Table 6

Comparison of LM-DSCB and atrous spatial pyramid pooling

Module	FNR	Dice
Atrous spatial pyramid pooling	0.0902	0.8746
LM-DSCB (ours)	0.0788	0.9173

Dice, dice similarity coefficient; FNR, false negative rate; LM-DSCB, lightweight multi-scale depth-wise separable convolution block.

Discussion

This paper verifies the effectiveness of the proposed model in medical image lesion segmentation through both quantitative and qualitative analyses. It achieves the best performance in terms of MSD, Dice, and FNR. The ablation experiments demonstrate that the dual-branch encoder, feature fusion strategy, and LM-DSCB all play a critical role in improving the final segmentation results. Moreover, the visual segmentation results show a high degree of consistency between its outputs and the ground truth. It accurately identifies lesion regions while effectively preventing over-segmentation and under-segmentation. This capability is crucial in clinical applications, as it reduces the risk of missing subtle lesions where diagnostic sensitivity matters most.

As a classic segmentation network, U-Net restores spatial image information through an encoder-decoder structure. However, due to the locality of convolution operations, it struggles to effectively capture global context and long-range dependencies. Experimental results indicate that U-Net exhibits certain limitations in segmenting specific lesion regions. CE-Net enhances contextual modeling by incorporating dense atrous convolution and a residual multi-kernel pooling block. Ma-net improves local and global feature integration through self-attention mechanisms and multi-scale feature fusion. However, their ability to model long-range dependencies remains limited. Transunet combines the global modeling capability of Transformer with the precise localization of U-Net. On the Kaggle_3M dataset, Transunet demonstrates superior performance. However, its computational complexity increases significantly when handling high-resolution images. In contrast, Vm-unet and RS³Mamba enhance long-range modeling capabilities through VSS model while reducing computational costs. Nevertheless, their segmentation performance on our datasets does not surpass that of Transunet. Additionally, PIDNet mitigates the loss of high-resolution information during feature fusion through a three-branch network architecture. However, it still faces challenges in recognizing lesions with complex morphologies. Compared to these models, our method integrates long-range dependency modeling with lightweight computation and enhanced feature fusion, which enhances its ability to segment lesions with irregular shapes, blurred boundaries, and varying scales. Nonetheless, the performance varies across datasets. The model achieves better results on the Kvasir-SEG dataset than on the Kaggle_3M dataset, due to the more distinct boundaries and consistent visual patterns in colonoscopy images. In contrast, magnetic resonance images often contain fragmented or low-contrast lesions, which remain more challenging for accurate segmentation.

In addition to accuracy, our model also demonstrates competitive efficiency. The proposed model achieves an inference speed of 78 FPS on a single NVIDIA GeForce RTX 4090D GPU. The computational complexity is evaluated with one 256×256 input images, yielding 19.3 giga floating-point operations and 34.4M parameters. When the LM-DSCB is removed, the computational cost decreases slightly by 0.3 giga floating-point operations, while the number of parameters remains nearly unchanged. This observation indicates that LM-DSCB is lightweight, introducing only negligible overhead. Compared with the baseline U-Net, our method introduces only a moderate increase in parameters while significantly improving computational efficiency.

Beyond the within-dataset experiments, we also carried out cross-dataset validation, where the model was trained on Kvasir-SEG and evaluated on CVC-ColonDB and ETIS. As reported in Table 3, the proposed method achieved higher Dice while reducing FNR in comparison with typical baseline networks. These results suggest that the architecture performs reliably not only on the training data but also under domain variations, which is important for practical clinical use.

To assess whether the observed performance improvements are statistically significant, we conducted Wilcoxon signed-rank tests on the Kvasir-SEG dataset using the Dice values obtained from 9-fold cross-validation. Our method was compared with U-Net and CE-Net. The resulting probability values are 0.0039 and 0.0273, respectively, both indicating statistically significant improvements. We also reported the mean and standard deviation of the fold-wise Dice differences to further illustrate the stability of the improvements. The detailed results were summarized in Table 7.

Table 7

Wilcoxon signed-rank test results for Dice values

Comparison	Mean ± standard deviation	Probability value
Ours and U-Net	0.095±0.015	0.0039
Ours and CE-Net	0.019±0.019	0.0273

Dice, dice similarity coefficient.

Our approach achieves promising results overall. However, its performance can still be influenced by data complexity (38), and concerns regarding privacy preservation remain central to clinical deployment. The model shows good capability in handling small lesions and those with moderate irregularity, yet it encounters difficulties when boundaries are highly fragmented or when lesion shapes become extremely irregular. In such cases, under-segmentation along lesion edges or missed detection of very small targets may occur. While the results obtained on the Kaggle datasets are encouraging, clinical evidence is still insufficient. Broader validation across different imaging modalities and disease types is necessary to establish robustness, with particular emphasis on lesion heterogeneity in future evaluations.

Conclusions

This paper proposes an innovative dual-encoder network architecture for medical image lesion segmentation. The architecture combines the advantages of ResNet encoder and VSS encoder, effectively capturing both local semantic details and global contextual dependencies. The ResNet encoder extracts hierarchical local features, while the VSS encoder models long-range dependencies for global context. After fusing the features from both encoders, complementary information is fully leveraged to obtain richer feature representations. To further enhance feature processing, a LM-DSCB is introduced, combining DSConv with multi-scale pooling operations to improve lesion detection at different scales. Experimental results on the Kaggle_3M and Kvasir-SEG datasets demonstrate that the proposed model achieves outstanding segmentation performance and competitive efficiency. Moreover, when trained on Kvasir-SEG and tested on two external datasets (CVC-ColonDB and ETIS), the model shows superior cross-dataset generalization, further confirming its robustness.

Future work will emphasize few-shot learning to strengthen performance in limited-data settings, supported by transfer learning and data augmentation. To address irregular boundaries and small structures, edge-aware and multi-scale feature fusion approaches will be investigated. Multi-center studies across varied imaging modalities and patient cohorts will be carried out to validate clinical applicability, while efforts toward workflow integration and regulatory compliance will further promote real-world deployment.

Acknowledgments

The authors would like to thank all the personnel who provided technical support or assisted with data collection. We also acknowledge the reviewers for their valuable comments and suggestions.

Footnote

Funding: This research was funded by the Research Topic on Educational Informatization in Jiangsu Higher Education Institutions (grant No. 2025JSETKT172) and Nantong Social Livelihood Science and Technology Plan (Directive Project) (grant No. MS2024019).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-576/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Rastogi D, Johri P, Donelli M, Kadry S, Khan AA, Espa G, Feraco P, Kim J. Deep learning-integrated MRI brain tumor analysis: feature extraction, segmentation, and Survival Prediction using Replicator and volumetric networks. Sci Rep 2025;15:1437. [Crossref] [PubMed]
Gao C, Wu L, Wu W, Huang Y, Wang X, Sun Z, Xu M, Gao C. Deep learning in pulmonary nodule detection and segmentation: a systematic review. Eur Radiol 2025;35:255-66. [Crossref] [PubMed]
Han K, Sheng VS, Song Y, Liu Y, Qiu C, Ma S, Liu Z. Deep semi-supervised learning for medical image segmentation: a review. Expert Syst Appl 2024;245:123052.
Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: Navab N, Hornegger J, Wells W, Frangi A. editors. Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015. Cham: Springer; 2015:234-41.
Azad R, Aghdam EK, Rauland A, Jia Y, Avval AH, Bozorgpour A, Karimijafarbigloo S, Cohen JP, Adeli E, Merhof D. Medical Image Segmentation Review: The Success of U-Net. IEEE Trans Pattern Anal Mach Intell 2024;46:10076-95. [Crossref] [PubMed]
Thapar P, Rakhra M, Prashar D, Mrsic L, Khan AA, Kadry S. Skin cancer segmentation and classification by implementing a hybrid FrCN-(U-NeT) technique with machine learning. PLoS One 2025;20:e0322659. [Crossref] [PubMed]
Alqarafi A, Khan AA, Mahendran RK, Al-Sarem M, Albalwy F. Multi-scale GC-T2: Automated region of interest assisted skin cancer detection using multi-scale graph convolution and tri-movement based attention mechanism. Biomed Signal Process Control 2024;95:106313.
Chen J, Mei J, Li X, Lu Y, Yu Q, Wei Q, Luo X, Xie Y, Adeli E, Wang Y, Lungren MP, Zhang S, Xing L, Lu L, Yuille A, Zhou Y. TransUNet: Rethinking the U-Net architecture design for medical image segmentation through the lens of transformers. Med Image Anal 2024;97:103280. [Crossref] [PubMed]
Kim JW, Khan AU, Banerjee I. Systematic Review of Hybrid Vision Transformer Architectures for Radiological Image Analysis. J Imaging Inform Med 2025;38:3248-62. [Crossref] [PubMed]
Gu A, Dao T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv:2312.00752 (Accessed September 17, 2025). Available online: https://arxiv.org/abs/2312.00752
Ruan J, Li J, Xiang S. VM-Unet: Vision mamba unet for medical image segmentation. arXiv:2402.02491 (Accessed September 17, 2025). Available online: https://arxiv.org/abs/2402.02491
Ma X, Zhang X, Pun MO. RS3Mamba: Visual state space model for remote sensing image semantic segmentation. IEEE Geosci Remote Sens Lett 2024;21:6011405.
Li G, Huang Q, Wang W, Liu L. Selective and multi-scale fusion Mamba for medical image segmentation. Expert Syst Appl 2025;261:125518.
Huang H, Lin L, Tong R, Hu H, Zhang Q, Iwamoto Y, Han X, Chen YW, Wu J. UNet 3+: A Full-Scale Connected UNet for Medical Image Segmentation. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona: IEEE; 2020:1055-9.
Gu Z, Cheng J, Fu H, Zhou K, Hao H, Zhao Y, Zhang T, Gao S, Liu J. CE-Net: Context Encoder Network for 2D Medical Image Segmentation. IEEE Trans Med Imaging 2019;38:2281-92. [Crossref] [PubMed]
Oktay O, Schlemper J, Le Folgoc L, Lee M, Heinrich M, Misawa K, Mori K, McDonagh S, Hammerla NY, Kainz B, Glocker B, Rueckert D. Attention U-Net: Learning where to look for the pancreas. arXiv:1804.03999 (Accessed September 17, 2025). Available online: https://arxiv.org/abs/1804.03999
Das N, Das S. Attention-UNet architectures with pretrained backbones for multi-class cardiac MR image segmentation. Curr Probl Cardiol 2024;49:102129. [Crossref] [PubMed]
Fan T, Wang G, Li Y, Wang H. MA-Net: A multi-scale attention network for liver and tumor segmentation. IEEE Access 2020;8:179656-65.
Huang H, Chen Z, Zou Y, Lu M, Chen C, Song Y, Zhang H, Yan F. Channel prior convolutional attention for medical image segmentation. Comput Biol Med 2024;178:108784. [Crossref] [PubMed]
Xu J, Xiong Z, Bhattacharyya SP. PIDNet: A Real-time Semantic Segmentation Network Inspired by PID Controllers. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver: IEEE; 2023:19529-39.
Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, Wang M. Swin-Unet: Unet-like pure transformer for medical image segmentation. In: Karlinsky L, Michaeli T, Nishino K. editors. Computer Vision – ECCV 2022 Workshops. Cham: Springer; 2023:205-18.
Li Z, Zheng Y, Shan D, Yang S, Li Q, Wang B, Zhang Y, Hong Q, Shen D. ScribFormer: Transformer Makes CNN Work Better for Scribble-Based Medical Image Segmentation. IEEE Trans Med Imaging 2024;43:2254-65. [Crossref] [PubMed]
Lu Z, She C, Wang W, Huang Q. LM-Net: A light-weight and multi-scale network for medical image segmentation. Comput Biol Med 2024;168:107717. [Crossref] [PubMed]
Meng W, Liu S, Wang H. AFC-Unet: Attention-fused full-scale CNN-transformer unet for medical image segmentation. Biomed Signal Process Control 2025;99:106839.
Zhu L, Liao B, Zhang Q, Wang X, Liu W, Wang X. Vision Mamba: Efficient visual representation learning with bidirectional state space model. arXiv:2401.09417 (Accessed September 17, 2025). Available online: https://arxiv.org/abs/2401.09417
Shi Y, Xia B, Jin X, Wang X, Zhao T, Xia X. Vmambair: Visual state space model for image restoration. IEEE Trans Circuits Syst Video Technol 2025;35:5560-74.
Zhang M, Yu Y, Jin S, Gu L, Ling T, Tao X. VM-UNET-V2: Rethinking vision mamba UNet for medical image segmentation. In: Peng W, Cai Z, Skums P. editors. Bioinformatics Research and Applications. ISBRA 2024. Singapore: Springer; 2024:335-46.
Xing Z, Ye T, Yang Y, Liu G, Zhu L. Segmamba: Long-range sequential modeling mamba for 3D medical image segmentation. In: Linguraru MG, Dou Q, Feragen A, Giannarou S, Glocker B, Lekadir K, Schnabel JA. editors. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024. Cham: Springer; 2024:578-88.
Wu R, Liu Y, Liang P, Chang Q. H-vmunet: High-order vision mamba unet for medical image segmentation. Neurocomputing 2025;624:129447.
Buda M, Saha A, Mazurowski MA. Association of genomic subtypes of lower-grade gliomas with shape features automatically extracted by a deep learning algorithm. Comput Biol Med 2019;109:218-25. [Crossref] [PubMed]
Jha D, Smedsrud PH, Riegler MA, Halvorsen P, De Lange T, Johansen D, Johansen HD. Kvasir-SEG: A Segmented Polyp Dataset. In: Ro Y, Cheng WH, Kim J, Chu WT, Cui P, Choi JW, Hu MC, De Neve W, editors. MultiMedia Modeling. MMM 2020. Cham: Springer; 2020:451-62.
Bernal J, Sánchez FJ, Fernández-Esparrach G, Gil D, Rodríguez C, Vilariño F. WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. Comput Med Imaging Graph 2015;43:99-111. [Crossref] [PubMed]
Silva J, Histace A, Romain O, Dray X, Granado B. Toward embedded detection of polyps in WCE images for early diagnosis of colorectal cancer. Int J Comput Assist Radiol Surg 2014;9:283-93. [Crossref] [PubMed]
Zhu Y, Peng M, Wang X, Huang X, Xia M, Shen X, Jiang W. LGCE-Net: A local and global contextual encoding network for effective and efficient medical image segmentation. Appl Intell 2025; [Crossref]
Liu F, Hua Z, Li J, Fan L. DBMF: Dual Branch Multiscale Feature Fusion Network for polyp segmentation. Comput Biol Med 2022;151:106304. [Crossref] [PubMed]
Nanni L, Lumini A, Fantozzi C. Exploring the potential of ensembles of deep learning networks for image segmentation. Information 2023;14:657.
Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans Pattern Anal Mach Intell 2018;40:834-48. [Crossref] [PubMed]
Kujur A, Raza Z, Khan AA, Wechtaisong C. Data complexity based evaluation of the model dependence of brain MRI images for classification of brain tumor and Alzheimer’s disease. IEEE Access 2022;10:112117-33.

Cite this article as: Chen H, Min BW, Zhang H. A dual-branch network for lesion segmentation in medical images using state space models. Quant Imaging Med Surg 2025;15(12):11977-11991. doi: 10.21037/qims-2025-576

A dual-branch network for lesion segmentation in medical images using state space models

Introduction

Related work

Methods

Dual-branch network architecture

SS2D

Dual-branch encoder architecture

LM-DSCB

Results

Datasets

Experimental details

Comparative experiments

Table 1

Table 2

Table 3

Ablation experiments

Table 4

Table 5

Table 6

Discussion

Table 7

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share