Enhancing vision Mamba with two-dimensional position embedding and multiscale fusion for medical image segmentation

Xusen Zhang; Ruixian Li; Jing Rao; Mingju Wang; Jing Zhang; Liang Zhao

doi:10.21037/qims-2025-aw-2178

Original Article

Enhancing vision Mamba with two-dimensional position embedding and multiscale fusion for medical image segmentation

Xusen Zhang¹, Ruixian Li², Jing Rao¹, Mingju Wang¹, Jing Zhang¹, Liang Zhao³

¹Department of Information and Resource, Taihe Hospital, Hubei University of Medicine, Shiyan, China; ²Department of Clinical Nutrition, Taihe Hospital, Hubei University of Medicine, Shiyan, China; ³Center of Precision Medicine, Taihe Hospital, Hubei University of Medicine, Shiyan, China

Contributions: (I) Conception and design: L Zhao, J Zhang; (II) Administrative support: J Rao, M Wang; (III) Provision of study materials or patients: None; (IV) Collection and assembly of data: X Zhang, R Li; (V) Data analysis and interpretation: X Zhang; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Liang Zhao, PhD. Center of Precision Medicine, Taihe Hospital, Hubei University of Medicine, South Renmin Road, Shiyan 442000, China. Email: s080011@e.ntu.edu.sg; Jin Zhang, MS. Department of Information and Resource, Taihe Hospital, Hubei University of Medicine, South Renmin Road, Shiyan 442000, China. Email: zhjinwhu@taihehospital.com.

Background: Medical image segmentation plays a critical role in diagnostic and scientific research, enabling the delineation of regions of interest and the identification of fine-grained lesion at the pixel level. Recent advances in deep learning, particularly in Vision Transformers (ViTs), have significantly accelerated this field. However, existing ViT models often struggle to capture long-range dependencies while maintaining computational efficiency. Therefore, we aimed to design a model that can not only accurately identify target lesions but also ensure computational efficiency.

Methods: To achieve this goal, we established vision Mamba, an efficient visual state space model (SSM), and proposed enhancements through the integration of two-dimensional (2D) position embedding and a multiscale feature fusion block (MB). Our 2D position embedding method effectively incorporates spatial information into patch embedding, while the MB supplements detailed information lost during Mamba’s sequential processing.

Results: Experiments on three public datasets showed the superiority of our approach. Compared to a suboptimal model, our proposed method achieved average performance improvements of 1.29%, 1.18%, 5.52%, 1.18%, and 1.03% across the three datasets in terms of the dice similarity coefficient, volumetric overlap error, average surface distance between objects, Jaccard coefficient, and recall, respectively.

Conclusions: Our method not only enhances segmentation accuracy but also retains computational efficiency, making it a promising candidate for practical medical image segmentation applications.

Keywords: Vision Mamba; two-dimensional position embedding (2D position embedding); multiscale feature fusion; state space model (SSM)

Submitted Oct 15, 2025. Accepted for publication Jan 29, 2026. Published online Mar 18, 2026.

doi: 10.21037/qims-2025-aw-2178

Introduction

Medical image segmentation plays a critical role in disease diagnosis and scientific research. It enables the precise extraction of pixel-level lesion or regions of interest from complex medical images (1,2), which is crucial for disease diagnosis, surgical planning, and treatment planning. Currently, the manual segmentation of lesion objects by experienced radiologists or pathologists remains the gold standard for medical image diagnosis. However, this process is time-consuming and labor-intensive, and the segmentation results may vary depending on the experience and skill level of different physicians. Deep learning-based models have improved the efficiency of medical image segmentation, while reducing costs, and ensuring accuracy and stability.

Over the past decade, convolutional neural network models, including fully convolutional network models (3) and their variants [e.g., UNet (4), nnUNet (1), and SegResNet (5)], have achieved remarkable results in various image segmentation tasks. These networks typically adopt a U-shaped encoder-decoder framework, and use skip connections to fuse features from the encoder and decoder, thereby enhancing feature detail information. However, due to the limitations of convolutional operations, these U-shaped convolutional networks excel in extracting local features but struggle to establish long-term dependencies between features, limiting the further improvement of model performance. To address the limitations of convolutional operations, researchers have attempted to establish long-term dependencies between local features through attention mechanisms.

Attention mechanisms mimic the process of how humans allocate attention when processing competing information. By dynamically assigning weights to input features, models can quantify the degree of correlation between features (i.e., attention coefficients), while suppressing irrelevant features.

Vision Transformer (ViT) (6) introduced the self-attention mechanism of the Transformer into the image domain for the first time. Through the self-attention mechanism, the Transformer can focus on different regions in an image and learn their correlations, effectively capturing global contextual information and establishing long-term dependencies between local features. Although various ViT-based model variants have performed well in medical image segmentation tasks, their quadratic computational complexity poses challenges when applied to downstream tasks.

Recently, the state space model (SSM) has garnered attention from researchers due to its linear computational complexity and ability to improve model efficiency through parallel training. Based on studies of classical SSMs, researchers introduced Mamba (7), which integrates time-varying parameters into the SSM framework, enabling the dynamic filtering of past information. Compared to the Transformer, Mamba can achieve similar performance with fewer parameters (8,9).

Liu et al. proposed a vision SSM (VMamba) (9) based on Mamba, which exhibits linear computational complexity while achieving accuracy comparable to ViT. VMamba has facilitated the development of multiple Mamba-based models in the field of medical image segmentation. These models introduce global information for each image patch by scanning surrounding patches in multiple directions, and then extract and synthesize the features of image patches containing global information. However, Mamba was specifically designed for natural language processing (NLP) tasks (7), and two issues arise when it is directly applied to image data. First, it treats images as patch embeddings similar to token sequences in NLP but lacks two-dimensional (2D) position information. Second, Mamba processes flattened one-dimensional (1D) image patch sequences recursively, which can cause spatially adjacent pixels to be too far apart in the flattened sequence, leading to the loss of local feature pixels. Therefore, effectively incorporating 2D position information into patch embedding vectors to enable Mamba to extract 2D position relationships between image patches, and enhancing local information for image patches are crucial for medical image segmentation. Thus, specialized position embedding methods for vision Mamba models, such as the relative position bias used in the Swin Transformer (10) to enhance its performance in visual tasks, need to be developed.

Position embedding methods designed for Transformers have been incorporated into Mamba-based medical image segmentation models, but these have shown limited or inconsistent performance improvements (8). Thus, we developed a model called vision Mamba with 2D Position Embedding and Multiscale Feature Fusion Block (VMPM), which adopts a U-shaped architecture, and integrates a 2D Image Position Embedding (2DIPE) method and a multiscale feature fusion block (MB). Both the 2DIPE and MB are specifically designed for Mamba-based vision models. 2DIPE not only incorporates 2D position information for each image patch into the patch embedding, but also records the relative position coordinates between different patches. While MB aims to enhance the output of Mamba by complementing local detail information.

The main contributions of this study can be summarized as follows:

Introduction of an efficient position encoding method: an innovative positional encoding method was developed specifically for the Mamba model architecture. This method enables the precise capture and representation of the 2D spatial relationships between image patches during the sequential scanning process in Mamba.
Development of an effective multiscale feature fusion structure: a multiscale feature fusion module was designed to compensate for the loss of detailed feature information during Mamba-based feature extraction. Through systematic comparisons of different module architectures, the most effective structure was identified to enhance segmentation performance.
Thorough experimental validation: comparative experiments conducted on three benchmark datasets showed that the proposed method significantly improved medical image segmentation performance while maintaining the same parameter count and computation efficiency. Ablation studies further confirmed the effectiveness of the introduced 2D position encoding method and feature fusion structure.

In the following section, we first summarize the classical methods for medical image segmentation, then review the development of SSMs, and finally discuss the application of positional encoding methods and multiscale detail feature fusion techniques in vision Mamba models.

Medical image segmentation

Medical image segmentation is a crucial step in artificial intelligence-assisted disease diagnosis, enabling the accurate extraction of specific organs, pathological tissues, and tumors from medical images. UNet (4), which employs a U-shaped network structure, uses skip connections to fuse detailed information from the decoder with abstract information from the corresponding encoder, yielding refined image features. This approach and its variants [e.g., nnUNet (1) and SegResNet (5)] have been widely adopted in medical image segmentation.

Deep learning models based on Transformers have also made significant contributions to medical image processing tasks. ViT was the first to apply Transformers to visual tasks, achieving remarkable performance (6). The Swin Transformer (10) employs a hierarchical structure and shifted window attention mechanism to reduce computational complexity while improving prediction accuracy. UNETR (11) integrates UNet and Transformers to establish a network structure dedicated specifically to medical image segmentation. Swin-UNETR (12) introduces shifted window attention into UNETR, further reducing computational complexity and enhancing prediction performance.

SSMs

In recent years, there has been a growing interest in developing models similar to Transformers that can capture long-range dependencies while maintaining linear reasoning complexity. This interest led to the development of Mamba. A novel architecture called HiPPO was proposed (13), which stores historical information through functional approximation methods. The structured state space sequence model (S4) (14), a successor to HiPPO, enhances computational efficiency by converting the parameter matrix A (referred to as the HiPPO matrix) into a sum of regular and low-rank matrices. Mamba, an SSM based on the S4 architecture, integrates the concept of selective memory from long short-term memory (15) into the SSM, enabling selective memory and historical information updating. However, due to its inherent 1D scanning mechanism (7), Mamba faces significant challenges in learning 2D spatial features. Consequently, a series of Mamba-based vision models have been proposed to address this issue.

For instance, ViM (8), the first vision Mamba model, flattens the 2D image into a sequence of 1D patch embedding vectors and scans these patches in both forward and backward directions to enhance the global perspective. However, merely adding forward and backward perspectives to each image patch does not enable the model to fully comprehend the position of each patch in 2D space. To overcome this limitation, VMamba (9) scans these image patches in four directions (forward, backward, left, and right), providing a more comprehensive global perspective. Subsequently, UMamba pioneered the integration of Mamba models with UNet architecture for medical image segmentation tasks. VM-Unet also adopted a similar architecture, applying the visual Mamba model to the segmentation of dermatological images and human organ images (16).

Inspired by these developments, LKM-UNet incorporated bidirectionally scanned Mamba modules (forward and backward) into the UNet framework for medical image segmentation (17). LightMUNet further advanced this approach by combining quad-directional scanned Mamba modules (forward, backward, leftward, and rightward) with the UNet architecture to develop a lightweight vision Mamba model. SwinUMamba (18) enhanced this paradigm by integrating quad-directional scanned Mamba modules with UNet while incorporating pre-trained weights from VMamba, thereby improving performance in medical image segmentation tasks.

In addition to the aforementioned methods, LoG-VMamba employs a local token extractor to sequentially scan image patches via a visual Mamba module for extracting local features. Subsequently, a global token extractor is used to sequentially scan the global features of the entire image across each channel through the visual mamba module. Finally, the local features and global features are concatenated to assist the Mamba model to learn the 2D spatial relationships among medical image features (19).

Position embedding in vision mamba

To better address the difficulty the Mamba model faces in learning the 2D spatial positional relationships among image features, Zhu et al. (8) added a learnable position embedding method to vision Mamba for image classification, segmentation, and object detection. Zhang et al. (20) employed a position mapping function that maps the coordinates of points to position embeddings, after which the position encoding vector is added to the feature vector. Lin et al. (21) introduced a learnable position embedding method in the vision Mamba model for image enhancement tasks.

Learnable positional encoding learns the 2D relative positional relationships between image patches through a parameter matrix. However, it requires training resources and struggles to accurately learn the 2D relative positional relationships between image patches in each training session. Conversely, the 2DIPE method proposed in this study can record the 2D relative positional coordinates between image patches during each Mamba scan operation. Moreover, the positional encoding vectors of each image patch can be determined before training, without occupying additional computational resources. The positional encodings specifically designed for Transformers (22) are restricted to 1D sequences and thus inadequate for helping the model understand the 2D spatial positional relationships among image features. Consequently, they are not appropriate for vision Mamba.

MB about mamba

In addition to the aforementioned issues, the Mamba scanning method also results in the loss of detailed information (23). Du et al. (24) proposed a Multiscale Anisotropic Convolution Module (MACM), which consists of convolutional layers with four different kernels. The MACM module is placed in parallel with the Mamba module to compensate for the multiscale detailed information lost during the Mamba scanning process. Xing et al. (25) introduced a Gated Spatial Convolution module, which is positioned before the Mamba module to extract local detail features from images and subsequently fuse them with the global features captured by the Mamba module for complementary enhancement. Chen et al. (26) developed the CaVMamba model, which places a fixed-size convolutional layer in parallel with a Mamba module to supplement the detailed information in the output features of the Mamba module. However, the aforementioned studies did not thoroughly investigate the positional relationship between the detail feature extraction module and the Mamba module.

In the present study, the segmentation performance of the model was evaluated under three configurations: with the detail feature extraction module placed before the Mamba module, after the Mamba module, and in parallel with the Mamba module. The experimental results demonstrated that the parallel placement of the detail feature extraction module and the Mamba module resulted in the optimal segmentation performance.

Methods

Overall architecture

Currently, vision Mamba models face challenges in acquiring 2D spatial features and processing detailed information. To address these issues, we introduced VMPM, a novel network model that enhances the spatial position information of image patches by incorporating 2D spatial position encoding and compensates for missing detail information through a multiscale feature fusion module.

As illustrated in Figure 1, the VMPM network comprises a patch embedding and position embedding (PPE) module, MB module, and upsampling (UB) module. The PPE module initially performs patch embedding on the image patches, mapping them into vectors, which are then summed with corresponding 2D spatial position encoding vectors to enhance the spatial position information. The MB module supplements multiscale detail features for MB module output. The UB module fuses the output from the previous module with the corresponding output of the MB module using skip connections and performs UB to supplement detailed information. The visual state space block (VSSB), derived from Mamba (9), is an extension of the Mamba block designed to capture extensive contextual information of visual data.

Figure 1 Architecture of the proposed VMPM model, which establishes the relative position relationships between image patches in 2D space using a position encoding method, and integrates multiscale local features to supply the image detail information. 2D, two-dimensional; CONV, convolution operation; DW, depth-wise; GT, ground truth; Linear, linear transformation function; MB, multiscale feature fusion block; Norm, normalization operation; Relu, ReLU operation; SS2D, 2D Selective Scan; UB module, upsampling module; VMPM, vision Mamba with 2D position embedding and multiscale feature fusion block; VSSB, visual state space block.

Our model employs a symmetric structure. The encoder comprises seven stages. In the first stage, the patch embedding layer divides the input image into overlapping patches, mapping each patch into a 1 × 1 × C vector, resulting in an output with dimensions W/4, H/4, and C. Subsequently, the output of the first layer is summed with the position encoding vectors. In each of the following five stages, the MB module is used to downsample the input. Similarly, the decoder is organized into six stages. In each stage, the UB module performs feature fusion and UB on the output from the previous stage and the corresponding MB output.

Preliminaries

S4 and Mamba are inspired by the continuous system (7), which maps a 1D sequence $x \in R$ to $y \in R$ through a hidden state $h_{t} \in R^{N}$ . This system uses $A \in R^{N \times N}$ , $B \in R^{N \times 1}$ , and $C \in R^{1 \times N}$ as projection parameters. The aforementioned process can be represented as a linear ordinary differential equation.

$\begin{array}{l} h_{t} = A h_{t - 1} + B x_{t} \\ y_{t} = C h_{t} \end{array}$ [1]

To match real-world problems and simplify computational complexity, zero-order hold (27) is used to discretize the SSM and its parameters as follows:

$h_{t} = \bar{A} h_{t - 1} + \bar{B} x_{t}$ [2]

$y_{t} = C h_{t}$ [3]

$\bar{A} = \exp (Δ A)$ [4]

$\bar{B} = {(Δ A)}^{- 1} (\exp (Δ A) - I) \cdot Δ B$ [5]

$B = L i n e a r_{N} (x_{t})$ [6]

$C = L i n e a r_{N} (x_{t})$ [7]

$Δ = s o f t p l u s (L i n e a r_{N} (x_{t}))$ [8]

2D position embedding method

By assuming the initial condition h₀=0, and unrolling Eqs. [2-8], we obtain:

$\begin{array}{l} h_{1} = \bar{B_{1}} x_{1} \\ y_{1} = C_{1} \bar{B_{1}} x_{1} \\ h_{2} = \bar{A_{2}} \cdot \bar{B_{1}} x_{1} + \bar{B_{2}} x_{2} \\ y_{2} = C_{2} \bar{A_{2}} \cdot \bar{B_{1}} x_{1} + C_{2} \bar{B_{2}} x_{2} \end{array}$ [9]

In general:

$h_{t} = \sum_{j = 1}^{t} (\prod_{k = j + 1}^{t} \bar{A_{k}}) \bar{B_{j}} x_{j}$ [10]

$y_{t} = C_{t} \sum_{j = 1}^{t} (\prod_{k = j + 1}^{t} \bar{A_{k}}) \bar{B_{j}} x_{j}$ [11]

By converting Eq. [11] into matrix form, we obtain:

$y = \bar{G} x$ [12]

$[\begin{matrix} \begin{matrix} y_{1} \\ y_{2} \end{matrix} \\ \begin{matrix} ⋮ \\ y_{L} \end{matrix} \end{matrix}] = [\begin{matrix} \begin{matrix} \begin{matrix} \begin{matrix} g_{1, 1} \\ g_{2, 1} \end{matrix} \\ \begin{matrix} ⋮ \\ g_{L, 1} \end{matrix} \end{matrix} & \begin{matrix} \begin{matrix} 0 \\ g_{2, 2} \end{matrix} \\ \begin{matrix} ⋮ \\ g_{L, 2} \end{matrix} \end{matrix} \end{matrix} & \begin{matrix} \begin{matrix} \begin{matrix} \dots \\ \dots \end{matrix} \\ \begin{matrix} ⋱ \\ \dots \end{matrix} \end{matrix} & \begin{matrix} \begin{matrix} 0 \\ 0 \end{matrix} \\ \begin{matrix} 0 \\ g_{L, L} \end{matrix} \end{matrix} \end{matrix} \end{matrix}] [\begin{matrix} \begin{matrix} x_{1} \\ x_{2} \end{matrix} \\ \begin{matrix} ⋮ \\ x_{L} \end{matrix} \end{matrix}]$ [13]

$g_{i, j} = C_{i} \prod_{k = j + 1}^{i} \bar{A_{k}} \cdot \bar{B_{j}}$ [14]

$\hat{G}$ can be viewed as an attention matrix that selects relevant information from the past according to $x_{t}$ , and aggregates that information to produce $y_{t}$ .

Substituting Eq. [4] and Eq. [5] into Eq. [14] yields the following:

$g_{i, j} = C_{i} \cdot [\prod_{k = j + 1}^{i} \exp (Δ_{k} \cdot A)] \cdot [{(Δ_{j} \cdot A)}^{-}^{1} (\exp (Δ_{j} \cdot A) - I) \cdot Δ_{j} \cdot B_{j}]$ [15]

where S_B(x_t), S_C(x_t), and Sp(S∆(x_t)) denote B, C and ∆, respectively, and Sp represents the Softplus activation function. Substituting S_B(x_t), S_C(x_t), and Sp(S∆(x_t)) into Eq. [15] yields the following:

$\begin{array}{l} g_{i, j} = S_{C} (x_{i}) \cdot [\prod_{k = j + 1}^{i} \exp (S p (S_{Δ} (x_{k})) \cdot A)] \\ \cdot [{(S p (S_{Δ} (x_{j})) \cdot A)}^{- 1} (\exp (S p (S_{Δ} (x_{j})) \cdot A) - I) \cdot S p (S_{Δ} (x_{j}))] \cdot S_{B} (x_{j}) \end{array}$ [16]

$g_{i, j} = S_{C} (x_{i}) \cdot M \cdot S_{B} (x_{j})$ [17]

$M = [\prod_{k = j + 1}^{i} \exp (S p (S_{Δ} (x_{k})) \cdot A)] \cdot [{(S p (S_{Δ} (x_{j})) \cdot A)}^{- 1} (\exp (S p (S_{Δ} (x_{j})) \cdot A) - I) \cdot S p (S_{Δ} (x_{j}))]$ [18]

where A is a structured matrix that has no linear relationship with $x_{t}$ ; ∆ also has no linear relationship with x_t; S_C and S_B are linearly mapped from $x_{t}$ ; Matrix M represents the result of Eq. [18]; $N \cdot x_{i}$ represents $S_{C} (x_{i})$ ; and $W \cdot x_{j}$ represents $S_{B} (x_{j})$ . Substituting M, $N \cdot x_{i}$ , and $W \cdot x_{j}$ into Eq. [17] yields:

$g_{i, j} = N \cdot x_{i} \cdot M \cdot W \cdot x_{j} = N [x_{i} \cdot D \cdot x_{j}]$ [19]

Our designed position embedding method is expressed in Eq. [20] as follows:

$x_{i} = [\begin{matrix} x_{i}_{, 0} \\ x_{i}_{, 1} \\ ⋮ \\ x_{i}_{, 2 n} \\ x_{i}_{, 2 n + 1} \end{matrix}] + [\begin{matrix} \cos (x_{i}) - \sin (x_{i}) \\ \cos (y_{i}) + \sin (y_{i}) \\ ⋮ \\ \cos (x_{i}) - \sin (x_{i}) \\ \cos (y_{i}) + \sin (y_{i}) \end{matrix}]$ [20]

where x_i and y_i are the position coordinates of patch i in the original input image. Substituting Eq. [20] into Eq. [19] yields the following (the detailed derivation from Eq. [19] and Eq. [20] to Eq. [21] and Eq. [22] is provided in Appendix 1):

$g_{i, j} = N \cdot N^{'}$ [21]

$N^{'} = [\begin{matrix} \begin{matrix} E_{0} + Q_{0} (\cos (x_{i} - x_{j}) - \sin (x_{i} + x_{j})) \\ E_{1} + Q_{1} (\cos (y_{i} - y_{j}) + \sin (y_{i} + y_{j})) \end{matrix} \\ ⋮ \\ \begin{matrix} E_{2 n} + Q_{2 n} (\cos (x_{i} - x_{j}) - \sin (x_{i} + x_{j})) \\ E_{2 n + 1} + Q_{2 n + 1} (\cos (y_{i} - y_{j}) + \sin (y_{i} + y_{j})) \end{matrix} \end{matrix}]$ [22]

where $\cos (x_{i} - x_{j}) - \sin (x_{i} + x_{j})$ and $\cos (y_{i} - y_{j}) + \sin (y_{i} + y_{j})$ show the 2D relative position information of patch i and patch j for Mamba. The item g_i,j obtained from the matrix multiplication result of matrix N and matrix N' contains both , which can encode the relative positional information of patch i and patch j along the x-axis, and , which can encode their relative positional information along the y-axis. Therefore, g_i,j can incorporate the 2D relative positional coordinates of patch i and patch j, which is equivalent to the relative positional information between image patch i and image patch j.

MB

The MB module comprises the following three branches: a multiscale detail feature branch (MDFB), a shortcut connection branch, and a Mamba module branch. The output results of the three branches are added together, and subsequently undergo BatchNorm2d normalization followed by a Rectified Linear Unit (ReLU) activation (Figure 1). The MDFB is formed by three 3×3 convolution kernels in series. Two 3×3 convolution operations are effectively similar to a 5×5 convolution, while three 3×3 convolution operations are effectively similar to a 7×7 convolution. Therefore, the outputs of the three convolution kernels are concatenated to form features with multiscale details. The above content can be represented by the following formulas:

$\begin{array}{l} A = C^{3 \times 3} (f_{P E}) \\ B = C^{3 \times 3} (C^{3 \times 3} (f_{P E})) \\ C = C^{3 \times 3} (C^{3 \times 3} (C^{3 \times 3} (f_{P E}))) \end{array}$ [23]

$f_{M D F B} = C a t (A, B, C)$ [24]

$f_{S h o t c u t} = C^{1 \times 1} (f_{P E})$ [25]

$f_{M a m b a} = C^{1 \times 1} (V S S B (V S S B (f_{P E})))$ [26]

$f_{o u t p u t} = B R (f_{M D F B} + f_{S h o t c u t} + f_{M a m b a})$ [27]

where f_PE represents the output of the 2DIPE module; f_Shotcut, f_Mamba, and f_MDFB represent the output of the shortcut connection branch, Mamba module branch, and MDFB, respectively. f_Shotcut is used to propagate shallow features and enhance model robustness. To enable f_Shotcut, f_Mamba and f_MDFB can be added together for the detailed information enhancement of the Mamba module output. The outputs of Eq. [24], Eq. [25], and Eq. [26] are mapped to the same dimension. C^3×3 denotes convolution operations with a 3×3 kernel; Cat () indicates concatenation along the channel; and BR () represents a combination of the normalization and ReLU operations.

The UB module is used to upsample the output features of the previous layer, then concatenate them with the corresponding output features of the MB module. The concatenated result is then fed into convolution and normalization layers for feature fusion (Figure 1). The above content can be represented by the following formulas:

$\begin{array}{l} A = C a t (U P S (f_{M B}), f_{M B}) \\ B = C N B L (A) + A \\ f_{U B} = L r e l u (B) \end{array}$ [28]

where UPS represents an UB operation; CNBL () represents a combination of convolution and normalization operations repeated three times; and Lrelu represents an Lrelu activation function.

Datasets

Three publicly available datasets were used to assess the proposed model: AbdomenMR, ISIC2018, and Breast Ultrasound Images (BUSI), which correspond to magnetic resonance imaging (MRI) images, dermoscopic images, and ultrasound images, respectively. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.

AbdomenMR is a publicly available 2D segmentation dataset comprising 110 MRI cases from the MICCAI 2022 AMOS Challenge (28), including 13 types of abdominal organs (liver, spleen, pancreas, right kidney, left kidney, stomach, gallbladder, esophagus, aorta, inferior vena cava, right adrenal gland, left adrenal gland, and duodenum). The size of each 2D MRI image is 320×320. In total, 60 annotated cases were used for training, and 50 cases were used for testing.

ISIC2018 (29) is a publicly available skin lesion segmentation dataset, containing 2,694 dermoscopy images with segmentation mask labels. Following previous work (17), the dataset was split into training and test sets at a 7:3 ratio. Specifically, the training set comprised 1,886 images, while the test set comprised 808 images.

BUSI (30) is a dataset collected from 600 female patients, comprising 780 images categorized into three groups: 210 images of malignant breast cancer, 437 images of benign breast lesions, and 133 images of normal breast tissue. Each image in the collection is saved in PNG format with an average size of 500×500 pixels. The dataset was split into training and testing sets at an 8:2 ratio.

Results

In this section, we demonstrate the superiority of our proposed VMPM model through a comparative analysis with other models. We first outline the datasets and evaluation metrics used in our experiments, then provide the implementation details, and finally present the experimental results. We also describe the ablation experiments that were conducted to evaluate the effectiveness of the main modules of the model.

Implementation details

The VMPM model was implemented in the nnU-Net framework (1) using PyTorch. The loss function was defined as the sum of dice loss and cross-entropy loss, and the AdamW optimizer was used with a weight decay of 0.05. A cosine learning rate decay was adopted with an initial learning rate of 0.0001. We use the pre-trained VMamba-tiny model (9) to initialize our VMPM model for all three datasets. During the training phase, we addressed overfitting by employing data augmentation methods such as random augmentation, horizontal flipping, and vertical flipping, each with equal probability. All the experiments were carried out on a workstation equipped with two Intel (R) Xeon (R) Gold 6130 Central Processing Units and four NVIDIA V100 Graphics Processing Units. VMPM was trained for 100 epochs in the three datasets.3.2 Evaluation Metrics.

Eight commonly used semantic segmentation evaluation metrics were used to evaluate the performance of the proposed model: dice similarity coefficient (DSC), volumetric overlap error (VOE), average symmetric surface distance (ASSD), average surface distance between objects (OBJ_ASD), Jaccard coefficient (JAC), recall, floating point operations per second (FLOPs), and model params (Params). ASSD and OBJ_ASD are defined as follows:

$A S S D = \frac{\sum_{p \in \hat{P}} d (p, \hat{G}) + \sum_{g \in \hat{P}} d (g, \hat{P})}{| \hat{P} | + | \hat{G} |}$ [29]

$O B J_{A S D} = \frac{\sum_{O_{p} \in \hat{P}} \sum_{p \in O_{p}} d (p, O_{G}) + \sum_{O_{G} \in \hat{G}} \sum_{g \in O_{g}} d (g, O_{p})}{| \hat{P} | + | \hat{G} |}$ [30]

where $\hat{P}$ and $\hat{G}$ represent the set of boundary points on the predicted and ground-truth segmentation results, respectively; $d (p, \hat{G})$ represents the shortest Euclidean distance from point p to the set $\hat{G}$ ; O_P represents the set of points on the boundary of the predicted segmentation objects; O_G represents the set of points on the boundary of the ground-truth segmentation objects; and d(p,O_G) represents the shortest distance from point p to the set O_G.

Performance evolution

MRI image segmentation

The first segmentation task in our experiments was the segmentation of multiple organs on MRI images. The primary objective of multi-organ segmentation is to separate the contours of each organ from their adjacent organs in complex images, addressing the issue of boundary adhesion between different organs. This study has important implications for research on 3D organ reconstruction and radiological disease diagnosis.

On the AbdomenMR dataset, the VMPM model achieved average scores of 0.7747, 0.1163, 15.4405, 0.8837, and 0.9197 for the DSC, VOE, OBJ_ASSD, JAC, and recall metrics, respectively (Table 1). Compared to the second-best performing model, the VMPM model achieved absolute improvements of 1.91%, 0.98%, 13.00%, 0.98%, and 0.54% on each corresponding metric. Notably, the second-best performing model had higher parameter counts and computational costs than our VMPM model.

Table 1

Quantitative comparison of medical image segmentation by representative models on the ADBOMEN 2D dataset

Structure	Model	FLOPs (G)↓	Params (M)↓	DSC↑	VOE↓	ASSD↓	OBJ_ASD↓	JAC↑	Recall↑
CNN	SegResNet (5)	24.36	6.29	0.7433	0.1487	1.1287	22.6694	0.8513	0.8834
	UNet (4)	69.43	42.21	0.6459	0.1969	1.5090	29.8643	0.8031	0.8302
	nnUNet (1)	180.05	92.48	0.7391	0.1454	1.1120	18.8890	0.8546	0.8852
Transformer	UNTER (11)	41.09	87.12	0.5346	0.2257	1.8620	37.0248	0.7743	0.8119
Transformer	SwinUNTER (12)	29.47	25.12	0.6465	0.1899	1.7319	37.7270	0.8101	0.8510
Mamba	U-Mamba^†	230.97	76.40	0.7336	0.1503	1.1584	17.7469	0.8497	0.8814
	LKM-UNet (17)	414.04	189.55	0.7556	0.1391	1.1071	21.1006	0.8609	0.8949
	LightMUNet^†	16.46	7.28	0.6104	0.1973	1.6658	37.1760	0.8027	0.8447
	SwinUMamba (18)	33.59	55.06	0.7407	0.1261	1.0268^‡	21.0410	0.8739	0.9143
	Our	30.44	49.24	0.7747^‡	0.1163^‡	1.3358	15.4405^‡	0.8837^‡	0.9197^‡

^†, U-Mamba (arXiv:2401.04722) and LightMUNet (arXiv:2403.05246) are both preprint models proposed for biomedical image segmentation and have not yet been formally published. ^‡, the top score for each indicator. ↑ indicates that a higher value of the index corresponds to better model performance. ↓ indicates that a lower value of the index corresponds to better model performance. 2D, two-dimensional; ASSD, average symmetric surface distance; DSC, dice similarity coefficient; FLOPs, floating point operations per second; JAC, Jaccard coefficient; OBJ_ASD, average surface distance between objects; Params, model params; VOE, volumetric overlap error.

Dermoscopic image segmentation

The second segmentation task in our experiments involved the segmentation of lesion areas on dermoscopic images. This task required addressing issues such as irregular lesion boundaries, and low contrast between lesion areas and surrounding normal skin tissue. The study findings have important implications for research on small lesion segmentation and skin disease diagnosis.

On the ISIC2018 dataset, the VMPM model achieved average scores of 0.9005, 0.1664, 5.6729, 5.7611, 0.8336, and 0.9257 for the DSC, VOE, ASSD, OBJ_ASSD, JAC, and recall metrics, respectively (Table 2). Compared to the second-best performing model, VMPM demonstrated absolute improvements of 1.24%, 1.44%, 10.42%, 13.26%, 1.44%, and 0.84% on each metric, respectively.

Table 2

Quantitative comparison of medical image segmentation by representative models on the ISIC2018 dataset

Structure	Model	FLOPs (G)↓	Params (M)↓	DSC↑	VOE↓	ASSD↓	OBJ_ASD↓	JAC↑	Recall↑
CNN	SegResNet (5)	15.60	6.29	0.8881	0.1824	6.4800	7.6917	0.8176	0.9056
	UNet (4)	42.66	42.19	0.8823	0.1916	7.1346	7.1346	0.8084	0.9034
	nnUNet (1)	115.79	92.47	0.8861	0.1888	6.5273	7.1814	0.8112	0.9173
Transformer	UNTER (11)	26.41	87.51	0.8835	0.1908	6.9508	7.7931	0.8092	0.9118
Transformer	SwinUNTER (12)	19.16	25.12	0.8876	0.1834	6.5929	8.9991	0.8166	0.9026
Mamba	U-Mamba^†	147.94	76.39	0.8880	0.1854	6.3330	6.6420	0.8146	0.9159
	LKM-UNet (17)	265.54	189.53	0.8736	0.2041	7.4426	7.4169	0.7959	0.9082
	LightMUNet^†	16.31	7.28	0.8782	0.1964	7.1769	9.8475	0.8036	0.8966
	SwinUMamba (18)	33.64	55.06	0.8854	0.1808	6.7528	7.2172	0.8192	0.9101
	Our	30.49	49.23	0.9005^‡	0.1664^‡	5.6729^‡	5.7611^‡	0.8336^‡	0.9257^‡

^†, U-Mamba (arXiv:2401.04722) and LightMUNet (arXiv:2403.05246) are both preprint models proposed for biomedical image segmentation and have not yet been formally published. ^‡, the top score for each indicator. ↑ indicates that a higher value of the index corresponds to better model performance. ↓ indicates that a lower value of the index corresponds to better model performance. ASSD, average symmetric surface distance; DSC, dice similarity coefficient; FLOPs, floating point operations per second; JAC, Jaccard coefficient; OBJ_ASD, average surface distance between objects; Params, model params; VOE, volumetric overlap error.

Ultrasound image segmentation

The third segmentation task in our experiments focused on the segmentation of malignant tumor regions on BUSIs. This task required the models to distinguish between normal tissue, benign tumors, and malignant tumor regions, and perform accurate segmentation. The study findings have important implications for research on lesion segmentation under ultrasound and the early diagnosis of breast cancer.

On the BUSI dataset, the VMPM achieved average scores of 0.7906, 0.2736, 24.7079, 11.8915, 0.7264, and 0.8564 for the DSC, VOE, ASSD, OBJ_ASSD, JAC, and recall metrics, respectively (Table 3). Compared to the second-best performing model, VMPM showed absolute improvements of 0.71%, 1.11%, 5.09%, 1.11%, and 1.70% on the DSC, VOE, ASSD, JAC, and recall metrics, respectively. The second-best performing model also had higher parameter counts and computational costs than our VMPM model.

Table 3

Quantitative comparison of medical image segmentation by representative models on the BUSI dataset

Structure	Model	FLOPs (G)↓	Params (M)↓	DSC↑	VOE↓	ASSD↓	OBJ_ASD↓	JAC↑	Recall↑
CNN	SegResNet (5)	62.41	6.29	0.7647	0.3141	27.5470	20.7096	0.6859	0.7976
	UNet (4)	170.64	42.19	0.7046	0.3830	44.2565	60.6935	0.6170	0.5751
	nnUNet (1)	466.25	125.56	0.7733	0.3519	33.1856	22.1834	0.6481	0.7753
Transformer	UNTER (11)	105.63	87.51	0.7129	0.3688	40.1893	22.5552	0.6312	0.7654
Transformer	SwinUNTER (12)	75.77	25.12	0.7455	0.3301	32.9583	15.1986	0.6699	0.8210
Mamba	U-Mamba^†	574.41	103.07	0.7623	0.3190	27.4389	15.8467	0.6810	0.8299
	LKM-UNet (17)	1,066.62	255.64	0.7788	0.2847	26.8441	16.5064	0.7153	0.7935
	LightMUNet^†	85.44	7.28	0.7378	0.3368	39.9766	17.6262	0.6632	0.8157
	SwinUMamba (18)	177.42	55.06	0.7835	0.2876	26.0337	10.8414^‡	0.7124	0.8394
	Our	175.76	49.23	0.7906^‡	0.2736^‡	24.7079^‡	11.8915	0.7264^‡	0.8564^‡

^†, U-Mamba (arXiv:2401.04722) and LightMUNet (arXiv:2403.05246) are both preprint models proposed for biomedical image segmentation and have not yet been formally published. ^‡, the top score for each indicator. ↑ indicates that a higher value of the index corresponds to better model performance. ↓ indicates that a lower value of the index corresponds to better model performance. ASSD, average symmetric surface distance; BUSI, breast ultrasound images; DSC, dice similarity coefficient; FLOPs, floating point operations per second; JAC, Jaccard coefficient; OBJ_ASD, average surface distance between objects; Params, model params; VOE, volumetric overlap error.

Visualization experiments

The above analyses have quantitatively demonstrated the superiority of the proposed model compared to other methods. In this section, we provide a more intuitive description of the segmentation results using several examples from two datasets.

Qualitative evaluation

As Figure 2 shows, the proposed VMPM model focuses more on overall features than local similar features. For instance, in the feature visualization images from the BUSI dataset shown in Figure 2, all models except LKMU-Net (17), SwinUMamba (18), and our VMPM model incorrectly identified background regions as lesion areas, ignoring the overall shape features of the lesion regions themselves.

Figure 2 Visual comparison of medical image segmentation results. Green-colored regions represent correct predictions, orange-colored regions represent false positives, and blue-colored regions represent false negatives. 2D, two-dimensional; BUSI, breast ultrasound images; GT, ground truth.

In addition to emphasizing the overall features, the proposed VMPM model can also detect lesions with low contrast relative to their surroundings. For example, in the ISIC2018 dataset depicted in Figure 2, only our model accurately identified less prominent lesion areas; all the other models failed to do so.

These observations clearly indicate that the proposed model places greater emphasis on overall features when segmenting medical images.

Boundary sensitivity

Although medical image segmentation involves entire lesion regions, the main challenge lies in defining boundary areas, as these regions often exhibit very low contrast, making them difficult to visualize.

To address this issue, we zoomed into the segmentation details, particularly the boundary regions. As illustrated in Figure 3, the differences between our results and the ground truth were minimal, and these differences became more pronounced in complex scenarios. For instance, determining the boundaries of lesion regions in the AbdomenMR 2D images presented in Figure 3 was challenging due to the adhesion of multiple target images and the low contrast with the surrounding background. Apart from the proposed VMPM model, other models often expanded the segmentation region to increase the hit rate (as illustrated by the orange area in Figure 3).

Figure 3 Boundary sensitivity of medical image segmentation details. The segmentation results of the original images are displayed in the odd-numbered rows, while the even-numbered rows present magnified details of the marked regions from the corresponding odd-numbered rows. The green areas represent correct predictions, the orange areas represent false-positive results, and the blue areas represent false-negative results. 2D, two-dimensional; BUSI, breast ultrasound images; GT, ground truth.

Similar observations can be made from the visualized images of the ISIC2018 and BUSI datasets shown in Figure 3. Compared to all other models, our proposed VMPM model exhibited a superior ability to generate more precise boundaries for the lesion areas. This underscores the heightened sensitivity of our model to subtle variations in the boundary regions.

Ablation study

In this study, three ablation experiments were designed to thoroughly investigate the effectiveness of the proposed two modules and their contributions to model performance. First, starting with the baseline model, the 2DIPE and MB modules were incrementally incorporated, after which comparative experiments were performed to evaluate the impact of each module on model performance. Additionally, since the proposed 2DIPE is a generalizability method specifically designed for vision Mamba, it was integrated into five different vision Mamba models to compare segmentation performance before and after its incorporation. Further, the relationship between the performance improvement conferred by the 2DIPE module and the number of image patches was analyzed. Finally, to explore the impact of MB modules with different structures on the performance of the VMPM model, comparative experiments were conducted. The experimental results from all three sets of experiments were visualized for comprehensive analysis.

Ablation analysis of 2D image positional embedding and multiscale fusion modules

To evaluate the efficacy of the newly conceived modules, a series of ablation experiments were performed on the ISIC2018 dataset with the baseline VMPM model. Detailed results are shown in Table 4.

Table 4

Ablation study of the positional embedding and multiscale fusion modules

Number	Position embedding	Multiscale feature fusion module	DSC↑	VOE↓	JAC↑	Recall↑
1	×	×	0.8951	0.1742	0.8258	0.9215
2	×	√	0.8983	0.1697	0.8303	0.9243
3	√	×	0.8999	0.1670	0.8330	0.9194
4	√	√	0.9005	0.1664	0.8336	0.9257

↑ indicates that a higher value of the index corresponds to better model performance. ↓ indicates that a lower value of the index corresponds to better model performance. √ indicates that the module shown in the column header is used in this ablation study. × indicates that the module shown in the column header is not used in this ablation study. DSC, dice similarity coefficient; JAC, Jaccard coefficient; VOE, volumetric overlap error.

Impact of 2DIPE

The 2DIPE module was designed to enhance the relative spatial relationships among the 2D image features, thereby improving the model’s comprehension of global characteristics. As shown in Table 4, incorporating the 2DIPE module alone into the baseline model resulted in improvements of 0.48%, 0.72%, and 0.72% in the DSC, VOE, and JAC metrics, respectively, compared to the original model. Thus, the 2DIPE module enables the segmented targets to more closely approximate the reference targets in terms of overall morphology.

Impact of MB

The MB aims to supplement detailed information in 2D spatial image features. As presented in Table 4, adding the MB alone to the baseline model led to improvements of 0.32%, 0.45%, 0.45%, and 0.28% in the DSC, VOE, JAC, and recall metrics, respectively, relative to the original model. Thus, the MB enhances the segmentation accuracy of fine structural details compared to the 2DIPE.

Combined impact of 2DIPE and MB

The 2DIPE module strengthens the model’s attention to global features, while the MB module supplements fine-grained spatial details in 2D images. As illustrated in Table 4, integrating both modules simultaneously improved the DSC, VOE, JAC, and recall metrics by 0.54%, 0.78%, 0.78%, and 0.42%, respectively, compared to the original model.

As shown in Figure 4, the segmentation performance of the VMPM model was improved with the incorporation of the MB module or the 2DIPE module, demonstrating the efficacy of both modules in improving the performance of the model. Moreover, when both modules were simultaneously integrated into the VMPM model, the morphology of the segmentation heatmap was closest to the ground truth, further validating the synergistic effect between these two modules.

Figure 4 Comparative heatmaps illustrating the performance of 2DIPE and MB. 2DIPE, two-dimensional position embedding; GT, ground truth; MB, multiscale feature fusion block.

Effectiveness analysis of positional embedding module

To evaluate the effectiveness of the 2DIPE module in the vision Mamba models, a comparative analysis of segmentation performance was conducted on the ISIC2018 dataset using baseline vision Mamba models [UMamba, LKM-UNet (17), LightMUNet, and SwinUMamba (18)], and the proposed VMPM before and after 2DIPE module integration. Detailed results are presented in Table 5.

Table 5

Effectiveness analysis of the 2D image positional embedding module evaluated on the ISIC2018 dataset

Model	PE	DSC↑	VOE↓	JAC↑	Recall↑	Patches
U-Mamba^†	×	0.8880	0.1854	0.8146	0.9159	4×4
U-Mamba^†	√	0.8892	0.1807	0.8193	0.9090	4×4
SwinUMamba (18)	×	0.8854	0.1869	0.8131	0.9101	8×8
SwinUMamba (18)	√	0.8950	0.1733	0.8267	0.9251	8×8
VMPM (our)	×	0.8951	0.1742	0.8258	0.9215	8×8
VMPM (our)	√	0.8999	0.1670	0.8330	0.9194	8×8
LKMU-Net (17)	×	0.8736	0.2041	0.7959	0.9082	64×64
LKMU-Net (17)	√	0.8925	0.1773	0.8227	0.9104	64×64
LightMUNet^†	×	0.8782	0.1964	0.8036	0.8966	512×512
LightMUNet^†	√	0.8788	0.1961	0.8039	0.9000	512×512

^†, U-Mamba (arXiv:2401.04722) and LightMUNet (arXiv:2403.05246) are both preprint models proposed for biomedical image segmentation and have not yet been formally published. ↑ indicates that a higher value of the index corresponds to better model performance. ↓ indicates that a lower value of the index corresponds to better model performance. √ indicates that the module shown in the column header is used in this ablation study. × indicates that the module shown in the column header is not used in this ablation study. 2D, two-dimensional; DSC, dice similarity coefficient; JAC, Jaccard coefficient; VMPM, vision Mamba with 2DIPE and MB model; VOE, volumetric overlap error.

As shown in Table 5, the UMamba model exhibited improvements of 0.12%, 0.47%, and 0.47% in the DSC, VOE, and JAC metrics, respectively, after 2DIPE module integration, while recall decreased by 0.69%. The SwinUMamba (18) model showed improvements of 0.96%, 1.36%, 1.36%, and 1.5% in the DSC, VOE, JAC, and recall metrics, respectively. The VMPM model demonstrated improvements of 0.48%, 0.72%, and 0.72% in the DSC, VOE, and JAC metrics, respectively, with a 0.21% reduction in recall. The LKM-UNet (17) model achieved increases of 1.89%, 2.68%, 2.68%, and 0.22% in the DSC, VOE, JAC, and recall metrics, respectively. The LightMUNet model exhibited marginal improvements of 0.06%, 0.03%, 0.03%, and 0.34% in the DSC, VOE, JAC metrics, and recall, respectively.

As shown in Figure 5, segmentation heatmaps of the UMamba, LKM-UNet (17), LightMUNet, SwinUMamba (18), and VMPM models were compared before and after 2DIPE module integration. A clear performance improvement was observed across the majority of the models after 2DIPE module integration.

Figure 5 Comparative heatmaps illustrating the performance of the 2D image positional embedding module. 2D, two-dimensional; 2DIPE, two-dimensional position embedding; GT, ground truth; MB, multiscale feature fusion block; VMPM, vision Mamba with 2DIPE and MB model.

Ablation analysis of multiscale fusion module

To investigate the impact of the shortcut connection branch and the placement of the MDFB on the performance of the vision Mamba model (Figure 6), four architectural configurations were evaluated based on the VMPM baseline model: (I) the use of a Mamba module branch only; (II) the placement of a shortcut connection branch and MDFB before the Mamba module branch; (III) the placement of a shortcut connection branch and MDFB after the Mamba module branch; and (IV) the arrangement of the shortcut connection branch and MDFB in parallel with the Mamba module branch. The performance of these four configurations was systematically assessed on the ISIC2018 dataset, with detailed results presented in Table 6.

Figure 6 Comparison diagram of VMPM network architectures with four different MB structures. “Only Mamba” refers to the VMPM baseline model with only the Mamba module branch. “Before scanning” refers to the placement of a shortcut connection branch and MDFB before the Mamba module branch. “After scanning” refers to the placement of a shortcut connection branch and MDFB after the Mamba module branch. “Parallel scanning” refers to the arrangement of a shortcut connection branch and MDFB in parallel with the Mamba module branch. 2DIPE, two-dimensional position embedding; CONV, convolution operation; MB, multiscale feature fusion block; MDFB, multiscale detail feature branch; VMPM, vision Mamba with 2DIPE and MB model; VSSB, visual state space block.

Table 6

Contrastive experiment of four different structures of MB on the ISIC2018 dataset

Position of shortcut connection branch and MDFB	DSC↑	VOE↓	JAC↑	Recall↑
Only mamba	0.8951	0.1742	0.8258	0.9215
Before scanning	0.8971	0.1713	0.8287	0.9235
After scanning	0.8974	0.1709	0.8291	0.9297
In parallel scanning	0.8983	0.1697	0.8303	0.9243

↑ indicates that a higher value of the index corresponds to better model performance. ↓ indicates that a lower value of the index corresponds to better model performance. DSC, dice similarity coefficient; JAC, Jaccard coefficient; MB, multiscale feature fusion block; MDFB, multiscale detail feature branch; VOE, volumetric overlap error.

As shown in Table 6, placing the shortcut connection branch and MDFB before the Mamba module branch resulted in increases of 0.20%, 0.29%, 0.29%, and 0.20% in the DSC, VOE, JAC, and recall metrics, respectively, compared to the baseline model with the Mamba module branch only. When the shortcut connection branch and MDFB were placed after the Mamba module branch, these metrics increased by 0.23%, 0.33%, 0.33%, and 0.82%, respectively. The parallel placement of the Mamba module branch with the shortcut connection branch and MDFB yielded the most significant improvements, with increases of 0.32%, 0.45%, 0.45%, and 0.28% in the DSC, VOE, JAC, and recall metrics, respectively.

As illustrated in Figure 7, this study generated comparative segmentation heatmaps of the VMPM model with four different structures. Notably, the segmentation performance of the model improved in all configurations following the incorporation of the shortcut connection branch and MDFB. However, the most pronounced enhancement in segmentation quality was achieved when the shortcut connection branch and MDFB were arranged parallel to the Mamba module branch, as evidenced by the well-defined boundaries and complete morphological preservation demonstrated in the rightmost column of the segmentation heatmaps (Figure 7).

Figure 7 Comparative heatmaps illustrating the effect of the structures of the MB. GT, ground truth; MB, multiscale feature fusion block.

Discussion

This experimental study showed that the combination of 2DIPE and MB yielded excellent results, achieving impressive performance on three publicly available medical image segmentation datasets. As shown in Table 1, the VMPM model achieved the optimal scores across almost all metrics (excepting the ASSD metric), but the OBJ_ASSD metric was more informative for multi-organ segmentation tasks than the ASSD metric. This is because the ASSD metric measures the average nearest distance between all boundary pixels of the predicted result and all boundary pixels of the ground truth, whereas the OBJ_ASSD metric calculates the average nearest distance between the boundary pixels of each predicted object in the segmented image and the corresponding boundary pixels of the ground-truth object. Notably, the VMPM exhibited a 13.00% improvement in the OBJ_ASSD metric compared to the suboptimal model. Further, as Tables 2,3, and Figures 2,3 show, VMPM also achieved good segmentation results in the dermoscopy images and BUSIs.

As shown in Table 4 and Figure 4, incorporating the 2DIPE module alone into the baseline model resulted in improvements across almost all the metrics except recall, which decreased by 0.21%. Thus, while the 2DIPE module enhances the model’s focus on global features, it may overlook certain fine-grained pixel-level details compared to the original model. As illustrated in the third row of Figure 4, the baseline model of the VMPM generated a heatmap with a larger segmented area but less distinct boundaries. After incorporating the 2DIPE method, the segmented area became smaller but exhibited sharper boundaries. Although some detailed pixels may be lost due to the reduction in segmentation area, overall segmentation performance was improved. Additionally, the third row of Figure 4 demonstrates that the MB module enhances the segmentation accuracy of fine structural details compared to the 2DIPE module. Finally, the last row of Figure 4 that the two modules exhibited synergistic effects, collectively enhancing model performance.

As shown in Table 5 and Figure 5, by introducing 2D spatial position encoding into the vision Mamba model, 2DIPE incorporates spatial position information into the image patch embedding vectors without increasing computational complexity, thereby enhancing the model’s overall understanding of image features and improving its performance. Notably, certain models experienced slight declines in recall after 2DIPE module integration, which may be because the 2DIPE module enhances the model’s focus on global features, but certain fine-grained pixel-level details are overlooked compared to the original model. However, overall, the 2DIPE module enhanced the performance of the vision Mamba models.

Additionally, it was observed during the experiments that each image needs to be partitioned into patches before being input into the Mamba module, with the 2DIPE assigning a 2D positional encoding vector to each patch. Consequently, the performance enhancement conferred by the 2DIPE module may be correlated with the number of image patches. As illustrated in Table 5, when a 512×512 image was divided into 16 patches, the DSC improvement was only 0.12%. Conversely, when the image was partitioned into 64 patches, the DSC improvements for the SwinUMamba (18) and UMamba increased to 0.96% and 0.48%, respectively. Further partitioning the image into 4,096 patches resulted in a 1.89% DSC improvement. However, when positional encoding was applied to every individual pixel, the DSC improvement was only 0.06%. These findings also suggest that when the number of patches is limited, the model can infer the relative spatial positions of features in patches without explicit positional encoding. However, as the number of patches increases, 2D positional encoding becomes essential for the model to accurately localize features. Since the model learns features from patches, pixel-level positional encoding provides negligible benefit for understanding feature localization in 2D space for pixel-wise scanning vision Mamba models.

As shown in Table 6 and Figure 7, the MB module, with its shortcut connection branch and MDFB, supplements the scanning results of the Mamba module branch with multiscale detail information, thus improving the model’s segmentation performance. However, the position of the shortcut connection branch and MDFB influences the extent of this performance improvement. The parallel placement of the shortcut connection branch and MDFB with the Mamba module branch maximally enhanced the performance of the VMPM model. This is attributed to the ability of the shortcut connection branch and MDFB to provide multiscale detailed information for the Mamba module branch’s output. When placed before the Mamba module branch, some multiscale details are lost during the Mamba module branch’s scanning process. Conversely, if positioned after the Mamba module branch, the shortcut connection branch and MDFB extract already degrade details from the Mamba-processed features, leading to further information loss. By arranging the shortcut connection branch and MDFB in parallel with the Mamba module branch, the Mamba module branch captures global features from the previous layer’s output, while the shortcut connection branch and MDFB extract multiscale local details. The subsequent fusion of their outputs enables the simultaneous extraction of both global features and multiscale local details, thereby optimizing model performance.

In summary, integrating the 2DIPE and MB into the vision Mamba model represents an innovative and feasible approach. It addresses the issue that Mamba scanning image patch embedding vectors lack 2D spatial position information and mitigates the issue of local feature pixel loss. However, it is important to note that the experimental process has certain limitations. Although the VMPM model demonstrated superior segmentation performance on the ultrasound image lesion segmentation dataset compared to similar models (achieving a DSC of 0.7906), significant room for improvement remains. The model fails to precisely delineate the boundaries of some lesion tissues and may overlook certain lesion regions. Future research efforts will focus on optimizing the VMPM model for the task of ultrasound image lesion segmentation specifically.

Conclusions

This study developed a novel medical image segmentation model, called VMPM, which integrates the 2DIPE and MB modules. The 2DIPE module significantly enhances the model’s comprehensive understanding of spatial features by introducing a 2D position encoding approach to the vision Mamba architecture. Concurrently, the MB incorporates multiscale detailed information into the vision Mamba framework, thereby further improving model performance. The experimental results demonstrated that the synergistic combination of 2DIPE and MB effectively enhanced segmentation accuracy, with outstanding performance observed in both overall contour delineation and edge refinement. The proposed VMPM model exhibits considerable generalizability and could be readily extended to other 2D image analysis tasks, such as remote sensing image recognition (31), strawberry image reconstruction and segmentation in agricultural production (32), rose image object recognition (33), and lawn image segmentation (34). Further, while the VMPM model performs well in 2D medical image segmentation tasks, three-dimensional (3D) images can provide greater assistance for precise clinical diagnosis. Consequently, subsequent research plans include extending the VMPM model to 3D medical image segmentation tasks and other scenarios to further enhance its versatility.

Acknowledgments

None.

Footnote

Funding: This work was supported by Research Project on Application of Medical Artificial Intelligence by Chinese National Health Commission (grant No. YLXX24AIA040).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-aw-2178/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Isensee F, Jaeger PF, Kohl SAA, Petersen J, Maier-Hein KH. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods 2021;18:203-11. [Crossref] [PubMed]
Ruan J, Li J, Xiang S. VM-UNet: Vision Mamba UNet for Medical Image Segmentation. ACM Transactions on Multimedia Computing, Communications and Applications 2025; [Crossref]
Shelhamer E, Long J, Darrell T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans Pattern Anal Mach Intell 2017;39:640-51. [Crossref] [PubMed]
Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. Cham: Springer; 2015:234-41.
Andriy Myronenko. 3D MRI Brain Tumor Segmentation Using Autoencoder Regularization. In: Crimi A, Bakas S, Kuijf H, Keyvan F, Reyes M, van Walsum T, editors. Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries. BrainLes 2018. Lecture Notes in Computer Science. Cham: Springer; 2018:311-20.
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929 [Preprint]. 2021. Avaliable online: https://arxiv.org/pdf/2010.11929/100
Gu A, Dao T. Mamba: Linear-time sequence modeling with selective state spaces. Proceedings of the First conference on Language Modeling; Philadelphia, PA, USA. 7–9 October 2024. Available online: https://openreview.net/forum?id=tEYskw1VY2
Zhu L, Liao B, Zhang Q, Wang X, Liu W, Wang X. Vision mamba: efficient visual representation learning with bidirectional state space model. Proceedings of the 41st International Conference on Machine Learning (ICML'24). 2024;235:62429-42.
Liu Y, Tian Y, Zhao Y, Yu H, Xie L, Wang Y, Ye Q, Jiao J, Liu Y. Vmamba: Visual state space model. Advances in neural information processing systems; Held 10-15 December 2024, Vancouver, Canada. 2024:103031-63.
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF international conference on computer vision. 2021:10012-22.
Hatamizadeh A, Tang Y, Nath V, Yang D, Myronenko A, Landman B, Roth H, Xu D. Unetr: Transformers for 3d medical image segmentation. Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2022;43:574-84.
Hatamizadeh A, Nath V, Tang Y, Yang D, Roth HR, Xu D. Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In: International MICCAI brain lesion workshop. Cham: Springer International Publishing; 2021:272-84.
Gu A, Dao T, Ermon S, Rudra A, Ré C. Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems; Vancouver, Canada. 2020; 33:1474-87.
Gu A, Goel K, Ré C. Efficiently Modeling Long Sequences with Structured State Spaces. arXiv:2111.00396 [Preprint]. 2022. Available online: https://arxiv.org/abs/2111.00396
Hochreiter S, Schmidhuber J. Long short term memory. Neural computation 1997;9:1735-80. [Crossref] [PubMed]
Ruan J, Li J, Xiang S. VM-UNet: Vision Mamba UNet for Medical Image Segmentation. ACM Trans Multimedia Comput Appl 2025. doi: 10.1145/3767748
Wang J, Chen J, Chen DZ, Wu J. LKM-UNet: Largekernel vision mamba unet for medical image segmentation. In: Linguraru MG, Dou Q, Feragen A, Giannarou S, Glocker B, Lekadir K, Schnabel JA. Medical Image Computing and Computer Assisted Intervention – MICCAI 2024. MICCAI 2024. Lecture Notes in Computer Science. Cham: Springer; 2024:360-70.
Liu J, Yang H, Zhou HY, Xi Y, Yu L, Yu Y, Liang Y, Shi G, Zhang S, Zheng H, Wang S. Swin-umamba: Mamba-based unet with imagenet-based pretraining. International conference on medical image computing and computer-assisted intervention. Cham: Springer; 2024:615-25.
Dang TDQ, Nguyen HH, Tiulpin A. LoG-VMamba: Local-Global Vision Mamba for Medical Image Segmentation. Available online: https://openaccess.thecvf.com/content/ACCV2024/papers/Dang_LoG-VMamba_Local-Global_Vision_Mamba_for_Medical_Image_Segmentation_ACCV_2024_paper.pdf
Zhang T, Yuan H, Qi L, Zhang J, Zhou Q, Ji S, Yan S, Li X. Point cloud mamba: Pointc loud learning via state space model. 2024. Available online: https://dl.acm.org/doi/abs/10.1609/aaai.v39i10.33098
Lin WT, Lin YX, Chen JW, Hua KL. PixMamba: Leveraging state space models in a dual-level architecture for underwater image enhancement. Proceedings of the Asian Conference on Computer Vision 2024:176-91.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention is all you need. 2017. Available online: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Guo H, Li J, Dai T, Ouyang Z, Ren X, Xia ST. Mambair: A simple baseline for image restoration with state space model. European conference on computer vision. Cham: Springer; 2024:222-41.
Du Q, Wang L, Chen H. A mixed Mamba U-net for prostate segmentation in MR images. Sci Rep 2024;14:19976. [Crossref] [PubMed]
Xing Z, Ye T, Yang Y, Liu G, Zhu L. Segmamba: Long-range sequential modeling mamba for 3dmedical image segmentation. International Conference on Medical Image Computing and Computer Assisted Intervention. Vancouver: Springer; 2024:578-88.
Chen Q, Xu Z, Fang X. CaVMamba: convolution-augmented VMamba for medical image segmentation. The Visual Computer 2025;41:5855-72.
Tustin Arnold. A method of analyzing the behavior of linear systems in terms of time series. Journal of the Institution of Electrical Engineers-Part IIA: Automatic Regulators and Servo Mechanisms 1947;94:130-42.
Ma J, Zhang Y, Gu S, Ge C, Mae S, Young A, et al. Unleashing the strengths of unlabelled data in deep learning-assisted pan-cancer abdominal organ quantification: the FLARE22 challenge. Lancet Digit Health 2024;6:e815-26. [Crossref] [PubMed]
Codella NCF, Gutman D, Emre Celebi M, Helba B, Marchetti MA, Dusza SW, Kalloo A, Liopyris K, Mishra N, Kittler H, Halpern A. Skin lesion analysis toward melanoma detection: A challenge at the 2017 International symposium on biomedical imaging (ISBI), hosted by the international skin imaging collaboration (ISIC). ISBI; 2018; Washington, DC, USA. IEEE; 2018:168-72.
Al-Dhabyani W, Gomaa M, Khaled H, Fahmy A. Dataset of breast ultrasound images. Data Brief 2020;28:104863. [Crossref] [PubMed]
Chen H, Song J, Han C, Xia J, Yokoya N. ChangeMamba: Remote Sensing Change Detection With Spatiotemporal State Space Model. IEEE Transactions on Geoscience and Remote Sensing 2024;62:1-20.
Zhao F, He Y, Song J, Wang J, Xi D, Shao X, Wu Q, Liu Y, Chen Y, Zhang G, Zhang C, Chen Y, Chen J, Mizuno K. Smart UAV-assisted blueberry maturity monitoring with Mamba-based computer vision. Precision Agric 2025;26:56.
You S, Li B, Chen Y, Ren Z, Liu Y, Wu Q, Tao J, Zhang Z, Zhang C, Xue F, Chen Y, Zhang G, Chen J, Wang J, Zhao F. Rose-Mamba-YOLO: an enhanced framework for efficient and accurate greenhouse rose monitoring. Front Plant Sci 2025;16:1607582. [Crossref] [PubMed]
Tao J, Qiao Q, Song J, Sun S, Chen Y, Wu Q, Liu Y, Xue F, Wu H, Zhao F. Deep Learning-Driven Automatic Segmentation of Weeds and Crops in UAV Imagery. Sensors (Basel) 2025;25:6576. [Crossref] [PubMed]

Cite this article as: Zhang X, Li R, Rao J, Wang M, Zhang J, Zhao L. Enhancing vision Mamba with two-dimensional position embedding and multiscale fusion for medical image segmentation. Quant Imaging Med Surg 2026;16(4):275. doi: 10.21037/qims-2025-aw-2178