Application of transformer models in medical image segmentation: a narrative review

Yuan Xu; Yang Peng; Chi Zhang; Kai Jiang; Xiaoming She; Lin Feng

doi:10.21037/qims-2025-aw-2381

Review Article

Application of transformer models in medical image segmentation: a narrative review

Yuan Xu^1# , Yang Peng^1# , Chi Zhang² , Kai Jiang² , Xiaoming She³, Lin Feng^1,2

¹School of Computer Science, Sichuan Normal University, Chengdu, China; ²Sichuan Internet College, Sichuan Normal University, Chengdu, China; ³Sichuan Bank Co., Ltd., Chengdu, China

Contributions: (I) Conception and design: Y Xu; (II) Administrative support: L Feng, X She; (III) Provision of study materials or patients: All authors; (IV) Collection and assembly of data: Y Xu, Y Peng; (V) Data analysis and interpretation: Y Xu, Y Peng; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

^#These authors contributed equally to this work.

Correspondence to: Xiaoming She, Master’s Degree. Sichuan Bank Co., Ltd., No. 715, North Section of Hupan Road, Chengdu 610000, China. Email: 839949655@qq.com; Lin Feng, Doctoral Degree. School of Computer Science, Sichuan Normal University, No. 1819, Section 2, Chenglong Avenue, Longquanyi District, Chengdu 610068, China; Sichuan Internet College, Sichuan Normal University, Chengdu, China. Email: fenglin@sicnu.edu.cn.

Background and Objective: Transformer-based medical image segmentation plays a crucial role in healthcare applications, facilitating precise diagnosis, treatment planning, and disease monitoring. Traditionally, convolutional neural networks (CNNs), which excel at local feature extraction, have dominated this field. However, they have a limited ability to capture the long-range dependencies within images and thus face difficulty in handling the complex, interconnected structures present in medical data. Transformer modeling, as an advanced tool in natural language processing (NLP), has also demonstrated its value in computer vision tasks. With its increasing popularity, research on its application in medical imaging has grown significantly. Nowadays, several models have demonstrated that combining CNNs and transformers can effectively capture both local and global information, thereby enhancing segmentation performance. A review was conducted to characterize the research on transformer models applied in medical image segmentation.

Methods: Databases including Google Scholar, arXiv, ResearchGate, Microsoft Academic, PubMed, and Semantic Scholar, as well as large language models including ChatGPT and DeepSeek, were used to search for the latest developments in this field. Specifically, English language literature published from 2021 to 2025 was included in the review.

Key Content and Findings: In this investigation, we explored a variety of methods for integrating Transformer with traditional U-shaped architectures, as well as the efficiency disparities exhibited by different approaches. Through this investigation, we conducted research centered on the U-shaped architectures of pure Transformer and hybrid Transformer, and analyzed models that have been proven to possess outstanding performance. Segmentation models can be divided into pure transformer and hybrid architectures. In this review, the applications of transformer models in medical segmentation were examined, the performance of these models on different datasets was summarized, and advanced strategies were quantitatively analyzed. Multimodal large language models have been reported in the recent literature, signifying that these no longer constitute a speculative technology but rather an emerging development in the development field.

Conclusions: The number of pure transformer models is relatively small compared with that of and hybrid models, with the latter generally demonstrating superior performance. Among hybrid models, those integrating the transformer into the decoder typically outperform other hybrid variants; however, the number of such models is also limited. Public datasets in medical image segmentation are highly diverse, with corresponding datasets available for a variety of data modalities. Further research will focus on lightweight structural design, prior knowledge integration, data utilization efficiency enhancement, and foundational model scale expansion to improve model practicality and generalization in low-compute, low-data scenarios. Finally, the future developments for improving the effectiveness of the transformer models in the medical field were summarized.

Keywords: Transformer; medical imaging; hybrid architectures; image segmentation; multimodal large language models

Submitted Nov 10, 2025. Accepted for publication Mar 17, 2026. Published online Apr 14, 2026.

doi: 10.21037/qims-2025-aw-2381

Introduction

Medical image segmentation is a core component of medical image processing, and its primary purpose is to precisely delineate specific anatomical structures, lesion areas, or targets of interest from medical images and extract quantitative information such as location, shape, and texture (1). The key elements of this process include the objects of segmentation (e.g., organs and lesions) and the technical task (e.g., achieving target-background separation via the classification of pixels through algorithms). Generally, medical image processing. Generally, medical image segmentation encompasses preprocessing, feature extraction, segmentation algorithms, and postprocessing. This technology, medical image segmentation, can assist in diagnosis, support treatment planning, and provide quantitative data for scientific research and teaching.

Medical image segmentation technology first emerged in the 1970s and 1980s and has evolved alongside the widespread adoption of medical imaging techniques, undergoing three distinct phases of development. In its early phase, segmentation was primarily conducted via traditional algorithms such as threshold segmentation, depending on manually designed features, which limited its ability to handle complex structures. In the intermediate phase, machine learning methods such as active contour models were introduced, thus enhancing robustness. In the modern era (2010s to the present), deep learning has emerged as the cornerstone approach, with models such as fully convolutional networks (FCNs) and U-Net enabling end-to-end pixel-level prediction (2). By integrating multimodal data fusion and attention mechanisms, accuracy has been further improved (3). The current trends in this field include the development of lightweight models, semisupervised learning, and interpretability research (4).

Before the advent of the transformer model, the field of natural language processing (NLP) primarily relied on recurrent neural networks (RNNs) and convolutional neural networks (CNNs) for processing sequential data.

However, both of these networks hold inherent limitations: RNNs have low training efficiency, are prone to gradient vanishing or explosion in the handling of long sequences, and face difficulty capturing long-range dependencies. CNNs, on the other hand, are constrained by the local receptive fields of their convolutional kernels, which limit their ability to grasp the global context and model sequential positional information effectively (5). Although the introduction of the attention mechanism in 2014 improved the handling of long-range dependencies, integration with RNNs was still required and failed to fully address the issue of parallelization (6). In 2017, a landmark paper titled “Attention Is All You Need” was published by Ashish Vaswani’s Google Brain Team in collaboration with the University of Toronto. This work introduced the first sequence model built solely on self-attention mechanisms, enabling efficient capture of global dependencies through parallel computation. The model achieved breakthrough performance in the 2014 Workshop on Statistical Machine Translation 4 English-to-German translation task, demonstrating over a 10-fold improvement in training speed compared to the best existing models at the time, achieving a Bilingual Evaluation Understudy (BLEU) score of 28.4 and thereby confirming the superiority of the self-attention mechanism (6).

Transformers, which have been widely recognized as state-of-the-art (SOA) tools in NLP, have also been recognized for their value in computer vision tasks. With their growing popularity, they have also been extensively examined in terms of their application in the relatively complex domain of medical imaging (7). They substantially improve the capability to identify lesions in early diagnosis, providing strong support for timely tumor intervention (8). For instance, in skin cancer diagnosis, the Gaussian Splatting-Transformer UNet model integrates two-dimensional (2D) Gaussian splatting technology with the Transformer UNet architecture, effectively addressing the challenge of missed detections of lesions with blurred boundaries and irregular shapes. This model achieved a Dice coefficient of 92.1% on the International Skin Imaging Collaboration dataset, providing a 7.3% improvement over the traditional U-Net and a 15% reduction in the misdiagnosis rate (9). For brain tumor boundary identification, the UNETR model uses the 3D Swin Transformer to capture the spatial dependencies between tumors and normal tissues. Combining this capacity with multimodal magnetic resonance imaging (MRI) features, it achieved a Dice coefficient of 88.5% for tumor core segmentation on the Brain Tumor Segmentation (BraTS) dataset, successfully assisting doctors in detecting tiny nodules overlooked by traditional methods (10).

However, several challenges remain in the application of transformer models. The quadratic complexity of the self-attention mechanism leads to a surge in memory consumption in the processing of high-resolution 3D medical images, limiting deployment on portable devices (11). Meanwhile, the high cost of medical annotation and the limited size of datasets render transformer models prone to overfitting in small-sample scenarios and to unstable performance during cross-domain transfer. Additionally, inadequate local feature modeling results in blurred segmentation boundaries (12). Measures to address these issues include adopting window attention to reduce complexity (4), integrating self-supervised pretraining to alleviate data scarcity [see Chen et al. (2021) in Appendix 1], employing CNN-Transformer hybrid architectures to enhance local feature extraction (13), and developing multimodal fusion and temporal modeling mechanisms. We conducted a systematic review to comprehensively characterize the developmental trajectory of transformer models in the field of medical image segmentation, analyzed the performance of its various variants on different datasets from the perspective of architectural innovation, summarized the core evaluation metrics in medical image segmentation tasks, and reviewed the medical image segmentation datasets and their typical data formats. We present this article in accordance with the Narrative Review reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-aw-2381/rc).

Methods

First introduced in 2021, Vision Transformer (ViT) (14) was the first ViT model and formed the foundation for subsequent research on ViTs. TransUNet [see Chen et al. (2021) in Appendix 1] pioneered the systematic application of the transformer architecture to medical image segmentation tasks, fully demonstrating its exceptional performance as an encoder. Researchers have successively proposed a variety of related models for image segmentation. To characterize the status of transformers in this field and to assess their potential in the field of medical image segmentation, we conducted a systematic review of relevant research papers published between 2021 and 2025. This review involved a range of academic resources and tools, including search engines such as Google Scholar, arXiv, ResearchGate, Microsoft Academic, PubMed, and Semantic Scholar, as well as large language models such as ChatGPT and DeepSeek. Searches were conducted with keywords such as “medicine”, “transformer”, “CNN-Transformer hybrid architecture for medical image segmentation”, “segmentation”, “imaging”, and “vision”. Table 1 provides the specific details regarding the literature search and analysis.

Table 1

Literature search strategy

Items	Specification
Date of search	August 10, 2025
Databases and other sources searched	Google Scholar, arXiv, ResearchGate, Microsoft Academic, PubMed, Semantic Scholar, ChatGPT, and DeepSeek
Search terms used	“Medical” and “transformers” and “CNN-Transformer Hybrid for Medical Image Segmentation” and “segmentation” and “imaging” or “vision”
Timeframe	2021 to 2025
Inclusion criteria	English-language literature, including published manuscripts and preprint articles, on medical image segmentation with transformer-based models
Exclusion criteria	Unpublished manuscript, conference abstracts, and literature on classification networks, non–medical image segmentation, and models without transformers or attention
Selection process	Literature was selected by the author Y.X.

Transformers

Before the emergence of the transformer model, the field of NLP primarily employed RNNs and CNNs to process sequential data. However, there is difficulty in using RNNs to capture long-range dependencies due to issues with parallel computing, gradient vanishing or explosion, and their low training efficiency. Meanwhile, CNNs, being constrained by their local perception characteristics, possess limited ability to grasp the global context and model sequential positional information. Although additive attention was introduced in 2014, it still required integration with RNNs and could not fully resolve the parallelization issue (15). It was not until sequence models solely based on self-attention mechanisms were proposed in “Attention Is All You Need” that a breakthrough was achieved in machine translation tasks (6).

Overall architecture

The encoder-decoder architecture is a key component of deep learning and consists primarily of the compression of input information into an intermediate representation and the corresponding generation of target sequences. The general architecture of transformer models is as follows: The encoder transforms the input sequence in symbolic form $(x_{1}, \dots, x_{n})$ into a sequence of continuous representations $z = (z_{1}, \dots, z_{n})$ . Given z, the decoder generates the output sequence $(y_{1}, \dots, y_{m})$ element by element. Each step of generation in this model involves an autoregressive approach, in which previously generated symbols are used as additional input in the generation of new symbols. Transformer uses stacked self-attention mechanisms in the encoder and point-wise fully connected layers in the decoder; these architectures are shown in the left and right panels of Figure 1, respectively.

Figure 1 Transformer model architecture. This figure was adapted from an open access article (7) under the terms of the Creative Commons Attribution License 4.0 (CCBY).

The encoder-decoder stack

Encoder

Following the standard transformer architecture proposed by Vaswani et al. (6), the encoder is constructed by stacking 6 identical layers, as illustrated in the left panel of Figure 1. Each encoder layer consists of two sublayers: a self-attention sublayer and a position-wise feed-forward sublayer. Residual connections and layer normalization are applied around each sublayer to facilitate stable and efficient information propagation.

The primary function of the transformer encoder is to convert an input sequence into a sequence of continuous vector representations. For example, in machine translation, the input tokens are first embedded and augmented with positional encoding to incorporate sequential order information. The self-attention mechanism then models contextual dependencies among tokens, enabling the encoder to capture both semantic and syntactic relationships within the sequence.

To enhance training stability and representation learning, residual connections and layer normalization are employed, which help mitigate optimization difficulties in deep networks, such as gradient degradation. The feed-forward network (FFN) further applies position-wise nonlinear transformations to each token representation, thereby increasing the model’s expressive capacity. Finally, through successive layers of attention, feed-forward transformations, residual connections, and normalization, the encoder produces high-level feature representations that are subsequently passed to the decoder.

Decoder

The decoder is also composed of a stack of 6 identical layers (6), as illustrated in the right panel in Figure 1. In contrast to the encoder, the decoder is composed of layers, each consisting of three sublayers: a masked self-attention sublayer, an encoder-decoder (cross)-attention sublayer, and a position-wise feed-forward sublayer. Residual connections and layer normalization are applied around each sublayer to ensure stable information flow during training.

The primary function of the decoder is to generate output sequences in an autoregressive manner based on the representations produced by the encoder. To this end, the masked self-attention sublayer enforces the autoregressive property by restricting each position to attend only to previously generated tokens. This masking mechanism prevents access to future information during both training and inference, thereby ensuring that the model learns valid conditional dependencies.

The encoder-decoder attention sublayer allows the decoder to attend to the encoder outputs, effectively aligning the generated tokens with the relevant source representations. Through this cross-attention mechanism, the decoder integrates contextual information from the input sequence while maintaining causal generation behavior. Finally, the feed-forward sublayer applies position-wise nonlinear transformations to enhance the expressive capacity of token representations. Together, these components enable probabilistic sequence modeling via chain-rule factorization, providing a principled foundation for autoregressive sequence generation.

Attention

When observing objects, people tend to focus their attention on key elements and give priority to relatively important information. Essentially, the attention mechanism in deep learning shares a high degree of similarity with humans’ selective visual attention mechanism. Both mechanisms involve sifting out a small amount of key information from a vast amount of data and focusing on it while filtering out and ignoring irrelevant information.

In the field of machine learning, achieving this effect primarily relies on attention functions, which map a query and a set of key-value pairs to an output. The core idea is to generate the output through a weighted sum of values, with the weights dynamically calculated by a compatibility function between the query and the key.

Scaled dot-product attention

Scaled dot-product attention is the central attention mechanism in transformer models, and it computes dynamic weights through dot-product similarity to achieve global associations among sequence elements (6). Essentially, it matches query vectors with key vectors and aggregates value vectors in a weighted manner. The entire process is implemented for efficient parallelization through pure matrix operations, and its mathematical formulation is expressed as follows:

$Attention (Q,K,V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V$ [1]

The “scaled dot-product” attention mechanism is computed synchronously for a set of projected queries, keys, and values. These queries, keys, and values respectively, form the query matrix Q, the key matrix K, and the value matrix V, with their corresponding vector dimensions being dq, dk, and dv. Among them, Q, K, and V serve as input parameters for the scaled dot-product attention. In this mechanism, the query matrix Q undergoes a dot-product operation with the key matrix K, generating an original similarity matrix. This matrix reflects the association strength between each query vector (from Q) and all key vectors (from K). To prevent the softmax function from entering the gradient saturation region and causing the gradient vanishing problem, the similarity matrix needs to be divided by the scaling factor $\sqrt{d_{k}}$ (where dk is the dimension of matrix K) to adjust the similarity. Subsequently, a softmax calculation is performed on the scaled matrix to transform the similarity into probability distribution weights. Once these probability distribution weights are obtained, a weighted sum can be performed on the value matrix V, and thus the final attention output is generated. The entire process is illustrated in Figure 2A.

Figure 2 Attention mechanism of a transformer. (A) Scaled dot-product attention. (B) Multihead self-attention. This figure was adapted from an open access article (7) under the terms of the Creative Commons Attribution License 4.0 (CCBY). K, key; MatMul, matrix multiplication; Q, query; V, value.

Multihead self-attention (MHSA)

MHSA is the key innovation of the transformer architecture, enabling the model to simultaneously capture semantic information across different dimensions by parallelizing multiple sets of attention mechanisms (6). Its design addresses the representational limitations of single-head attention, significantly enhancing the model’s ability to capture complex dependencies. The entire process of MHSA can be expressed by the following formula:

$MultiHead (Q,K,V) = Concat ({head}_{1}, ..., {head}_{h}) W^{O}$ [2]

where

${head}_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})$ [3]

and where $W_{i}^{Q}$ , $W_{i}^{K}$ , and $W_{i}^{V}$ respectively represent the linear projection matrices for the query, key, and value corresponding to the i-th attention head, with W^O representing the linear projection matrix for the output of the multihead attention.

The input vectors of MHSA are mapped to the low-dimensional subspaces of query Q, key K, and value V through h sets of independent linear projections. This process is illustrated in Figure 2B. Subsequently, each set of projections conducts scaled dot-product attention operations in parallel, independently calculates attention weights, and aggregates value vectors, enabling each attention head to focus on different semantic subspaces (e.g., grammatical structures, semantic associations, or anaphoric relations). Finally, the outputs of all attention heads are concatenated and then fused through a linear transformation W^O to be restored to the original dimension.

Feedforward network

The FFN is another critical element of the transformer module and follows immediately after the multihead attention sublayer (6). Essentially, the FFN is a simple two-layer fully connected network, with the core process being as follows: dimension elevation → activation processing → dimension reduction.

In the original transformer architecture, the FFN is composed of two fully connected (i.e., linear) layers, with a nonlinear activation function embedded between them. Specifically, the first linear layer is responsible for mapping the input features into a higher-dimensional space, which is followed by processing with the rectified linear unit activation function. Subsequently, the second linear layer maps the features back to their original dimensions. Formally, the FFN can be expressed as follows:

$FFN (x) = \max (0, x W + b_{1}) W_{2} + b_{2}$ [4]

where W₁ and W₂ are learnable weight matrices, while b₁ and b₂ are bias terms. The FFN operates in a “position-aware” manner, meaning it processes each token vector in the sequence independently and uniformly. This indicates that within the FFN, tokens at different positions in the sequence do not directly interact with each other. Given that the self-attention mechanism is essentially a weighted summation with strong linear characteristics, the activation function within the FFN endows the model with powerful nonlinear expressive capabilities, enabling it to learn and fit more complex functions. This is also the key to the transformer’s exceptional learning ability. Meanwhile, the FFN can also conduct in-depth “contemplation” and “digestion” of each token.

Position encoding

Before the advent of transformer models, NLP tasks primarily consisted of recurrent processing methods exemplified by RNNs and long short-term memory, in which tokens are input into the model one by one. These models adopted a sequential structure, inherently incorporating the positional information of tokens within the sequence. However, this entailed numerous inherent flaws, such as difficulty in the handling of long-sequence data and the disproportionate impact of tokens positioned later in the sentence on the final results. In contrast, transformer inputs the entire token sequence into the model at once, foregoing the recurrent structure. Although this approach resolves the aforementioned issues, the model consequently loses the ability to acquire the relative and absolute positional information of each token within the sentence. To address this new challenge, integrating the sequential signals of tokens into word embeddings was proposed, which can aid the model in learning positional information in a process known as “positional encoding” (6).

Positional encoding in the transformer model serves as a crucial component to address the self-attention mechanism’s lack of sequence order awareness. Its implementation involves generating positional encodings via fixed trigonometric functions, as shown in the following formula:

$PE (pos, 2i) = \sin \frac{p o s}{10000^{\frac{2 i}{d_{m o d e l}}}}$ [5]

or

$PE (pos, 2i+1) = \cos \frac{p o s}{10000^{\frac{2 i}{d_{m o d e l}}}}$ [6]

where PE is positional encoding, denoting the position of a token within a sentence, and the positional encoding for each token is a vector (also known as an “embedding”), with i indicating the dimensional index of the positional encoding vector and $d_{model}$ the dimensionality of the positional encoding vector. Eqs. [5,6], despite appearing to be highly complex as a single expression, can be presented more intuitively as follows:

${PE}_{pos} = [\begin{array}{l} \sin (W_{1} \cdot p o s) \\ \cos (W_{1} \cdot p o s) \\ \sin (W_{2} \cdot p o s) \\ \cos (W_{2} \cdot p o s) \\ \dots \dots \dots \dots \dots \\ \sin (W_{d / 2} \cdot p o s) \\ \cos (W_{d / 2} \cdot p o s) \end{array}]$ [7]

It can be observed from $w_{i} = \frac{1}{10000^{\frac{2 i}{d_{m o d e l}}}}$ that each positional encoding vector alternates between cosine and sine functions. When the model processes high-dimensional features and as the _{$d_{model}$} (model dimension) increases, the growth of $10000^{\frac{2 i}{d_{m o d e l}}}$ becomes slower. This is because the exponent becomes smaller, leading to a decrease in the growth rate of the entire expression. The frequency is related to the rate of change of $\frac{1}{10000^{\frac{2 i}{d_{_{m o d e l}}}}}$ . When $10000^{\frac{2 i}{d_{m o d e l}}}$ grows slowly relative to changes in the position, this ratio changes more slowly, resulting in a sluggish change in frequency. The constant 10,000 is used to adjust the decay rate of the frequency. A larger value leads to more low frequency dimensions, which is suitable for long sequences; if the value is too small, there will be an excessive number of high frequencies, reducing the ability to distinguish positions.

In the transformer model, each positional encoding is unique, with a value range of [−1, 1], which effectively prevents numerical overflow in long sequences. Moreover, the positional encoding for position $p o s + k$ can be expressed as a linear transformation of the encoding for position as follows:

${PE}_{pos+k} = T_{K} * {PE}_{pos}$ [8]

where T_k is a transformation matrix determined by k. Additionally, the inner product of the positional encodings of two tokens decreases as the distance between them increases. This characteristic aligns with the linguistic pattern in which nearby tokens exhibit higher correlation than do distant ones and further reflects the property of long-range decay.

Introduction to ViT

In 2020, the Google Brain Team revolutionized the field of computer vision by introducing ViT, which successfully supplanted CNNs—exemplified by AlexNet (introduced in 2012)—in completing visual tasks. ViT is a computer vision model built upon the transformer architecture. It directly applies the transformer to image classification tasks, overcoming the framework constraints of traditional CNNs. The functional principle of ViT is the conversion of images into 1D sequence data, which are then fed into the transformer module for processing and ultimately classified.

Foundation of ViT

ViT processes image data via the transformer architecture. Its basic architecture is illustrated in the left panel of Figure 3. For medical imaging tasks, the input image is divided into nonoverlapping patches of fixed size. After each patch is flattened into a 1D vector, it is projected into a high-dimensional space via a linear layer (a fully connected layer or 1×1 convolution) and then fed into the transformer model. Since the transformer lacks built-in positional awareness, positional encoding needs to be added to each patch to preserve spatial location information. ViT achieves this by adding a learnable classification token at the beginning of the sequence. This token not only serves as a special marker but also extracts global features of the entire image during training. The transformer encoder (as shown in the right panel of Figure 3) primarily consists of an MHSA mechanism and an FFN. After positional encoding processing, the image patch vectors enter the standard transformer encoder architecture, where multilayer stacked self-attention layers capture long-range dependencies between patches. After being processed by several layers of the transformer encoder, the resulting class embedding vector integrates information from the entire input image and is ultimately sent to the classification head (typically a fully connected layer) to complete the image classification task.

Figure 3 Diagram of the ViT architecture. The image is segmented into block sequences, which are then converted into classification results through a Transformer encoder. Norm, layer normalization; MLP, multi-layer perceptron; ViT, Vision Transformer.

ViT for medical image segmentation

CNNs are frequently applied in a variety of medical image analysis scenarios, such as tumor detection (16,17), coronavirus disease 2019 (COVID-19) detection (18), skin lesion detection (19), and segmentation (20). However, constrained by their limited receptive fields, CNNs may struggle to effectively learn explicit long-range dependencies (21). In contrast, medical diagnostic systems based on the ViT can capture a broader receptive field and demonstrate exceptional performance across a diverse array of medical imaging tasks (22,23). Numerous ViT-based systems have been developed for the full spectrum of medical image modalities and include (I) classification systems (24); (II) detection systems (25); and (III) segmentation systems (26-28).

In the field of medical image segmentation, ViTs are gradually surpassing traditional CNNs and becoming a key technology for handling complex anatomical structures and high-precision boundary segmentation due to their powerful global modeling capabilities. According to the manner in which ViTs are integrated with CNNs and their positions within the segmentation network, the architectures of ViTs in medical image segmentation can be divided into two main classes: pure ViT architectures (pure ViT) and hybrid architectures (hybrid ViT).

In pure ViT architectures, for encoder applications such as UNETR (10), ViTs are used to replace traditional U-Net encoders to extract multiscale global features, which are suitable for multi-organ segmentation but entail high computational complexity. For decoder applications such as Convolution-Transformer Network for Medical Image Segmentation (ConvTransSeg) (see Gong et al. in Appendix 1), ViT decoders are employed to refine boundaries, which enhances boundary accuracy but insufficiently utilizes global information. End-to-end ViTs, such as Swin-UNet (29), utilize ViT modules in both the encoder and decoder to comprehensively model long-range dependencies, yet they have a large number of parameters and demanding hardware requirements.

Hybrid architectures, on the other hand, combine the local feature extraction capability of CNNs with the global modeling ability of ViTs. For instance, it was reported that TransUNet with an encoder hybrid achieved a Dice similarity coefficient (DSC) of 94% in computed tomography (CT) liver segmentation. Moreover, UNetFormer (30) with a decoder hybrid was found to improve the boundary recognition of irregular tumors in ultrasound images. Multi-axis Vision Transformer-U-Net (MaxViT-UNet) (see Khan et al. in Appendix 1), an end-to-end hybrid model, achieved a DSC greater than 88% in cardiac MRI segmentation, balancing efficiency and accuracy.

Architectures dedicated to addressing the challenges of scarce annotations and blurred boundaries have also been developed. In the realm of weakly supervised learning, for example, query-based MaxViT-Unet employs sketch annotations for training. It reduces computational burdens through multi-axis ViT modules and compensates for boundary information with an edge enhancement module, significantly minimizing errors in breast ultrasound segmentation. Meanwhile, a query-based decoder that dynamically refines features to enhance segmentation accuracy has also been developed. In terms of self-supervised pretraining, models such as TransPath (31) conduct pretraining on pathological images, effectively reducing reliance on annotated data and offering a novel approach for medical image segmentation.

Application of transformer-based models for medical image segmentation

A large number of segment-based transformer models have emerged, and these can be classified according to their architecture and training strategy. In the field of medical image segmentation, the majority of transformer-based methods include an encoder-decoder architecture similar to that of U-Net, while pure transformer architectures contain no CNN components. These models use tokenization and self-attention mechanisms to learn features and capture long-range dependencies, establishing connections between different embedding spaces by stacking modules in a series or through encoder-decoder configurations. The hybrid architecture comprises a CNN and transformer components, with a CNN encoder and transformer module typically being included to extract local features and model global relationships, respectively, or with the two being fused in parallel.

Pure transformer architecture

In pure transformer architecture, traditional convolutional operations are completely absent, and thus, it captures long-range dependencies between any positions in an image through the self-attention mechanism. This addresses a key problem of traditional CNNs, which, due to their local receptive fields, struggle to model the global context; therefore, pure transformers are particularly well-suited for segmenting scattered or morphologically variable lesions in medical images.

Modular architecture is commonly employed in pure transformer models, in which image modules replace pixels as the fundamental units of information, and self-attention mechanisms are implemented to determine the correlations across sets of embedded information. These models learn both local and global dependencies through feature representation and downsampling, which is followed by upsampling for pixel-wise prediction. Swin-Unet, as the first U-shaped network based solely on transformers, inherited the classic encoder-bottleneck-decoder architecture of U-Net, but all its modules are constructed with Swin Transformer blocks. By leveraging sliding window attention and symmetric transformer encoding-decoding, it has, for the first time, demonstrated the superiority of pure transformers in medical image segmentation; its architecture is illustrated in Figure 4. In the encoder, the input image is partitioned into fixed nonoverlapping blocks (patch partition) and mapped to high-dimensional vectors through linear embedding. Subsequently, three hierarchical feature extraction stages are applied, each comprising two Swin Transformer blocks that alternately apply window multihead self-attention (W-MSA) and shifted W-MSA (SW-MSA). This design confines attention computation to local windows, reducing computational complexity from O(N²) to O(N × k), where k is the window size. During feature extraction, downsampling is achieved through patch merging, which merges adjacent blocks to halve the image resolution while doubling the feature dimensionality. The bottleneck layer employs only two Swin Transformer blocks to process the lowest-resolution features, thereby avoiding convergence issues during deep network training. The decoder implements upsampling through patch expanding layers, reversing the downsampling process. Additionally, skip connections are used to concatenate encoder features with decoder features at the same scale, effectively preserving spatial details.

Figure 4 The architecture of Swin-Unet, which is composed of an encoder, bottleneck, decoder, and skip connections. Encoder, bottleneck and decoder are all constructed based on the Swin Transformer block. This figure was adapted from an open access article (32) under the terms of the Creative Commons Attribution License 4.0 (CCBY).

Jain et al. (33) proposed the OneFormer model, which is driven by textual instructions that allow for a single model to adapt to three major segmentation tasks without requiring architectural adjustments. Additionally, it only requires panoramic annotation data for joint training, thereby achieving fully unified segmentation. The Cross-Shaped Window Transformer UNet model proposed by Liu et al. (34) significantly expands the interaction range of the receptive field while reducing computational load by dividing multihead attention into two groups and computing them in parallel. This model achieved a mean DSC of 91.46% on the Automated Cardiac Diagnosis Challenge (ACDC) dataset, a mean DSC of 81.12%, and a mean Hausdorff distance (HD) of 18.86 on the Synapse dataset. The Medical Image Segmentation Transformer (MISSFormer) network architecture proposed by Huang et al. (35) effectively addresses the limitations of insufficient long-range dependency modeling and lack of local context. It achieved state-of-the-art (SOTA) performance in both Synapse (abdominal organs) and ACDC (heart) segmentation tasks, with an average Dice score of 81.96% on Synapse, surpassing that of Swin-Unet by 2.83%.

The Segmentation Transformer 3D (SegFormer3D) model proposed by Perera et al. (36) employs a pure transformer architecture, hierarchical multiscale processing, and efficient attention compression techniques, successfully addressing the issue of model bloat in 3D medical image segmentation. This model achieved SOTA-level competitiveness with extremely low resource consumption (4.5 million parameters, 17 GFLOPS). The SegFormer3D model yielded an average Dice score of 90.96% on the ACDC dataset, just 1% lower than that of the SOTA, verifying its robustness in cardiac structure segmentation. The MMAformer proposed by Ding et al. (37) is a multiscale modality-aware transformer specifically designed for multimodal medical image segmentation. Its primary objective is to enhance segmentation accuracy through efficient fusion of multimodal information, and it particularly excels in boundary recognition and small target segmentation. On the BraTS2021 dataset, this model achieved a Dice score of 88.78% for enhancing tumor (small target) segmentation and an average Dice score of 91.53%.

The TransDeepLab model proposed by Azad et al. (38) represents the first attempt to completely replace the entire architecture of DeepLabv3+ with a Transformer. This approach foregoes traditional convolutional operations to address the limitations of CNNs’ local receptive fields and the lack of low-level features in transformers. This model effectively resolves the challenge of multiscale feature fusion through the use of Swin-Transformer blocks and cross-context attention mechanisms. On the Synapse dataset, it achieved an average Dice score of 80.16, with particularly outstanding performance on the left kidney, pancreas, and stomach organs, with Dice scores of 84.08%, 61.19%, and 78.40%, respectively. On the International Skin Imaging Collaboration 2017 (skin lesion segmentation) dataset, it yielded a Dice score of 92.39%. Wu et al. (39) developed the D-Former network model, with both its encoder and decoder incorporating four D-Former blocks. It progressively reduces resolution through three downsampling layers and then restores it through three upsampling layers. This model achieved an average Dice score of 88.83% on the Synapse dataset and 92.29% on the ACDC dataset.

The pure transformer architecture demonstrates strong global modeling capabilities in medical image segmentation, yet it is associated with a number of issues: it involves high computational complexity and large memory consumption, which restricts model depth and resolution and renders it unsuitable for volumetric data. It has insufficient local feature capture ability and a lack of inherent spatial inductive bias, making it less than adequate for modelling the fine structures of organ boundaries in medical images. There is a need for massive annotated data for training and a reliance on experts for medical image annotation, resulting in high costs and scarcity, thus hindering its widespread adoption. The architecture adapts poorly to multiscale features, with significantly lower segmentation accuracy for small organs than for large ones. Currently, the key directions for the optimization of pure transformer architectures include innovation of local-global hybrid architectures and efficient attention mechanisms.

Hybrid architecture

In the U-shaped network architecture for medical image segmentation, the transformer module can be flexibly embedded into different stages, including the encoder, decoder, skip connections, and encoder-decoder interaction layers, as shown specifically in Figure 5. There are performance tradeoffs across these stages due to their functional differences. When deployed in the encoder stage (10) [see Chen et al. (2021) in Appendix 1], it can enhance contextual modeling, making it suitable for segmenting large anatomical structures and multiple organs. However, this approach consumes a significant computational resource and is prone to losing fine spatial features during tokenization. When introduced into the decoder stage (40), it aims to maintain global semantic consistency and improve the regional coherence of segmentation results, but its ability to recover details lost during the encoding phase is limited. Integrating the transformer into skip connection paths (41) can effectively bridge the semantic gap between encoder and decoder features, optimizing boundary delineation and the segmentation of small structures while keeping computational costs relatively low—although its global modeling capacity is comparatively weaker. Schemes that enable encoder-decoder interaction through cross-attention mechanisms or bottleneck structures (e.g., the convolutional transformer architecture) adopt explicit feature exchange strategies, balancing global perception and local precision. However, this design comes at the cost of increased architectural complexity. Overall, the performance of transformer-based medical image segmentation models depends not only on the inherent characteristics of the self-attention mechanism but also—more critically—on its specific functional positioning and implementation within the information flow of the U-shaped network.

Figure 5 Classification of transformer-based medical image segmentation methods. The blue layers are the transformer-based layers.

Transformers in encoder

Traditional CNNs are constrained by the local receptive field of convolutional kernels and thus are unable to adequately model dependencies between distant regions in an image. In contrast, transformers can directly compute the association weights between any two positions in the image through the self-attention mechanism, thereby capturing global semantic information. Specifically, CNNs extract low-level, high-resolution features, while transformers encode semantic-level global features, forming a “local-to-global” hierarchical representation. This enables the model to obtain richer and more discriminative semantic features during the encoding stage.

In 2021, Chen et al. proposed the TransUNet model, the first hybrid architecture that introduced transformer architecture into medical image segmentation, overcoming the limitations of traditional methods. Its structure is shown in Figure 6. In the encoder, the model uses CNN at the front end to extract high-resolution local features. After dividing the feature maps into image patches, these patches are input into the Transformer back end, where the self-attention mechanism is employed to model global dependencies. In the decoder, the model progressively upsamples the features output by the transformer and fuses CNN features from different levels through skip connections. Additionally, TransUNet optimizes the skip connections in the U-shaped network structure by adding them at one-half, one-quarter, and one-eighth resolution scales. Experiments have demonstrated that this approach significantly enhances the segmentation accuracy of small organs. The model achieved an average Dice score of 77.48% on the Synapse dataset and 89.71% on the ACDC dataset, with a notable 6.19% increase in the Dice coefficient for segmenting small organs such as the pancreas. Overall, TransUNet innovatively combines the global modeling capability of transformer with the local detail preservation advantage of U-Net. Through a hybrid encoder, a convolutional upsampling path decoder, and multiscale skip connections, it significantly improves the accuracy of medical image segmentation. Its performance surpasses that of contemporary CNN and self-attention methods, establishing it as the SOTA in medical image segmentation.

Figure 6 Overview of the framework proposed by [Chen J et al. (2021) in preprinted article reference]. (A) Transformer architecture; (B) TransUNet model architecture. This figure was reused under the terms of the Creative Commons Attribution License 4.0 (CCBY). CNN, convolutional neural network; MLP, multi-layer perceptron; MSA, multi-head self-attention.

The Swin Soft Mixture Transformer (SMT) model proposed by Płotka et al. (42) employs hierarchical Swin Transformer blocks in its encoder and replaces the traditional FFN in the second and fourth layers with soft mixture-of-experts modules, addressing the long-range dependency bottleneck of transformers. It achieved a DSC of 85.09% on the TotalSegmentator-V2 dataset. The Hierarchical Interleaved Transformer (HiFormer) model proposed by Heidari et al. (43) is a hybrid CNN-Transformer architecture specifically designed for medical image segmentation. It combines the local perception capability of CNNs with the global modeling capability of Swin Transformers and resolves feature inconsistency through double-level fusion cross-attention. It achieved SOTA performance on multiple medical segmentation tasks, with an average Dice of 80.39% and an HD of 14.70 on the Synapse dataset. The Dual Attention Transformer U-Net (DA-TransUNet) model proposed by Sun et al. (44) first extracts spatial and channel features of images through dual attention blocks and then performs global optimization via transformers. This sequential design avoids parameter redundancy in transformers and enhances feature representation. It achieved an average Dice of 79.80% and an HD of 23.48 on the Synapse dataset and an intersection over union (IoU) of 0.8251 and a Dice of 0.8947 on the Centro de Visión por Computador Clinic Database (CVC-ClinicDB) dataset. The Multiattention Transformer U-Net (MA-TransUnet) model proposed by Wang et al. (45) employs overlapping patch embedding and hierarchical transformer blocks (three layers), with each layer containing a U-shaped Transformer Main Network that combines patch merging and transformer blocks to model long-range dependencies, avoiding the complex coupling of traditional hybrid architectures. It achieved an average DSC of 81.96% and an HD of 18.20 on the Synapse dataset, along with an average DSC of 91.89% on the International Skin Imaging Collaboration (ISIC)-2018 dataset. The Dual U-shaped Cross-Modal Fusion Network (DUCFNet) model proposed by Liu et al. (46) uses U-ViT in its encoder to achieve the cross-modal fusion of text embeddings and image features, leveraging their complementary strengths. It further employs channel-aware cross fusion and contextual‑driven cross fusion modules to eliminate scale-related semantic differences and align the semantic hierarchies of the encoder and decoder.

In the Medical Transformer (MedT) model proposed by Valanarasu et al. (47), the transformer serves as the sole encoder, directly processing image patch sequences. Valanarasu et al. were the first to introduce learnable gating (gated axial attention mechanism) in medical image segmentation, employing dynamically adjusting positional encoding weights to address the challenge of training with small datasets. Simultaneously, the MedT model employs a dual-branch collaborative global–local training strategy to balance long-range dependencies and local details, without requiring pretraining. The ScribFormer model proposed by Li et al. (48) models long-range dependencies through the MHSA module, resolving the issue of missing global information in scribble annotations and breaking through the limitations of CNN’s local receptive field. On the ACDC dataset, it achieved a DSC of 88.8%. The Dual-Stream Transformer (DS-Former) model proposed by Zhang et al. (49) integrates the 3D Swin Transformer in the encoder to extract global features and convolutional layers and thus local features in parallel, achieving deep interaction through self-attention mechanisms. During feature fusion, SW-MSA is used again to integrate dual-stream features, reducing computational complexity from O(n²) to O(n). On the UK Biobank (UKBB) and Beyond the Cranial Vault (BTCV) datasets, it achieved average Dice scores of 89.87% and 81.41%, respectively. The MedFuseNet model proposed by Chen et al. (50) adopted a dual-branch encoder consisting of a CNN and Swin Transformer, fusing features through cross-attention. Adaptive cross-attention modules are embedded in skip connections, and the decoder incorporates squeeze-and-excitation (SE) attention. On the Synapse dataset, it achieved an HD of 18.44 and a DSC of 58.16% for pancreas segmentation. The Transformer Residual U-Net (TransResU-Net) model proposed by Tomar et al. (51) includes a pretrained Residual Net 50 (ResNet-50) as the backbone network, extracting multilevel features through residual blocks. These features are then input into transformer encoder blocks to model global contextual dependencies, addressing the issue of small-target feature loss. On the Kvasir Segmentation Dataset (Kvasir-SEG) dataset, it achieved a DSC of 88.84% and a mean IoU of 82.14%.

When transformer is applied to the encoder, its self-attention mechanism can globally compute the association weights among pixel points and capture long-range dependencies, making it particularly suitable for medical images with complex structures and blurred boundaries. However, transformer has a large number of parameters and requires a lightweight design to meet real-time clinical demands; moreover, for the processing of high-resolution images, its computational complexity is high, necessitating the optimization of the attention mechanism.

Transformers in decoder

The transformer decoders use the self-attention mechanism to integrate multiscale features from the encoder, dynamically adjusting regional feature weights to enhance boundary accuracy; thus, it can—for instance—precisely distinguish the blurred boundaries between tumors and normal tissues in brain tumor MRI scans. The local details captured by the CNN encoder are combined with the global context provided by the transformer decoder, improving the recognition capability for complex structures. The self-attention mechanism of the transformer can adaptively learn multiscale dependencies, enabling flexible handling of lesions with varying morphologies. During decoding, the transformer dynamically fuses features from different hierarchical levels of the encoder through the cross- attention mechanism, avoiding the feature redundancy issues associated with traditional U-Net skip connections.

The UNetFormer model proposed by Wang et al. (30) employs a lightweight ResNet18 as the encoder to extract multiscale local features from input images while balancing efficiency and performance. The decoder, serving as the core part of the model, is responsible for global context modeling and feature refinement. It consists of three global-local transformer blocks (GLTBs) and one feature refinement head (FRH). The decoder receives multiscale features from the encoder, captures global dependencies through GLTBs, and then optimizes the output via the FRH. The key advantage of this model lies in its adoption of an efficient global-local attention mechanism, which replaces the computationally intensive self-attention mechanism of traditional transformers. The model structure is illustrated in Figure 7.

Figure 7 UNetFormer model architecture.

Tayeb et al. (52) introduced the OmniBlock module in the UNestFormer model. This module is exclusively used in the decoder and, innovatively, integrates four types of attention mechanisms: global channel attention, global spatial attention, local channel attention, and local spatial attention. With the traditional convolutional blocks being replaced with OmniBlock, a densely nested Transformer skipping pathway is constructed, achieving multiscale feature aggregation through weighted summation. The model achieved a Dice coefficient of 93.42% on the ACDC dataset, with an average DSC of 85.74% and an HD of 13.25 on the Synapse dataset. The UCTNet (UNet + Convolution + Transformer Network) proposed by Guo et al. (53) redefines the complementary paradigm between CNN and transformer through an uncertainty-guided mechanism. The transformer is integrated into the Uncertainty-Guided CNN-Transformer Block (UCTBlock module) in the decoder, performing self-attention calculations only on high-uncertainty regions to establish global dependencies. This model achieved a DSC of 89.44% on the Synapse dataset and 92.91% on the ACDC dataset. ConvTransSeg is an asymmetric hybrid architecture proposed by Gong et al. (see Gong et al. in Appendix 1), featuring a pure transformer multilevel decoder in which each level contains three transformer blocks connected across different resolution levels via linear layers. It achieved an average Dice score of 86.7% on the ISIC skin dataset. The hybrid architecture combining UNet and Convolutional Neural Network Extreme-Tiny (ConvNeXt Tiny) was proposed by Kamsari et al. (54), and when it was evaluated on the public Breast Ultrasound Images (BUSI) dataset, it was confirmed that the attention mechanism could effectively focus on tumor regions and reduce false positives (FPs). Notably, when the SE block was integrated with attention gates, a significant performance improvement was observed in breast tumor segmentation, with precision increasing by approximately 15%.

Network models that place the transformer in the decoder struggle to adapt to tasks with limited data and scarce annotations. Their inference speed may also be constrained, making them unable to meet extremely high real-time requirements. Additionally, these models are mired by bottlenecks in computational efficiency and video memory usage. In the future, a more lightweight design could be pursued, or unlabeled medical data could be used to pretrain the transformer decoder, thereby alleviating the reliance on annotations.

Transformers in the encoder-decoder architecture

Applying transformers to both the encoder and decoder represents a design aimed at comprehensively fusing global context and optimizing dynamic features. The encoder captures long-range dependencies and global semantics from the input image, while the decoder leverages global context again during detail recovery, ensuring consistency between high-level semantics and low-level details throughout the entire process. Meanwhile, the transformers at both ends can process features from different modalities, enabling continuous information fusion and enhancing adaptability.

MaxViT-UNet network model proposed by Khan et al. (see Khan et al. in Appendix 1) adopts a classic encoder-bottleneck-decoder structure and achieves multiscale feature fusion through skip connections (see Figure 8 for a schematic of the structure). In the encoder, initial downsampling is first performed via a stem layer and then via four encoding stages, each containing two MaxViT blocks. The MaxViT blocks are primarily composed of mobile inverted bottleneck convolution, W-MSA, and grid MSA, which are designed to fuse local and global information while suppressing noise. In the decoder, each stage sequentially performs upsampling, concatenation, and MaxViT refinement. MAXViT-UNet is a medical image segmentation model with a hybrid CNN-Transformer architecture, in which both the encoder and decoder use MaxViT blocks to achieve end-to-end local-global feature fusion.

Figure 8 Architecture of the MaxViT-Unet. This figure was adapted from an open access article (32) under the terms of the Creative Commons Attribution License 4.0 (CCBY). CNN, convolutional neural network; MaxViT-Unet, Multi-axis Vision Transformer-U-Net.

The No New Network Transformer (nnFormer) model proposed by Zhou et al. (55) employs local voxel self-attention in the encoder to model three-dimensional (3D) local dependencies, which overcomes the limitations of traditional 2D approaches. In the decoder, it replaces conventional concatenation and summation operations with skip attention to enhance encoder-decoder feature fusion. This model achieved an average DSC of 86.83% and an HD of 10.63 on the Synapse dataset. The Boundary-Attentive Residual U-Net++ (BRAU-Net++) proposed by Lan et al. (56) employs a transformer mechanism based on Dynamic Sparse Transformer in the BiFormer blocks of both the encoder and decoder to model long-range dependencies. This model achieved a DSC of 82.47% and an HD of 19.07 on the Synapse dataset. The Global Context ViT (GC-ViT) block primarily harmonizes local and global attention and enables dynamic downsampling and upsampling during encoding and decoding. The Global Context U-Net (GCtx-UNet) achieved a Dice coefficient of 82.39% on the Synapse dataset. The Multiscale Transformer U-Net (MaS-TransUNet) model proposed by Upadhyay et al. (57) uses transformers to replace traditional convolutions in both the encoder and decoder. The encoder constructs multiscale features using hierarchical Swin Transformer blocks, and the decoder, also based on Swin Transformer blocks, recovers resolution through upsampling and incorporates skip connections to fuse high-resolution features. This model attained a Dice coefficient of 84.1% on the COVID-19 lung CT dataset.

The Edge-Guided Transformer U-Net (EG-TransUNet) proposed by Pan et al. (58) introduces transformers at multiple stages: In the encoder, the position-enhanced module addresses the issue of variable lesion sizes, while the channel-spatial attention module jointly optimizes the channel and spatial dimensions to enhance lesion localization accuracy. In the decoder, the semantic gap attention module reduces the semantic gap through task-aware channel selection. This model achieved a mean Dice score of 93.44% on the CVC-ClinicDB dataset. The Dual-Scale Transformer U-Net (DS-TransUNet) proposed by Lin et al. (59) includes a dual-branch Swin Transformer U-shaped architecture: In the encoder, Swin Transformer blocks of two different scales are used, while in the decoder, each upsampling stage integrates Swin Transformer blocks to replace traditional convolutions. This model achieved a mean IoU of 89.4% on the CVC-ClinicDB dataset. The 3D TransUNet was proposed by Chen et al. [see Chen et al. (2023) in Appendix 1]. This model integrates transformer encoders and decoders, with both global dependencies and small-target optimization being taken into account. The encoder applies ViT to process CNN features for extracting global dependencies, while the decoder optimizes segmentation through query vectors and cross-attention mechanisms. This model achieved a Dice coefficient of 88.11% on the BraTS2021 dataset. The GCtx-UNet proposed by Alrfou et al. (60) is a lightweight U-shaped architecture based on GC-ViT, with both its encoder and decoder consisting of GC-ViT modules. Dual Sparse Selection Attention U-Net (DSSAU-Net) (61) proposed by Xia et al. was specifically designed for fetal head and pubic symphysis segmentation tasks. Its encoder and decoder are both built on the dual sparse selection attention (DSSA) mechanism, with the DSSA blocks explicitly performing region-level and pixel-level sparse token selection to reduce computational complexity. This model achieved a DSC of 85.35% and an HD of 37.38 on the Medical Image Computing and Computer-Assisted Intervention International Ultrasound Grand Challenge 2024 (MICCAI IUGC 2024) test set.

When using transformer architecture in both the encoder and decoder simultaneously, they both need to compute self-attention, with the complexity growing quadratically with the input size. Moreover, the technical implementation of such models is complex, requiring the design of positional encoding, multiscale interaction, and fusion mechanisms at both ends, which increases development difficulty and tuning costs.

Transformer in between the encoder and decoder

Applying transformers to non-encoder-decoder structures such as skip connections, bottleneck layers, and feature fusion modules allows for the balancing of global and local information in the processing of intermediate layers, while introducing global dependencies without compromising the focused functions of the encoder and decoder. Through skip connections, transformers can align low-level textures with high-level semantics in advance, reducing the semantic gap. When placed in bottleneck layers, they can significantly reduce attention computation, saving video memory and computational resources. When integrated into feature fusion modules, transformers can adaptively assign weights to multi-scale features, enhancing fusion performance.

Dual cross-attention (DCA), first proposed by Ates et al. (62), is a straightforward but powerful attention module. It can optimize the skip connections in U-Net-based medical image segmentation architectures; its structure is illustrated in Figure 9. DCA bridges the semantic gap between the encoder and decoder by modeling the channel and spatial relationships of multiscale encoder features. The process is as follows: First, channel cross-attention and spatial cross-attention modules model channel and spatial relationships. Second, the encoder features are upsampled to align them with the corresponding layers of the decoder.

Figure 9 Architecture of the DCA net, with the ViT in between the encoder and decoder. This figure was adapted from an open access article (32) under the terms of the Creative Commons Attribution License 4.0 (CCBY). CNN, convolutional neural network; DCA, dual cross-attention; ViT, Vision Transformer.

The TransGUNet model proposed by Nam et al. (see Nam et al. in Appendix 1) employs a pure CNN in its encoder for feature extraction but lacks a transformer. However, the model introduces an Adaptive Computation Structure Graph Neural Network (ACS-GNN) module in the skip connections to convert cross-scale features into a graph structure and model anatomical relationships through node attention. This model achieved an average DSC of 80.9% on the Synapse dataset and 94.4% on the CVC-ClinicDB dataset. The Ultrasound Computed Tomography U-Net (USCT-UNet) model proposed by Xie et al. (63) applies a U-shaped Skip Connection (USC) module to process skip connections and perform feature embedding and multiscale global modeling on the encoder’s output, thereby alleviating the semantic gap issue in U-Net. On the ISIC-2018 dataset, this model attained a Dice score of 88.66%, a mean IoU of 81.62%, and an HD of 15.25. The Transformer-based Enhanced Fusion Network (TranSEFusionNet) model proposed by Zhang et al. (64) replaces the traditional concatenation with the Spatial Feature Module in the skip connections to reduce semantic discrepancies, achieving a balance between global and local features. It achieved a mean Dice score of 86.48% on key metrics in the CVC-ClinicDB dataset. The CIS-UNet model proposed by Imran et al. (65) inputs high-dimensional features into an improved Swin Transformer (CSW-SA module) at the bottleneck layer, balancing global dependency modeling and computational efficiency. The Hierarchical Attention-Guided U-Net (HAU-Net) model proposed by Zhang et al. (66) consists of a hybrid architecture, embedding a transformer in the skip connections of U-Net to integrate global contextual information. It retains the advantages of CNN in local detail extraction while incorporating global semantics, addressing key challenges in BUSI segmentation. This model achieved a Dice score of 83.11% on the publicly available BUSI dataset.

The Shifted Window Transformer U-Net (SWTR-Unet) model proposed by Hille et al. (67) embeds 12 Swin transformer modules in the bottleneck layer to model global dependencies. The transformer is solely used to connect the bottleneck Explained ConvNet in Table S1 between the encoder and decoder, collaborating with the CNN encoder and convolutional decoder. The Boundary-Aware Transformer (BATForme) model proposed by Lin et al. (26) addresses the issues of insufficient global dependencies and boundary distortion in traditional methods by embedding a Cross-Scale Global Transformer and a Boundary-Aware Local Transformer into high- and midlevel features. It achieved Dice scores of 92.84% and 90.76% on the ACDC and ISIC-2018 datasets, respectively. The ViT-V-Net model proposed by Chen et al. (68) was specifically designed for unsupervised volumetric medical image registration. The ViT, positioned after the encoder, learns long-distance spatial dependencies, overcoming the limitations of the ConvNet’s local receptive fields and enhancing registration accuracy. The Adaptive Token Transformer U-Net (ATTransUNet) model proposed by Li et al. (69) is based on the ViT architecture integrated into skip connections and uses adaptive tokens. The encoder employs an Adaptive Token Extraction Module to extract the most discriminative visual tokens, reducing complexity and enhancing performance. The decoder, on the other hand, adopts a Selective Feature Reinforcement Module to focus on the most contributive features. The Transformer Attention U-Net (TransAttUnet) model proposed by Chen et al. (70) uses a self-aware attention (SAA) module as a bridge between the encoder and decoder. It processes high-level feature maps through the SAA while capturing long-range dependencies via an MHSA mechanism and achieved a DSC of 90.74% on the ISIC-2018 dataset. The Swin Transformer Module U-Net (STM-UNet) model proposed by Shi et al. (71) employs Swin Transformer residual blocks in skip connections to connect corresponding levels of the encoder and decoder. The transformer is solely used in skip connections to introduce global modeling capabilities, combining the advantages of CNN and a multilayer perceptron.

Although keeping the transformer independent of the encoder or decoder has emerged as a promising research direction, this approach prevents the transformer from performing feature extraction at the encoder stage, leading to a delayed introduction of global information. Moreover, if global modeling is conducted solely in intermediate or cross-connection parts, an information “discontinuity” may arise between the encoder and decoder, thereby constraining the transformer’s potential.

According to a review of this aforementioned research, we believe that the root causes of the performance enhancement in hybrid models can be summarized into three primary aspects.

First, the skip connection in U-Net represents an early attempt to bridge the feature gap between the encoder and decoder. The new-generation hybrid architecture (72) further reveals that such feature gaps also exist between different levels within the encoder. Low-level features are rich in details but lack semantic information, whereas high-level features exhibit the opposite characteristics. By leveraging mechanisms such as DCA transformer, the model can actively fuse features from different levels to construct hybrid features that combine rich details with strong semantic information, which is crucial for performance improvement. Second, precise segmentation of medical images requires both a global understanding of organ structures and the delineation of local boundaries. CNNs excel at capturing local features but fall short in global comprehension, while standard transformers perform well in global modeling but entail high computational costs and are prone to overlooking local details. The GLTB in hybrid architecture (72) includes either parallel or sequential approaches, enabling the transformer branch to capture long-range dependencies and the CNN branch to focus on local features. The subsequent fusion of these features provides comprehensive contextual information, making the model’s decisions more reliable. Third, general-purpose architectures have limitations in handling specific tasks, whereas hybrid models can be tailored to address particular challenges. For instance, the “coarse-to-fine” two-stage design of Tailored Two-Stage Hybrid Network (TTH-Net) (73) is well-suited for vein segmentation. This tailored design enables hybrid models to outperform general-purpose models in their respective domains.

Evaluation metrics

In the field of medical image segmentation, evaluation metrics are used to quantify the similarity and error between the model’s segmentation results and the ground truth. Assessing algorithm performance requires a comprehensive consideration of multiple metrics, and appropriate metric combinations should be selected based on task requirements (such as the size of the target region and the complexity of its shape) to comprehensively reflect the regional overlap, boundary accuracy, and volumetric consistency of the segmentation results (74). This section briefly introduces and describes the different classes of mainstream evaluation metrics.

Regional overlap

In medical image segmentation, regional overlap-based metrics assess model performance by quantifying the degree of overlap between the predicted segmentation region and the ground truth. Among them, the DSC and IoU are the principal metrics. Both are based on the principle of set similarity, yet they differ in computational methods and application scenarios.

DSC

DSC is particularly common in segmentation tasks. It is used to measure the similarity between two sets, with a value range of 0 to 1. The larger the value is, the more similar the two sets are. It is commonly employed to calculate the similarity of closed regions. The formula is as follows:

$DSC = \frac{2 \times | A \cap B |}{| A | + | B |}$ [9]

where A is the predicted region; B is the true region; $| A \cap B |$ is the number of pixels in the intersection of the predicted and true regions; and $| A |$ and $| B |$ are the total number of pixels in the predicted and true regions, respectively. The metric takes values in the range of 0 to 1, with higher values indicating a greater degree of overlap. The DSC is highly sensitive to class imbalance. Even when the proportion of foreground pixels is low, DSC can still effectively reflect the quality of segmentation because the denominator is directly related to the size of the target region, effectively preventing the background from dominating the evaluation. Meanwhile, when the overlapping region decreases, DSC declines linearly. However, when DSC approaches 0 or 1, it encounters the issue of gradient saturation, which results in slow convergence during the later stages of training.

IOU

The IOU, also known as the “Jaccard Index”, is one of the most commonly used metrics in semantic segmentation. The IoU calculates the ratio of the intersection area between the predicted region and the true region to their union area as follows:

$IoU = \frac{| A \cap B |}{| A \cup B |}$ [10]

This is equivalent to the following:

$IoU = \frac{TP}{TP + FP + FN}$ [11]

where the definitions of A and B are the same as those in Eq. [9], while TP, FP, and FN are the numbers of true positive (TP), FP, and false negative (FN) pixels, respectively. The value range of IoU also lies within [0, 1], with values closer to 1 indicating a higher degree of overlap. IoU and DSC (can be mathematically interconverted as follows:

$DSC = \frac{2 \times IoU}{1 + IoU}$ [12]

$IoU = \frac{DSC}{2 - DSC}$ [13]

The IoU is more intuitively defined and easier to understand than is the DSC, and its indication of overlap discrepancies is also more pronounced. However, in small-object tasks, IoU is more susceptible to the influence of the union, resulting in a significant decline in its values. Similar to the DSC, the IoU cannot directly reflect minor boundary errors. A comparison between the IoU and DSC is provided in Table 2.

Table 2

Comparison between the IoU and DSC

Feature	DSC	IoU
Definition	$DSC = \frac{2 \times \| A \cap B \|}{\| A \| + \| B \|}$	$IoU = \frac{\| A \cap B \|}{\| A \cup B \|}$
Range of values	0–1	0–1
Small-target stability	More stable, less affected by the total number of pixels	Easy to oscillate, reliance on absolute overlap area
Sensitivity	Smooth gradient descent and balanced penalty equilibrium	Most influenced by highly overlapping regions and less by low-overlap areas
Clinical application scenarios	Tumor segmentation and small-organ recognition	Measurement of organ transplant volume and boundary-sensitive structures

A, predicted region; B, true region; DSC, Dice similarity coefficient; IoU, intersection over union.

As core metrics of regional overlap, DSC and IoU quantify segmentation quality through the calculation of the proportion of overlapping regions and the ratio to the union area, respectively. The DSC is more suitable for scenarios involving small objects and class imbalance, providing a stable evaluation. In contrast, the IoU is more sensitive to high-precision boundary optimization but requires caution regarding its instability with small objects. In practice, it is advisable to select metrics based on clinical objectives and refer to domain-specific standards.

Pixel-level classification index

In medical image segmentation, pixel-level classification metrics based on the confusion matrix are primarily employed to quantify the model’s accuracy in classifying each pixel. These metrics assess performance by analyzing the statistical relationships among TPs, true negatives (TNs), FPs, and FNs.

Precision

Precision is used to measure the proportion of TPs cases among all instances predicted as positive. Precision reflects the model’s false detection performance, with a high precision indicating fewer FPs and a lower risk of misdiagnosis. It is particularly suitable for tumor segmentation tasks, for which reducing FPs is crucial to preventing overtreatment. The formula for precision is as follows:

$Precision = \frac{TP}{TP + FP}$ [14]

Recall

Recall is the proportion of pixels that are actually positive and have been correctly identified, reflecting the model’s ability to control FNs. It is used to measure the capability of detecting positive cases. In medical practice, a high recall indicates a low miss rate and thus that nearly all lesions can be detected. The formula for recall is as follows:

$Recall = \frac{TP}{TP + FN}$ [15]

Specificity

Specificity is the proportion of all TNs that are correctly predicted as negatives. It measures the model’s ability to accurately identify backgrounds or nontarget regions. In medical practice, a high specificity indicates a low probability of falsely reporting background areas as lesions. In medical images where the background occupies a significantly large proportion, specificity can effectively reflect the model’s ability to recognize the background. It can also complement sensitivity to avoid an excessively high proportion of FPs. However, when used alone, it may mask issues with missed detections (FNs). The formula for specificity is as follows:

$Specificity = \frac{TN}{TN + FP}$ [16]

F1 score

The F1 score is the harmonic mean of precision and recall and is used to balance the two metrics. It primarily provides a comprehensive reflection of both detection capability and the accuracy of positive predictions. This parameter is particularly crucial in tumor or lesion segmentation tasks. The F1 score considers both precision and recall simultaneously, making it a more reasonable metric for class-imbalanced tasks. However, it does not account for the correctness of background classification and does not directly reflect boundary quality. The formula for the F1 score is as follows:

$F1 = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall}$ [17]

Boundary-based metrics

Boundary-based metrics assess model performance by quantifying the contour differences between segmentation results and ground truth annotations, with a particular focus on geometric consistency along the edges. The HD and average surface distance (ASD) are metrics that measure boundary matching accuracy from the perspectives of extreme error and average error, respectively. In the evaluation of shape matching for structures such as tumors and organs, these metrics demonstrate greater sensitivity than do IoU and DSC.

HD

The HD is a measure that describes the degree of similarity between two sets of points, serving as one defined form of the distance between them. Suppose there are two sets, $A = {a_{1}, \dots, a_{p}}$ and $B = {b_{1}, \dots, b_{p}}$ . The HD between these two-point sets is defined as follows:

$H (A,B) = max (h (A,B),h (B,A))$ [18]

$h (A,B) = {max}_{(a \in A)} {{min}_{(b \in B)} ‖ a - b ‖}$ [19]

$h (B,A) = {max}_{(b \in B)} {{min}_{(a \in A)} ‖ b - a ‖}$ [20]

Eq. [18] expresses the bidirectional HD, which represents the most fundamental form of the HD. In Eqs. [18,19], h(A, B) and h(B, A) are the one-way HD from set A to set B and from set B to set A, respectively. Specifically, h(A, B) is calculated by first determining the distance $‖ a_{i} - b_{j} ‖$ between each point a in set A and its nearest point b in set B, sorting these distances, and finally obtaining the maximum value among them as the value of h(A, B). The calculation of h(B, A) follows the same principle. From Eq. [18], it can be seen that the bidirectional HD H(A, B) is the larger of the two one-way distances, h(A, B) and h(B, A), and it measures the maximum degree of mismatch between the two point sets.

Average surface distance

The ASD measures the average distance between two segmentation boundaries, representing the overall mean level of boundary deviation. In obtaining the ASD, the average nearest distance from the predicted boundary to the ground-truth boundary is calculated, which is followed by the calculation of the average nearest distance from the ground-truth boundary to the predicted boundary; finally, these two values are averaged, and this provides an assessment of the overall matching degree. The advantage of the ASD lies in its sensitivity to overall boundary deviations, as it is not dominated by extreme outliers and can reflect the general consistency of the segmentation shape. However, its drawback is that it ignores the maximum deviation, potentially masking significant local errors. The formula for the ASD is as follows:

$ASD (S,G) = \frac{\sum_{s \in S} {min}_{g \in G} d (s,g) + \sum_{g \in G} {min}_{s \in S} d (g,s)}{| S | + | G |}$ [21]

where |S| and |G| are the number of points on the predicted boundary and the ground-truth boundary, respectively.

A comparison between the HD and ASD is provided in Table 3.

Table 3

Comparison between Hausdorff distance and average surface distance

Feature	Hausdorff distance	Average surface distance
Core meaning	Maximum local error	Average overall error
Noise sensitivity	High	High
Computational complexity	High	Moderate
Applicable scenarios	Segmentation of fine structures such as blood vessels/nerves	Organ volume measurement, chronic lesion tracking

Datasets

Medical imaging datasets are key drivers for the implementation of artificial intelligence in healthcare, being applied across medical research contexts, from basic research to clinical decision-making. Compared with that of ordinary image datasets, the annotation of medical images requires a significant amount of labor from clinicians with professional expertise. Early pathological image data often exhibited small-scale characteristics. Medical datasets are typically provided by hospitals, research institutions, or international competition platforms and comprise raw images (from MRI, CT, ultrasound, X-ray, etc.) and annotation files (for lesion areas, organ contours, segmentation masks, diagnostic labels, etc.). Researchers need to select data modalities and annotation quality based on task requirements and stay attuned to newly updated resources. Table 4 lists several common types of medical imaging modalities, while Table 5 presents the frequently used public datasets in multiple popular medical image segmentation tasks.

Table 4

Common medical imaging modalities

Modality	Characteristics	Common tasks
Computed tomography	High spatial resolution and gray value related to tissue density	Pulmonary nodule detection, organ segmentation, tumor segmentation
Magnetic resonance imaging	High soft tissue contrast and multiple sequences	Brain structure segmentation and tumor detection
Positron emission tomography	Reflects metabolic activity	Cancer diagnosis and functional analysis
X-ray	Low cost and fast acquisition but overlapping information	Fracture detection and diagnosis of lung diseases
Ultrasound	Real-time imaging and no radiation, but noisy	Obstetric monitoring and cardiac function analysis
Histopathology	Ultrahigh resolution microscopic imaging	Cancer cell detection and tissue classification

Table 5

Medical image dataset

Dataset	Year	Dimension	Modal	Task	Definition
Synapse [Chen et al. (2021) in preprinted article reference]	2015	3D	CT	Multiple organ segmentation	Synapse Multi-Organ Segmentation Dataset
GLAS (41)	2016	2D	Pathology	Gland	Gland Segmentation Challenge
BTCV (49)	2015	3D	CT	Abdominal segmentation	Beyond the Cranial Vault
BraTS21 (75)	2021	3D	MRI	Brain tumor segmentation	Brain Tumor Segmentation 2021
MURA (76)	2018	2D	X-ray	Bone abnormality detection	Musculoskeletal Radiographs
KiTS19 (77)	2019	3D	CT	Kidney and renal tumor segmentation	Kidney Tumor Segmentation 2019 Challenge
CheXpert (78)	2019	2D	X-ray	Multilabel classification of chest diseases	Chest X-Ray Expert-Labeled Dataset
SegTHOR (79)	2019	3D	CT	Thorax segmentation	Segmentation of Thoracic Organs at Risk in CT
IST-3 CT Head Scans (80)	2024	3D	CT	Automatic quantification and treatment of stroke risk	Third International Stroke Trial Computed Tomography Head Scans
LiTS (Liao et al. in preprinted article reference)	2019	3D	X-ray	Liver and tumor segmentation	Liver Tumor Segmentation Challenge
ISIC2020 (81)	2020	2D	Dermoscope	Skin lesion segmentation of melanoma	International Skin Imaging Collaboration 2020
BreakHis (82)	2016	2D	Pathology	Classification of benign or malignant breast tumors	Breast Cancer Histopathological Image Dataset
ACDC (83)	2018	3D	MRI	Cardiac structure segmentation	Automated Cardiac Diagnosis Challenge
LIDC-IDRI (84)	2018	3D	CT	Pulmonary nodule segmentation	Lung Image Database Consortium and Image Database Resource Initiative
MoNuSeg (85)	2017	2D	Pathology	Nuclear segmentation of multiple organs	Multi-Organ Nuclei Segmentation Dataset
DSB18 (86)	2018	2D	Microscope	Nuclear segmentation	Data Science Bowl 2018 (Nuclei Segmentation Challenge)
DDR (87)	2019	2D	Fundus image	Breast cancer cells segmentation	Diabetic Retinopathy Dataset
MSD (88)	2018	3D	mpMR	Glioma of the head segmentation	Medical Segmentation Decathlon
Mindboggle (89)	2016	3D	MRI	Brain segmentation	Mindboggle Brain Imaging Dataset
CVC-ClinicDB (90)	2015	2D	Endoscope	Abdominal colorectal segmentation	Colonoscopy Video Challenge – Clinic Database
Colorectal-Liver-Metastases (91)	2023	3D	CT	Liver tumor segmentation	Colorectal Liver Metastases Dataset
FUGC (92)	2024	2D	Ultrasound	Abdominal cervix uteri segmentation	Fundus Glaucoma Challenge Dataset
OASIS (93)	2007	3D	MRI	Diagnostic classification of Alzheimer disease	Open Access Series of Imaging Studies
SZ-CXR (94)	2018	2D	X-ray	Pulmonary tuberculosis segmentation	Shenzhen Chest X-ray Dataset

2D, two-dimensional; 3D, three-dimensional; CT, computed tomography; MRI, magnetic resonance imaging.

Quantitative analysis

To compare the performance of medical image segmentation algorithms in a scientific and objective manner, it is usually necessary to rely on a series of quantitative metrics. These metrics can demonstrate the algorithm’s ability to characterize the target morphology, boundary location, and volume size from different dimensions. This section describes quantitative analysis applied to widely recognized datasets from each of the common medical imaging modalities—CT and MRI. The Synapse Multi-Organ Segmentation Dataset is used in the CT field, provides a series of valuable annotations for multi-organ segmentation tasks, and can be used to assess an algorithm’s performance. Meanwhile, the ACDC is a popular representative dataset for MRI and is valuable for quantitative analysis.

We selected the Synapse dataset and the ACDC dataset because they jointly provide data with diverse shapes, sizes, and tissue characteristics, extensively covering a wide range of real-world scenarios. The dataset offers pixel-level segmentation annotations for eight abdominal organs: the aorta, gallbladder, left kidney, right kidney, liver, pancreas, spleen, and stomach. These eight organs exhibit significant size variations. For example, the liver is large with clear boundaries, while the pancreas is small with vague boundaries, making its segmentation quite challenging. In experiments using the Synapse dataset, the DSC and HD metrics are commonly employed as evaluation criteria. The ACDC dataset is an authoritative dataset in the field of cardiac MRI segmentation and diagnosis, often used to validate the generalization performance of models in complex scenarios such as temporal variability, interpatient differences, and cardiac motion. The ACDC dataset includes cardiac MRI scans from 150 cases. Each case contains 10–15 2D short-axis slices, with each slice having 20–30 time frames. This dataset provides pixel-level segmentation annotations for three key cardiac anatomical structures: the left ventricular cavity, the right ventricular cavity, and the myocardium. Tables 6,7 present the experimental results of all the models mentioned in the text on the Synapse dataset and the ACDC dataset, with some additional model data included. As seen in Table 6, the UCTNet network model performed best in terms of the DSC metric, reaching as high as 89.44%. This model embeds the transformer into the decoder, replacing the standard convolutional decoding module, and introduces the Uncertainty-Guided ViT (UgViT) module, which entrusts the processing of uncertain regions to the transformer, thereby clarifying the role division between the CNN and transformer. Following closely was the D-Former network model with a DSC of 88.83%, a pure transformer model specifically designed for 3D medical images. It achieved breakthroughs in organ-level segmentation of the gallbladder, left kidney, and liver. Its gallbladder segmentation results were 9% higher than those of nnFormer, its left kidney segmentation was 7.39% higher than that of MISSFormer, and its liver segmentation reached an impressive 96.99%, approaching perfect segmentation. In terms of the HD metric, Parallelly Aggregating Global and Local Representations Transformer (PHTrans) stood out with an HD value of 8.68. According to the experimental results, these networks all demonstrated excellent performance, fully confirming their effectiveness in their respective tasks.

Table 6

Network performances on the Synapse dataset reported in Dice score (%)

Networks	DSC	HD	Aorta	Gallbladder	Left kidney	Right kidney	Liver	Pancreas	Spleen	Stomach
TransUNet [Chen et al. (2021) in Preprinted article reference]	77.48	31.69	87.23	63.13	81.87	77.02	94.08	55.86	85.08	75.62
Swin-Unet (29)	77.48	31.69	87.23	63.13	81.87	77.02	94.08	55.86	85.08	75.62
CSWin-UNet (34)	81.12	18.86	87.13	67.85	83.51	78.53	95.23	65.94	89.05	81.74
MISSFormer (35)	81.96	18.20	86.99	68.65	85.21	82.00	94.41	65.67	91.92	80.81
SegFormer3D (36)	82.15	–	90.43	55.26	86.53	86.13	95.68	73.06	89.02	81.12
TransDeepLab (38)	80.16	21.25	86.04	69.16	84.08	79.88	93.53	61.19	89.00	78.40
D-Former (39)	88.83	–	92.12	80.09	92.60	91.91	96.99	76.67	93.78	86.44
HiFormer (43)	80.69	19.14	87.03	68.61	84.23	78.37	94.07	60.77	90.44	82.03
DA-TransUNet (44)	79.80	23.48	86.54	65.27	81.70	80.45	94.57	61.62	88.53	79.73
MA-TransUnet (45)	85.26	10.86	88.44	72.38	88.54	85.38	95.65	71.87	93.48	86.35
MedFuseNet (50)	78.40	18.44	85.71	62.58	85.10	77.76	93.98	58.16	75.33	78.74
UNestFormer (52)	85.74	13.25	90.02	73.21	89.95	88.58	95.93	70.15	92.75	85.34
UCTNet (53)	89.44	–	92.86	83.74	85.95	90.97	97.06	84.51	93.28	87.12
nnFormer (55)	86.57	10.63	92.07	70.17	86.55	86.25	96.84	83.35	90.51	86.83
BRAU-Net++ (56)	82.47	19.07	87.95	69.10	87.13	81.53	94.71	65.17	91.89	82.26
3D TransUNet [Chen et al. (2023) in preprinted article reference]	88.11	–	92.67	81.66	85.29	87.76	97.34	82.69	91.90	85.59
GCtx-UNet (60)	82.39	15.94	86.30	69.32	86.11	81.89	94.64	64.88	91.81	84.15
TransGUNe (Nam et al. in preprinted article reference)	80.90	–	86.66	61.08	87.07	83.83	95.30	59.73	89.48	84.68
TransClaw-Unet (95)	78.09	26.38	85.87	61.38	84.83	79.36	94.28	57.65	87.74	73.55
PHTrans (96)	88.55	8.68	92.54	80.89	85.25	91.30	97.04	83.42	91.20	86.75

3D, three-dimensional; BRAU-Net++, Boundary-Attentive Residual U-Net++; DA-TransUNet, Dual Attention Transformer U-Net; DSC, Dice similarity coefficient; GCts-UNet, Global Context U-Net; HD, Hausdorff distance; HiFormer, Hierarchical Interleaved Transformer; MA-TransUnet, Multiattention Transformer; MISSFormer, Medical Image Segmentation Transformer; nnFormer, No New Network Transformer; PHTrans, Parallelly Aggregating Global and Local Representations Transformer; SegFormer3D, Segmentation Transformer 3D; Synapse, Synapse Multi-Organ Segmentation Dataset; UCTNet, UNet + Convolution + Transformer Network; UNestFormer, Enhancing Decoders and Skip Connections with Nested Transformers.

Table 7

Performance of different network models on ACDC dataset (%)

Network	DSC	RV	Myo	LV
UNETR (10)	87.15	84.52	84.36	92.57
TransUNet [Chen et al. (2021) in preprinted article reference]	89.71	88.86	84.53	95.73
BATFormer (26)	92.84	91.97	90.26	96.30
Swin-Unet (29)	90.00	88.55	85.62	95.83
CSWin-UNet (34)	91.46	89.68	88.94	95.76
MISSFormer (35)	90.86	89.55	88.04	94.99
SegFormer3D (36)	90.96	88.50	88.86	95.53
D-Former (49)	92.29	91.33	89.60	95.93
UTNet (40)	88.30	90.41	89.15	94.39
ScribFormer (48)	88.80	87.10	87.10	92.20
MedFuseNet (50)	89.73	88.36	86.29	94.54
UNestFormer (52)	93.42	93.12	91.06	96.08
UCTNet (53)	92.91	91.32	89.75	94.65
nnFormer (55)	92.06	90.94	89.58	95.65
GCtx-UNet (60)	91.23	89.88	87.25	96.57
PHTrans (96)	91.79	90.13	89.58	95.76
LeVit-Unet (97)	90.32	89.55	87.64	93.76
VT-UNet^† (98)	91.13	89.44	88.42	95.53
Mixed Transformer (99)	90.43	86.64	89.04	95.62
MedFormer (Gao et al. in preprinted article reference)	92.14	90.95	89.71	95.76

^†, model was pretrained on ImageNet. ACDC, Automated Cardiac Diagnosis Challenge; DSC, Dice similarity coefficient; LV, left ventricle; MISSFormer, Medical Image Segmentation Transformer; Myo, left ventricular myocardium; nnFormer, No New Network Transformer; PHTrans, Parallelly Aggregating Global and Local Representations Transformer; RV, right ventricle; SegFormer3D, Segmentation Transformer 3D; UCTNet, UNet + Convolution + Transformer Network; UNestFormer, Enhancing Decoders and Skip Connections with Nested Transformers.

Similarly, we evaluated each model listed in Table 7 using the DSC (expressed as a percentage) as the evaluation metric. The table provides an overview of each network’s performance in the four subcategories of the left ventricle, right ventricle, and left ventricular myocardium (Myo), as well as the average Dice score across all three categories. The UNestFormer model achieved the highest average DSC score. It is a hybrid architecture specifically designed for 2D medical image segmentation, with transformer modules embedded in both the decoder and skip connections. Following closely behind was UCTNet, whose average DSC was 0.51% lower than that of UNestFormer, demonstrating strong competitiveness. However, in the Myo sub-category, UNestFormer also outperformed other networks. Overall, we can tentatively consider to be the optimal network.

It should be clarified that although this review focuses on the research on CNN and transformer hybrid module architecture models, this does not imply that pure CNN model architectures hold no value in the field of medical image segmentation. For instance, the nnU-Net architecture reported by Wasserthal et al. (100) employs a self-adaptive framework based on U-Net. This framework can automatically configure hyperparameters according to the characteristics of the dataset, eliminating the need for manual parameter tuning. This architecture achieved an overall DSC of 94.3%, demonstrating highly competitive performance. Readers interested in researching the field of image segmentation, they should refer to the work by Wasserthal et al.

Conclusions

Transformer-based image segmentation approaches have demonstrated outstanding performance in numerous image-related applications, including in the field of medical imaging. This can be attributed to the self-attention mechanism within transformers, which enables the model to capture global relationships in images. Regarding the application of transformers in medical imaging, this review categorizes them into two major types according to network model architecture: pure transformer architectures and hybrid architectures. Moreover, we described several improved structures that have emerged in recent years, as well as some of the latest trends and training techniques aimed at enhancing the performance of ViT-based medical image segmentation methods.

In the research domain on pure transformer architectures, we selected the most representative Swin-Unet network architecture as an example. Using this model as a starting point, we progressively reviewed the development trajectory of pure transformer models in recent years and their performance on public datasets. In the field of hybrid model architectures, hybrid models predominantly adopt U-shaped network architectures as the mainstream design. In light of this, we conducted an in-depth analysis of this architecture and categorized it in detail for detailed discussion.

Studies on transformer-based models have primarily focused on hybrid architectures. Hybrid models often outperform those with pure transformer architectures. Furthermore, among hybrid models, such as the TransUNet, Swin SMT, and DA-TransUNet models, it is more common to incorporate transformer designs in the encoder part. However, designs that employ transformers in the decoder are relatively rare, yet such models have demonstrated outstanding performance on public datasets. For instance, the UCTNet model proposed by Yan et al. achieved an excellent average DSC of 89.44% on the Synapse dataset.

For the quantitative analysis, we selected the commonly used Synapse and ACDC datasets examples to conduct detailed data comparisons and summaries of recently developed network models. Additionally, we systematically summarized and categorized the commonly used evaluation metrics and datasets in the field of medical image segmentation. This revealed differences in the selection of evaluation metrics across different datasets. For example, analyses with the Synapse dataset typically include a dual-metric evaluation approach comprising DSC and HD; meanwhile, analyses with the ACDC dataset include DSC and Myo as the evaluation metrics. For readers uncertain about which evaluation metrics to choose in their experiments, we recommend prioritizing the DSC metric. Although this paper provides a comprehensive summary of public datasets and introduces common dataset formats, further elaboration upon the specific data formats used by the models is still needed.

Certain challenges remain regarding the practical clinical of transformer applications, but several novel solutions and future directions are emerging:

Researching more efficient architectures and algorithms has the primary objective of reducing computational overhead and enhancing efficiency. Researchers can explore lighter-weight self-attention mechanisms and the deeper integration of CNNs and transformers, enabling models to reduce parameters and computational load while maintaining performance.
The fusion of location-related prior information can help the model highlight the key features of the target task. The position coding of the transformer can be carefully designed, and the prior knowledge of image position can be incorporated so as to improve the generalization ability of the model.
The outstanding performance of deep learning models often relies on larger-scale datasets, but this is not the case in the field of medical imaging. This limits their practicality in real-time application scenarios. To prevent models from overfitting, future research may focus on developing novel data augmentation methods to increase the diversity of datasets. In supervised learning models, well-annotated datasets are indispensable; however, this process is both costly and time-consuming. Therefore, we suggest that future research should concentrate on unsupervised learning techniques, which can automatically generate labels for clinical image analysis.
Built upon the ViT architecture, the Segment Anything Model (SAM) has further advanced this field (101), extending the application of transformers to the realm of large-scale, prompt-guided image segmentation. SAM employs a ViT-based image encoder to extract high-level visual features (102) and couples this with a flexible prompt encoder and a lightweight mask decoder. This enables the model to generate corresponding segmentation masks based on various forms of user prompts, such as points, boxes, or masks. Trained on a dataset comprising billions of masks covering a diverse range of visual concepts, SAM has now become a highly representative multimodal foundational model in the field of segmentation, demonstrating strong cross-domain generalization capabilities.

Multimodal large language models have been examined across numerous studies (103), and they no longer represent a future research direction but rather a present reality. Our focus should shift to whether transformer-based architectures should accommodate such large language models. We believe that large language models should be taken into account because multimodal large language models have further achieved advanced semantic understanding, complementing pixel-level predictions. With their powerful language reasoning capabilities, such models can interpret ambiguous user queries, analyze semantic relationships between anatomical entities, and provide contextual priors that are difficult to capture through purely visual supervised encoding. This capability is particularly advantageous in the field of medical image segmentation, given the variability of anatomical structures, the scarcity of annotated data, and the complexity of clinical semantics, which have long been challenges in this domain. Meanwhile, research on multimodal large language models has made remarkable progress in recent years. Models such as GPT-4 Vision, Gemini, and Claude 3 have demonstrated the ability to jointly understand images and text, while models such as Large Language and Vision Assistant (LLaVA) and Bootstrapped Language-Image Pretraining, version 2 (BLIP-2) include mature visual-language architectures that integrate visual encoders with large language models.

Overall, the introduction of transformers heralds a new era in the field of medical image segmentation. As researchers continue to innovate and refine transformer-based methods, the possibilities of medical image analysis continue to expand and have the potential to profoundly improve patient care and healthcare outcomes.

Acknowledgments

The authors thank the anonymous referees for their valuable comments.

Footnote

Reporting Checklist: The authors have completed the Narrative Review reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-aw-2381/rc

Funding: This paper is in part supported by the National Natural Science Foundation of China (No. 62376231) and the Sichuan Science and Technology Program (Nos. 2024NSFSC1070 and 2026YFHZ0206).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-aw-2381/coif). X.S. is employed at Sichuan Bank Co., Ltd. She is the technical leader in this study while her institution has no financial interest in this paper. The other authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Yao W, Bai J, Liao W, Chen Y, Liu M, Xie Y. From CNN to Transformer: A Review of Medical Image Segmentation Models. J Imaging Inform Med 2024;37:1529-47. [Crossref] [PubMed]
Ronneberger O, Fischer P, Brox T. U-Net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention; 2015: Springer.
Zhou Z, Siddiquee MMR, Tajbakhsh N, Liang J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. Deep Learn Med Image Anal Multimodal Learn Clin Decis Support (2018) 2018;11045:3-11. [Crossref] [PubMed]
Liao M, Yang R, Zhao Y, Liang W, Yuan J. FocalTransNet: A Hybrid Focal-Enhanced Transformer Network for Medical Image Segmentation. IEEE Trans Image Process 2025;34:5614-27. [Crossref] [PubMed]
Khan S, Naseer M, Hayat M, Zamir SW, Khan FS, Shah M. Transformers in vision: A survey. ACM Comput Surv 2022;54:1-41.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Adv Neural Inf Process Syst 2017;30.
Khan RF, Lee BD, Lee MS. Transformers in medical image segmentation: a narrative review. Quant Imaging Med Surg 2023;13:8747-67. [Crossref] [PubMed]
Bougourzi F, Dornaika F, Taleb-Ahmed A, Truong Hoang V. Rethinking attention gated with hybrid dual pyramid transformer-cnn for generalized segmentation in medical imaging. In: International Conference on Pattern Recognition; 2024: Springer.
Kumar A, Kanthen KR, John J. GS-TransUNet: integrated 2D Gaussian splatting and transformer UNet for accurate skin lesion analysis. In: Medical Imaging 2025: Computer-Aided Diagnosis; 2025: SPIE.
Hatamizadeh A, Tang Y, Nath V, Yang D, Myronenko A, Landman B, Roth HR, Xu D. UNETR: Transformers for 3D medical image segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; 2022.
Zhang Y, Xi R, Wang W, Li H, Hu L, Lin H, Towey D, Bai R, Fu H, Higashita R. Low-contrast medical image segmentation via transformer and boundary perception. IEEE Trans Emerg Top Comput Intell 2024;8:2297-309.
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin Transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2021.
Wang G, Liu X, Li C, Xu Z, Ruan J, Zhu H, Meng T, Li K, Huang N, Zhang S. A Noise-Robust Framework for Automatic Segmentation of COVID-19 Pneumonia Lesions From CT Images. IEEE Trans Med Imaging 2020;39:2653-63. [Crossref] [PubMed]
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: International Conference on Learning Representations; 2021.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016.
Rauf Z, Sohail A, Khan SH, Khan A, Gwak J, Maqbool M. Attention-guided multi-scale deep object detection framework for lymphocyte analysis in IHC histological images. Microscopy (Oxf) 2023;72:27-42. [Crossref] [PubMed]
Rauf Z, Khan AR, Sohail A, Alquhayz H, Gwak J, Khan A. Lymphocyte detection for cancer analysis using a novel fusion block based channel boosted CNN. Sci Rep 2023;13:14047. [Crossref] [PubMed]
Khan SH, Sohail A, Khan A, Hassan M, Lee YS, Alam J, Basit A, Zubair S. COVID-19 detection in chest X-ray images using deep boosted hybrid learning. Comput Biol Med 2021;137:104816. [Crossref] [PubMed]
Kaymak R, Kaymak C, Ucar A. Skin lesion segmentation using fully convolutional networks: A comparative experimental study. Expert Systems with Applications 2020;161:113742.
Zhou X, Li X, Hu K, Zhang Y, Chen Z, Gao X. ERV-Net: An efficient 3D residual neural network for brain tumor segmentation. Expert Systems with Applications 2021;170:114566.
Taye MM. Theoretical understanding of convolutional neural network: Concepts, architectures, applications, future directions. Computation 2023;11:52.
Zhang Y, Wang J, Gorriz JM, Wang S. Deep Learning and Vision Transformer for Medical Image Analysis. J Imaging 2023;9:147. [Crossref] [PubMed]
Wang C, Wang L, Wang N, Wei X, Feng T, Wu M, Yao Q, Zhang R. CFATransUnet: Channel-wise cross fusion attention and transformer for 2D medical image segmentation. Comput Biol Med 2024;168:107803. [Crossref] [PubMed]
Okolo GI, Katsigiannis S, Ramzan N. IEViT: An enhanced vision transformer architecture for chest X-ray image classification. Comput Methods Programs Biomed 2022;226:107141. [Crossref] [PubMed]
Li J, Chen J, Tang Y, Wang C, Landman BA, Zhou SK. Transforming medical imaging with Transformers? A comparative review of key properties, current progresses, and future perspectives. Med Image Anal 2023;85:102762. [Crossref] [PubMed]
Lin X, Yu L, Cheng KT, Yan Z. BATFormer: Towards Boundary-Aware Lightweight Transformer for Efficient Medical Image Segmentation. IEEE J Biomed Health Inform 2023;27:3501-12. [Crossref] [PubMed]
Chaoyang Z, Shibao S, Wenmao H, Pengcheng Z. FDR-TransUNet: A novel encoder-decoder architecture with vision transformer for improved medical image segmentation. Comput Biol Med 2024;169:107858. [Crossref] [PubMed]
Xiao H, Li L, Liu Q, Zhang Q, Liu J, Liu Z. Context-aware and local-aware fusion with transformer for medical image segmentation. Phys Med Biol 2024;69: [Crossref] [PubMed]
Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, Wang M. Swin-unet: Unet-like pure transformer for medical image segmentation. In: European Conference on Computer Vision (ECCV). Springer; 2022:425-44.
Wang L, Li R, Zhang C, Fang S, Duan C, Meng X, Atkinson PM. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS Journal of Photogrammetry and Remote Sensing 2022;190:196-214.
Wang X, Yang S, Zhang J, Wang M, Zhang J, Huang J, Yang W, Han X. TransPath: Transformer-based self-supervised learning for histopathological image classification. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI). Springer; 2021:166-76.
Khan A, Rauf Z, Khan AR, Rathore S, Khan SH, Shah N, Farooq U, Asif H, Asif A, Zahoora U. A recent survey of vision transformers for medical image segmentation.IEEE Access 2025;13:191824-49.
Jain J, Li J, Chiu MT, Hassani A, Orlov N, Shi H. OneFormer: One transformer to rule universal image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE; 2023:11424-34.
Liu X, Gao P, Yu T, Wang F, Yuan RY. CSWin-UNet: Transformer UNet with cross-shaped windows for medical image segmentation. Information Fusion 2025;113:102634.
Huang X, Deng Z, Li D, Yuan X, Fu Y. MISSFormer: An Effective Transformer for 2D Medical Image Segmentation. IEEE Trans Med Imaging 2023;42:1484-94. [Crossref] [PubMed]
Perera S, Navard P, Yilmaz A. SegFormer3D: an efficient transformer for 3d medical image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE; 2024:11026-36.
Ding H, Zhang X, Lu W, Yuan F, Luo H. MMAFormer: multiscale modality-aware transformer for medical image segmentation. Electronics 2024;13:4636.
Azad R, Heidari M, Shariatnia M, Aghdam EK, Karimijafarbigloo S, Adeli E, Merhof D. TransDeepLab: Convolution-free transformer-based DeepLab V3+ for medical image segmentation. In: International Workshop on Predictive Intelligence in Medicine (PRIME). Springer; 2022:128-39.
Wu Y, Liao K, Chen J, Wang J, Chen DZ, Gao H, Wu J. D-Former: A u-shaped dilated transformer for 3d medical image segmentation. Neural Computing and Applications 2023;35:1931-44.
Gao Y, Zhou M, Metaxas DN. UTNet: A hybrid transformer architecture for medical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI). Springer; 2021:610-20.
Wang H, Cao P, Wang J, Zaiane OR. UCTransNet: Rethinking the skip connections in U-Net from a channel-wise perspective with transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). AAAI Press; 2022:15347-55.
Płotka S, Chrabaszcz M, Biecek P. Swin-SMT: Global sequential modeling for enhancing 3d medical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI). Springer; 2024:566-76.
Heidari M, Kazerouni A, Soltany M, Azad R, Aghdam EK, Cohen-Adad J, Merhof D. HiFormer: Hierarchical multi-scale representations using transformers for medical image segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE; 2023:3254-64.
Sun G, Pan Y, Kong W, Xu Z, Ma J, Racharak T, Nguyen LM, Xin J. DA-TransUNet: integrating spatial and channel dual attention with transformer U-net for medical image segmentation. Front Bioeng Biotechnol 2024;12:1398237. [Crossref] [PubMed]
WangBZhaoZWeiZZhaiJTianXZhangX.Ma-Transunet: U-Shaped Transformer with Multi-Scale Cnn-Based Auxiliary Network for Medical Image Segmentation. Available at SSRN 4826331.
LiuSZhaoM.DUCFNet: Dual U-shaped Cross-modal Fusion Network for Lung Infection Region Segmentation.Authorea Preprints 2024.
Valanarasu JMJ, Oza P, Hacihaliloglu I, Patel VM. Medical transformer: Gated axial-attention for medical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention; 2021. Springer.
Li Z, Zheng Y, Shan D, Yang S, Li Q, Wang B, Zhang Y, Hong Q, Shen D. ScribFormer: Transformer Makes CNN Work Better for Scribble-Based Medical Image Segmentation. IEEE Trans Med Imaging 2024;43:2254-65. [Crossref] [PubMed]
Zhang L, Zuo Y, Jia Y, Li D, Zeng R, Li D, Chen J, Wang W. DS-Former: A dual-stream encoding-based transformer for 3D medical image segmentation. Biomedical Signal Processing and Control 2024;89:105702.
Chen R, He S, Xie J, Wang T, Xu Y, Fang J, Zhao X, Zhang S, Wang G, Lu H, Yang Z. MedFuseNet: fusing local and global deep feature representations with hybrid attention mechanisms for medical image segmentation. Sci Rep 2025;15:5093. [Crossref] [PubMed]
Tomar NK, Shergill A, Rieders B, Bagci U, Jha D. TransResU-Net: A Transformer based ResU-Net for Real-Time Colon Polyp Segmentation. Annu Int Conf IEEE Eng Med Biol Soc 2023;2023:1-4. [Crossref] [PubMed]
Tayeb AM, Kim T-H. Unestformer: Enhancing decoders and skip connections with nested transformers for medical image segmentation. IEEE Access 2024;12:190996-1009.
Guo X, Lin X, Yang X, Yu L, Cheng K-T, Yan Z. UCTNet: Uncertainty-guided CNN-Transformer hybrid networks for medical image segmentation. Pattern Recognition 2024;152:110491.
Kamsari M, Sadeghi S, de Oliveira GG, Alves AM, Sarshar NT, Anari S, Ranjbarzadeh R. The role of data augmentation and attention mechanisms in UNet and ConvNeXt architectures for optimizing breast tumor segmentation. Sci Rep 2025;15:45268. [Crossref] [PubMed]
Zhou HY, Guo J, Zhang Y, Han X, Yu L, Wang L, Yu Y. nnFormer: Volumetric Medical Image Segmentation via a 3D Transformer. IEEE Trans Image Process 2023;32:4036-45. [Crossref] [PubMed]
Lan L, Cai P, Jiang L, Liu X, Li Y, Zhang Y. Brau-net++: U-shaped hybrid cnn-transformer network for medical image segmentation. IEEE Transactions on Radiation and Plasma Medical Sciences 2026; [Crossref]
Upadhyay AK, Bhandari AK. MaS-TransUNet: A multiattention Swin Transformer U-Net for medical image segmentation. IEEE Transactions on Radiation and Plasma Medical Sciences 2024;9:613-26.
Pan S, Liu X, Xie N, Chong Y. EG-TransUNet: a transformer-based U-Net with enhanced and guided models for biomedical image segmentation. BMC Bioinformatics 2023;24:85. [Crossref] [PubMed]
Lin A, Chen B, Xu J, Zhang Z, Lu G, Zhang D. Ds-transunet: Dual swin transformer u-net for medical image segmentation. IEEE Transactions on Instrumentation and Measurement 2022;71:1-15.
Alrfou K, Zhao T. GC-UNet: Efficient network for medical image segmentation. Multimedia Tools and Applications 2026;85:137.
Xia Z, Li H, Lan L. DSSAU-Net: U-Shaped Hybrid Network for Pubic Symphysis and Fetal Head Segmentation. Intrapartum Ultrasound Grand Challenge. Springer; 2024:32-45.
Ates GC, Mohan P, Celik E. Dual cross-attention for medical image segmentation. Engineering Applications of Artificial Intelligence 2023;126:107139.
Xie X, Yang M. USCT-UNet: Rethinking the Semantic Gap in U-Net Network From U-Shaped Skip Connections With Multichannel Fusion Transformer. IEEE Trans Neural Syst Rehabil Eng 2024;32:3782-93. [Crossref] [PubMed]
Zhang Y, Liu L, Han Z, Meng F, Zhang Y, Zhao Y. TranSEFusionNet: Deep fusion network for colorectal polyp segmentation. Biomedical Signal Processing and Control 2023;86:105133.
Imran M, Krebs JR, Gopu VRR, Fazzone B, Sivaraman VB, Kumar A, Viscardi C, Heithaus RE, Shickel B, Zhou Y, Cooper MA, Shao W. CIS-UNet: Multi-class segmentation of the aorta in computed tomography angiography via context-aware shifted window self-attention. Comput Med Imaging Graph 2024;118:102470. [Crossref] [PubMed]
Zhang H, Lian J, Yi Z, Wu R, Lu X, Ma P, Ma Y. HAU-Net: Hybrid CNN-transformer for breast ultrasound image segmentation. Biomedical Signal Processing and Control 2024;87:105427.
Hille G, Agrawal S, Tummala P, Wybranski C, Pech M, Surov A, Saalfeld S. Joint liver and hepatic lesion segmentation in MRI using a hybrid CNN with transformer layers. Comput Methods Programs Biomed 2023;240:107647. [Crossref] [PubMed]
Chen J, He Y, Frey E, Li Y, Du Y. ViT-V-Net: Vision Transformer for Unsupervised Volumetric Medical Image Registration. In: Medical Imaging with Deep Learning; 2021.
Li X, Pang S, Zhang R, Zhu J, Fu X, Tian Y, Gao J. ATTransUNet: An enhanced hybrid transformer architecture for ultrasound and histopathology image segmentation. Comput Biol Med 2023;152:106365. [Crossref] [PubMed]
Chen B, Liu Y, Zhang Z, Lu G, Kong AWK. Transattunet: Multi-level attention-guided u-net with transformer for medical image segmentation. IEEE Transactions on Emerging Topics in Computational Intelligence 2023;8:55-68.
Shi L, Gao T, Zhang Z, Zhang J. STM-UNet: An efficient U-shaped architecture based on Swin transformer and multiscale MLP for medical image segmentation. In: GLOBECOM 2023–2023 IEEE Global Communications Conference; 2023. IEEE.
Wu H, Min W, Gai D, Huang Z, Geng Y, Wang Q, Chen R. HD-Former: A hierarchical dependency Transformer for medical image segmentation. Comput Biol Med 2024;178:108671. [Crossref] [PubMed]
Song P, Yu Y, Zhang Y. Tth-net: Two-stage transformer–cnn hybrid network for leaf vein segmentation. Applied Sciences 2023;13:11019.
Wang R, Lei T, Cui R, Zhang B, Meng H, Nandi AK. Medical image segmentation using deep learning: A survey. IET Image Processing 2022;16:1243-67.
Roy S, Koehler G, Ulrich C, Baumgartner M, Petersen J, Isensee F, Jaeger PF, Maier-Hein KH. Mednext: transformer-driven scaling of convnets for medical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2023.
Liao W, Xiong H, Wang Q, Mo Y, Li X, Liu Y, Chen Z, Huang S, Dou D. Muscle: Multi-task self-supervised continual learning to pre-train deep models for x-ray images of multiple body parts. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2022.
Sun P, Mo Z, Hu F, Liu F, Mo T, Zhang Y, Chen Z. Kidney Tumor Segmentation Based on FR2PAttU-Net Model. Front Oncol 2022;12:853281. [Crossref] [PubMed]
Gaggion N, Mosquera C, Mansilla L, Saidman JM, Aineseder M, Milone DH, Ferrante E. CheXmask: a large-scale dataset of anatomical segmentation masks for multi-center chest x-ray images. Sci Data 2024;11:511. [Crossref] [PubMed]
Hassan R, Mondal MRH, Ahamed SI. UDBRNet: A novel uncertainty driven boundary refined network for organ at risk segmentation. PLoS One 2024;19:e0304771. [Crossref] [PubMed]
Jin B, Hernández MdCV, Fontanella A, Li W, Platt E, Armitage P, Storkey A, Wardlaw JM, Mair G. Pre-processing and quality control of large clinical CT head datasets for intracranial arterial calcification segmentation. In: MICCAI Workshop on Data Engineering in Medical Imaging (DEMI). Springer; 2024.
Mateen M, Hayat S, Arshad F, Gu YH, Al-Antari MA. Hybrid Deep Learning Framework for Melanoma Diagnosis Using Dermoscopic Medical Images. Diagnostics (Basel) 2024;14:2242. [Crossref] [PubMed]
Xie L, Lin M, Xu C, Luan T, Zeng Z, Qian W, Li C, Fang Y, Shen Q, Wu Z. MH-PFLGB: Model Heterogeneous Personalized Federated Learning via Global Bypass for Medical Image Analysis. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2024.
Zhang L, Yin X, Liu X, Liu Z. Medical image segmentation by combining feature enhancement Swin Transformer and UperNet. Sci Rep 2025;15:14565. [Crossref] [PubMed]
Armato SG 3rd, McLennan G, Bidaut L, McNitt-Gray MF, Meyer CR, Reeves AP, et al. The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI): a completed reference database of lung nodules on CT scans. Med Phys 2011;38:915-31. [Crossref] [PubMed]
Kumar N, Verma R, Sharma S, Bhargava S, Vahadane A, Sethi A. A Dataset and a Technique for Generalized Nuclear Segmentation for Computational Pathology. IEEE Trans Med Imaging 2017;36:1550-60. [Crossref] [PubMed]
Caicedo JC, Goodman A, Karhohs KW, Cimini BA, Ackerman J, Haghighi M, Heng C, Becker T, Doan M, McQuin C, Rohban M, Singh S, Carpenter AE. Nucleus segmentation across imaging experiments: the 2018 Data Science Bowl. Nat Methods 2019;16:1247-53. [Crossref] [PubMed]
Li T, Gao Y, Wang K, Guo S, Liu H, Kang H. Diagnostic assessment of deep learning algorithms for diabetic retinopathy screening. Information Sciences 2019;501:511-22.
Antonelli M, Reinke A, Bakas S, Farahani K, Kopp-Schneider A, Landman BA, et al. The Medical Segmentation Decathlon. Nat Commun 2022;13:4128. [Crossref] [PubMed]
Laiton-Bonadiez C, Sanchez-Torres G, Branch-Bedoya J. Deep 3D Neural Network for Brain Structures Segmentation Using Self-Attention Modules in MRI Images. Sensors (Basel) 2022;22:2559. [Crossref] [PubMed]
Liu J, Li K, Huang C, Dong H, Song Y, Li R. MixFormer: A mixed CNN–transformer backbone for medical image segmentation. IEEE Transactions on Instrumentation and Measurement 2024;74:1-20.
Simpson AL, Peoples J, Creasy JM, Fichtinger G, Gangai N, Keshavamurthy KN, Lasso A, Shia J, D'Angelica MI, Do RKG. Preoperative CT and survival data for patients undergoing resection of colorectal liver metastases. Sci Data 2024;11:172. [Crossref] [PubMed]
Wang F, Curran KM, Silvestre G. Semi-supervised cervical segmentation on ultrasound by a dual framework for neural networks. In: 2025 IEEE 22nd International Symposium on Biomedical Imaging (ISBI). IEEE; 2025.
Wang J, Ma Y, Xu C, Chu M, Fan Z, Wu D. Cwin-Net: A channel window attention network for magnetic resonance image super-resolution. Biomedical Signal Processing and Control 2025;110:108119.
Stirenko S, Kochura Y, Alienin O, Rokovyi O, Gordienko Y, Gang P, Zeng W. Chest X-ray analysis of tuberculosis by deep learning with segmentation and augmentation. 2018 IEEE 38th International Conference on Electronics and Nanotechnology (ELNANO). IEEE; 2018.
Yao C, Hu M, Li Q, Zhai G, Zhang XP. Transclaw u-net: claw u-net with transformers for medical image segmentation. In: 2022 5th International Conference on Information Communication and Signal Processing (ICICSP). IEEE; 2022.
Liu W, Tian T, Xu W, Yang H, Pan X, Yan S, Wang L. PHTrans: Parallelly aggregating global and local representations for medical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2022.
Xu G, Zhang X, He X, Wu X. LeViT-UNet: Make faster encoders with transformer for medical image segmentation. Chinese conference on pattern recognition and computer vision (PRCV). Springer; 2023.
Peiris H, Hayat M, Chen Z, Egan G, Harandi M. A robust volumetric transformer for accurate 3D tumor segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer; 2022.
Wang H, Xie S, Lin L, Iwamoto Y, Han XH, Chen YW, Tong R. Mixed transformer u-net for medical image segmentation. In: ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE; 2022.
Wasserthal J, Breit HC, Meyer MT, Pradella M, Hinck D, Sauter AW, Heye T, Boll DT, Cyriac J, Yang S, Bach M, Segeroth M. TotalSegmentator: Robust Segmentation of 104 Anatomic Structures in CT Images. Radiol Artif Intell 2023;5:e230024. [Crossref] [PubMed]
Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, Xiao T, Whitehead S, Berg AC, Lo W-Y. Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); 2023.
Gu H, Colglazier R, Dong H, Zhang J, Chen Y, Yildiz Z, et al. SegmentAnyBone: A universal model that segments any bone at any location on MRI. Med Image Anal 2025;101:103469. [Crossref] [PubMed]
Wang Z, Wu Z, Agarwal D, Sun J. MedCLIP: Contrastive Learning from Unpaired Medical Images and Text. Proc Conf Empir Methods Nat Lang Process 2022;2022:3876-87.

Cite this article as: Xu Y, Peng Y, Zhang C, Jiang K, She X, Feng L. Application of transformer models in medical image segmentation: a narrative review. Quant Imaging Med Surg 2026;16(5):421. doi: 10.21037/qims-2025-aw-2381