A novel dual-branch segmentation algorithm for overall spine segmentation
Introduction
Low back pain has emerged as a significant global public health challenge (1). When physical therapy or medication fails to effectively alleviate symptoms, surgery often becomes a necessary intervention. Accurate spine segmentation during surgery assists surgeons in identifying and localizing target vertebrae more clearly, thereby avoiding serious risks associated with misjudgment or improper procedures. However, traditional methods require the manual labeling of vertebrae individually, a labor-intensive and time-consuming process. Therefore, the development of a precise and efficient fully automated segmentation system for the identification and segmentation of spinal structures in computed tomography (CT) images has become a critical concern in the field of spine surgery.
Currently, medical image segmentation methods utilizing encoder-decoder architectures, exemplified by Unet (2), are widely employed and exhibit outstanding performance. For instance, Attunet (3) significantly enhances segmentation accuracy through the use of attention gates. Unet++ (4) intelligently employs nested and dense skip connections to mitigate the semantic gap between levels, thereby improving segmentation outcomes. Unet 3+ (5) achieves comprehensive feature extraction by fusing low-level details and high-level semantic features at multiple scales, which further enhances segmentation performance. Additionally, R2Unet (6) not only improves the accuracy of medical image segmentation but also significantly optimizes processing efficiency by incorporating residual blocks and recurrent mechanisms. Moreover, Resunet (7) demonstrated excellent performance in retinal blood vessel segmentation by leveraging a weighted attention mechanism. While U-networks have shown outstanding performance in numerous medical image segmentation tasks, they still encounter inherent limitations. Specifically, while the architecture excels at capturing local features, it is deficient in handling long-range dependencies and integrating global contextual information, which limits its effectiveness in segmenting highly complex and variable medical images.
In recent years, the Transformer model has garnered significant attention due to its multiple self-attention (MSA) mechanism, which effectively addresses long-range dependencies. Inspired by this, researchers have started exploring the application of Transformers in the field of medical image segmentation to compensate for the shortcomings of traditional methods, such as Unet, in processing global information. For example, Cao et al. proposed Swin-Unet (8), which successfully integrated Swin Transformer (9) into a U-shaped segmentation network, significantly enhancing the accuracy of multi-organ segmentation. Chen et al. combined the strengths of Transformers and convolutional neural networks (CNNs) to propose TransUNet (10), aiming to further improve segmentation performance. Xie et al. innovatively designed a pure Transformer architecture, SegFormer (11), which demonstrated efficient and powerful semantic segmentation capabilities through the nuanced combination of a hierarchical encoder and an efficient decoder. Huang et al. subsequently introduced MISSformer (12), achieving more accurate heart segmentation by leveraging global information at different scales. Mask2former (13) incorporates multi-scale high-resolution features and a masked attention mechanism based on Maskformer (14), effectively reducing background interference and significantly improving segmentation performance. However, it is important to note that while traditional Transformers excel in global modeling, they often overlook the rich contextual information between neighboring keys when calculating the attention matrix. As a result, they learn primarily from isolated query-key pairs, limiting their ability to capture complex feature relationships within an image, potentially leading to suboptimal performance in medical image segmentation tasks.
To address the aforementioned challenges, this paper proposes a segmentation network, DBU-Net, specifically designed for spine CT images. The network integrates the contextual Transformer (CoT) (15) and multi-scale feature channel attention (MFCA) modules into the nnUnet (16) framework, achieving a seamless integration of global context and cross-scale feature interaction. This enables the network to capture and analyze complex features in spine CT images with greater detail and comprehensiveness, ultimately leading to accurate and efficient automatic segmentation of spine CT images. We present this article in accordance with the TRIPOD + AI reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-24-2297/rc).
Methods
Datasets
This research utilized portions of the Vertebrae Segmentation (VerSe) 2019 and 2020 datasets, which are part of the vertebral labelling and segmentation benchmarking challenge (17-19). The dataset comprises data from 278 patients across various centers and providers, including Siemens, General Electric, Philips, and Toshiba. The minimum age of the patients was 18 years, and the images included 7 fully visible vertebrae (excluding the sacrum and transitional vertebrae) with a minimum pixel spacing of 1.5 mm (craniolateral), 1 mm (anterior-posterior), and 3 mm (left-right). Additionally, the CTSpine1K (20) dataset was used as an external test set to evaluate generalization capabilities. The CTSpine1K dataset is sourced from multiple medical centers and includes images scanned by different devices, making the data diverse and representative of various real-world clinical scenarios. Figure 1 shows some example images from the VerSe 2019 dataset. As illustrated in the figure, the dataset includes CT images with different views and sizes. The data set analyzed in this study is public and can be found at https://github.com/anjany/verse and https://github.com/MIRACLE-Center/CTSpine1K.
This study utilized two publicly available datasets. The datasets were made accessible to the public in legal compliance, and the data were de-identified before release, containing no individually identifiable information. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013) and was approved by the Ethics Committee of Shengjing Hospital, China Medical University (No. 2024PS1106K). The requirement of informed consent was waived due to the retrospective nature of the study.
Image preprocessing and data augmentation
In this study, we resampled the voxel size to an isotropic 1.0×1.0×1.0 mm3 and resized the images to ensure homogeneity across all axes. Subsequently, we normalized the images and binarized the masks. To improve the generalizability of the system, we applied data augmentation techniques during training, including random rotations of ±30 degrees, scaling, elastic transformations, additive noise, gamma correction, and flipping. We divided 278 patients into two groups: 238 patients for training and validation (with 20% of the images randomly selected for validation) and 40 patients for testing, ensuring that the test dataset remained completely unseen during training.
nnUnet
In medical image segmentation, researchers frequently develop specialized algorithms for specific tasks and issues. However, this approach may lead to models with poor generalization and robustness. The authors of nnUnet argue that tuning the network structure may lead to overfitting on specific datasets. They highlight that aspects unrelated to the network structure may have a greater impact on segmentation tasks, and thus emphasize the optimization of pre-processing, training, and post-processing procedures, primarily focusing on image processing. nnUnet demonstrates the applicability of various configurational strategies for medical image segmentation tasks by systematizing them into a set of fixed parameters (e.g., training batch size, image chunk size, number of downsamplings, etc.).
The architecture of nnUnet mirrors that of Unet, adhering to the encoder-decoder design paradigm and comprising a series of convolutional blocks. The Unet network structure is illustrated in Figure 2, where skip connections are employed between the encoder and decoder, and efficient feature mapping between internal blocks is achieved by concatenating generated features as supplementary information. To circumvent batch size constraints, nnUnet replaces the ReLU activation function in Unet with LeakyReLU and employs instance normalization instead of the more commonly used batch normalization. To enhance network stability and accelerate convergence on the image foreground, patch sampling ensures that more than one-third of the pixels in a batch’s sample belong to the foreground class. These enhancements significantly improve the stability and adaptability of the nnUnet training process, addressing training instability caused by variations in imaging methods, image sizes, and voxel spacing, and enabling its widespread use across various scenarios.
DBU-Net structure
Given nnUnet’s outstanding performance in data processing efficiency, training stability, and broad adaptability, we chose it as the foundational backbone network and further enhanced it to improve inter-pixel information interaction efficiency and the precise characterization of contextual information. Figure 3A illustrates the overall structure of our proposed DBU-Net network.

First, to enhance the network’s capacity for integrating contextual information, we introduce an enhanced CoT module in the decoder branch. As shown in Figure 3B, this module first extracts static contextual information from keys and then performs a self-attention operation using two consecutive 1×1 convolutional layers to generate dynamic context. Ultimately, a more comprehensive output is achieved by seamlessly fusing static and dynamic contexts. By integrating these two types of contextual information, the model leverages both global and local features, thereby enhancing feature representation richness and improving segmentation accuracy.
Additionally, to achieve comprehensive capture and efficient integration of multi-scale feature information in medical images, we designed the MFCA module. This module automatically identifies and enhances important feature channels through cross-scale feature interaction and an attention mechanism, while reducing redundant interference information. This innovative design allows the network to efficiently extract semantically rich and detailed feature representations from complex medical images, thereby providing more comprehensive feature information for subsequent decoding tasks.
Contextual transformer block
To address the issue of traditional transformers overlooking rich contextual information between neighboring keys when computing the attention matrix, the CoT module employs context encoding of the input keys to fully exploit this information, thereby enhancing the model’s ability to capture and represent image features. Specifically, for a given two-dimensional (2D) feature map (where H denotes height, W denotes width, and C denotes the number of channels), the keys, queries, and values are defined as , , and , respectively. The static contextual information between locally neighboring keys is first represented spatially using grouped convolution for all neighboring keys in a 3×3 grid, denoted as . Subsequently, the context key K1 is concatenated with the query Q, and then the attention matrix is obtained by two 1×1 convolutions and normalized:
At this stage, each local attention matrix no longer depends solely on isolated query-key pairs but is instead based on a richer, context-informed representation. Next, the feature map is computed by aggregating the values V within a typical self-attention mechanism to obtain :
Here, denotes a local matrix multiplication operation. Since each element in K2 is derived from the values V and their dynamic association with the query Q and the static context K1, it is used as the dynamic context representation of the input. Finally, the static context K1 and the dynamic context K2 are fused to produce the final output. Additionally, for local feature extraction and integration of the fused feature maps, we apply a 3×3 convolutional layer. This layer enhances spatial dependencies between features through its weighted summation operation within the receptive field, enabling the model to learn more complex and abstract feature representations.
MFCA block
In the broad field of medical image segmentation, handling multi-scale features is highly complex. To comprehensively capture and integrate cross-scale feature information while minimizing redundancy, we designed the MFCA module. This module progressively fuses feature maps between the current and next scales to enhance the network’s multi-scale learning capability. The MFCA module offers two main benefits: first, it directly extracts feature information from neighboring scales and generates channel attention maps to precisely focus on key features. Second, at the output stage, the core features of the current scale are combined with the detailed features of the next scale through weighted feature mapping, ensuring that the MFCA module’s output is rich and efficient, with fully integrated multi-scale features.
The structure of the MFCA block is illustrated in Figure 3C. Each MFCA block receives two inputs: the feature map Fi from the encoder at the current stage and the feature map Fi+1 from the next stage. Subsequently, Fi and Fi+1 undergo global average pooling (GAP) for dimensionality reduction, enabling the model to learn more representative features. They are then concatenated along the channel dimension. A 1×1 convolutional layer is then applied to extract feature information from the fused feature maps, generating a multi-scale channel attention map A:
Here, Concat denotes the concatenation along the channel dimension, and Conv1×1 denotes the 1×1 convolutional layer. At this stage, A fuses the feature information from two neighboring stages and completes the cross-scale information interaction. Next, A is divided into two parts, Ai and Ai+1, to ensure that their dimensions match Fi and Fi+1, respectively. Subsequently, Fi and Fi+1 are multiplied by the corresponding Ai and Ai+1 to obtain the weighted feature maps and , respectively. The formulas are shown below:
Here, denotes element-wise multiplication. We then upsample to the same spatial resolution as using a transposed convolutional layer and merge them through concatenation. The concatenated feature map contains high-level semantic information as well as low-level detail information. To further fuse these features and adjust their dimensions, we apply a 1×1 convolutional layer to and add the result to the input feature map Fi to obtain as the output of the MFCA block. The above process can be represented by the following equation:
Here, denotes element-wise addition and Deconv(·) denotes transposed convolutional layers.
Results
Evaluation metrics
In this experiment, to fully assess the model’s performance in the segmentation task, we selected Dice coefficient, accuracy (Acc), Intersection over Union (IoU), Precision (Pre), and Recall (Rec) as the evaluation metrics. The Dice coefficient quantifies the similarity between the predicted and actual annotations, with values ranging from 0 to 1, where a higher value indicates greater similarity. Acc reflects the correctness of the model’s classification across all samples, specifically the proportion of correctly predicted pixels to the total number of pixels, providing an intuitive measure of overall performance. IoU measures the overlap between the predicted results and the actual labeled regions, reflecting segmentation accuracy. Pre describes the proportion of true positive samples among all positively predicted samples, while Rec measures the proportion of true positive samples that are correctly identified. Collectively, these metrics offer a comprehensive evaluation of the model’s segmentation performance.
Hyperparameters
In this study, we performed experiments on a server configured with four 32 GB Tesla V100 graphics cards. The network training was conducted using a 5-fold cross-validation method with a stochastic gradient descent (SGD) optimizer, an initial learning rate of 0.02, and a linear learning rate decay strategy. To balance classification and segmentation accuracy during training, the loss function is a combination of cross-entropy loss and the Dice coefficient. Additionally, the input image size was set to 384×384 pixels, and the batch size was 22.
Comparative analysis
In order to fully validate the effectiveness of our newly proposed DBU-Net method, we compared its segmentation performance against several state-of-the-art segmentation networks on the test dataset. As shown in Table 1, DBU-Net achieves 94.59% in the Dice coefficient, which is 9.65% higher than Mask2former, 8.79% higher than PSPNet, 2.71% higher than Unet++, and 11.42% higher than Swin-Unet. Compared to nnUnet, DBU-Net shows significant improvements of 0.82%, 0.09%, 1.48%, and 1.73% in core quantitative metrics such as Dice, Acc, IoU, and Pre, respectively, demonstrating its superior performance. These results demonstrate that the DBU-Net algorithm not only meets high-precision positioning requirements but also confirms its effectiveness as an improved model. The prediction results of these segmentation networks are shown in Figure 4. Compared to other networks, DBU-Net accurately identifies and segments the skeletal structure, including subtle bone branches and complex joint connections, providing clear and precise segmentation results. Additionally, DBU-Net’s superior performance in edge processing enhances bone segmentation accuracy and ensures smooth, continuous edges, significantly reducing mis-segmentation and missed segmentation. Detailed experimental data in Table 1 clearly show that DBU-Net achieves the best results in most key metrics and significantly outperforms other methods in many test images, demonstrating its robust segmentation capability.
Table 1
Method | Year | Dice (%) | Accuracy (%) | Intersection over Union (%) | Precision (%) | Recall (%) |
---|---|---|---|---|---|---|
PSPNet (21) | 2017 | 85.80 | 96.89 | 75.30 | 99.32 | 75.74 |
Unet++ (4) | 2018 | 91.88 | 98.19 | 85.18 | 97.56 | 87.32 |
Resuent (7) | 2018 | 90.39 | 97.91 | 82.48 | 98.61 | 83.46 |
Deeplabv3+ (22) | 2018 | 87.08 | 97.24 | 77.31 | 99.11 | 77.89 |
DANet (23) | 2019 | 86.80 | 97.16 | 76.87 | 99.34 | 77.30 |
TransUNet (10) | 2021 | 79.86 | 96.40 | 66.58 | 71.55 | 90.90 |
SegFormer (11) | 2021 | 85.35 | 96.73 | 74.59 | 99.62 | 74.81 |
Swin-Unet (8) | 2022 | 83.17 | 96.85 | 71.26 | 77.62 | 89.83 |
Mask2former (13) | 2022 | 84.94 | 96.61 | 73.93 | 99.68† | 74.11 |
nnUnet (16) | 2018 | 93.77 | 98.79 | 88.38 | 93.27 | 94.40† |
Our | 2024 | 94.59† | 98.88† | 89.86† | 95.00 | 94.26 |
†, the highest performance value in the corresponding metric column.
Ablation experiments
We also conducted an ablation study on the DBU-Net model to evaluate the effectiveness of both the CoT module and the MFCA block. As shown in the second row of data in Table 2, integrating the CoT module into nnUnet results in significant improvements in both Dice and IoU metrics, with increases of 0.65% and 1.31%, respectively. This demonstrates the effectiveness of the CoT module in capturing both global and local feature information. As indicated in the third row of Table 2, the introduction of the MFCA module leads to further improvements in Dice and IoU metrics, with increases of 0.17% and 0.17%, respectively. This underscores the enhanced multi-scale learning capability of the MFCA, which facilitates more accurate segmentation of complex and detailed features. Figure 5 shows that the segmentation of bone edges is more precise and complete compared to nnUnet. The experimental results indicate that the enhanced nnUnet model achieves improvements of 0.82% and 1.48% in Dice and IoU, respectively, thereby validating the effectiveness of the proposed method.
Table 2
Backbone | CoT | MFCA | Dice (%) | Accuracy (%) | Intersection over Union (%) |
---|---|---|---|---|---|
nnUnet (16) | × | × | 93.77 | 98.79 | 88.38 |
√ | × | 94.42 | 98.81 | 89.69 | |
√ | √ | 94.59† | 98.88† | 89.86† |
†, the highest performance value in the corresponding metric column. CoT, contextual Transformer; MFCA, multi-scale feature channel attention.

Experiment of generalization ability
In medical image segmentation tasks, evaluating the model’s generalization ability is crucial. In this study, we used additional CT images from CTSpine1K as an external test set to comprehensively assess the model’s adaptability and segmentation performance on diverse image data. To ensure consistency across model inputs, uniform preprocessing steps were applied to all test data. Figure 6 and Table 3 show that our method exhibits higher accuracy compared to nnUNet, effectively capturing the core of the target region while processing edge details with precision, resulting in clear, continuous, and complete edge segmentation. This outcome not only validates the model’s robust learning and generalization capabilities but also provides a solid foundation for its application in broader medical image analysis tasks.

Table 3
Method | Dice (%) | Accuracy (%) | Intersection over Union (%) | Precision (%) | Recall (%) |
---|---|---|---|---|---|
nnUnet (16) | 98.00 | 99.76† | 96.08 | 98.00† | 98.01 |
Our | 98.02† | 99.76† | 96.12† | 97.87 | 98.17† |
†, the highest performance value in the corresponding metric column.
Notably, during the experiment, we observed that the performance metrics of the external test set were significantly higher than those of the VerSe test set. Upon further analysis, we found that this phenomenon is primarily due to differences in the proportion of spine structures in the images and the completeness of label annotations. Specifically, the VerSe test set contains a relatively larger proportion of spinal structures, but its labeling is incomplete (e.g., some vertebrae are segmented but not labeled), resulting in lower Dice and other metrics. In contrast, the external test set contains a smaller proportion of spinal structures but has more complete labeling, leading to relatively higher performance metrics.
Discussion
Accurate segmentation of bones from CT images is a challenging task due to the diversity of skeletal structures. As shown in the comparison experimental results in Figure 4, PSPnet and Deeplabv3+ employ a pyramid pooling module to obtain multi-scale contextual information, thereby enhancing the completeness of segmentation structures. However, detailed boundary information may be lost in this process, resulting in less clear and accurate boundaries in the segmentation results (see the 2nd and 3rd panels of Figure 4). Unet++ and Resunet are able to segment the target structures more accurately by introducing a new feature fusion approach based on Unet, which is more effective in extracting key features of images with complex textures, shapes, and boundaries. As the boundary features are often weakened or over-smoothed during the feature fusion process, the boundaries between bones may lack clarity. To prevent overfitting of specific datasets, nnUnet partially mitigates this issue by optimizing the pre-processing, training, and post-processing procedures, achieving good segmentation results. However, due to Unet’s limited ability to integrate contextual information, structure omission occurs in some images. Swin-Unet, TransUNet, SegFormer, and Mask2former excel in capturing global features and contextual information of images, owing to their Transformer-based architectures. For instance, Swin-Unet is structurally more similar to Mask. Nevertheless, these networks primarily focus on global feature learning, making it challenging for their attention mechanisms to fully capture small targets, leading to issues such as blurred and unclear skeletal boundaries. Additionally, several studies based on nnUnet have extended its capabilities by incorporating Transformer architectures. Guo et al. (24) for brain tumor segmentation, Qian et al. (25) for kidney tumor segmentation, Xie et al. (26) for brain tumor segmentation, and Zhang et al. (27) for corneal cell segmentation, all aim to improve the segmentation accuracy and precision of specific tissues or organs in medical images. However, due to the spine’s unique anatomical structure—its elongated shape, multi-segmentation, and complex connections with surrounding tissues, our study focuses on accurately identifying the boundaries of individual vertebrae and intervertebral discs, and these research methods may not be applicable. Notably, Transformer-based architectures typically require substantial computational resources, which may limit their clinical applicability. The method proposed in this study combines the advantages of a convolutional network similar to the self-attention mechanism of Transformers and nnUNet, balancing global and local feature extraction while maintaining a small increase in the number of parameters. This approach not only preserves the overall structure effectively but also significantly improves the accuracy of bone boundary segmentation.
Conclusions
In this study, we propose DBU-Net, a deep learning segmentation network built upon nnUnet and CoT modules. The network integrates the nnUnet encoder with the optimized CoT module, significantly enhancing its feature representation capability. We also designed the MFCA module to comprehensively capture and efficiently integrate multi-scale feature information in medical images. We conducted a comprehensive evaluation on two standard datasets, VerSe from Medical Image Computing and Computer Assisted Intervention (MICCAI) 2019 and 2020, using various metrics including Dice coefficients, Acc, IoU, Pre, and Rec. The results demonstrate that DBU-Net achieves superior performance, validating its ability to segment spine structures efficiently and accurately. DBU-Net exhibited greater accuracy in segmenting textures and boundaries compared to other networks. Additionally, we performed ablation experiments to provide quantitative results and validate the effectiveness of the proposed module (28). In future work, we will continue to explore more advanced attention mechanisms to further enhance the model’s performance in handling complex backgrounds and fine features, thereby promoting its clinical application.
Acknowledgments
None.
Footnote
Reporting Checklist: The authors have completed the TRIPOD + AI reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-24-2297/rc
Funding: This work was partially supported by
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-24-2297/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study utilized two publicly available datasets. The datasets were made accessible to the public in legal compliance, and the data were de-identified before release, containing no individually identifiable information. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013) and was approved by the Ethics Committee of Shengjing Hospital, China Medical University (No. 2024PS1106K). The requirement of informed consent was waived due to the retrospective nature of the study.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Global, regional, and national incidence, prevalence, and years lived with disability for 354 diseases and injuries for 195 countries and territories, 1990-2017: a systematic analysis for the Global Burden of Disease Study 2017. Lancet 2018;392:1789-858. [Crossref] [PubMed]
- Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer International Publishing; 2015:234-41.
.Oktay O Schlemper J Folgoc LL Lee M Heinrich M Misawa K Mori K McDonagh S Hammerla NY Kainz B Glocker B Rueckert D Attention U-net: Learning where to look for the pancreas. arxiv: 1804.03999,2018 .- Zhou Z, Rahman Siddiquee MM, Tajbakhsh N, Liang J. Unet++: A nested U-net architecture for medical image segmentation. Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4. Springer International Publishing; 2018: 3-11.
- Huang H, Lin L, Tong R, Hu H, Zhang Q, Lwamoto Y, Han X, Chen Y, Wu J. UNet 3+: A Full-Scale Connected UNet for Medical Image Segmentation. ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE; 2020:1055-59.
.Alom MZ Hasan M Yakopcic C Taha TM Asari VK Recurrent Residual Convolutional Neural Network based on U-Net (R2U-Net) for Medical Image Segmentation. arXiv: 1802.06955,2018 .- Zhang Z, Liu Q, Wang Y. Road Extraction by Deep Residual U-Net. IEEE Geosci Remote Sens Lett 2018;15:749-53.
- Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, Wang M. Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation. European conference on computer vision. Cham: Springer; 2022:205-18.
- Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF international conference on computer vision. 2021:10012-22.
.Chen J Lu Y Yu Q Luo X Adeli E Wang Y Lu L Yuille AL Zhou Y TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arxiv: 2102.04306,2021 .- Xie E, Wang W, Yu Z, Anandkumar A, Alvarez JM, Luo P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Advances in neural information processing systems 2021;34:12077-90.
Huang X Deng Z Li D Yuan X. Missformer: An effective medical image segmentation transformer. arxiv: 2109.07162,2021 .- Cheng B, Misra I, Schwing AG, Kirillov A, Girdhar R. Masked-attention mask transformer for universal image segmentation. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022:1280-9.
- Cheng B, Schwing AG, Kirillov A. Per-pixel classification is not all you need for semantic segmentation. Proceedings of the 35th International Conference on Neural Information Processing Systems. 2021;34:17864-75.
- Li Y, Yao T, Pan Y, Mei T. Contextual Transformer Networks for Visual Recognition. IEEE Trans Pattern Anal Mach Intell 2023;45:1489-500. [Crossref] [PubMed]
.Isensee F Petersen J Klein A Zimmerer D Jaeger PF Kohl S Wasserthal J Koehler G Norajitra T Wirkert S Maier-Hein KH nnU-Net: self-adapting framework for U-Net-based medical image segmentation. arxiv: 1809.10486,2018 .- Sekuboyina A, Husseini ME, Bayat A, Löffler M, Liebl H, Li H, et al. VerSe: A Vertebrae labelling and segmentation benchmark for multi-detector CT images. Med Image Anal 2021;73:102166. [Crossref] [PubMed]
- Löffler MT, Sekuboyina A, Jacob A, Grau AL, Scharr A, El Husseini M, Kallweit M, Zimmer C, Baum T, Kirschke JS. A Vertebral Segmentation Dataset with Fracture Grading. Radiol Artif Intell 2020;2:e190138. [Crossref] [PubMed]
- Kawathekar ID, Areeckal AS, Aparna V. A Novel Deep Learning Pipeline for Vertebra Labeling and Segmentation of Spinal Computed Tomography Images. IEEE Access 2024;12:15330-47.
.Deng Y Wang C Hui Y Li Q Li J Luo S Sun M Quan Q Yang S Hao Y Liu P Xiao H Zhao C Wu X Zhou SK CTSpine1K: a large-scale dataset for spinal vertebrae segmentation in computed tomography. arxiv: 2105.14711,2021 .- Zhao H, Shi J, Qi X, Wang X, Jia J. Pyramid scene parsing network. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA. 2017:6230-9.
- Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H. Encoder-decoder with atrous separable convolution for semantic image segmentation. Proceedings of the European conference on computer vision (ECCV). 2018:801-18.
- Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H. Dual attention network for scene segmentation. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019:3146-54.
- Guo S, Chen Q, Wang L, Wang L, Zhu Y. nnUnetFormer: an automatic method based on nnUnet and transformer for brain tumor segmentation with multimodal MR images. Phys Med Biol 2023;
- Qian L, Luo L, Zhong Y, Zhong D. A Hybrid Network Based on nnU-Net and Swin Transformer for Kidney Tumor Segmentation. International Challenge on Kidney and Kidney Tumor Segmentation. Cham: Springer Nature Switzerland; 2024:30-9.
- Xie Y, Zhou C, Mei J, Li X, Xie C, Zhou Y. Brain Tumor Segmentation Through Supervoxel Transformer. 2024 IEEE International Symposium on Biomedical Imaging (ISBI), Athens, Greece; 2024:1-5.
- Zhang D, Zhang J, Li S, Dong Z, Zheng Q, Zhang J. U-NTCA: nnUNet and nested transformer with channel attention for corneal cell segmentation. Front Neurosci 2024;18:1363288. [Crossref] [PubMed]
Zhou Z He A Wu Y Yao R Xie X Li T. Spatial-Frequency Dual Progressive Attention Network For Medical Image Segmentation. arxiv: 2406.07952,2024 .