A novel dual-branch segmentation algorithm for overall spine segmentation

Tian Gao; He Zhang; Yuhan Ying; Andi Li; Hua Lu; Yiwen Zhao; Lei Zhang; Guoli Song

doi:10.21037/qims-24-2297

Original Article

A novel dual-branch segmentation algorithm for overall spine segmentation

Tian Gao^1,2#, He Zhang^3#, Yuhan Ying^1,2,4, Andi Li^1,2,4, Hua Lu⁵, Yiwen Zhao^1,2, Lei Zhang⁶, Guoli Song^1,2

¹State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang, China; ²Institutes for Robotics and Intelligent Manufacturing, Chinese Academy of Sciences, Shenyang, China; ³Orthopedic Department, The Second Affiliated Hospital of Chongqing Medical University, Chongqing, China; ⁴University of Chinese Academy of Sciences, Beijing, China; ⁵Department of Neurosurgery, Affiliated Hospital of Jiangnan University, Wuxi, China; ⁶Spine Surgery Unit, Shengjing Hospital of China Medical University, Shenyang, China

Contributions: (I) Conception and design: T Gao, H Zhang, Y Ying, Y Zhao, G Song; (II) Administrative support: H Lu, G Song; (III) Provision of study materials or patients: H Zhang, L Zhang, G Song; (IV) Collection and assembly of data: T Gao, H Zhang, H Lu, L Zhang, G Song; (V) Data analysis and interpretation: T Gao, H Zhang, A Li, L Zhang; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

^#These authors contributed equally to this work.

Correspondence to: Guoli Song, PhD. State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences, 114 Nanta Street, Shenhe District, Shenyang 110016, China. Institutes for Robotics and Intelligent Manufacturing, Chinese Academy of Sciences, Shenyang, China. Email: songgl@sia.cn; Lei Zhang, MD. Spine Surgery Unit, Shengjing Hospital of China Medical University, 36 Sanhao Street, Heping District, Shenyang 110000, China. Email: zhangl7@sj-hospital.org.

Background: Accurate identification of bone structures helps surgeons to better locate target areas during the procedure and reduce damage to surrounding tissue. To efficiently automate the laborious task of labeling vertebrae in computed tomography (CT) images, we propose a DBU-Net deep learning segmentation network built upon the nnUnet framework.

Methods: Specifically, DBU-Net incorporates the multi-scale feature channel attention module and the dual-branch decoder architecture. The multi-scale feature channel attention module effectively combines image features from adjacent stages, integrating detailed and structural information from different scales. By applying a weighted operation, it adaptively adjusts the importance of each channel, enabling precise extraction of multi-scale features. The dual-branch decoder structure, built upon the nnUNet decoder, adds a branch that incorporates a contextual Transformer module to capture global contextual information. At each decoding stage, the features from the two branches interact, seamlessly merging global context with local features. This significantly enhances the network’s ability to process complex features in spinal CT images, improving segmentation accuracy.

Results: We conducted a comprehensive evaluation using the Vertebrae Segmentation (VerSe) dataset from Medical Image Computing and Computer Assisted Intervention (MICCAI) 2019 and 2020. The experimental results demonstrate that DBU-Net achieves state-of-the-art performance with an average Dice coefficient of 94.59%.

Conclusions: This highlights its significant potential to assist surgeons in clearly and accurately identifying and locating spinal structures, and it is anticipated to provide robust technical support for the precise execution of spine-related surgeries and the effective diagnosis of diseases.

Keywords: Computed tomography (CT); deep learning; spine; image segmentation; contextual attention mechanism

Submitted Oct 21, 2024. Accepted for publication Feb 07, 2025. Published online Mar 28, 2025.

doi: 10.21037/qims-24-2297

Introduction

Low back pain has emerged as a significant global public health challenge (1). When physical therapy or medication fails to effectively alleviate symptoms, surgery often becomes a necessary intervention. Accurate spine segmentation during surgery assists surgeons in identifying and localizing target vertebrae more clearly, thereby avoiding serious risks associated with misjudgment or improper procedures. However, traditional methods require the manual labeling of vertebrae individually, a labor-intensive and time-consuming process. Therefore, the development of a precise and efficient fully automated segmentation system for the identification and segmentation of spinal structures in computed tomography (CT) images has become a critical concern in the field of spine surgery.

Currently, medical image segmentation methods utilizing encoder-decoder architectures, exemplified by Unet (2), are widely employed and exhibit outstanding performance. For instance, Attunet (3) significantly enhances segmentation accuracy through the use of attention gates. Unet++ (4) intelligently employs nested and dense skip connections to mitigate the semantic gap between levels, thereby improving segmentation outcomes. Unet 3+ (5) achieves comprehensive feature extraction by fusing low-level details and high-level semantic features at multiple scales, which further enhances segmentation performance. Additionally, R2Unet (6) not only improves the accuracy of medical image segmentation but also significantly optimizes processing efficiency by incorporating residual blocks and recurrent mechanisms. Moreover, Resunet (7) demonstrated excellent performance in retinal blood vessel segmentation by leveraging a weighted attention mechanism. While U-networks have shown outstanding performance in numerous medical image segmentation tasks, they still encounter inherent limitations. Specifically, while the architecture excels at capturing local features, it is deficient in handling long-range dependencies and integrating global contextual information, which limits its effectiveness in segmenting highly complex and variable medical images.

In recent years, the Transformer model has garnered significant attention due to its multiple self-attention (MSA) mechanism, which effectively addresses long-range dependencies. Inspired by this, researchers have started exploring the application of Transformers in the field of medical image segmentation to compensate for the shortcomings of traditional methods, such as Unet, in processing global information. For example, Cao et al. proposed Swin-Unet (8), which successfully integrated Swin Transformer (9) into a U-shaped segmentation network, significantly enhancing the accuracy of multi-organ segmentation. Chen et al. combined the strengths of Transformers and convolutional neural networks (CNNs) to propose TransUNet (10), aiming to further improve segmentation performance. Xie et al. innovatively designed a pure Transformer architecture, SegFormer (11), which demonstrated efficient and powerful semantic segmentation capabilities through the nuanced combination of a hierarchical encoder and an efficient decoder. Huang et al. subsequently introduced MISSformer (12), achieving more accurate heart segmentation by leveraging global information at different scales. Mask2former (13) incorporates multi-scale high-resolution features and a masked attention mechanism based on Maskformer (14), effectively reducing background interference and significantly improving segmentation performance. However, it is important to note that while traditional Transformers excel in global modeling, they often overlook the rich contextual information between neighboring keys when calculating the attention matrix. As a result, they learn primarily from isolated query-key pairs, limiting their ability to capture complex feature relationships within an image, potentially leading to suboptimal performance in medical image segmentation tasks.

To address the aforementioned challenges, this paper proposes a segmentation network, DBU-Net, specifically designed for spine CT images. The network integrates the contextual Transformer (CoT) (15) and multi-scale feature channel attention (MFCA) modules into the nnUnet (16) framework, achieving a seamless integration of global context and cross-scale feature interaction. This enables the network to capture and analyze complex features in spine CT images with greater detail and comprehensiveness, ultimately leading to accurate and efficient automatic segmentation of spine CT images. We present this article in accordance with the TRIPOD + AI reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-24-2297/rc).

Methods

Datasets

This research utilized portions of the Vertebrae Segmentation (VerSe) 2019 and 2020 datasets, which are part of the vertebral labelling and segmentation benchmarking challenge (17-19). The dataset comprises data from 278 patients across various centers and providers, including Siemens, General Electric, Philips, and Toshiba. The minimum age of the patients was 18 years, and the images included 7 fully visible vertebrae (excluding the sacrum and transitional vertebrae) with a minimum pixel spacing of 1.5 mm (craniolateral), 1 mm (anterior-posterior), and 3 mm (left-right). Additionally, the CTSpine1K (20) dataset was used as an external test set to evaluate generalization capabilities. The CTSpine1K dataset is sourced from multiple medical centers and includes images scanned by different devices, making the data diverse and representative of various real-world clinical scenarios. Figure 1 shows some example images from the VerSe 2019 dataset. As illustrated in the figure, the dataset includes CT images with different views and sizes. The data set analyzed in this study is public and can be found at https://github.com/anjany/verse and https://github.com/MIRACLE-Center/CTSpine1K.

Figure 1 Example images from the VerSe 2019 dataset. VerSe, Vertebrae Segmentation.

This study utilized two publicly available datasets. The datasets were made accessible to the public in legal compliance, and the data were de-identified before release, containing no individually identifiable information. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013) and was approved by the Ethics Committee of Shengjing Hospital, China Medical University (No. 2024PS1106K). The requirement of informed consent was waived due to the retrospective nature of the study.

Image preprocessing and data augmentation

In this study, we resampled the voxel size to an isotropic 1.0×1.0×1.0 mm³ and resized the images to ensure homogeneity across all axes. Subsequently, we normalized the images and binarized the masks. To improve the generalizability of the system, we applied data augmentation techniques during training, including random rotations of ±30 degrees, scaling, elastic transformations, additive noise, gamma correction, and flipping. We divided 278 patients into two groups: 238 patients for training and validation (with 20% of the images randomly selected for validation) and 40 patients for testing, ensuring that the test dataset remained completely unseen during training.

nnUnet

In medical image segmentation, researchers frequently develop specialized algorithms for specific tasks and issues. However, this approach may lead to models with poor generalization and robustness. The authors of nnUnet argue that tuning the network structure may lead to overfitting on specific datasets. They highlight that aspects unrelated to the network structure may have a greater impact on segmentation tasks, and thus emphasize the optimization of pre-processing, training, and post-processing procedures, primarily focusing on image processing. nnUnet demonstrates the applicability of various configurational strategies for medical image segmentation tasks by systematizing them into a set of fixed parameters (e.g., training batch size, image chunk size, number of downsamplings, etc.).

The architecture of nnUnet mirrors that of Unet, adhering to the encoder-decoder design paradigm and comprising a series of convolutional blocks. The Unet network structure is illustrated in Figure 2, where skip connections are employed between the encoder and decoder, and efficient feature mapping between internal blocks is achieved by concatenating generated features as supplementary information. To circumvent batch size constraints, nnUnet replaces the ReLU activation function in Unet with LeakyReLU and employs instance normalization instead of the more commonly used batch normalization. To enhance network stability and accelerate convergence on the image foreground, patch sampling ensures that more than one-third of the pixels in a batch’s sample belong to the foreground class. These enhancements significantly improve the stability and adaptability of the nnUnet training process, addressing training instability caused by variations in imaging methods, image sizes, and voxel spacing, and enabling its widespread use across various scenarios.

Figure 2 Unet network structure.

DBU-Net structure

Given nnUnet’s outstanding performance in data processing efficiency, training stability, and broad adaptability, we chose it as the foundational backbone network and further enhanced it to improve inter-pixel information interaction efficiency and the precise characterization of contextual information. Figure 3A illustrates the overall structure of our proposed DBU-Net network.

Figure 3 Overall network structure of DBU-Net. (A) The overall structure of DBU-Net. (B) The structure of the CoT module. (C) The structure of the MFCA module. CoT, contextual transformer; GAP, global average pooling; GAP, global average pooling; MFCA, multi-scale feature channel attention.

First, to enhance the network’s capacity for integrating contextual information, we introduce an enhanced CoT module in the decoder branch. As shown in Figure 3B, this module first extracts static contextual information from keys and then performs a self-attention operation using two consecutive 1×1 convolutional layers to generate dynamic context. Ultimately, a more comprehensive output is achieved by seamlessly fusing static and dynamic contexts. By integrating these two types of contextual information, the model leverages both global and local features, thereby enhancing feature representation richness and improving segmentation accuracy.

Additionally, to achieve comprehensive capture and efficient integration of multi-scale feature information in medical images, we designed the MFCA module. This module automatically identifies and enhances important feature channels through cross-scale feature interaction and an attention mechanism, while reducing redundant interference information. This innovative design allows the network to efficiently extract semantically rich and detailed feature representations from complex medical images, thereby providing more comprehensive feature information for subsequent decoding tasks.

Contextual transformer block

To address the issue of traditional transformers overlooking rich contextual information between neighboring keys when computing the attention matrix, the CoT module employs context encoding of the input keys to fully exploit this information, thereby enhancing the model’s ability to capture and represent image features. Specifically, for a given two-dimensional (2D) feature map $X \in ℝ^{H \times W \times C}$ (where H denotes height, W denotes width, and C denotes the number of channels), the keys, queries, and values are defined as $K = X$ , $Q = X$ , and $V = W_{v} (X)$ , respectively. The static contextual information between locally neighboring keys is first represented spatially using grouped convolution for all neighboring keys in a 3×3 grid, denoted as $K^{1} \in ℝ^{H \times W \times C}$ . Subsequently, the context key K¹ is concatenated with the query Q, and then the attention matrix $A \in ℝ^{H \times W \times (3 \times 3 \times C_{h})}$ is obtained by two 1×1 convolutions and normalized:

$A = Softmax (W_{δ} (W_{θ} [K^{1}, Q]))$ [1]

At this stage, each local attention matrix no longer depends solely on isolated query-key pairs but is instead based on a richer, context-informed representation. Next, the feature map is computed by aggregating the values V within a typical self-attention mechanism to obtain $K^{2} \in ℝ^{H \times W \times C}$ :

$K^{2} = V ⊙ A$ [2]

Here, $⊙$ denotes a local matrix multiplication operation. Since each element in K² is derived from the values V and their dynamic association with the query Q and the static context K¹, it is used as the dynamic context representation of the input. Finally, the static context K¹ and the dynamic context K² are fused to produce the final output. Additionally, for local feature extraction and integration of the fused feature maps, we apply a 3×3 convolutional layer. This layer enhances spatial dependencies between features through its weighted summation operation within the receptive field, enabling the model to learn more complex and abstract feature representations.

MFCA block

In the broad field of medical image segmentation, handling multi-scale features is highly complex. To comprehensively capture and integrate cross-scale feature information while minimizing redundancy, we designed the MFCA module. This module progressively fuses feature maps between the current and next scales to enhance the network’s multi-scale learning capability. The MFCA module offers two main benefits: first, it directly extracts feature information from neighboring scales and generates channel attention maps to precisely focus on key features. Second, at the output stage, the core features of the current scale are combined with the detailed features of the next scale through weighted feature mapping, ensuring that the MFCA module’s output is rich and efficient, with fully integrated multi-scale features.

The structure of the MFCA block is illustrated in Figure 3C. Each MFCA block receives two inputs: the feature map F_i from the encoder at the current stage and the feature map F_i₊₁ from the next stage. Subsequently, F_i and F_i₊₁ undergo global average pooling (GAP) for dimensionality reduction, enabling the model to learn more representative features. They are then concatenated along the channel dimension. A 1×1 convolutional layer is then applied to extract feature information from the fused feature maps, generating a multi-scale channel attention map A:

$F^{'} = Concat (GAP (F_{i}, F_{i + 1}))$ [3]

$A = Sigmoid ({Conv}_{1 \times 1} (F^{'}))$ [4]

Here, Concat denotes the concatenation along the channel dimension, and Conv_1×1 denotes the 1×1 convolutional layer. At this stage, A fuses the feature information from two neighboring stages and completes the cross-scale information interaction. Next, A is divided into two parts, A_i and A_i₊₁, to ensure that their dimensions match F_i and F_i₊₁, respectively. Subsequently, F_i and F_i₊₁ are multiplied by the corresponding A_i and A_i₊₁ to obtain the weighted feature maps ${\hat{F}}_{i}$ and ${\hat{F}}_{i + 1}$ , respectively. The formulas are shown below:

$A_{i}, A_{i + 1} = Split (A)$ [5]

${\hat{F}}_{i} ({\hat{F}}_{i + 1}) = F_{i} (F_{i + 1}) \otimes A_{i} (A_{i + 1})$ [6]

Here, $\otimes$ denotes element-wise multiplication. We then upsample ${\hat{F}}_{i + 1}$ to the same spatial resolution as ${\hat{F}}_{i}$ using a transposed convolutional layer and merge them through concatenation. The concatenated feature map ${\hat{F}}_{C}$ contains high-level semantic information as well as low-level detail information. To further fuse these features and adjust their dimensions, we apply a 1×1 convolutional layer to ${\hat{F}}_{C}$ and add the result to the input feature map F_i to obtain ${\hat{F}}^{'}$ as the output of the MFCA block. The above process can be represented by the following equation:

${\hat{F}}_{C} = Concat (\hat{F_{i}}, Deconv ({\hat{F}}_{i + 1}))$ [7]

${\hat{F}}^{'} = F_{i} \oplus {Conv}_{1 \times 1} ({\hat{F}}_{C})$ [8]

Here, $\oplus$ denotes element-wise addition and Deconv(·) denotes transposed convolutional layers.

Results

Evaluation metrics

In this experiment, to fully assess the model’s performance in the segmentation task, we selected Dice coefficient, accuracy (Acc), Intersection over Union (IoU), Precision (Pre), and Recall (Rec) as the evaluation metrics. The Dice coefficient quantifies the similarity between the predicted and actual annotations, with values ranging from 0 to 1, where a higher value indicates greater similarity. Acc reflects the correctness of the model’s classification across all samples, specifically the proportion of correctly predicted pixels to the total number of pixels, providing an intuitive measure of overall performance. IoU measures the overlap between the predicted results and the actual labeled regions, reflecting segmentation accuracy. Pre describes the proportion of true positive samples among all positively predicted samples, while Rec measures the proportion of true positive samples that are correctly identified. Collectively, these metrics offer a comprehensive evaluation of the model’s segmentation performance.

Hyperparameters

In this study, we performed experiments on a server configured with four 32 GB Tesla V100 graphics cards. The network training was conducted using a 5-fold cross-validation method with a stochastic gradient descent (SGD) optimizer, an initial learning rate of 0.02, and a linear learning rate decay strategy. To balance classification and segmentation accuracy during training, the loss function is a combination of cross-entropy loss and the Dice coefficient. Additionally, the input image size was set to 384×384 pixels, and the batch size was 22.

Comparative analysis

In order to fully validate the effectiveness of our newly proposed DBU-Net method, we compared its segmentation performance against several state-of-the-art segmentation networks on the test dataset. As shown in Table 1, DBU-Net achieves 94.59% in the Dice coefficient, which is 9.65% higher than Mask2former, 8.79% higher than PSPNet, 2.71% higher than Unet++, and 11.42% higher than Swin-Unet. Compared to nnUnet, DBU-Net shows significant improvements of 0.82%, 0.09%, 1.48%, and 1.73% in core quantitative metrics such as Dice, Acc, IoU, and Pre, respectively, demonstrating its superior performance. These results demonstrate that the DBU-Net algorithm not only meets high-precision positioning requirements but also confirms its effectiveness as an improved model. The prediction results of these segmentation networks are shown in Figure 4. Compared to other networks, DBU-Net accurately identifies and segments the skeletal structure, including subtle bone branches and complex joint connections, providing clear and precise segmentation results. Additionally, DBU-Net’s superior performance in edge processing enhances bone segmentation accuracy and ensures smooth, continuous edges, significantly reducing mis-segmentation and missed segmentation. Detailed experimental data in Table 1 clearly show that DBU-Net achieves the best results in most key metrics and significantly outperforms other methods in many test images, demonstrating its robust segmentation capability.

Table 1

Comparison of the performance metrics of our method with other methods

Method	Year	Dice (%)	Accuracy (%)	Intersection over Union (%)	Precision (%)	Recall (%)
PSPNet (21)	2017	85.80	96.89	75.30	99.32	75.74
Unet++ (4)	2018	91.88	98.19	85.18	97.56	87.32
Resuent (7)	2018	90.39	97.91	82.48	98.61	83.46
Deeplabv3+ (22)	2018	87.08	97.24	77.31	99.11	77.89
DANet (23)	2019	86.80	97.16	76.87	99.34	77.30
TransUNet (10)	2021	79.86	96.40	66.58	71.55	90.90
SegFormer (11)	2021	85.35	96.73	74.59	99.62	74.81
Swin-Unet (8)	2022	83.17	96.85	71.26	77.62	89.83
Mask2former (13)	2022	84.94	96.61	73.93	99.68^†	74.11
nnUnet (16)	2018	93.77	98.79	88.38	93.27	94.40^†
Our	2024	94.59^†	98.88^†	89.86^†	95.00	94.26

^†, the highest performance value in the corresponding metric column.

Figure 4 Comparison of results from various experimental methods.

Ablation experiments

We also conducted an ablation study on the DBU-Net model to evaluate the effectiveness of both the CoT module and the MFCA block. As shown in the second row of data in Table 2, integrating the CoT module into nnUnet results in significant improvements in both Dice and IoU metrics, with increases of 0.65% and 1.31%, respectively. This demonstrates the effectiveness of the CoT module in capturing both global and local feature information. As indicated in the third row of Table 2, the introduction of the MFCA module leads to further improvements in Dice and IoU metrics, with increases of 0.17% and 0.17%, respectively. This underscores the enhanced multi-scale learning capability of the MFCA, which facilitates more accurate segmentation of complex and detailed features. Figure 5 shows that the segmentation of bone edges is more precise and complete compared to nnUnet. The experimental results indicate that the enhanced nnUnet model achieves improvements of 0.82% and 1.48% in Dice and IoU, respectively, thereby validating the effectiveness of the proposed method.

Table 2

Results of ablation experiments

Backbone	CoT	MFCA	Dice (%)	Accuracy (%)	Intersection over Union (%)
nnUnet (16)	×	×	93.77	98.79	88.38
	√	×	94.42	98.81	89.69
	√	√	94.59^†	98.88^†	89.86^†

^†, the highest performance value in the corresponding metric column. CoT, contextual Transformer; MFCA, multi-scale feature channel attention.

Figure 5 A detailed visualization comparison between nnUnet and our algorithm on local image patches. The left side shows the label fused with the original image, the top right side shows the nnUnet segmentation result fused with the original image, and the bottom right side shows the segmentation result fused with the original image by our method.

Experiment of generalization ability

In medical image segmentation tasks, evaluating the model’s generalization ability is crucial. In this study, we used additional CT images from CTSpine1K as an external test set to comprehensively assess the model’s adaptability and segmentation performance on diverse image data. To ensure consistency across model inputs, uniform preprocessing steps were applied to all test data. Figure 6 and Table 3 show that our method exhibits higher accuracy compared to nnUNet, effectively capturing the core of the target region while processing edge details with precision, resulting in clear, continuous, and complete edge segmentation. This outcome not only validates the model’s robust learning and generalization capabilities but also provides a solid foundation for its application in broader medical image analysis tasks.

Figure 6 External test set prediction visualization results. From left to right are the original image to be segmented, the label, the nnUnet segmentation result, the segmentation result of our method, and the fused image of the segmentation result of our method and the original image.

Table 3

Comparison of the performance metrics of our method with nnUnet on CTSpine1K

Method	Dice (%)	Accuracy (%)	Intersection over Union (%)	Precision (%)	Recall (%)
nnUnet (16)	98.00	99.76^†	96.08	98.00^†	98.01
Our	98.02^†	99.76^†	96.12^†	97.87	98.17^†

^†, the highest performance value in the corresponding metric column.

Notably, during the experiment, we observed that the performance metrics of the external test set were significantly higher than those of the VerSe test set. Upon further analysis, we found that this phenomenon is primarily due to differences in the proportion of spine structures in the images and the completeness of label annotations. Specifically, the VerSe test set contains a relatively larger proportion of spinal structures, but its labeling is incomplete (e.g., some vertebrae are segmented but not labeled), resulting in lower Dice and other metrics. In contrast, the external test set contains a smaller proportion of spinal structures but has more complete labeling, leading to relatively higher performance metrics.

Discussion

Accurate segmentation of bones from CT images is a challenging task due to the diversity of skeletal structures. As shown in the comparison experimental results in Figure 4, PSPnet and Deeplabv3+ employ a pyramid pooling module to obtain multi-scale contextual information, thereby enhancing the completeness of segmentation structures. However, detailed boundary information may be lost in this process, resulting in less clear and accurate boundaries in the segmentation results (see the 2nd and 3rd panels of Figure 4). Unet++ and Resunet are able to segment the target structures more accurately by introducing a new feature fusion approach based on Unet, which is more effective in extracting key features of images with complex textures, shapes, and boundaries. As the boundary features are often weakened or over-smoothed during the feature fusion process, the boundaries between bones may lack clarity. To prevent overfitting of specific datasets, nnUnet partially mitigates this issue by optimizing the pre-processing, training, and post-processing procedures, achieving good segmentation results. However, due to Unet’s limited ability to integrate contextual information, structure omission occurs in some images. Swin-Unet, TransUNet, SegFormer, and Mask2former excel in capturing global features and contextual information of images, owing to their Transformer-based architectures. For instance, Swin-Unet is structurally more similar to Mask. Nevertheless, these networks primarily focus on global feature learning, making it challenging for their attention mechanisms to fully capture small targets, leading to issues such as blurred and unclear skeletal boundaries. Additionally, several studies based on nnUnet have extended its capabilities by incorporating Transformer architectures. Guo et al. (24) for brain tumor segmentation, Qian et al. (25) for kidney tumor segmentation, Xie et al. (26) for brain tumor segmentation, and Zhang et al. (27) for corneal cell segmentation, all aim to improve the segmentation accuracy and precision of specific tissues or organs in medical images. However, due to the spine’s unique anatomical structure—its elongated shape, multi-segmentation, and complex connections with surrounding tissues, our study focuses on accurately identifying the boundaries of individual vertebrae and intervertebral discs, and these research methods may not be applicable. Notably, Transformer-based architectures typically require substantial computational resources, which may limit their clinical applicability. The method proposed in this study combines the advantages of a convolutional network similar to the self-attention mechanism of Transformers and nnUNet, balancing global and local feature extraction while maintaining a small increase in the number of parameters. This approach not only preserves the overall structure effectively but also significantly improves the accuracy of bone boundary segmentation.

Conclusions

In this study, we propose DBU-Net, a deep learning segmentation network built upon nnUnet and CoT modules. The network integrates the nnUnet encoder with the optimized CoT module, significantly enhancing its feature representation capability. We also designed the MFCA module to comprehensively capture and efficiently integrate multi-scale feature information in medical images. We conducted a comprehensive evaluation on two standard datasets, VerSe from Medical Image Computing and Computer Assisted Intervention (MICCAI) 2019 and 2020, using various metrics including Dice coefficients, Acc, IoU, Pre, and Rec. The results demonstrate that DBU-Net achieves superior performance, validating its ability to segment spine structures efficiently and accurately. DBU-Net exhibited greater accuracy in segmenting textures and boundaries compared to other networks. Additionally, we performed ablation experiments to provide quantitative results and validate the effectiveness of the proposed module (28). In future work, we will continue to explore more advanced attention mechanisms to further enhance the model’s performance in handling complex backgrounds and fine features, thereby promoting its clinical application.

Acknowledgments

None.

Footnote

Reporting Checklist: The authors have completed the TRIPOD + AI reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-24-2297/rc

Funding: This work was partially supported by the National Key R&D Program of China (grant No. 2022YFB4700700), the National Natural Science Foundation of China (grant agreement Nos. 62073314 & 92048203), the Wuxi Health Commission’s Research Project Plan (grant No. Z202118), the Wuxi Science and Technology Bureau’s Research Project Plan (grant No. Y20222022).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-24-2297/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study utilized two publicly available datasets. The datasets were made accessible to the public in legal compliance, and the data were de-identified before release, containing no individually identifiable information. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013) and was approved by the Ethics Committee of Shengjing Hospital, China Medical University (No. 2024PS1106K). The requirement of informed consent was waived due to the retrospective nature of the study.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Global, regional, and national incidence, prevalence, and years lived with disability for 354 diseases and injuries for 195 countries and territories, 1990-2017: a systematic analysis for the Global Burden of Disease Study 2017. Lancet 2018;392:1789-858. [Crossref] [PubMed]
Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer International Publishing; 2015:234-41.
OktayOSchlemperJFolgocLLLeeMHeinrichMMisawaKMoriKMcDonaghSHammerlaNYKainzBGlockerBRueckertD. Attention U-net: Learning where to look for the pancreas. arxiv: 1804.03999, 2018.
Zhou Z, Rahman Siddiquee MM, Tajbakhsh N, Liang J. Unet++: A nested U-net architecture for medical image segmentation. Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4. Springer International Publishing; 2018: 3-11.
Huang H, Lin L, Tong R, Hu H, Zhang Q, Lwamoto Y, Han X, Chen Y, Wu J. UNet 3+: A Full-Scale Connected UNet for Medical Image Segmentation. ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE; 2020:1055-59.
AlomMZHasanMYakopcicCTahaTMAsariVK. Recurrent Residual Convolutional Neural Network based on U-Net (R2U-Net) for Medical Image Segmentation. arXiv: 1802.06955, 2018.
Zhang Z, Liu Q, Wang Y. Road Extraction by Deep Residual U-Net. IEEE Geosci Remote Sens Lett 2018;15:749-53.
Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, Wang M. Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation. European conference on computer vision. Cham: Springer; 2022:205-18.
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF international conference on computer vision. 2021:10012-22.
ChenJLuYYuQLuoXAdeliEWangYLuLYuilleALZhouY. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arxiv: 2102.04306, 2021.
Xie E, Wang W, Yu Z, Anandkumar A, Alvarez JM, Luo P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Advances in neural information processing systems 2021;34:12077-90.
HuangXDengZLiDYuanX.Missformer: An effective medical image segmentation transformer. arxiv: 2109.07162, 2021.
Cheng B, Misra I, Schwing AG, Kirillov A, Girdhar R. Masked-attention mask transformer for universal image segmentation. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022:1280-9.
Cheng B, Schwing AG, Kirillov A. Per-pixel classification is not all you need for semantic segmentation. Proceedings of the 35th International Conference on Neural Information Processing Systems. 2021;34:17864-75.
Li Y, Yao T, Pan Y, Mei T. Contextual Transformer Networks for Visual Recognition. IEEE Trans Pattern Anal Mach Intell 2023;45:1489-500. [Crossref] [PubMed]
IsenseeFPetersenJKleinAZimmererDJaegerPFKohlSWasserthalJKoehlerGNorajitraTWirkertSMaier-HeinKH. nnU-Net: self-adapting framework for U-Net-based medical image segmentation. arxiv: 1809.10486, 2018.
Sekuboyina A, Husseini ME, Bayat A, Löffler M, Liebl H, Li H, et al. VerSe: A Vertebrae labelling and segmentation benchmark for multi-detector CT images. Med Image Anal 2021;73:102166. [Crossref] [PubMed]
Löffler MT, Sekuboyina A, Jacob A, Grau AL, Scharr A, El Husseini M, Kallweit M, Zimmer C, Baum T, Kirschke JS. A Vertebral Segmentation Dataset with Fracture Grading. Radiol Artif Intell 2020;2:e190138. [Crossref] [PubMed]
Kawathekar ID, Areeckal AS, Aparna V. A Novel Deep Learning Pipeline for Vertebra Labeling and Segmentation of Spinal Computed Tomography Images. IEEE Access 2024;12:15330-47.
DengYWangCHuiYLiQLiJLuoSSunMQuanQYangSHaoYLiuPXiaoHZhaoCWuXZhouSK. CTSpine1K: a large-scale dataset for spinal vertebrae segmentation in computed tomography. arxiv: 2105.14711, 2021.
Zhao H, Shi J, Qi X, Wang X, Jia J. Pyramid scene parsing network. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA. 2017:6230-9.
Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H. Encoder-decoder with atrous separable convolution for semantic image segmentation. Proceedings of the European conference on computer vision (ECCV). 2018:801-18.
Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H. Dual attention network for scene segmentation. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019:3146-54.
Guo S, Chen Q, Wang L, Wang L, Zhu Y. nnUnetFormer: an automatic method based on nnUnet and transformer for brain tumor segmentation with multimodal MR images. Phys Med Biol 2023;
Qian L, Luo L, Zhong Y, Zhong D. A Hybrid Network Based on nnU-Net and Swin Transformer for Kidney Tumor Segmentation. International Challenge on Kidney and Kidney Tumor Segmentation. Cham: Springer Nature Switzerland; 2024:30-9.
Xie Y, Zhou C, Mei J, Li X, Xie C, Zhou Y. Brain Tumor Segmentation Through Supervoxel Transformer. 2024 IEEE International Symposium on Biomedical Imaging (ISBI), Athens, Greece; 2024:1-5.
Zhang D, Zhang J, Li S, Dong Z, Zheng Q, Zhang J. U-NTCA: nnUNet and nested transformer with channel attention for corneal cell segmentation. Front Neurosci 2024;18:1363288. [Crossref] [PubMed]
ZhouZHeAWuYYaoRXieXLiT.Spatial-Frequency Dual Progressive Attention Network For Medical Image Segmentation. arxiv: 2406.07952, 2024.

Cite this article as: Gao T, Zhang H, Ying Y, Li A, Lu H, Zhao Y, Zhang L, Song G. A novel dual-branch segmentation algorithm for overall spine segmentation. Quant Imaging Med Surg 2025;15(4):2944-2956. doi: 10.21037/qims-24-2297

A novel dual-branch segmentation algorithm for overall spine segmentation

Introduction

Methods

Datasets

Image preprocessing and data augmentation

nnUnet

DBU-Net structure

Contextual transformer block

MFCA block

Results

Evaluation metrics

Hyperparameters

Comparative analysis

Table 1

Ablation experiments

Table 2

Experiment of generalization ability

Table 3

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share