SPW-TransUNet: three-dimensional computed tomography-cone beam computed tomography image registration with spatial perpendicular window Transformer

Rui Hu; Shimeng Yang; Jingjing Zhang; Xiaokun Hu; Teng Li

doi:10.21037/qims-24-1138

Original Article

SPW-TransUNet: three-dimensional computed tomography-cone beam computed tomography image registration with spatial perpendicular window Transformer

Rui Hu¹, Shimeng Yang¹, Jingjing Zhang¹, Xiaokun Hu², Teng Li¹

¹Key Laboratory of Intelligent Computing and Signal Processing, Ministry of Education/School of Artificial Intelligence, Anhui University, Hefei, China; ²Department of the Interventional Medical Center, the Affiliated Hospital of Qingdao University, Qingdao, China

Contributions: (I) Conception and design: R Hu; (II) Administrative support: T Li, J Zhang, X Hu; (III) Provision of study materials or patients: R Hu, S Yang; (IV) Collection and assembly of data: R Hu, S Yang; (V) Data analysis and interpretation: R Hu, X Hu; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Xiaokun Hu, PhD. Department of the Interventional Medical Center, the Affiliated Hospital of Qingdao University, No. 1677 Wutaishan Road, Qingdao 266000, China. Email: huxiaokun770@163.com; Teng Li, PhD. Key Laboratory of Intelligent Computing and Signal Processing, Ministry of Education/School of Artificial Intelligence, Anhui University, No. 111, Jiulong Road, Hefei Economic and Technological Development Zone, Hefei 230601, China. Email: 13002@ahu.edu.cn.

Background: Current medical image registration methods based on Transformer still encounter challenges, including significant local intensity differences and limited computational efficiency when dealing with three-dimensional (3D) computed tomography (CT) and cone beam CT (CBCT) images. These limitations hinder the precise alignment necessary for effective diagnosis and treatment planning. Therefore, the aim of this study is to develop a novel method that overcomes these challenges by enhancing feature interaction and computational efficiency in 3D medical image registration.

Methods: This paper introduces a novel method that enhances feature interaction within Transformer by computing attention within resizable spatial perpendicular window (SPW). Additionally, it introduces a self-learning mapping control (SLMC) mechanism, which uses a mini convolutional neural network (CNN) to adaptively transform feature vectors into probability vectors. This approach is integrated into the UNet framework, resulting in the SPW-TransUNet. The effectiveness of the SPW-TransUNet is demonstrated through evaluations on two critical 3D medical imaging tasks: CT-CBCT registration and inter-CT registration. We utilized a range of evaluation metrics including Dice similarity coefficient (DICE), structural similarity index measure (SSIM), target registration error (TRE), and negative Jacobian percentage. The validation process involved comparative analysis against established baseline methods using statistical tests to ensure the robustness and reliability of our results.

Results: The proposed method demonstrated outstanding performance in the registration of 124 pairs of CT-CBCT lung images from 20 patients, achieving the lowest TRE of 2.16 mm and a minimal negative Jacobian of 0.126. It also recorded the highest SSIM and Dice coefficient of 86.87% and 88.28%, respectively. For the liver CT task involving 150 patients, the method achieved peak SSIM and DICE scores of 76.92% and 85.77%, respectively. Furthermore, ablation studies confirmed the effectiveness of the designed structural components.

Conclusions: The SPW-TransUNet offers significant improvements in feature interaction and computational efficiency for medical image registration, providing an effective reference solution for patient and target localization in image-guided radiation therapy.

Keywords: Spatial perpendicular window (SPW); Transformer; three-dimensional medical image (3D medical image); computed tomography-cone beam computed tomography registration (CT-CBCT registration)

Submitted Jun 06, 2024. Accepted for publication Oct 28, 2024. Published online Nov 29, 2024.

doi: 10.21037/qims-24-1138

Introduction

Medical image registration aims to find the deformation vector field and establish a spatial correspondence between two images (1,2). In cancer radiotherapy (3), radiation therapy technologist or physicians rely on computed tomography (CT) images to make treatment plans and verify patient positions in the treatment room using cone beam CT (CBCT) images. However, rigid registration alone to align the gross position is not sufficient for accurate radiotherapy. Non-rigid registration is essential due to anatomical changes like positional offset, breathing or organ movement during treatment. It can dynamically adjust to these movements, ensuring accurate target localization throughout the treatment process (4,5). Moreover, the ability to perform these adjustments in real-time is critical especially in the context of online adaptive radiotherapy, as it allows for immediate corrections in response to patient movement, enhancing both the efficacy and safety of the therapy. However, as shown in Figure 1, CBCT images suffer from artifacts, inadequate spatial resolution, and low signal-to-noise ratio (SNR) due to a lack of projection data and low doses of radiation (6,7). These factors cause significant local intensity differences in the corresponding anatomical structure between CT and CBCT images, challenging CT-CBCT image registration.

Figure 1 Sample images of CT and CBCT. The two images on the right are enlarged patches in red boxes on the left. CT, computed tomography; CBCT, cone beam computed tomography.

Traditional image registration algorithms usually perform iterative optimization by maximizing the similarity measure between fixed and moving images and penalizing deformations that do not conform to physical rules (8,9). For CT-CBCT images, some researchers propose to match them from local intensities of image-sliced cubes and slices (10). Other methods use global matching algorithms to achieve both intensity correction and intensity matching, such as DEMONS (11) and optical flow (12). However, in these methods, the computing process of iterative optimization to maximize the energy function is time-consuming. The integration of intensity correction and intensity matching further increases the computational complexity. As a result, traditional methods cannot fully meet the real-time demands required for three-dimensional (3D) medical image registration in clinical settings, impacting procedures such as surgical navigation and adaptive radiotherapy. These scenarios require immediate image processing to ensure precise patient treatment (5,13).

With the advancement of deep learning technology, the UNet (14) based models have been widely adopted in medical image processing and shown promising results (15). For medical image registration, due to the time-intensive process of obtaining medical labels from professionals, some methods opt for a training registration network. These methods do not require labels and instead utilize the similarity metrics between the fixed and deformed moving images to guide network training. Some methods adopt UNet to predict the deformation vector field directly from fixed and moving image without the iterative optimization process (13,14) to meet real-time clinical needs. To address the intensity differences between CT and CBCT, some researches (15-17) further incorporate additional attention modules based on the UNet registration network. These registration algorithms mostly rely on convolutional structures in their models (18), which suffer from limited receptive fields that hinder their ability to capture long-distance spatial relationships between pixels (19). In cases where there is a significant difference between images being registered, due to disease progression, image quality differences or natural anatomical variations, the networks do not perform optimally (4).

The recently emergent Transformer-based networks can be a promising solution for medical image registration with sizeable local intensity differences and significant differences in anatomical structure shapes (20,21). Owing to its inherent advantage in the receptive field, the Transformer provides enough interaction space to model the relationships between features. It has also been applied in CT-CBCT image registration and has shown superior to convolution-based networks (22,23).

The original Transformer-based method calculates self-attention across the entire input image, forming a large receptive field. However, its computational cost is prohibitively high for clinical applications (24). To reduce computational complexity, several Transformer-based networks have explored local self-attention mechanisms. As shown in Figure 2, in the Swin-Transformer (25), attention is limited to non-overlapping local windows, and a shifting window operation is introduced to enable communication between adjacent windows. However, this approach has limited efficiency in expanding the receptive field, requiring many blocks to achieve global self-attention. Other effective self-attention mechanisms include sequential axial self-attention (26) and dilated window self-attention (27). The former applies local windows in a horizontal and vertical sequence, while the latter expands local windows to achieve a global attention representation. Additionally, the orthogonal Transformer (28) computes self-attention by projecting the data onto an orthogonal space to reduce computation. This method uses dimensionality reduction to lessen the computational burden. However, some information is lost when projecting features onto a low-dimensional space. The original feature vector $v$ undergoes a Householder transformation ( $R 1$ and $R 2$ represent the reflection matrices), obtaining the transition vector $ω$ and the dimensionality-reduced vector $u$ . This loss of information during the transformation process is represented by the vector $d$ in Figure 2.

Figure 2 Illustration of the feature interaction space of Transformer with different self-attention methods. (A) Global self-attention Transformer. (B) Local self-attention Transformer variants. (C) Feature interaction in 3D medical image registration.

H

denotes the number of attention heads.

X

,

Y

, and

Z

correspond to the length, width, and height dimensions, respectively, illustrating the volumetric expanse of the feature interaction zone. SPW, spatial perpendicular window; 3D, three-dimensional.

The Transformer-based networks mentioned above compute local self-attention to increase computational efficiency, but they inevitably sacrifice some information regarding visual details. Current Transformer-based models cannot dynamically adjust the mapping from feature vectors to probability vectors during training, limiting the model’s ability to explore potential correlations between various features fully. Furthermore, previous methods are primarily designed for two-dimensional (2D) image processing. Against the backdrop of significant computational demands in 3D medical image registration, it is imperative to employ these techniques with greater efficacy.

This paper introduces a novel spatial perpendicular window (SPW)-Transformer for efficient 3D image representation, and further develops a model combining the SPW-Transformer with the UNet, named SPW-TransUNet, specifically for image registration. In the attention computing of SPW-Transformer, we divide the representation space of multi-head attention into three perpendicular groups and perform a self-attention operation on each group. The operating window size $k$ of the three groups is adjustable and can gradually increase as the dimensionality increases, which better handles high-dimensional information. It is worth noting that these three groups of attention compute in parallel. As shown in the box in Figure 2C, the blue boxes represent the feature interaction space volumes of the three mainstream local window self-attention methods. The red box represents the feature interaction space volume of the proposed SPW-Transformer. Therefore, it expands the feature interaction space of Transformer effectively. The width of the operation window is adjustable to obtain a better feature interaction space during network training. To enhance the functionality of the SPW-Transformer, we implement a miniature convolutional neural network (CNN) model designed to adaptively learn and adjust the fixed “temperature” parameter (29). This adjustment optimizes the process within the Transformer that the mapping of feature vectors can be adjusted to probability vectors. In summary, the main contributions of this paper are as follows:

A novel Transformer architecture for 3D medical image registration named SPW-Transformer is designed to perform self-attention operations in parallel within SPW with adjustable sizes. It promotes efficient extension of feature interaction space, avoiding a significant increase in computational cost.
A novel unsupervised CT-CBCT image registration method is proposed, combining SPW-Transformer and UNet, which can handle the significant local intensity differences between the image pair and the significant shape changes. To control the mapping process of the SPW-Transformer, a miniature CNN model is adopted to dynamically adjust the mapping process from the feature vector to probability vector transformation.
We extensively evaluate the proposed method on two benchmark 3D medical image datasets for lung CT-CBCT and liver CT image registration. The proposed method shows state-of-the-art performance, and the SPW-Transformer is more efficient than previous Transformers.

Methods

Formulation and framework

The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013). We formulate the image registration problem as $G_{θ} (I_{F}, I_{M}) = ϕ$ , where $I_{F}$ is a fixed image and $I_{M}$ is a moving image, $ϕ$ is the deformation field generated by the network. We aim to obtain the optimal parameter $θ$ by minimizing the expected loss function $L$ on the training set so that the two images register in space. The formula is as follows:

$\hat{θ} = \underset{θ}{a r g m i n} [N_{(I_{F}, I_{M})} (L (I_{F}, I_{M}, G_{θ} (I_{F}, I_{M})))]$ [1]

Overall structure

The proposed unsupervised registration network comprises two stages, as illustrated in Figure 3. In the first stage, we use an affine registration network to rigidly register the overall position of the raw data of the inputted images, providing the foundation for the subsequent non-rigid network. As depicted in Figure 4, we employ a multi-layer convolutional network to build an affine registration network that down-samples the input data and produces 12 transformation parameters, consisting of a 3×3 affine vector matrix $α$ and a 3D displacement vector $β$ (30). The number of channels per layer is marked in Figure 4. Subsequently, we utilize these transformation parameters to warp the input image $I$ , and the entire process can be expressed as $f (I) = α I + β$ .

Figure 3 Overall structure of the proposed method. In the first stage, the data is affine registered. The second stage is the proposed SPW-TransUNet for more refined deformable registration. SPW, spatial perpendicular window.

Figure 4 The overall structure of affine network used in the first stage registration.

After obtaining the warped image from the first stage, we input it into the second stage, SPW-TransUNet, a specialized model to tackle the complex deformable registration tasks. Finally, the loss function $L$ is calculated. The back-propagation loss updates the gradient information of the model. The entire registration process is end-to-end. The trained model takes the fixed and moving images that need to be registered as input during inference, and then directly generates the deformation vector field as output.

SPW-TransUNet

The structure of SPW-TransUNet is illustrated in Figure 5 and consists primarily of SPW-Transformer and convolutional-block (Conv-Block). The Conv-Block incorporates convolution, 3D-funnel rectified linear unit (FReLU), and Group Normalization (Group Norm). To facilitate preliminary feature extraction from the original data, a scale-invariant convolutional layer with a kernel size of 3×3 is employed. The activation layer extends 2D-FReLU (31) to the 3D space, enabling a nonlinear transformation of image features to simulate the spatial context of each pixel.

Figure 5 The overall structure of the SPW-TransUNet. F represents a fixed image; M represents a moving image; N represents the number of SPW-Transformer blocks contained in SPW-Transformer; and K represents the width of the SPW. MLOI, multi-level original information; Conv Block, convolutional block; SPW, spatial perpendicular window; DVF, deformation vector field.

In the normalization layer, for 3D medical registration task, the batch size during model training is often constrained by graphics processing unit (GPU) memory limitations. Consequently, the mean and variance calculated by standard batch normalization may not accurately reflect the data variations when using small batch sizes. To address this issue, Group Norm demonstrates superior performance when a small batch size is selected. The operations are repeated twice, and the resulting features are concatenated with the input features to form a Conv-Block (32).

Through the Conv-Block, the overall low-dimensional information of the input image is preliminarily extracted (33). Then, the extracted features are sent to the SPW-Transformer, which models the dense nonlinear relationship between these features. Each layer of the SPW-Transformer uses different window widths to achieve information exchange between internal features in different ranges. SPW-Transformer contains the core of the SPW multi-head self-attention (SPW-MSA) and self-learning mapping control (SLMC). Also, to minimize the information loss, we pool the original information into multi-level original information (MLOI) and embed them into the corresponding layers separately. SPW-Transformer is the proposed model’s core component and will be detailed in the following.

SPW-Transformer

Due to the high computational cost of processing 3D medical images, designing effective self-attention mechanisms in the Transformer model is crucial. Some recent works compute attention locally with sequential mechanisms such as shift windows (34), dilated windows (27), axial stripes (26), and orthogonal windows (28). As introduced before, previous methods may neglect some details and hinder the performance of 3D image registration. A SPW-MSA mechanism is proposed here, which computes self-attention in parallel in spatial mutually perpendicular directions. This design can expand the self-attention area and achieve self-attention more effectively while controlling computational costs. The structure of the SPW-Transformer is shown in Figure 6, and the input is noted as $F^{e}$ . SPW-Transformer comprises four components, which are multi-layer perceptron (MLP), layer normalization (LN), the SPW-MSA, and the residual structure. Given the input $F^{e}$ , the calculation process of the SPW-Transformer can be expressed by the following equations:

$F^{l} = SPW-MSA [LN (F^{e})] + F^{e}$ [2]

$F^{d} = MLP [LN (F^{l})] + F^{l}$ [3]

Figure 6 Schematic diagram of SPW-Transformer block. The structure in the rightmost box has two scenarios, where the encoder uses Conv Block, and the decoder uses up-sampling. LN, layer normalization; SPW, spatial perpendicular window; MSA, multi-head self-attention; MLP, multi-layer perceptron; Conv Block, convolutional block.

where $F^{l}$ and $F^{d}$ are the outputs of the residual block. It is worth noting that the proposed SPW-TransUNet in the forward propagation order contains [2, 2, 6, 2, 2] SPW-Transformer blocks.

SPW-MSA

SPW-MSA is the core part of SPW-Transformer. To balance the computational efficiency and receptive field range, the multi-head attention is divided into three groups, and each group performs self-attention in parallel in a window of one direction. These windows contain the most relevant positions of the feature centers, and the directions are perpendicular. Additionally, the window width parameter of the network is dynamically adjustable in training. As the network layer deepens and the feature dimension increases, the window width proportionally expands, allowing for a broader range of deep layer feature interactions.

Denote the input feature as $I \in R^{(s \times w \times h) \times c}$ , which is first linearly projected to the $C$ head. In each head, the self-attention in the sagittal, frontal, and horizontal planes are performed separately. The implementation process is shown in Figure 7. For self-attention in the sagittal plane, we divide $I$ into windows $[I_{1}, \dots, I_{m}, \dots, I_{M}]$ of equal width non-overlapping and each of them contains $s \times k \times h$ tokens, where $I \in R^{(s \times w \times h) \times c}$ , $k$ is the width, and $M = w / k$ . After that, we use $W_{Q} \in R^{C \times d}$ , $W_{K} \in R^{C \times d}$ , and $W_{V} \in R^{C \times d}$ to represent the projection matrices of $Q$ , $K$ , and $V$ . The output of the sagittal plane self-attention of each head can be expressed as:

$S_{Att} (I) = [Att (I_{1} W_{Q}, I_{1} W_{K}, I_{1} W_{V}), \dots, Att (I_{m} W_{Q}, I_{m} W_{K}, I_{m} W_{V}), \dots, Att (I_{M} W_{Q}, I_{M} W_{K}, I_{M} W_{V})]$ [4]

Figure 7 A schematic of SPW-MSA performing self-attention.

k

represents the width of the feature interaction window. SPW, spatial perpendicular window; MSA, multi-head self-attention.

Similarly, the frontal and horizontal plane self-attention outputs $F_{Att} (I)$ and $H_{Att} (I)$ of each head can be inferred. As shown in Figure 7, $H$ is a multiple of 3. In each group, the sagittal, frontal, and horizontal planes are performed, respectively. Finally, the attention results obtained from the three cross-sections are concatenated to generate the output of all heads:

$A L L_{Att} (I) = [F_{Att} (I) \oplus H_{Att} (I) \oplus S_{Att} (I)] W_{O}$ [5]

where $\oplus$ is the connection operation, and $W_{O} \in R^{C \times C}$ .

SLMC

Calculation of each head’s self-attention output $A L L_{Att} (I)$ is a crucial step in the Transformer structure. In the previous methods, the attention mapping is computed with the softmax function which transforms the feature vector into the probability vector by exponentiating each feature vector and normalizing these values by the sum of all exponentials. In addition, softmax has a fixed “temperature”, i.e., a fixed denominator in the activation function softmax. It makes the attention mapping process inflexible (34). However, the temperature parameter significantly influences converting feature vectors into probability vectors. When the temperature parameter is too large, the difficulty of network convergence increases, while the network tends to fall into a local optimum when the temperature parameter is small. To adaptively control the sharpness of feature matching in image registration, in our SPW-Transformer, we attempt to adjust the attention mapping according to the adaptive information of the image pair.

As shown in Figure 8, a miniature CNN is introduced that adaptively learns the temperature parameters $λ$ In the early stage of the registration iteration, the miniature CNN can adjust $λ$ to register the image for the preliminary approximate matching. Afterward, the features are finely matched through the $λ$ control attention mapping. This mapping method can approximate the probability distribution represented by the network output and increase the model-to-feature correlation search space. Furthermore, to maximize the accuracy of the feature mapping, we use Gumbel-Softmax (35) instead of the normal softmax function. The specific operation method that uses Gumbel-Softmax to sample a matching matrix. Then, uses the CNN network to learn the appropriate temperature parameters from the feature itself, as shown in the following equation:

$Gumbel-Softmax (Q K^{T}) = onehot [argmax Softmax (\frac{Q K^{T} + G}{λ})]$ [6]

Figure 8 The mapping control method of SPW-Transformer. CNN, convolutional neural network; DW-Conv, depth-wise convolution; SPW, spatial perpendicular window.

where $G$ is a randomly sampled independent identically distributed sample from Gumbel Noise [0, 1], it is expected to approximate the probability distribution represented by the network output and increase the model-to-feature correlation search space (34).

In the above equation, when $λ$ is large, the mapping process becomes smooth and suitable for precise registration, and when $λ$ is close to 0, the mapping becomes sharp, suitable for coarse registration. In the Transformer’s position encoding step, we use the depth-wise convolution (Dw-Conv) operator (36) to operate on the value matrix $V$ , which reduces the computational cost of Token position encoding. It can effectively save Token position information during the attention calculation by directly applying the most crucial local neighborhood position information to the linear projection value.

Loss function

The loss part of the entire network consists of four parts. In the first stage, the similarity loss $L_{s i m 1}$ is used. In the second stage, the similarity loss $L_{s i m 2}$ , the Jacobian regularization loss $L_{J d e t}$ (30), and the smoothness constraint loss $L_{r e g}$ (37) are used. It should be noted that the similarity functions in the first and second stages both use correlation coefficient (CC) (36). The fixed image is denoted as $I_{F}$ . The moving image $I_{M}$ deformed by the deformation vector field is denoted as $I_{S}$ . $L_{s i m 1}$ and $L_{s i m 2}$ use CC to calculate the similarity between $I_{F}$ and $I_{S}$ , which expressed by the following formula:

$C C (I_{F}, I_{S}) = \sum_{p \in Ω} \frac{{\sum_{p_{i}} [I_{F} (p_{i}) - {\hat{I}}_{F} (p_{i})] [I_{S} (p_{i}) - {\hat{I}}_{_{S}} (p_{i})]}^{2}}{\sum_{p_{i}} [I_{F} (p_{i}) - {\hat{I}}_{F} (p_{i})] \sum_{p_{i}} [I_{S} (p_{i}) - {\hat{I}}_{_{S}} (p_{i})]}$ [7]

where ${\hat{I}}_{F} (p_{i})$ and ${\hat{I}}_{_{S}} (p_{i})$ represent the volume $I_{F} (p_{i})$ and $I_{S} (p_{i})$ minus its average intensity, $p_{i}$ respectively. represents the iteration on the $n^{3}$ volume around point $p$ . $Ω$ is the total volume. The $n = 12$ is used in this paper. When the CC index is larger, the average local similarity between the two images is higher. The calculation method of CC adopts a windowing operation similar to convolution.

It is worth noting that the second stage of registration is non-rigid. $L_{r e g}$ and $L_{J d e t}$ ensure the deformation vector field conforms to the natural physical deformation in the second stage. $L_{J d e t}$ penalizes local collapse in the deformation vector field. $L_{J d e t}$ is expressed as:

$L \frac{1}{V} \sum_{p \in Ω} σ {[- | J_{ϕ} (p) |]}_{J d e t}$ [8]

where $V$ represents the total number of elements in $| J_{ϕ} (p) |$ , $σ$ represents an activation function, here select rectified linear unit (ReLU), and $| J_{ϕ} (p) |$ represents the Jacobian matrix determinant of the deformation field $ϕ$ at position $p$ We encourage employ $L_{r e g}$ to further enhance the smoothness of the deformation vector field, as:

$L_{r e g} = \sum_{p \in Ω} ‖ \nabla ϕ (p) ‖$ [9]

The total loss function $L$ used in the network can be written as,

$L = L_{s i m 1} + L_{s i m 2} + λ_{1} L_{J d e t} + λ_{2} L_{J d e t}$ [10]

where $λ_{1} = 400$ and $λ_{2} = 0.8$ . All the parameters are tuned by grid search. We design experiments in the “Experimental” section to verify the impact of parameters on the results.

Experimental

Dataset and preprocessing

This paper uses two public benchmark datasets containing lung and liver medical images to verify the effectiveness of the proposed unsupervised medical image registration method. For the CT-CBCT registration task, the four-dimensional (4D)-Lung dataset (38) is adopted. The datasets consist of images from 20 patients with locally-advanced, non-small cell lung cancer including seven women and 13 men. All are treated at the same institution. The 4D-fan beam CT (FBCT) images are acquired using a 16-slice helical CT scanner, and the 4D-CBCT images are obtained with a commercial CBCT scanner. CT and CBCT images both include 10 breathing phases. The number of paired CT and CBCT images for all patients is 124 pairs. In addition, due to the limited number of patients in the data, to ensure the method’s applicability, we divide the data into 2:1:1. In four-fold cross-validation, ten patients are used for training, five patients are used for validating, and five patients are used for testing. Each CT and CBCT scan in the 4D-Lung dataset consists of approximately 50 slices, with each slice having a thickness of 3 mm. For isotropy, the images and anatomical labels are resampled to 1 mm using trilinear interpolation and nearest neighbor interpolation, respectively. To maintain consistency of the network input, the image size is center cropped to 128×256×256 for training with each inputting containing 128 slices In four-fold cross-validation, ten patients are used for training, five patients are used for validating, and five patients are used for testing.

Morphological differences in lung images from the same patient are often minimal and may not fully test a registration algorithm’s performance against significant morphological variations. Therefore, to better evaluate our algorithm, we use liver images from different patients, which present greater morphological disparities. The publicly available Liver Tumor Segmentation Challenge (LITS) (39), Medical Segmentation Decathlon (MSD) (40), and the Segmentation of the Liver Competition (SLIVER) (41) datasets as the liver datasets are used together in the CT image registration task. The MSD contains 513 liver-related scans, with slice thickness between 0.8 and 8 mm. The LITS contain 130 scans, with slice thickness between 0.7 and 5 mm. The SLIVER contains 20 scans, voxel spacing varied from 0.55 to 0.8 mm in the x/y direction, and slice spacing varied from 1 to 3 mm. All data have liver segmentation labels. To maintain consistency of the network input, the data is first sampled to 1 mm voxel size using the same sampling method as the lung data. Then the image size is center cropped to 128×128×128 according to the liver anatomical structure label for traning. Subsequently, each CT image is randomly paired with other CT images in the dataset. The data are assigned, and we use MSD for the training, SLIVER for validating, and LITS for testing.

Evaluation indicators

Following previous research (30,42), we use four metrics to evaluate the image registration effect quantitatively:

Structural similarity index measure (SSIM): to reflect the overall effect of registration, calculate the structure similarity between the fixed image and the moving image after the deformation vector field deformation.
Dice similarity coefficient (DICE) score: to calculate the percentage of overlap in the corresponding anatomical label as follows:
$DICE (F_{i}, M_{i}) = 2 \frac{| F_{i} \cap M_{i} |}{| F_{i} | \cup | M_{i} |}$ [11]
where $F_{i}$ and $M_{i}$ are the anatomical structures labeled corresponding to the fixed and moving images, respectively.
Target registration error (TRE): TRE is the average distance between warped landmarks and the fixed image landmarks in CT-CBCT registration.
$| J_{ϕ} | \leq 0$ is a voxel with a negative Jacobian coefficient in the deformation vector field, and we calculate its percentage in the total deformation vector field.

Experimental setup

Two essential packages are used to deploy the model: version 3.6 of Python and version 1.6 of the PyTorch framework. We chose to train the model using the Adam optimizer. In addition, the learning rate is set to 0.0002 and the batch size to 2. All experiments are done on a cluster with NVIDIA TASLE V100 with 32 G of GPU memory.

To rigorously assess the efficacy of our network, we conduct comparative analyses with two well-established traditional algorithms and three cutting-edge deep learning-based algorithms. We employ advanced normalization tools (ANTs) (42), renowned for its robust image normalization capabilities; deedsBCV (43), which excels in discrete voxel-based image registration; TransMorph (20), which utilizes a Transformer framework for deformable registration; Volume Tweening Network (VTN) (29), which incorporates volumetric transformations for enhanced spatial accuracy; and VoxelMorph (37), a deep CNN approach tailored for efficient and scalable application in clinical settings. The experimental results on the liver and lung datasets are shown in Tables 1,2, respectively.

Table 1

Comparison of registration performance between the proposed SPW-TransUNet and other mainstream algorithms on lung data

Methods	TRE↓	SSIM↑	DICE↑	$\| J_{ϕ} \| \leq 0$ ↓
ANTs (42)	2.55	76.63	79.33	0.152
deedsBCV (43)	3.95	75.44	77.15	0.215
TransMorph (20)	2.31	85.14	87.40	0.197
VoxelMorph (37)	2.29	84.81	85.91	0.337
VTN (30)	2.21	84.24	86.59	0.169
SPW-TransUNet	2.16^†	86.87^†	88.28^†	0.126^†

^†, the numbers denote the best scores. A downward arrow indicates that a lower metric is better, and an upward arrow indicates that a higher metric is better. SPW, spatial perpendicular window; TRE, target registration error; SSIM, structural similarity index measure; DICE, Dice similarity coefficient; ANTs, advanced normalization tools; VTN, Volume Tweening Network.

Table 2

Comparison of registration performance between the proposed SPW-TransUNet and other mainstream algorithms on liver data

Methods	SSIM↑	DICE↑	$\| J_{ϕ} \| \leq 0$ ↓
ANTs (42)	70.56	81.29	0.217^†
deedsBCV (43)	72.33	84.07	0.265
TransMorph (20)	74.48	84.83	0.301
VoxelMorph (37)	72.01	83.62	0.315
VTN (30)	72.57	84.05	0.271
SPW-TransUNet	76.92^†	85.77^†	0.229

^†, the numbers denote the best scores. A downward arrow indicates that a lower metric is better, and an upward arrow indicates that a higher metric is better. SPW, spatial perpendicular window; SSIM, structural similarity index measure; DICE, Dice similarity coefficient; ANTs, advanced normalization tools; VTN, Volume Tweening Network.

Ablation experiments are performed in the liver datasets to verify the effectiveness of the designed network components. Table 3 shows the five sets of ablation experiments: (I) Swin-TransUNet: we replace the attention method in SPW-MSA with the shift windows local self-attention used in Swin-Transformer (25); (II) Cross-TransUNet: we replace the attention method in SPW-MSA with the dilated windows local self-attention used in CrossFormer (27); (III) Orthogonal-TransUNet: we replace the attention method in SPW-MSA with alternately computed attention in the dimensionally reduced orthogonal space and local windows in Orthogonal-Transformer (28); (IV) ViT-TransUNet: We replace the attention method of SPW-MSA with ViT’s global self-attention (24); (V) without SLMC (w/o SLMC): we remove the SLMC structure, and adopt a fixed temperature parameter D instead $λ$ ; (VI) complete structure, i.e., SPW-TransUNet. Finally, as shown in Table 4, the parameters of the proposed SPW-TransUNet and other deep learning-based methods are calculated. In addition, to evaluate the hyperparameter settings of the loss function, the comparative experiments with different hyperparameter settings are added, as shown in Table 5. It can be seen that the selected hyperparameters obtain the optimal results.

Table 3

Ablation experiment on liver data

Methods	DICE↑	$\| J_{ϕ} \| \leq 0$ ↓
Swin-TransUNet (25)	84.62	0.304
Cross-TransUNet (27)	84.80	0.281
Orthogonal-TransUNet (28)	83.40	0.265
ViT-TransUNet (24)	84.37	0.276
w/o SLMC	85.26	0.277
Complete structure	85.77^†	0.229^†

^†, the numbers denote the best scores. A downward arrow indicates that a lower metric is better, and an upward arrow indicates that a higher metric is better. DICE, Dice similarity coefficient; w/o, without; SLMC, self-learning mapping control.

Table 4

The amounts of parameters of the various methods

Methods	Params (M)
TransMorph (20)	44.72
VTN (30)	27.63
Swin-T (25)	29.70
Cross-T (27)	34.53
Orthogonal-T (28)	24.61
Ours	21.59^†

^†, the numbers denote the best scores. VTN, Volume Tweening Network; T, Transformer.

Table 5

Effects of different loss hyperparameters on liver registration results

$λ_{1}$	$λ_{2}$
$λ_{1}$	0.1	0.2	0.4	0.6	0.8	1
100	83.21	84.33	84.11	84.20	85.38	85.08
200	84.05	84.31	84.56	84.14	85.54	85.19
400	84.73	85.11	84.51	85.64	85.77^†	85.50
600	84.85	84.97	84.89	85.40	85.08	85.27
800	84.33	85.38	85.13	85.62	84.91	85.03
1,000	83.28	83.67	84.17	84.51	85.04	84.22

^†, the numbers denote the best scores. The indicators in the table represent the DICE score percentage of the registration result. DICE, Dice similarity coefficient.

Results

In inter-patient lung CT-CBCT image registration, the quantitative evaluation of the experiment results is shown in Table 1. The proposed method SPW-TransUNet obtains the highest DICE score, SSIM index, and the lowest TRE. Although the folding degree of the deformation field generated by our method is slightly higher than that of ANTs, it is still superior to other registration algorithms based on deep learning. In addition, our method is superior to traditional iterative algorithms in other indicators. It is worth noting that even if the traditional iterative algorithm is close to our method in quantitative indicators, it is challenging to meet the real-time clinical requirements due to its need for iterative optimization. Meanwhile, our method achieved a 0.88% DICE score value higher than the Transformer-based TransMorph. It proved that our SPW-Transformer architecture is superior to the previous methods. In Figure 9, we show sample data pairs with significant local intensity differences, and the labels generated by our method are close to the labels of fixed images. The overall registration effect is better than the comparative methods.

Figure 9 Lung experimental results sample. (c) and (d) are the liver segmentation labels of (a) and (b) images, respectively. The columns on the right side of the original image label show examples of the comparative algorithms, including the images and the lung label are warped by the deformation vector field. ANTs, advanced normalization tools; VTN, Volume Tweening Network; SPW, spatial perpendicular window.

The quantitative experiment results are shown in Table 2 in liver CT image registration. Unlike lung registration, the liver registration data pair comprises different patients. The target anatomical structure has an enormous shape difference, which requires a higher processing ability of the network for large displacement. In this scenario, our algorithm is still ahead of other algorithms. In Table 2, SPW-TransUNet obtained the highest SSIM index and DICE score. Compared with other methods, the proposed method improves the SSIM score more than the DICE score. It shows that SPW-TransUNet can well register the overall anatomical structure of the image pair. We conducted comprehensive statistical analyses using paired two-tailed t-tests to evaluate the significance of the differences in performance metrics between our proposed SPW-TransUNet and the best-performing baseline method, TransMorph. The analyses were performed using R statistical software (44).

For the lung dataset, using four-fold cross-validation, we calculated the mean performance metrics for each fold. Our method showed statistically significant improvements over the best-performing baseline (TransMorph) in DICE with t[3]=3.7, P=0.034, SSIM with t[3]=4.12, P=0.026 and TRE with t[3]=−2.20, P=0.113. The negative t value for TRE indicates a reduction in error, which is desirable. Although the P value for TRE is above the conventional 0.05 threshold, the trend suggests an improvement. Similarly, for the liver dataset, our method demonstrated statistically significant improvements in DICE with t[149]=2.45, P=0.015, and SSIM with t[149]=3.89, P=0.0002. All P values are below the 0.05 significance threshold, confirming the enhanced effectiveness and reliability of our approach for both medical image registration tasks.

It should be noted that the average time for our method to complete an inference on the liver data is 0.427 seconds. In comparison, the TransMorph algorithm has the closest performance to our algorithm, which takes 2.619 seconds. The difference between our algorithm and the best-performing algorithm on the Jacobian fold percentage is less than 0.5%, and the value is within an acceptable range. The results of various algorithms are shown in Figure 10. The liver label of the warped image generated by our method is more similar to that of the fixed image. Meanwhile, the edges of the labels are smoother.

Figure 10 Liver experimental results sample. (c) and (d) are the liver segmentation labels of (a) and (b) images, respectively. The columns on the right side of the original image label show examples of the comparison algorithm method, including the images and the liver label that the deformation vector field warps. ANTs, advanced normalization tools; VTN, Volume Tweening Network; SPW, spatial perpendicular window.

The results of the ablation experiments are shown in Table 3. From the aspect of the self-attention method, we conducted the experiments with representative global and local self-attention methods, respectively. The proposed approach achieves the best DICE score and the lowest Jacobian folding percentage. The advantage of the SPW-Transformer in terms of accuracy is demonstrated. In addition, the SLMC is removed, i.e., the fixed temperature parameter is used. The experimental results show that the fixed temperature parameter cannot be adaptively adjusted to the mapping process. It limits the model’s adaptability to deal with different features, eventually leading to a decline in registration accuracy.

Table 4 shows the parameter quantities of our model and the other deep learning methods in Tables 2,3. Under our experimental conditions, the proposed model’s parameters are less than those of other models based on deep learning. In addition, the proposed model can be easily adapted to different tasks by adjusting the model’s hyperparameters, such as the parameters of the attention head in each converter, the number of blocks, and the window’s width.

Discussion

This paper proposes an unsupervised medical image registration network called SPW-TransUNet, which integrates a SPW-Transformer into the UNet framework (14). Compared with other mainstream medical image registration algorithms, our method achieves optimal registration results on datasets involving lung CT-CBCT image registration and liver CT image registration tasks. In SPW-TransUNet, instead of directly using down-sampling, we utilize modified Conv-Blocks to handle low-dimensional features effectively. This strategy reduces the loss of feature information and enhances registration accuracy. The essence of the Transformer lies in its self-attention mechanism for computing connections between features, which is appropriate for registration tasks that require modeling dense nonlinear relationships between moving and fixed images. Based on this, the SPW-Transformer is designed to handle long-range dependencies in high-dimensional features by computing SPW.

In SPW-TransUNet, we employ modified Conv-Blocks instead of direct down-sampling to process low-dimensional features. This strategy effectively reduces the loss of feature information and enhances registration accuracy. The SPW-Transformer expands the receptive field by computing self-attention within SPW, allowing for better modeling of long-range dependencies without incurring prohibitive computational costs. More importantly, the size of the SPW can be adjusted for different data types, enabling the network to adapt to various medical imaging scenarios. Additionally, introducing a miniature CNN for adaptive computation of the temperature parameter in the attention mapping process referred to as SLMC enhances the flexibility and robustness of feature matching.

Despite the promising results, there are areas for improvement in our current research. One limitation is the types of registration data currently available; the effectiveness of registration on modalities and organs other than the lungs and liver remains to be explored. Unsupervised medical image registration relies heavily on the assumption that intensity similarities correspond to anatomical correspondences. This assumption may not hold in cases of pathology (e.g., tumors, lesions) where the anatomy is altered. Consequently, the registration effect may not be ideal for images with significant intensity differences or without anatomical labels. This challenge could be better addressed by exploring new similarity functions or incorporating semi-supervised or weakly supervised approaches that include anatomical priors or segmentation labels to enhance performance in such scenarios (45,46).

Additionally, although our method is more efficient than some Transformer-based models, the computational demands are still significant, especially for higher-resolution images or when scaling to whole-body scans. Future work could focus on optimizing the network architecture for better resource utilization, potentially through model compression techniques or more efficient algorithm designs (47). When GPU memory is plentiful, transitioning to a cascading architecture could theoretically improve performance by utilizing multiple layers of deformable registration models to achieve more significant registration accuracy (30).

Furthermore, Figures 9,10 illustrate that the performance on edge information registration does not surpass the overall registration effect as measured by the DICE score. This result may be because the network places a higher emphasis on high-intensity anatomical structures during registration, potentially neglecting finer details at the edges. To address this phenomenon, we intend to investigate edge-preserving techniques or edge-aware loss functions and validate our approach to different imaging modalities in future work.

Conclusions

This paper proposes SPW-TransUNet, an unsupervised medical image registration network with a SPW-Transformer. This method improves the network’s receptive field by computing self-attention within SPW in parallel without significantly increasing the computational cost, enhancing the network’s ability to model positional relationships between features over long distances and having significant local intensity differences. Additionally, a novel SLMC is designed to adaptively adjust the attention map through a miniature CNN network to control the self-attention mapping process. The proposed method shows significant advantages over the existing mainstream methods. It provides a practical reference solution to automatic CT and CBCT image registration in conventional clinics.

Acknowledgments

Funding: This work was supported by the National Key Research and Development Program of China (No. 2019YFE120100), the National Natural Science Foundation (NSF) of China (No. 11975312), and the 2020 Major Scientific Research Problems and Medical Technology Program of China Medical Education Association (No. 2020KTZ003).

Footnote

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-24-1138/coif). All authors report that this work was supported by the National Key Research and Development Program of China (No. 2019YFE120100), the National Natural Science Foundation (NSF) of China (No. 11975312), and the 2020 Major Scientific Research Problems and Medical Technology Program of China Medical Education Association (No. 2020KTZ003). The authors have no other conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013).

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Hill DL, Batchelor PG, Holden M, Hawkes DJ. Medical image registration. Phys Med Biol 2001;46:R1-45. [Crossref] [PubMed]
Xiao H, Xue X, Zhu M, Jiang X, Xia Q, Chen K, Li H, Long L, Peng K. Deep learning-based lung image registration: A review. Comput Biol Med 2023;165:107434. [Crossref] [PubMed]
Stock M, Pasler M, Birkfellner W, Homolka P, Poetter R, Georg D. Image quality and stability of image-guided radiotherapy (IGRT) devices: A comparative study. Radiother Oncol 2009;93:1-7. [Crossref] [PubMed]
Teng X, Chen Y, Zhang Y, Ren L. Respiratory deformation registration in 4D-CT/cone beam CT using deep learning. Quant Imaging Med Surg 2021;11:737-48. [Crossref] [PubMed]
Duan L, Ni X, Liu Q, Gong L, Yuan G, Li M, Yang X, Fu T, Zheng J. Unsupervised learning for deformable registration of thoracic CT and cone-beam CT based on multiscale features matching with spatially adaptive weighting. Med Phys 2020;47:5632-47. [Crossref] [PubMed]
Schulze R, Heil U, Gross D, Bruellmann DD, Dranischnikow E, Schwanecke U, Schoemer E. Artefacts in CBCT: a review. Dentomaxillofac Radiol 2011;40:265-73. [Crossref] [PubMed]
Kang SR, Shin W, Yang S, Kim JE, Huh KH, Lee SS, Heo MS, Yi WJ. Structure-preserving quality improvement of cone beam CT images using contrastive learning. Comput Biol Med 2023;158:106803. [Crossref] [PubMed]
Marstal K, Berendsen F, Staring M, Klein S. SimpleElastix: A user-friendly, multi-lingual library for medical image registration. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2016:134-42.
Oliveira FP, Tavares JM. Medical image registration: a review. Comput Methods Biomech Biomed Engin 2014;17:73-93. [Crossref] [PubMed]
Park S, Plishker W, Quon H, Wong J, Shekhar R, Lee J. Deformable registration of CT and cone-beam CT with local intensity matching. Phys Med Biol 2017;62:927-47. [Crossref] [PubMed]
Nithiananthan S, Schafer S, Uneri A, Mirota DJ, Stayman JW, Zbijewski W, Brock KK, Daly MJ, Chan H, Irish JC, Siewerdsen JH. Demons deformable registration of CT and cone-beam CT using an iterative intensity matching approach. Med Phys 2011;38:1785-98. [Crossref] [PubMed]
Hermann S, Werner R. High accuracy optical flow for 3D medical image registration using the census cost function. In: Image and Video Technology: 6th Pacific-Rim Symposium, PSIVT 2013, Guanajuato, Mexico, October 28-November 1, 2013. Proceedings 6. Springer Berlin Heidelberg; 2014:23-35.
Liang X, Morgan H, Nguyen D, Jiang S. Deep learning based CT-to-CBCT deformable image registration for autosegmentation in head and neck adaptive radiation therapy. arXiv:2102.00590. 2021. Available online: https://arxiv.org/abs/2102.00590
Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer International Publishing; 2015:234-41.
Wang H, Ni D, Wang Y. Recursive Deformable Pyramid Network for Unsupervised Medical Image Registration. IEEE Trans Med Imaging 2024;43:2229-40. [Crossref] [PubMed]
Hu R, Yan H, Nian F, Mao R, Li T. Unsupervised computed tomography and cone-beam computed tomography image registration using a dual attention network. Quant Imaging Med Surg 2022;12:3705-16. [Crossref] [PubMed]
Sang Y, Ruan D. 4D-CBCT registration with a FBCT-derived plug-and-play feasibility regularizer. In: Medical Image Computing and Computer Assisted Intervention-MICCAI 2021: 24th International Conference, Strasbourg, France, September 27-October 1, 2021, Proceedings, Part IV 24. Springer International Publishing; 2021:108-17.
He Y, Li T, Ge R, Yang J, Kong Y, Zhu J, Shu H, Yang G, Li S. Few-Shot Learning for Deformable Medical Image Registration With Perception-Correspondence Decoupling and Reverse Teaching. IEEE J Biomed Health Inform 2022;26:1177-87. [Crossref] [PubMed]
Luo W, Li Y, Urtasun R, Zemel R. Understanding the effective receptive field in deep convolutional neural networks. Adv Neural Inf Process Syst 2016;29:4898-906.
Chen J, Frey EC, He Y, Segars WP, Li Y, Du Y. TransMorph: Transformer for unsupervised medical image registration. Med Image Anal 2022;82:102615. [Crossref] [PubMed]
Liu Z, Lv Q, Yang Z, Li Y, Lee CH, Shen L. Recent progress in transformer-based medical image analysis. Comput Biol Med 2023;164:107268. [Crossref] [PubMed]
Chen J, He Y, Frey EC, Li Y, Du Y. Vit-v-net: Vision transformer for unsupervised volumetric medical image registration. arXiv:2104.06468. 2021. Available online: https://arxiv.org/abs/2104.06468
Mok TCW, Chung A. Affine medical image registration with coarse-to-fine vision transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022:20835-44.
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929. 2020. Available online: https://arxiv.org/abs/2010.11929
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, et al. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021:10012-22.
Ho J, Kalchbrenner N, Weissenborn D, Salimans T. Axial attention in multidimensional transformers. arXiv:1912.12180. 2019. Available online: https://arxiv.org/abs/1912.12180
Liang X, Yang E, Deng C, Yang Y. CrossFormer: Cross-modal Representation Learning via Heterogeneous Graph Transformer. ACM Trans Multimedia Comput Commun Appl 2024; [Crossref]
Huang H, Zhou X, He R. Orthogonal transformer: An efficient vision transformer backbone with token orthogonalization. Adv Neural Inf Process Syst 2022;35:14596-607.
Wang F, Liu H. Understanding the behaviour of contrastive loss. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021:2495-504.
Zhao S, Lau T, Luo J, Chang EI, Xu Y. Unsupervised 3D End-to-End Medical Image Registration With Volume Tweening Network. IEEE J Biomed Health Inform 2020;24:1394-404. [Crossref] [PubMed]
Ma N, Zhang X, Sun J. Funnel activation for visual recognition. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16. Springer International Publishing; 2020:351-68.
Xiao T, Singh M, Mintun E, Darrell T, Dollár P, Girshick R. Early convolutions help transformers see better. Adv Neural Inf Process Syst 2021;34:30392-400.
Ramachandran P, Parmar N, Vaswani A, et al. Stand-alone self-attention in vision models. Adv Neural Inf Process Syst 2019. Available online: https://proceedings.neurips.cc/paper/2019/hash/3416a75f4cea9109507cacd8e2f2aefc-Abstract.html
He YL, Zhang XL, Ao W, Huang JZ. Determining the optimal temperature parameter for Softmax function in reinforcement learning. Appl Soft Comput 2018;70:80-5.
Ebrahimi M, Cheong H, Jayaraman PK, Javid F. Optimal design of frame structures with mixed categorical and continuous design variables using the Gumbel-Softmax method. Struct Multidiscipl Optim 2024;67:31.
Xin Y, Chen Y, Ji S, Han K, Xie X. On-the-Fly Guidance Training for Medical Image Registration. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland; 2024:694-705.
Balakrishnan G, Zhao A, Sabuncu MR, Guttag J, Dalca AV. VoxelMorph: A Learning Framework for Deformable Medical Image Registration. IEEE Trans Med Imaging 2019; Epub ahead of print. [Crossref]
Hugo GD, Weiss E, Sleeman WC, Balik S, Keall PJ, Lu J, et al. Data from 4D lung imaging of NSCLC patients (Version 2). The Cancer Imaging Archive; 2016. Available online: https://doi.org/10.7937/K9/TCIA.2016.ELN8YGLE
Bilic P, Christ P, Li HB, Vorontsov E, Ben-Cohen A, Kaissis G, et al. The Liver Tumor Segmentation Benchmark (LiTS). Med Image Anal 2023;84:102680. [Crossref] [PubMed]
Antonelli M, Reinke A, Bakas S, Farahani K, Kopp-Schneider A, Landman BA, et al. The Medical Segmentation Decathlon. Nat Commun 2022;13:4128. [Crossref] [PubMed]
Heimann T, van Ginneken B, Styner MA, Arzhaeva Y, Aurich V, Bauer C, et al. Comparison and evaluation of methods for liver segmentation from CT datasets. IEEE Trans Med Imaging 2009;28:1251-65. [Crossref] [PubMed]
Avants BB, Tustison N, Song G. Advanced normalization tools (ANTS). Insight J 2009;2:1-35.
Heinrich MP, Maier O, Handels H. Multi-modal multi-atlas segmentation using discrete optimisation and self-adaptive risk. In: Medical Image Computing and Computer-Assisted Intervention-MICCAI 2013. Springer; 2013:315-22.
R Core Team. R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing; 2021.
Gao X, Zhong W, Wang R, Heimann AF, Tannast M, Zheng G. MAIRNet: weakly supervised anatomy-aware multimodal articulated image registration network. Int J Comput Assist Radiol Surg 2024;19:507-17. [Crossref] [PubMed]
Zhou H, Chen H, Yu B, Pang S. Expert Syst Appl 2024;237:121379.
Cheng H, Zhang M, Shi JQ. A Survey on Deep Neural Network Pruning: Taxonomy, Comparison, Analysis, and Recommendations. IEEE Trans Pattern Anal Mach Intell 2024;46:10558-78. [Crossref] [PubMed]

Cite this article as: Hu R, Yang S, Zhang J, Hu X, Li T. SPW-TransUNet: three-dimensional computed tomography-cone beam computed tomography image registration with spatial perpendicular window Transformer. Quant Imaging Med Surg 2024;14(12):9506-9521. doi: 10.21037/qims-24-1138

SPW-TransUNet: three-dimensional computed tomography-cone beam computed tomography image registration with spatial perpendicular window Transformer

Introduction

Methods

Formulation and framework

Overall structure

SPW-TransUNet

SPW-Transformer

SPW-MSA

SLMC

Loss function

Experimental

Dataset and preprocessing

Evaluation indicators

Experimental setup

Table 1

Table 2

Table 3

Table 4

Table 5

Results

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share