St-RegSeg: an unsupervised registration-based framework for multimodal magnetic resonance imaging stroke lesion segmentation

Chengzhi Gui; Xingwei An; Tingting Li; Shuang Liu; Dong Ming

doi:10.21037/qims-24-725

Original Article

St-RegSeg: an unsupervised registration-based framework for multimodal magnetic resonance imaging stroke lesion segmentation

Chengzhi Gui , Xingwei An , Tingting Li , Shuang Liu, Dong Ming

Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin, China

Contributions: (I) Conception and design: C Gui; (II) Administrative support: X An, S Liu, D Ming; (III) Provision of study materials or patients: C Gui; (IV) Collection and assembly of data: C Gui, T Li; (V) Data analysis and interpretation: C Gui; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Xingwei An, PhD. Academy of Medical Engineering and Translational Medicine, Tianjin University, 92 Weijin Road, Nankai District, Tianjin 300110, China. Email: anxingwei@tju.edu.cn.

Background: Stroke is one of the leading causes of disability and death worldwide. Ischemic stroke accounts for 75–90% of all stroke incidents. Assessing the size and location of the stroke lesion is crucial for treatment decisions, especially those related to urgent vascular reconstruction surgery. Magnetic resonance imaging (MRI) offers excellent soft tissue contrast and multimodal imaging characteristics, which can reflect changes in the physiological functions of brain soft tissues in patients with ischemic stroke. However, using deep learning (DL) techniques for MRI segmentation of stroke lesions still faces many challenges. On the one hand, single-modal segmentation models cannot effectively integrate multimodal information; on the other hand, there is a semantic content drift between multimodal stroke MRI images, leading to lower accuracy in subsequent multimodal image segmentation. To address these issues, we aimed to propose the stroke unsupervised registration and segmentation (St-RegSeg) framework.

Methods: The St-RegSeg framework integrates an unsupervised registration model, ConvNXMorph, and a segmentation model, nnUNet-v2, enabling both registration and segmentation of multimodal MRI images. The St-RegSeg framework was evaluated on the ISLES’22 dataset from three centers.

Results: The St-RegSeg framework demonstrated significant improvements in performance metrics and computational efficiency. Compared to advanced normalization tools (ANTs) [symmetric normalization (SyN)] + nnUNet-v2, the St-RegSeg framework improved the Dice similarity coefficient (DSC) by 25.31% in the registration phase, reduced mean squared error (MSE) by 17.36%, increased normalized cross-correlation (NCC) by 16.06%, and enhanced mutual information (MI) by 17.09%. Additionally, in the segmentation phase, it increased the DSC by 0.84%, and the overall inference speed was increased by 40.91 times. Compared to the suboptimal TransMorph + nnUNet-v2, the St-RegSeg framework improved the DSC by 3.68% in the registration phase, reduced MSE by 8.91%, increased NCC by 8.49%, enhanced MI by 6.18%, and in the segmentation phase, it raised the DSC by 0.5%, with the overall inference speed increased by 2.13 times.

Conclusions: The St-RegSeg framework provides a highly effective solution for the registration and segmentation of multimodal MRI images in ischemic stroke cases. Its performance metrics and computational efficiency significantly outperform existing methods, making it a promising tool for clinical applications. The code is open-sourced and available at: https://github.com/Cooper-Gu/St-RegSeg.

Keywords: Multimodal image segmentation; medical image registration; stroke; ConvNeXt; cascaded registration network

Submitted Apr 08, 2024. Accepted for publication Oct 31, 2024. Published online Nov 29, 2024.

doi: 10.21037/qims-24-725

Introduction

Stroke is one of the leading causes of death and disability worldwide (1). Ischemic stroke accounts for 75–90% of stroke cases. Ischemic stroke is caused by various types of cerebrovascular diseases that obstruct cerebral blood supply, leading to ischemic and hypoxic necrosis of local brain tissue, and rapidly manifesting a corresponding syndrome of neurological function deficits. When utilizing magnetic resonance imaging (MRI) to assess stroke lesions, the three most crucial modalities are diffusion-weighted imaging (DWI), apparent diffusion coefficient (ADC), and fluid-attenuated inversion recovery (FLAIR). DWI is the gold standard for evaluating ischemic stroke within MRI, generating images by monitoring the free diffusion of water molecules within tissue. ADC is a parameter derived from DWI data that can assess the extent of brain tissue damage. FLAIR suppresses cerebrospinal fluid signals, enhancing the contrast between gray and white matter. These three modalities complement each other and, when combined, provide more comprehensive information, thus more accurately delineating the boundaries of the lesion.

Although MRI encompasses multiple modalities, the segmentation labels for MRI images are often annotated in a single modality. Furthermore, semantic content drift between different modalities means that labels annotated in one modality may not be applicable in another. In the field of medical imaging, annotating segmentation labels is exceptionally challenging and time-consuming, requiring significant human and material resources. To overcome the limitations of single-modality segmentation accuracy, multimodal segmentation networks have emerged as a breakthrough solution.

Currently, multimodal image segmentation primarily relies on two frameworks. The first framework uses multimodal fusion technology to integrate information from multiple modalities and then decodes it. The characteristic of this type of network is the inclusion of a feature fusion module (2). The second multimodal segmentation framework initially employs a registration algorithm to align images from different modalities, followed by inputting the registered images and the fixed images into the segmentation model through early fusion. Both frameworks hold significant research value, with this paper focusing more on the exploration of the second framework.

Traditional registration algorithms involve an iterative optimization process aimed at estimating a smooth mapping between points in one image and corresponding points in another image, calculating the similarity between images, and finally selecting an appropriate optimization method to iteratively refine the process until the registered images achieve maximum similarity (3). However, these methods often incur high computational costs for each iteration, consuming a significant amount of time in practical applications and facing challenges in terms of registration accuracy and time consumption.

To address the aforementioned issues, recent research has explored the application of deep learning (DL) methods in image registration. Many of these techniques involve constructing predictive deformable spatial transformations (STs), namely, predicting the pixel correspondence between image pairs based on image blocks (4). By contrast, recent papers have proposed the concept of directly predicting STs using complete input image pairs (5). Although these methods have shown precise registration results, it is important to note that they fall within the category of supervised methods. In other words, these methods require registration labels for model training, and obtaining these registration labels is highly challenging.

Unsupervised registration refers to the process of aligning images based on their inherent features without the need for labels or prior information. It primarily employs spatial transformer networks, using generated registration fields to deform moving images. The training of unsupervised registration networks relies on image similarity loss and the smoothness loss of the registration field (6). de Vos et al. introduced the first unsupervised registration network, DIRNet (7), which consists of three components: a regressor, a spatial transformer network, and a resampler (8). In recent unsupervised learning models, VoxelMorph (9) has shown superior performance to other models. It combines convolutional neural networks (CNNs) with ST, leveraging a probabilistic generative model for representation learning with CNNs and image reconstruction with ST, achieving higher computational efficiency without the need for extensive annotated data (10). In recent years, vision transformers (ViT) (11) have emerged as powerful models in the field of visual recognition, replacing CNNs across various computer vision tasks. Based on ViT, Chen et al. introduced a new unsupervised registration network, ViT-V-Net (12). To more effectively handle large-size images, the Swin transformer adopts a grid partition strategy to divide images and establishes multi-level cross-layer connections between different partition (13). To overcome the limitations of CNN architectures in modeling long-distance spatial relationships in images, Chen et al. developed the TransMorph (14), a hybrid framework of Transformer-ConvNet based on Swin Transformer modules, specifically for image registration. For effective handling of semantic correspondences and deformable registration, Zhu and Lu proposed Swin-VoxelMorph (15), a symmetric unsupervised network based on Swin-Unet (16), capable of simultaneously estimating forward and backward transformations while minimizing the dissimilarity of images. Additionally, other previously proposed unsupervised registration networks include SYMNet (17) and CycleMorph (18).

Cascaded registration networks are primarily composed of two parts: a coarse registration network and a fine registration network. Each registration sub-network is responsible for aligning the fixed image with the moving image. Subsequently, the predicted registration field is used to deform the moving image, and the deformed image is passed to the next cascaded sub-network, ultimately yielding the registration result.

Cascaded registration networks have been explored in multiple studies (19-21). Zhao et al. proposed a coarse-to-fine recursive cascading approach for registration (19). Each sub-network takes the deformed image and the original fixed image as inputs, thus updating the parameters of all sub-networks through backpropagation. The coarse-to-fine cascading approach proposed in (20) involves cascading different sub-networks, transitioning the image from a global to a local perspective. In contrast (21), introduced the concept of recursive cascading, taking both the deformed image and the original fixed image as inputs for the next cascade, thus differentiating it from previous studies.

In order to maximize the utilization of multimodal information during the segmentation process, a feasible solution is to employ registration-based segmentation. Ma et al. proposed the registration-guided DL (RgDL) segmentation framework (22). The RgDL segmentation model primarily consists of two steps: registration-based contour propagation and DL-based segmentation. In the former, contours are generated and propagated by registering computed tomography (CT) with cone beam CT (CBCT); in the latter, the propagated contours are used as inputs to guide the DL model for accurate segmentation. Cabezas et al. summarized atlas-based segmentation for brain MRI images (23). Atlas-based segmentation involves the propagation of atlas labels onto the target image following the registration of the atlas template with the target image. To improve the lesion segmentation accuracy on the ISLES’22 dataset, Wu et al. proposed the W-Net (24), which incorporates the ANTs registration algorithm for registration, followed by fusing the registered images as input into the segmentation network. Zhu et al. introduced NeurReg (25), a multi-mask network capable of simultaneously addressing registration and segmentation tasks.

In terms of ImageNet top-1 accuracy, Common Objects in Context (COCO) object detection, and ADE20K semantic segmentation tasks, ConvNeXt outperforms Swin transformers (13). Enhancements and applications based on the ConvNeXt architecture have become prevalent in the field of computer vision, with examples including HoVerNet (26), LACN (27), Nextformer (28), Inceptionnext (29), and others.

In this study, drawing inspiration from the ConvNeXt module, we slightly modified the ConvNeXt module and named it the ConvNeXt-R module to adapt it for unsupervised registration tasks. The ConvNeXt-R module is used in an encoder-decoder structure to generate a deformable field. Based on the 3D-ConvNeXt-R encoder-decoder structure, the unsupervised registration network ConvNXMorph was further proposed. To further enhance the precision of the registration model, an attention gate (AG) mechanism and a cascaded registration network were introduced. Finally, the proposed cascaded ConvNXMorph + nnUNet-v2 structure was named St-RegSeg.

The primary contributions of this paper are as follows:

This paper proposes a registration network with two fixed images and one moving image as input, thereby incorporating multimodal image information in an unsupervised registration network and enhancing registration accuracy.
This paper proposes an unsupervised cascaded registration network named ConvNXMorph, which leverages the cascading concept and designs the ConvNeXt-R module as the backbone architecture.
This paper proposes an unsupervised registration-based multimodal MRI stroke segmentation algorithm, named St-RegSeg, which can be employed for both the registration and segmentation of multimodal MRIs.

Methods

Datasets

The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013). The ISLES 2022 dataset originates from the ISLES’22 Challenge (30). It consists of three-dimensional (3D) multimodal MRI data obtained from three centers, specifically designed for evaluating automatic segmentation methods for acute and subacute ischemic stroke lesions. The MRI modalities include DWI, ADC, and FLAIR sequences. The dataset contains a total of 250 cases, each of which has undergone skull stripping and expert annotation under the DWI sequence. Notably, the ADC sequence is derived from the DWI sequence and is essentially the same modality, ensuring that there is no displacement bias between ADC and DWI images, negating the need for registration between them. Due to differences in scanning parameters, geometric discrepancies exist between the FLAIR and DWI sequences during the acquisition process, leading to label shifts when loading corresponding labels. For example, in Figure 1, the stroke lesion labels in the FLAIR sequence are mistakenly marked in the ventricles.

Figure 1 An example of a three-dimensional MRI (FLAIR, ADC, and DWI) image from the ISLES’22 dataset, showing the ischemic stroke lesion area (red). I_f₁, the first fixed image (ADC); I_f₂, the second fixed image (DWI); I_m, the moving image (FLAIR); I_d₁, the deformed image obtained through coarse registration; I_d₂, the deformed images obtained through fine registration. ST, spatial transformation; MRI, magnetic resonance imaging; FLAIR fluid-attenuated inversion recovery; ADC, apparent diffusion coefficient; DWI, diffusion-weighted imaging.

St-RegSeg

In this study, a multimodal MRI stroke lesion segmentation framework based on unsupervised registration, named St-RegSeg, is proposed. It integrates cascaded ConvNXMorph and nnUNet-v2, as illustrated in Figure 2.

Figure 2 The framework of St-RegSeg. I_f₁, the first fixed image (ADC); I_f₂, the second fixed image (DWI); I_m, the moving image (FLAIR); ADC, apparent diffusion coefficient; DWI, diffusion-weighted imaging; FLAIR, fluid-attenuated inversion recovery; ST, spatial transformation.

ConvNXMorph

Suppose $I_{f 1} \in R^{H \times W \times D}$ , $I_{f 2} \in R^{H \times W \times D}$ , $I_{m} \in R^{H \times W \times D}$ , and $I_{m} \circ ϕ \in R^{H \times W \times D}$ represent the first fixed image (ADC), the second fixed image (DWI), the moving image (FLAIR), and the deformed image ( $I_{d}$ ), respectively. The objective is to predict the registration field ( $ϕ$ ), which deforms $I_{m}$ to obtain $I_{d}$ . The network architecture of ConvNXMorph, as shown in Figure 3, begins with the early fusion of $I_{f 1}$ , $I_{f 2}$ , and $I_{m}$ , generating $ϕ$ through $ConvNeXt-R [g_{θ} (I_{f 1}, I_{f 2}, I_{m})]$ . Subsequently, the deformed image is obtained through an ST function. During the network training process, ConvNXMorph compares the image similarity between $I_{f 1}$ and $I_{m} \circ ϕ$ as well as between $I_{f 2}$ and $I_{m} \circ ϕ$ , and conducts backpropagation through the network.

Figure 3 Backbone network. (A) ConvNeXt module structure. (B) ConvNeXt-R module structure. GELU, Gaussian error linear unit.

In this study, drawing on the ConvNeXt module, we made slight modifications to the ConvNeXt module to better adapt to the registration task for this dataset. As shown in Figure 4, this modified module is named ConvNeXt-R. Specifically, the layer normalization (Layer Norm) within ConvNeXt has been substituted with batch normalization (Batch Norm), and Batch Norm has been incorporated subsequent to the ultimate convolutional operation. Moreover, a Gaussian error linear unit (GELU) activation function is integrated subsequent to the residual connections, enhancing the module’s capacity to handle the peculiarities of the dataset effectively.

Figure 4 The network architecture of ConvNXMorph.

g_{θ} (I_{f 1}, I_{f 2}, I_{m})

, the ConvNeXt-R encoder-decoder structure;

ϕ

, the registration field;

L_{s i m} (I_{f 1}, I_{m} \circ ϕ)

, the similarity loss between

I_{f 1}

and

I_{m} \circ ϕ

;

L_{s i m} (I_{f 2}, I_{m} \circ ϕ)

, the similarity loss between

I_{f 2}

and

I_{m} \circ ϕ

. I_f₁, the first fixed image (ADC); I_f₂, the second fixed image (DWI); I_m, the moving image (FLAIR).

As illustrated in Figure 5, the ConvNeXt-R encoder-decoder architecture integrates input images of dimensions 144×224×224×3 through early fusion of $I_{m}$ , $I_{f 1},$ and $I_{f 2}$ . In the encoding phase, a 1:1:3:1 ratio is preserved, aligning with the configuration of ConvNeXt-T (30). For the decoding phase, the computation ratio is adjusted to 1:1:1:1. AG was implemented to enhance feature fusion between the encoder and decoder stages. The ConvNeXt-R encoder-decoder architecture culminates in the generation of a registration field, subsequently applied to $I_{m}$ via an ST module for deformation purposes.

Figure 5 The ConvNeXt-R encoder-decoder architecture. X₁ represents the result of the upsampling step, and X₂ represents the feature map obtained during the downsampling step.

AG

To effectively capture both low-level and high-level semantic features of MRI, integrating low-level semantic information during the upsampling process is crucial. However, the direct connection strategy employed by ConvNeXt-R is overly simplistic and falls short of achieving this objective. Inspired by AG mechanism, found in long short-term memory (LSTM) networks, this study introduces a lightweight attention mechanism, termed AG, as depicted in Figure 6. After the upsampling step, to more effectively integrate feature information from different levels, the high-level semantic information $X_{1}$ is processed through a linear layer to produce three components: $\bar{G}$ , $\tilde{G}$ , and $\hat{G}$ .

$\bar{G}, \tilde{G}, \hat{G} = split (X_{1} W_{1})$ [1]

Figure 6 The AG mechanism. I_m, the moving image (FLAIR); I_f₁, the first fixed image (ADC); I_f₂, the second fixed image (DWI); FLAIR, fluid-attenuated inversion recovery; ADC, apparent diffusion coefficient; DWI, diffusion-weighted imaging; AG, attention gate.

$\bar{G}$ is added to $X_{2}$ and subsequently inputted into an additional linear layer, then passed through a sigmoid activation, facilitating an element-wise multiplication with $X_{2}$ . Similarly, $\tilde{G}$ and $\hat{G}$ are also subjected to element-wise multiplication after being activated by their respective sigmoid and tanh functions. The summation of the outputs derived from the first two computations is directed through another linear layer, culminating in the generation of $\tilde{X}$ .

$\tilde{X} = [s i g m o i d ({\bar{G}}_{2}) \cdot \tilde{G} + s i g m o i d (\tilde{G}) \cdot \tanh (\hat{G})] W_{3}$ [2]

where $G_{2} = (\bar{G} + X_{2}) W_{2}$ .

This approach effectively integrates low- and high-level semantic information. It enhances the robustness of the registration, filters out noise in the low-level semantic information, allowing the network to automatically focus on and align relevant features between two or more images during the registration process.

nnUNet-v2

In this study, the universal segmentation model nnUNet-v2 (31) was employed due to its strong generalizability and stable segmentation results. Within nnUNet-v2, each encoder and decoder comprises two convolutions with a kernel size of (3,3,3). Following each convolution, group normalization is applied, followed by the LeakyReLU activation function. nnUNet-v2 has achieved state-of-the-art (SOTA) performance in six recognized segmentation challenges.

Loss function

In the coarse registration stage, the loss function is as follows:

$\begin{array}{l} L_{coarse} (I_{f 1}, (I_{f 2}, I_{m}), ϕ) = λ_{1} L_{sim1} (I_{f 1}, I_{m} \circ ϕ) \\ + (1 - λ_{1}) L_{sim1} (I_{f 2}, I_{m} \circ ϕ) + λ_{2} L_{smooth} (ϕ) \end{array}$ [3]

In the fine registration stage, the loss function is as follows:

$\begin{array}{l} L_{fine} (I_{f 1}, (I_{f 2}, I_{m}), ϕ) = λ_{1} L_{sim2} (I_{f 1}, I_{m} \circ ϕ) \\ + (1 - λ_{1}) L_{sim2} (I_{f 2}, I_{m} \circ ϕ) + λ_{2} L_{smooth} (ϕ) \end{array}$ [4]

Where, $λ_{1}$ is a balance coefficient used to balance the weights of the two image similarity loss terms, $L_{s i m 1} (I_{f 1}, I_{m} \circ ϕ)$ and $L_{s i m 1} (I_{f 2}, I_{m} \circ ϕ)$ , in the loss function. The value of $λ_{1}$ ranges between 0 and 1. When $λ_{1}$ is close to 1, the weight of the first image similarity loss term is greater in the total loss; when $λ_{1}$ is close to 0, the weight of the second image similarity loss term is greater in the total loss. $λ_{2}$ is a regularization coefficient used to balance the weights of the image similarity loss and the deformation field regularization loss, $L_{s m o o t h} (ϕ)$ , in the loss function. A larger value of $λ_{2}$ will make the model pay more attention to a smooth deformation field, which helps prevent overfitting and the generation of unreasonable deformations. Conversely, a smaller value of $λ_{2}$ will reduce the effect of regularization, which might lead to a more complex deformation field. By adjusting $λ_{2}$ , one can control the model’s preference for the smoothness of the deformation field. $L_{s i m}$ represents the image similarity loss, and $L_{s m o o t h}$ represents the deformation field regularization.

Normalized cross-correlation (NCC) loss function

NCC is a similarity measurement method, primarily utilized in monomodal registration models. Compared to other methods, it offers advantages in terms of accuracy and reliability. The mathematical expression for NCC is as follows:

$NCC = \frac{\sum_{X} (I_{f} - \bar{I_{f}}) (I_{m} \circ ϕ - \bar{I_{m} \circ ϕ})}{\sqrt{\sum_{X} {(I_{f} - \bar{I_{f}})}^{2}} \sqrt{\sum_{X} {(I_{m} \circ ϕ - \bar{I_{m} \circ ϕ})}^{2}}}$ [5]

Where, $I_{f}$ represents the fixed image, and $I_{m} \circ ϕ$ represents the warped moving image.

The larger the NCC value, the more similar the two images are. Inverting the NCC in the loss function is performed to convert it into a loss that can be minimized. The NCC loss function formula is as follows:

$L_{sim1} (I_{f}, I_{m} \circ ϕ) = - NCC (I_{f}, I_{m} \circ ϕ)$ [6]

Modality independent neighborhood descriptor loss function

The modality independent neighborhood descriptor (MIND) is a similarity measurement method specifically designed for multimodal medical image registration. It quantifies similarity by utilizing local patterns around each voxel as features. This is achieved by comparing the central patch with patches located at a specific distance. The fundamental assumption of the MIND similarity function is that the local patterns around a voxel should exhibit similarity, even when considering different imaging modalities. First, define $D_{p}$ as the distance between patches centered at $x_{1}$ and $x_{2}$ :

$D_{p} (I, x_{1}, x_{2}) = \frac{1}{| P |} {\sum_{t \in P} (I (x_{1} + t) - I (x_{2} + t))}^{2}$ [7]

Where, $x_{1}$ and $x_{2}$ represent two positions in the image. I represents the image, and P represents the set of displacements from a voxel within a patch of size p×p×p to the center of the patch.

MIND is modeled as a Gaussian function of $D_{p}$ . The MIND value is lower when patches are dissimilar, and higher when patches are similar. The MIND formula is then given as follows:

$M I N D (I, x, r) = \exp \frac{- D_{p} (I, x, x + r)}{V (I, x)}$ [8]

Where, r is the distance vector, and $V (I, x)$ is the estimated value of local variance.

To construct an image registration similarity loss function based on MIND, it is necessary to compute the mean absolute difference of MIND between the two images to be registered. Therefore, it is defined as follows:

$L_{sim2} (I_{f}, I_{m} \circ ϕ) = \frac{1}{| R |} \sum_{r \in R} (M_{1} - M_{2})$ [9]

Where, $M_{1} = M I N D (I_{f}, x, r)$ , $M_{2} = M I N D (I_{m} \circ ϕ, x, r)$ , with $I_{f}$ and $I_{m} \circ ϕ$ representing the fixed image and the warped moving image, respectively. R represents a set of six displacement vectors of length d, with two displacement vectors each in the x, y, or z direction.

Deformation field regularization

Optimizing solely based on the similarity metric may result in the registration field $ϕ$ lacking smoothness and realism. To address this issue, a regularization term, $L_{s m o o t h} (ϕ)$ , is introduced into the loss function. $L_{s m o o t h} (ϕ)$ promotes consistency of displacement values between a position and its neighboring positions, thereby enhancing the smoothness of the deformation field. The regularization loss of the deformation field is defined as follows:

$L_{smooth} (ϕ) = \sum_{p \in Ω} {‖ \nabla U (p) ‖}^{2} = \sum_{p \in Ω} {‖ \frac{\partial U (p)}{\partial (x, y, z)} ‖}^{2}$ [10]

Where, p represents the voxel position, $Ω$ denotes the image domain, and U(p) represents the spatial gradient of the displacement field U. The spatial gradient is estimated through forward differencing:

$\frac{\partial U (p)}{\partial (x, y, z)} \approx U (p_{x, y, z} + 1) - U (p_{x, y, z})$ [11]

Cascaded ConvNXMorph

The cascaded registration network primarily consists of two parts: the coarse registration network and the fine registration network. Each registration sub-network is responsible for aligning the fixed image with the moving image. Subsequently, the moving image is warped using the predicted registration field and passed to the next cascaded sub-network, ultimately achieving the registration result. Figure 1 intuitively demonstrates the cascaded network structure.

In the coarse registration stage, ConvNXMorph (32) with an NCC loss function was employed. The objective of coarse registration is to achieve approximate alignment. The NCC loss function is highly suitable for unsupervised monomodal medical image registration. It is capable of measuring the similarity between images. Furthermore, it possesses translational invariance, enabling it to maintain a consistent correlation coefficient even if the images are moved. Additionally, the NCC loss takes into account the size of the entire image, capturing the structural similarity of the whole image rather than merely focusing on local features. This makes the NCC loss particularly effective during coarse registration.

In the fine registration phase, ConvNXMorph with MIND loss function (33) was used. Fine registration is conducted on the basis of the coarse registration results, with a focus on local registration to achieve higher accuracy. The MIND loss function is suitable for multimodal image registration. It measures the similarity between images by utilizing the pixel differences between central patches of different modalities and those located at specific distances. This approach is more capable of capturing local details and is sensitive to image transformations, making the MIND loss very effective for fine registration.

Preprocess and implementation details

All images were resized to 144×224×224. In the unsupervised registration model, an 8:2 random split was employed, with 200 cases used for training and 50 cases for validation. The data division for the segmentation network was identical to that of the registration network. To ensure robustness, the results of both the registration and segmentation networks were subjected to five-fold cross-validation.

All models used in this study were implemented using the PyTorch framework (34). The training and inference processes were conducted on an NVIDIA 3090Ti GPU (NVIDIA, Santa Clara, CA, USA). Initially, the $λ_{1}$ parameter was fixed, and the optimal hyperparameter $λ_{2}$ was determined within the range of 0.01 to 100, ultimately being set at 1.5. After fixing $λ_{2}$ at 1.5, the $λ_{1}$ parameter was adjusted. The experiments utilized the AdamW optimization algorithm with a learning rate set to 1×10⁻⁴ and a batch size of 1, training the unsupervised fine-registration model for 100 epochs. The segmentation model was trained for 300 epochs using adaptive parameter settings. All experiments followed a consistent training procedure.

Baseline methods

ConvNXMorph was compared with six other registration models. This included two traditional registration algorithms and four DL-based registration algorithms. The traditional registration algorithms employed were advanced normalization tools (ANTs) [symmetric normalization (SyN)] and NiftyReg. The DL-based registration methods utilized are VoxelMorph (10), SYMNet (17), ViT-V-Net (12), and TransMorph (14). To ensure consistency and fairness in the evaluation, the proposed ConvNXMorph, along with the comparison models VoxelMorph, SYMNet, ViT-V-Net, and TransMorph, were configured with identical hyperparameters and dataset divisions. Additionally, the same nnUNet-v2 segmentation model was used after registration for all methods.

Evaluation metrics

During the registration process, seven evaluation metrics were used. Firstly, dice similarity coefficient (DSC) was utilized to calculate the overlap between the true stroke annotation and the annotated stroke lesion after registration. Secondly, the average inference time (in seconds) for two fixed images and one warped moving image was measured. Thirdly, mean squared error (MSE) was used as a similarity metric. Fourthly, the NCC between two fixed images and one warped moving image was calculated. Fifthly, the number of pixel values less than or equal to zero in the Jacobian matrix ( $| J_{ϕ} | \leq 0$ ) was calculated. Additionally, the percentage of pixel values less than or equal to zero in the Jacobian matrix relative to the entire Jacobian matrix ( $% o f | J_{ϕ} | \leq 0$ ) was determined. Finally, mutual information (MI) was adopted as an evaluation metric.

The smaller the time, the faster the registration speed; the smaller the values of $| J_{ϕ} | \leq 0$ , $% o f | J_{ϕ} | \leq 0$ , and the MSE, and the larger values of DSC, NCC, and MI, the higher the image registration accuracy.

MSE is used to assess the difference between two fixed images and one warped moving image at the pixel level. The definition of MSE is as follows:

$M S E = \sum_{i = 1}^{2} \frac{1}{M \cdot N \cdot P} \sum {(I_{f i} - I_{m} \circ ϕ)}^{2}$ [12]

Where, M represents the image width, N represents the image height, and P represents the image length.

MI is a fundamental concept in information theory, intuitively measuring the difference between the pixel distributions of two fixed images and one warped moving image. The MI between $I_{f}$ and $I_{m} \circ ϕ$ is defined as follows (35):

$M I (I_{f}, I_{m} \circ ϕ) = \sum_{I_{f i}} \sum_{a, b} p (a, b) \log (\frac{p (a, b)}{p (a) p (b)})$ [13]

Where, $p (a)$ represents the probability of intensity a in image $I_{f}$ , and $p (b)$ represents the probability of intensity b in image $I_{m} \circ ϕ$ . These probabilities are estimated by constructing the pixel intensity histogram for each image. The probability $p (a, b)$ corresponds to the joint distribution of intensities in the two images $I_{f}$ and $I_{m} \circ ϕ$ .

The evaluation of the deformation field’s regularity employed the criterion $| J_{ϕ} | \leq 0$ . Specifically, the Jacobian matrix $J_{ϕ} (p) = \nabla ϕ (p) \in R^{3 \times 3}$ can capture the local characteristics of $ϕ$ around the voxel p. $| J_{ϕ} | \leq 0$ indicates the presence of folding at specific points (36). By calculating the Jacobian determinant, the quality of dense displacement vectors can be effectively quantified. The definition $J_{ϕ} (p)$ of is as follows:

$J_{ϕ} (p) = \nabla ϕ (i, j, k) = | \begin{matrix} \frac{\partial i}{\partial x} & \frac{\partial j}{\partial x} & \frac{\partial k}{\partial x} \\ \frac{\partial i}{\partial y} & \frac{\partial j}{\partial y} & \frac{\partial k}{\partial y} \\ \frac{\partial i}{\partial z} & \frac{\partial j}{\partial z} & \frac{\partial k}{\partial z} \end{matrix} |$ [14]

In the task of MRI segmentation, the average DSC is widely adopted as the primary quantitative evaluation metric by most models. Additionally, we employed three other metrics to assess the effectiveness of all segmentation methods, namely Intersection over Union (IoU), 95% Hausdorff distance (HD95) and sensitivity. The definitions of these four metrics are as follows:

$D S C = \frac{2 T P}{2 T P + F P + F N}$ [15]

$I o U = \frac{T P}{T P + F P + F N}$ [16]

$H D 95 = \max {\max_{p \in P} \min_{q \in Q} d (p, q), \max_{q \in Q} \min_{p \in P} d (p, q)}$ [17]

$Sensitivity = \frac{T P}{T P + F N}$ [18]

Where TP, FP, and FN represent the number of true positives, false positives, and false negatives pixels, respectively.

Statistical analysis

The distribution of data was assessed using the Kolmogorov-Smirnov test. For variables with a normal distribution, comparisons were made using the paired t-test. For variables not normally distributed, the Wilcoxon signed-rank test was used. All P values were from two-tailed tests, with statistical significance defined as a P value less than 0.05.

Results

Comparison of registration performance

This study conducted a quantitative evaluation of registration models using seven registration metrics. As shown in Table 1, the inference time of the non-cascaded ConvNXMorph was the shortest, at 0.814 seconds. This time is significantly lower than the inference times of other unsupervised models, namely VoxelMorph, SYMNet, ViT-V-Net, and TransMorph. Notably, compared to traditional registration models, the inference time of the cascaded ConvNXMorph was greatly reduced, approximately 1/80 of ANTs (SyN) and 1/76 of NiftyReg. The substantial reduction in registration inference time is crucial for clinical applications. Moreover, the cascaded ConvNXMorph improved the DSC by 25.31% compared to not using a registration algorithm and by 3.68% compared to the suboptimal TransMorph algorithm. Additionally, MSE is a robust metric in registration, and both the cascaded and non-cascaded ConvNXMorph outperformed existing registration models in terms of MSE. The MSE of the cascaded ConvNXMorph reached 1.795 (0.483)×10⁵.

Table 1

The primary quantitative analysis results of the ISLES’22 registration (mean ± SD)

Method	Time(s) ↓	DSC (%) ↑	MSE (×10⁵) ↓
–	–	51.12±0.92	2.592±0.732
ANTs (SyN)	65.912	67.71±1.03	2.172±0.594
NiftyReg	62.143	67.24±0.84	2.189±0.643
VoxelMorph (2019, TMI)	0.882	69.17±0.91	2.215±0.354
SYMNet (2020, CVPR)	0.827	70.17±0.84	2.032±0.511
Vit-V-Net (2021, arXiv)	1.043	72.14±1.21	2.047±0.954
TransMorph (2022, MIA)	3.425	72.75±1.34	1.955±0.832
Non-c ConvNXMorph	0.814	75.77±0.093	1.862±0.543
Cascaded ConvNXMorph	1.611	76.43±1.15	1.795±0.483

–, no method used, meaning the time taken cannot be calculated. ↑, indicates that the higher the value, the better the registration effect; ↓, indicates that the lower the value, the worse the registration effect. SD, standard deviation; DSC, Dice similarity coefficient; MSE, mean squared error; ANTs, advanced normalization tools; SyN, symmetric normalization; TMI, IEEE Transactions on Medical Imaging; CVPR, Computer Vision and Pattern Recognition; MIA, Medical Image Analysis; Non-c, represents non-cascaded.

As depicted in Table 2, the non-cascaded ConvNXMorph achieved significant improvement in the NCC metric, and the cascaded ConvNXMorph further improved the NCC metric and registration accuracy, reaching 1.879 (±0.051). Furthermore, the cascaded ConvNXMorph achieved the best MI on the ISLES’22 dataset, with 3.693 (±0.772). Furthermore, the cascaded ConvNXMorph reached the SOTA in the number and percentage of $| J_{ϕ} | \leq 0$ metrics, at 1,014 (±507) and 0.014 (±0.007), respectively. Table 3 shows the P values for these tests, and all the results of statistical tests were very significant.

Table 2

The secondary quantitative analysis results of ISLES’22 registration (mean ± SD)

Method	NCC ↑	MI ↑	$\| J_{ϕ} \| \leq 0$ ↓	$% of \| J_{ϕ} \| \leq 0$ ↓
–	1.213±0.049	2.913±1.043	0	0
ANTS (SyN)	1.619±0.030	3.154±0.987	1,098±324	0.015±0.004
NiftyReg	1.632±0.045	3.214±1.127	4,752±1,432	0.066±0.020
VoxelMorph (2019, TMI)	1.578±0.078	3.324±0.614	3,837±943	0.053±0.013
SYMNet (2020, CVPR)	1.594±0.032	3.274±1.124	1,732±656	0.024±0.009
Vit-V-Net (2021, arXiv)	1.613±0.079	3.389±0.917	5,132±3,243	0.071±0.045
TransMorph (2022, MIA)	1.732±0.054	3.478±0.542	1,943±787	0.027±0.011
Non-c ConvNXMorph	1.833±0.047	3.587±0.843	1,593±579	0.022±0.008
Cascaded ConvNXMorph	1.879±0.051	3.693±0.772	1,014±507	0.014±0.007

–, no method used. ↑, indicates that the higher the value, the better the registration effect; ↓, indicates that the lower the value, the worse the registration effect. SD, standard deviation; NCC, normalized cross-correlation; MI, mutual information; ANTs, advanced normalization tools; SyN, symmetric normalization; TMI, IEEE Transactions on Medical Imaging; CVPR, Computer Vision and Pattern Recognition; MIA, Medical Image Analysis; Non-c, represents non-cascaded.

Table 3

P values for cascade ConvNXMorph compared to all baselines

Method	P value (cascade ConvNXMorph > baseline)
Method	DSC	MSE	NCC	MI	$\| J_{ϕ} \| \leq 0$
ANTs (SyN)	1e−5	0.01	4e−5	0.02	0.04
NiftyReg	2e−4	0.01	3e−4	0.03	4e−3
VoxelMorph	2e−4	0.01	2e−4	0.02	3e−3
SYMNet	1e−4	0.02	7e−6	0.03	0.04
Vit-V-Net	8e−3	0.03	3e−3	0.03	0.02
TransMorph	5e−3	0.04	0.01	0.02	0.04

P values are calculated by paired t-test, and all P values are significant. DSC, Dice similarity coefficient; MSE, mean squared error; NCC, normalized cross-correlation; MI, mutual information; ANTs, advanced normalization tools; SyN, symmetric normalization.

Finally, Table 4 summarizes the parameters and floating point operations (FLOPs) for all registration methods. Compared to TransMorph, the cascaded ConvNXMorph utilized only 10.01% of its parameters, and the FLOPs were reduced by 8.35%.

Table 4

Parameters (M) and FLOPs (G) for different registration models.

Models	Params (M)	FLOPs (G)
VoxelMorph	0.27	319.52
SYMNet	1.10	339.76
Vit-V-Net	31.33	351.30
TransMorph	46.75	602.24
Non-c ConvNXMorph	2.34	275.98
Cascaded ConvNXMorph	4.68	551.96

Params, parameters; FLOPs, floating point operations; M, million; G, gigaflops; Non-c, non-cascaded.

Comparison of segmentation performance

Table 5 lists the nnUNet-v2 segmentation results based on different registration algorithms. The first two rows represent the DSC without using the FLAIR modality and without registration of the FLAIR modality, respectively. Subsequently, the remaining rows correspond to the use of ANTs (SyN), NiftyReg, VoxelMorph, SYMNet, Vit-V-Net, TransMorph, non-cascaded ConvNXMorph, and cascaded ConvNXMorph, respectively.

Table 5

Quantitative analysis of stroke lesion segmentation (mean ± SD)

Method	DSC (%) ↑	IoU (%) ↑	HD95 (mm) ↓	Sensitivity (%) ↑
Without FLAIR	78.70±0.07	72.90±0.07	3.41±0.01	74.23±0.05
FLAIR no-registrated	78.54±0.09	72.78±0.02	3.37±0.03	74.15±0.03
ANTS (SyN)	79.30±0.04	72.23±0.06	3.26±0.05	75.11±0.04
NiftyReg	79.23±0.08	72.20±0.05	3.25±0.05	75.02±0.07
VoxelMorph (2019, TMI)	79.47±0.02	72.97±0.03	3.20±0.06	75.35±0.02
SYMNet (2020, CVPR)	79.53±0.05	73.17±0.04	3.14±0.04	75.98±0.05
Vit-V-Net (2021, arXiv)	79.59±0.03	73.56±0.07	3.10±0.03	76.21±0.04
TransMorph (2022, MIA)	79.64±0.04	74.17±0.08	3.05±0.02	77.11±0.06
Non-c ConvNXMorph	79.95±0.06	74.98±0.03	3.03±0.07	78.12±0.03
Cascade ConvNXMorph	80.14±0.05	75.42±0.05	3.01±0.03	78.31±0.04

The best results are highlighted in bold on a gray background. ↑, indicates that the higher the value, the better the segmentation effect; ↓, indicates that the lower the value, the worse the segmentation effect. SD, standard deviation; DSC, Dice similarity coefficient; IoU, Intersection over Union; HD95, 95% Hausdorff distance; FLAIR, fluid-attenuated inversion recovery; ANTs, advanced normalization tools; SyN, symmetric normalization; TMI, IEEE Transactions on Medical Imaging; CVPR, Computer Vision and Pattern Recognition; MIA, Medical Image Analysis; Non-c, non-cascaded.

If only considering the DWI and ADC modalities, the DSC is 78.70%. However, the introduction of the unregistered FLAIR modality did not improve the segmentation results. On the contrary, the DSC decreased by 0.16%. This observation suggests that introducing unaligned multimodal images into the segmentation network not only fails to enhance segmentation performance but also leads to the network learning incorrect information due to anatomical misalignment between modalities, thereby affecting the performance of the segmentation network.

Meanwhile, compared to the unregistered method, the DSC of the ANTS (SyN), NiftyReg, VoxelMorph, SYMNet, Vit-V-Net, TransMorph, non-cascaded ConvNXMorph, and cascaded ConvNXMorph registration algorithms increased by 0.76%, 0.69%, 0.93%, 0.99%, 1.05%, 1.10%, 1.41%, and 1.60%, respectively. Compared to the sub-optimal TransMorph model, the cascaded ConvNXMorph demonstrated improvements in DSC, IoU, and sensitivity by 0.50%, 1.25%, and 1.2%, respectively, with a decrease in HD95 by 0.04 mm.

Table 6 shows the P values for these tests, and all statistical test results were very significant.

Table 6

Comparison of P values for nnUNet-v2 segmentation utilizing the cascade ConvNXMorph alongside other baseline registration algorithms

Method	P value (cascade ConvNXMorph > baseline)
Method	DSC	IoU	HD95	Sensitivity
Without FLAIR	1e−6	7e−7	2e−6	1e−8
FLAIR non-registered	4e−6	1e−7	1e−5	4e−9
ANTs (SyN)	7e−7	3e−7	6e−4	2e−9
NiftyReg	2e−5	6e−8	1e−3	1e−7
VoxelMorph	5e−6	2e−7	2e−3	1e−8
SYMNet	1e−6	2e−7	5e−4	3e−8
Vit-V-Net	2e−6	2e−6	6e−3	2e−8
TransMorph	6e−6	2e−5	0.02	2e−6

P values are computed through paired t-tests, and all P values exhibit statistical significance. DSC, Dice similarity coefficient; IoU, Intersection over Union; HD95, 95% Hausdorff distance; FLAIR, fluid-attenuated inversion recovery; ANTs, advanced normalization tools; SyN, symmetric normalization.

To better illustrate the effectiveness of the St-RegSeg algorithm, we compared our method with five SOTA single-modal segmentation methods, including one CNN-based method (nnUnet) and four stroke lesion segmentation algorithms [HCSNet (37), SEAN (38), FRPNet (39), SrSNet (40)]. As shown in Table 7, St-RegSeg achieved the best performance across four metrics and significantly outperformed other segmentation algorithms. Compared to HCSNet (2023, JBHI), St-RegSeg improved the DSC, IoU, and sensitivity by 31.41%, 30.25%, and 36.16%, respectively, and reduced the HD95 by 3.92 mm. In comparison with the second-best network SrSNet (2024, ESWA), St-RegSeg enhanced the DSC, IoU, and sensitivity by 1.10%, 1.11%, and 1.58%, respectively, and decreased the HD95 by 0.19 mm, demonstrating its superior segmentation performance. Table 8 shows the p-values for these tests, and all statistical tests results were very significant.

Table 7

Quantitative comparison between St-RegSeg and five single-model segmentation methods on the ISLES’22 datasets (mean ± SD)

Method	DSC (%) ↑	IoU (%) ↑	HD95 (mm) ↓	Sensitivity (%) ↑
HCSNet (2023, JBHI)	48.73±0.12	45.17±0.13	6.93±0.07	42.15±0.14
FRPNet (2024, CIBM)	61.77±0.17	58.77±0.06	5.65±0.09	59.86±0.13
SEAN (2021, MICCAI)	66.14±0.09	63.15±0.07	4.94±0.11	63.98±0.16
nnUnet (2021, Nature Methods)	78.70±0.07	72.90±0.07	3.41±0.01	74.23±0.05
SrSNet (2024, ESWA)	79.04±0.11	74.31±0.12	3.20±0.09	76.73±0.07
St-RegSeg	80.14±0.05	75.42±0.05	3.01±0.03	78.31±0.04

↑, indicates that the higher the value, the better the segmentation effect; ↓, indicates that the lower the value, the worse the segmentation effect. SD, standard deviation; DSC, Dice similarity coefficient; IoU, Intersection over Union; HD95, 95% Hausdorff distance; JBHI, IEEE Journal of Biomedical and Health Informatics; CIBM, Computers in Biology and Medicine; MICCAI, Medical Image Computing and Computer Assisted Intervention Society; ESWA, Expert Systems With Applications.

Table 8

P values for St-RegSeg compared to all baselines

Method	P value (St-RegSeg > baseline)
Method	DSC	IoU	HD95	Sensitivity
HCSNet	7e−11	2e−10	1e−8	7e−11
FRPNet	2e−9	5e−11	3e−7	2e−10
SEAN	1e−10	1e−9	5e−6	1e−9
nnUnet	5e−6	4e−6	5e−6	2e−8
SrSNet	6e−5	5e−6	0.01	3e−7

P values are calculated by paired t-test, and all P values are significant. DSC, Dice similarity coefficient; IoU, Intersection over Union; HD95, 95% Hausdorff distance.

Ablation study

Table 9 presents the ablation study conducted on the ISLES’22 dataset. In the ablation studies, we validated the effectiveness of the ConvNeXt-R module, triple-channel input, cascaded registration strategy, and AG mechanism. Using TransMorph as a baseline, components were gradually replaced and added to analyze the contribution of each component, and statistical analyses were conducted to verify the presence of significant differences. The baseline DSC was 79.64 (±0.04).

Table 9

Ablation studies on the ISLES’22 dataset

Method	I_f₁ (ADC)	I_f₂ (DWI)	Coarse-Reg (NCC/MIND)	Fine-Reg (NCC/MIND)	AG	Backbone	DSC (%), mean ± SD
TransMorph	×	√	NCC	×	×	Swin Transformer	79.64±0.04
ConvNXMorph	×	√	NCC	×	×	ConvNeXt-R	79.81±0.05
ConvNXMorph	×	√	MIND	×	×	ConvNeXt-R	79.47±0.06
ConvNXMorph	×	√	NCC	×	√	ConvNeXt-R	79.88±0.03
ConvNXMorph	√	√	NCC	×	√	ConvNeXt-R	79.95±0.06
Cascaded ConvNXMorph	√	√	NCC	NCC	√	ConvNeXt-R	79.96±0.07
Cascaded ConvNXMorph	√	√	NCC	MIND	√	ConvNeXt-R	80.14±0.05

ADC, apparent diffusion coefficient; DWI, diffusion-weighted imaging; NCC, normalized cross-correlation; MIND, Modality Independent Neighborhood Descriptor; AG, attention gate; DSC, Dice similarity coefficient; SD, standard deviation; I_f₁, the first fixed image (ADC); I_f₂, the second fixed image (DWI).

Initially, the introduction of the ConvNeXt-R module resulted in a 0.17% increase in DSC compared to the baseline (P=0.008), indicating that the incorporation of the ConvNeXt-R module is beneficial for the St-RegSeg framework. Subsequently, the MIND loss function was introduced, leading to a 0.34% decrease in DSC compared to the second ablation study. This decrease is speculated to be due to the MIND loss function’s focus on pixel-level information matching, which may not consider the overall image information as effectively as the NCC loss function.

In the fourth ablation study, the introduction of the AG mechanism resulted in a 0.07% increase in DSC compared to the second ablation study (P=0.034), thereby validating the effectiveness of the AG mechanism. In the fifth ablation study, the introduction of the triple-channel mechanism led to a 0.42% increase in DSC compared to the fourth ablation experiment (P=0.001), confirming the effectiveness of the triple-channel mechanism.

In the sixth ablation study, a cascaded registration network strategy was employed, and the use of the NCC loss function in the fine registration phase did not result in significant improvement (P=0.542). The seventh ablation study used the MIND loss function in the fine registration phase, which led to a 0.19% increase in DSC compared to the fifth ablation experiment (P=0.003). The reason for this outcome is speculated to be that the model trained with the NCC loss function during the coarse registration phase had already reached its optimal state, and hence, further optimization using the NCC loss function did not significantly enhance the results. However, employing the MIND loss function allowed for further optimization of the segmentation effect.

Discussion

Qualitative analysis in the registration phase

The stroke lesion segmentation labels for ISLES’22 were annotated under DWI and ADC modalities. To visualize the results, the labels were loaded onto the registered FLAIR modality. The qualitative results of the MRI registration are shown in Figure 7. Notably, compared to the original FLAIR modality annotations, the ANTs (SyN), NiftyReg, VoxelMorph, SYMNet, ViT-V-Net, and TransMorph algorithms all showed improvements in the alignment of the lesion areas with the annotations. Among these, the coarse registration ConvNXMorph appeared visually superior to the other six registration methods. It accurately deformed the moving image, as evidenced by the boundaries of the stroke lesion structure. The cascaded ConvNXMorph further enhanced the deformation precision of the moving image. Intuitively, the registration results of the cascaded ConvNXMorph were almost consistent with the fixed image structure in terms of the stroke lesion structure, surpassing the performance of the baseline methods.

Figure 7 Qualitative evaluation results for ANTs (SyN), NiftyReg, VoxelMorph, SYMNet, ViT-V-Net, TransMorph, non-cascaded ConvNXMorph, and cascaded ConvNXMorph. The cascaded ConvNXMorph demonstrates clearer visual improvements.

I_{m} \circ ϕ

, the deformed image;

ϕ

, the registration field. ANTs, advanced normalization tools; SyN, symmetric normalization.

Qualitative analysis in the segmentation phase

This study selected nnUNet-v2 as the segmentation model. Even without the use of the FLAIR modality, the network demonstrates good segmentation performance. To further enhance segmentation precision, an approach from the registration perspective was adopted, aligning different modalities before they are input into the segmentation model. This improvement ensures that the lesion areas contain precise information, thereby facilitating accurate segmentation.

Figure 8 displays examples of stroke lesion segmentation by non-cascaded ConvNXMorph, cascaded ConvNXMorph, and other unsupervised registration models. Compared to other methods, St-RegSeg has several advantages. Firstly, it can more precisely segment and locate small, medium, and large stroke lesions (as seen in Case 1, Case 3, and Case 2, respectively). Additionally, St-RegSeg also shows fine lesion segmentation edges (as illustrated in Case 3). Moreover, even in situations with multiple stroke lesions, St-RegSeg successfully captures more lesions and presents results closer to the annotations (as shown in Case 1 and Case 4). In summary, the St-RegSeg framework demonstrates strong capabilities in handling multiple stroke lesions and accurately segmenting lesion edges.

Figure 8 Qualitative evaluation of different registration models. The first column represents the DWI modality. The second and third columns represent the segmentation results without using the FLAIR modality and the segmentation results of the FLAIR modality without registration, respectively. From the fourth column to the last column, the images respectively represent the use of the ANTs (SyN), NiftyReg, VoxelMorph, SYMNet, ViT-V-Net, TransMorph, non-cascaded ConvNXMorph, and cascaded ConvNXMorph registration models. From top to bottom, the images depict the transverse sections of the brains of four different patients selected randomly. DWI, diffusion-weighted imaging; FLAIR, fluid-attenuated inversion recovery; ANTs, advanced normalization tools; SyN, symmetric normalization; GT, ground truth.

The contribution of DWI and ADC modalities to segmentation accuracy

In Figure 9, the value of $λ_{2}$ was fixed at 1.5, and the value of $λ_{1}$ was adjusted to investigate the contribution of DWI and ADC modalities to the segmentation accuracy of the St-RegSeg framework. The gray solid bars represent the DSC results using non-cascaded ConvNXMorph. The gray dashed bars indicate the improvement in DSC when using cascaded ConvNXMorph compared to non-cascaded ConvNXMorph.

Figure 9 Hyperparameter tuning of λ₁ was utilized to investigate the contribution of DWI and ADC modalities to the segmentation accuracy of the St-RegSeg framework. When λ₁ is 0, the network is a dual-channel input for ADC and FLAIR modalities. When λ₁ is 1, the network is a dual-channel input for DWI and FLAIR modalities. DSC, Dice similarity coefficient; DWI, diffusion-weighted imaging; ADC, apparent diffusion coefficient; FLAIR, fluid-attenuated inversion recovery; Non-c, represents non-cascaded.

In this study, the DSC for $λ_{1}$ values greater than 0.5 was overall higher than it was for values less than 0.5. This indicates that the segmentation results obtained using the DWI modality were significantly better compared to those obtained with the ADC modality. Interestingly, this phenomenon is consistent with prior knowledge that the DWI modality is considered the gold standard for diagnosing ischemic stroke in MRI.

In this study, the segmentation accuracy was highest when $λ_{1}$ was set to 0.8. At $λ_{1} = 0.8$ , the DSC increased by 0.50% compared to when $λ_{1} = 0$ , and by 0.21% compared to when $λ_{1} = 1$ . Therefore, we conclude that utilizing registration of DWI, ADC, and FLAIR tri-modalities more effectively improves the precision of stroke lesion segmentation.

Conclusions

This study proposes the St-RegSeg framework, which integrates the unsupervised registration model ConvNXMorph with nnUNet-v2 for multimodal image registration and segmentation. The ConvNXMorph model introduces an AG mechanism, triple-channel input, a cascaded registration network, and uses the MIND multimodal loss function in the fine registration phase to achieve optimal registration and segmentation performance. ConvNXMorph also improves inference speed, significantly accelerating the process of multimodal medical image registration. Although the St-RegSeg framework is highly effective for multimodal medical image registration and segmentation, there is still room for further enhancement of segmentation outcomes. Specifically, within the St-RegSeg framework, nnUNet-v2 can be replaced with more advanced segmentation models such as nnFormer (41), STU-Net (42), or MedNeXt (43), thereby further improving the segmentation performance of this framework.

Acknowledgments

Funding: This work was supported in part by the National Key Research & Development Program of China (No. 2022YFF1202400), the National Natural Science Foundation of China (No. 62276181), and the Key Program of Natural Science Foundation of Tianjin (Nos. 20JCZDJC00620 and 21JCZDJC00610).

Footnote

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-24-725/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013).

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Chen L, Bentley P, Rueckert D. Fully automatic acute ischemic lesion segmentation in DWI using convolutional neural networks. Neuroimage Clin 2017;15:633-43. [Crossref] [PubMed]
Zhu Z, He X, Qi G, Li Y, Cong B, Liu Y. Brain tumor segmentation based on the fusion of deep semantics and edge information in multimodal MRI. Inf Fusion 2023;91:376-87.
Beg MF, Miller MI, Trouvé A, Younes L. Computing large deformation metric mappings via geodesic flows of diffeomorphisms. Int J Comput Vis 2005;61:139-57.
Sokooti H, de Vos B, Berendsen F, Lelieveldt BPF, Išgum I, Staring M. Nonrigid Image Registration Using Multi-scale 3D Convolutional Neural Networks. Medical Image Computing and Computer Assisted Intervention − MICCAI 2017;1:232-9.
Eppenhof K, Lafarge MW, Moeskops P, Veta M, Pluim J. Deformable image registration using convolutional neural networks. Medical Imaging 2018. doi: 10.1117/12.2292443.
Simonsen J, Jensen OS. Spatial Transformer Networks. Adv neural Inf Process Syst 2015;28:2017-25.
de Vos BD, Berendsen FF, Viergever MA, Staring M, Išgum I. End-to-end unsupervised deformable image registration with a convolutional neural network. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. DLMIA ML-CDS 2017 2017. Lecture Notes in Computer Science, Springer, Cham; 2017:10553.
Sheikhjafari A, Punithakumar K, Ray N. Unsupervised Deformable Image Registration with Fully Connected Generative Neural Network. Midl; 2018:1-9.
Balakrishnan G, Zhao A, Sabuncu MR, Guttag J, Dalca AV. VoxelMorph: A Learning Framework for Deformable Medical Image Registration. IEEE Trans Med Imaging 2019; Epub ahead of print. [Crossref]
Balakrishnan G, Zhao A, Sabuncu MR, Guttag J, Dalca AV. An Unsupervised Learning Model for Deformable Medical Image Registration. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2018:9252-60.
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020. arXiv:2010.11929.
Chen J, He Y, Frey EC, Li Y, Du Y. ViT-V-Net: Vision Transformer for Unsupervised Volumetric Medical Image Registration. arXiv 2021. arXiv:2104.06468.
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV 2021). IEEE, New York, 2021:9992-10002.
Chen J, Frey EC, He Y, Segars WP, Li Y, Du Y. TransMorph: Transformer for unsupervised medical image registration. Med Image Anal 2022;82:102615. [Crossref] [PubMed]
Zhu Y, Lu S. Swin-VoxelMorph: A Symmetric Unsupervised Learning Model for Deformable Medical Image Registration Using Swin Transformer. Springer Nature Switzerland; 2022:78-87.
Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, Wang M. Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation. In: Karlinsky L, Michaeli T, Nishino K, editors. Computer Vision – ECCV 2022 Workshops. Lecture Notes in Computer Science; ECCV 2022; vol 13803.
Mok TCW, Chung ACS. Fast symmetric diffeomorphic image registration with convolutional neural networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2020:4644-53.
Kim B, Kim DH, Park SH, Kim J, Lee JG, Ye JC. CycleMorph: Cycle consistent unsupervised deformable image registration. Med Image Anal 2021;71:102036. [Crossref] [PubMed]
Zhao S, Lau T, Luo J, Chang EI, Xu Y. Unsupervised 3D End-to-End Medical Image Registration With Volume Tweening Network. IEEE J Biomed Health Inform 2020;24:1394-404. [Crossref] [PubMed]
Cheng Z, Guo K, Wu C, Shen J, Qu L. U-Net cascaded with dilated convolution for medical image registration. Chinese Automation Congress (CAC) (Hangzhou: IEEE); 2019:3647-51.
Zhao S, Dong Y, Chang E, Xu Y. Recursive cascaded networks for unsupervised medical image registration. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); 2019:10600-10.
Ma L, Chi W, Morgan HE, Lin MH, Chen M, Sher D, Moon D, Vo DT, Avkshtol V, Lu W, Gu X. Registration-guided deep learning image segmentation for cone beam CT-based online adaptive radiotherapy. Med Phys 2022;49:5304-16. [Crossref] [PubMed]
Cabezas M, Oliver A, Lladó X, Freixenet J, Cuadra MB. A review of atlas-based segmentation for magnetic resonance brain images. Comput Methods Programs Biomed 2011;104:e158-77. [Crossref] [PubMed]
Wu Z, Zhang X, Li F, Wang S, Huang L, Li J. W-Net: A boundary-enhanced segmentation network for stroke lesions. Expert Systems with Applications: An International Journal 2023. doi: 10.1016/j.eswa.2023.120637.
Zhu W, Myronenko A, Xu Z, Li W, Roth H, Huang Y, Milletari F, Xu D. NeurReg: Neural registration and its application to image segmentation. IEEE Workshop/Winter Conference on Applications of Computer Vision, WACV; 2020:3606-15.
Li J, Wang C, Huang B, Zhou Z. ConvNeXt-backbone HoVerNet for Nuclei Segmentation and Classification. arXiv 2022. arXiv:2202.13560.
Fan S, Liang W, Ding D, Yu H. LACN: A lightweight attention-guided ConvNeXt network for low-light image enhancement. Eng Appl Artif Intell 2023;117:105632.
Jiang Y, Yu J, Yang W, Zhang B, Wang Y. Nextformer: a convnext augmented conformer for end-to-end speech recognition. arXiv 2022. arXiv:2206.14747.
Yu W, Zhou P, Yan S, Wang X. InceptionNeXt: When Inception Meets ConvNeXt. Proceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition 2024, Seattle; 2024 June 17-21.
Hernandez Petzsche MR, de la Rosa E, Hanning U, Wiest R, Valenzuela W, Reyes M, et al. ISLES 2022: A multi-center magnetic resonance imaging stroke lesion segmentation dataset. Sci Data 2022;9:762. [Crossref] [PubMed]
Isensee F, Jaeger PF, Kohl SAA, Petersen J, Maier-Hein KH. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods 2021;18:203-11. [Crossref] [PubMed]
Avants BB, Epstein CL, Grossman M, Gee JC. Symmetric diffeomorphic image registration with cross-correlation: evaluating automated labeling of elderly and neurodegenerative brain. Med Image Anal 2008;12:26-41. [Crossref] [PubMed]
Heinrich MP, Jenkinson M, Bhushan M, Matin T, Gleeson FV, Brady SM, Schnabel JA. MIND: modality independent neighbourhood descriptor for multi-modal deformable registration. Med Image Anal 2012;16:1423-35. [Crossref] [PubMed]
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv 2019. arXiv:1912.01703.
Pluim JP, Maintz JB, Viergever MA. Mutual-information-based registration of medical images: a survey. IEEE Trans Med Imaging 2003;22:986-1004. [Crossref] [PubMed]
Ashburner J. A fast diffeomorphic image registration algorithm. Neuroimage 2007;38:95-113. [Crossref] [PubMed]
Liu L, Chang J, Liu Z, Zhang P, Xu X, Shang H. Hybrid Contextual Semantic Network for Accurate Segmentation and Detection of Small-Size Stroke Lesions From MRI. IEEE J Biomed Health Inform 2023;27:4062-73. [Crossref] [PubMed]
Liang K, Han K, Li X, Cheng X, Li Y, Wang Y, Yu Y. Symmetry-Enhanced Attention Network for Acute Ischemic Infarct Segmentation with Non-contrast CT Images. Springer International Publishing. Medical Image Computing and Computer Assisted Intervention – MICCAI 2021:432-41.
Wu Z, Zhang X, Li F, Wang S, Li J. A feature-enhanced network for stroke lesion segmentation from brain MRI images. Comput Biol Med 2024;174:108326. [Crossref] [PubMed]
Li T, An X, Di Y, Gui C, Yan Y, Liu S, Ming D. SrSNet: Accurate segmentation of stroke lesions by a two-stage segmentation framework with asymmetry information. Expert Syst Appl 2024;254:124329.
Zhou HY, Guo J, Zhang Y, Yu L, Wang L, Yu Y. nnFormer: Interleaved Transformer for Volumetric Segmentation. arXiv 2021. arXiv:2109.03201.
Huang Z, Wang H, Deng Z, Ye J, Su Y, Sun H, He J, Gu Y, Gu L, Zhang S, Qiao Y. STU-Net: Scalable and Transferable Medical Image Segmentation Models Empowered by Large-Scale Supervised Pre-training. arXiv 2023. arXiv:2304.06716.
Roy S, Koehler G, Ulrich C, Baumgartner M, Petersen J, Isensee F, Jaeger PF, Maier-Hein K. MedNeXt: Transformer-driven Scaling of ConvNets for Medical Image Segmentation. arXiv 2023. arXiv:2303.09975.

Cite this article as: Gui C, An X, Li T, Liu S, Ming D. St-RegSeg: an unsupervised registration-based framework for multimodal magnetic resonance imaging stroke lesion segmentation. Quant Imaging Med Surg 2024;14(12):9459-9476. doi: 10.21037/qims-24-725