St-RegSeg: an unsupervised registration-based framework for multimodal magnetic resonance imaging stroke lesion segmentation
IntroductionOther Section
Stroke is one of the leading causes of death and disability worldwide (1). Ischemic stroke accounts for 75–90% of stroke cases. Ischemic stroke is caused by various types of cerebrovascular diseases that obstruct cerebral blood supply, leading to ischemic and hypoxic necrosis of local brain tissue, and rapidly manifesting a corresponding syndrome of neurological function deficits. When utilizing magnetic resonance imaging (MRI) to assess stroke lesions, the three most crucial modalities are diffusion-weighted imaging (DWI), apparent diffusion coefficient (ADC), and fluid-attenuated inversion recovery (FLAIR). DWI is the gold standard for evaluating ischemic stroke within MRI, generating images by monitoring the free diffusion of water molecules within tissue. ADC is a parameter derived from DWI data that can assess the extent of brain tissue damage. FLAIR suppresses cerebrospinal fluid signals, enhancing the contrast between gray and white matter. These three modalities complement each other and, when combined, provide more comprehensive information, thus more accurately delineating the boundaries of the lesion.
Although MRI encompasses multiple modalities, the segmentation labels for MRI images are often annotated in a single modality. Furthermore, semantic content drift between different modalities means that labels annotated in one modality may not be applicable in another. In the field of medical imaging, annotating segmentation labels is exceptionally challenging and time-consuming, requiring significant human and material resources. To overcome the limitations of single-modality segmentation accuracy, multimodal segmentation networks have emerged as a breakthrough solution.
Currently, multimodal image segmentation primarily relies on two frameworks. The first framework uses multimodal fusion technology to integrate information from multiple modalities and then decodes it. The characteristic of this type of network is the inclusion of a feature fusion module (2). The second multimodal segmentation framework initially employs a registration algorithm to align images from different modalities, followed by inputting the registered images and the fixed images into the segmentation model through early fusion. Both frameworks hold significant research value, with this paper focusing more on the exploration of the second framework.
Traditional registration algorithms involve an iterative optimization process aimed at estimating a smooth mapping between points in one image and corresponding points in another image, calculating the similarity between images, and finally selecting an appropriate optimization method to iteratively refine the process until the registered images achieve maximum similarity (3). However, these methods often incur high computational costs for each iteration, consuming a significant amount of time in practical applications and facing challenges in terms of registration accuracy and time consumption.
To address the aforementioned issues, recent research has explored the application of deep learning (DL) methods in image registration. Many of these techniques involve constructing predictive deformable spatial transformations (STs), namely, predicting the pixel correspondence between image pairs based on image blocks (4). By contrast, recent papers have proposed the concept of directly predicting STs using complete input image pairs (5). Although these methods have shown precise registration results, it is important to note that they fall within the category of supervised methods. In other words, these methods require registration labels for model training, and obtaining these registration labels is highly challenging.
Unsupervised registration refers to the process of aligning images based on their inherent features without the need for labels or prior information. It primarily employs spatial transformer networks, using generated registration fields to deform moving images. The training of unsupervised registration networks relies on image similarity loss and the smoothness loss of the registration field (6). de Vos et al. introduced the first unsupervised registration network, DIRNet (7), which consists of three components: a regressor, a spatial transformer network, and a resampler (8). In recent unsupervised learning models, VoxelMorph (9) has shown superior performance to other models. It combines convolutional neural networks (CNNs) with ST, leveraging a probabilistic generative model for representation learning with CNNs and image reconstruction with ST, achieving higher computational efficiency without the need for extensive annotated data (10). In recent years, vision transformers (ViT) (11) have emerged as powerful models in the field of visual recognition, replacing CNNs across various computer vision tasks. Based on ViT, Chen et al. introduced a new unsupervised registration network, ViT-V-Net (12). To more effectively handle large-size images, the Swin transformer adopts a grid partition strategy to divide images and establishes multi-level cross-layer connections between different partition (13). To overcome the limitations of CNN architectures in modeling long-distance spatial relationships in images, Chen et al. developed the TransMorph (14), a hybrid framework of Transformer-ConvNet based on Swin Transformer modules, specifically for image registration. For effective handling of semantic correspondences and deformable registration, Zhu and Lu proposed Swin-VoxelMorph (15), a symmetric unsupervised network based on Swin-Unet (16), capable of simultaneously estimating forward and backward transformations while minimizing the dissimilarity of images. Additionally, other previously proposed unsupervised registration networks include SYMNet (17) and CycleMorph (18).
Cascaded registration networks are primarily composed of two parts: a coarse registration network and a fine registration network. Each registration sub-network is responsible for aligning the fixed image with the moving image. Subsequently, the predicted registration field is used to deform the moving image, and the deformed image is passed to the next cascaded sub-network, ultimately yielding the registration result.
Cascaded registration networks have been explored in multiple studies (19-21). Zhao et al. proposed a coarse-to-fine recursive cascading approach for registration (19). Each sub-network takes the deformed image and the original fixed image as inputs, thus updating the parameters of all sub-networks through backpropagation. The coarse-to-fine cascading approach proposed in (20) involves cascading different sub-networks, transitioning the image from a global to a local perspective. In contrast (21), introduced the concept of recursive cascading, taking both the deformed image and the original fixed image as inputs for the next cascade, thus differentiating it from previous studies.
In order to maximize the utilization of multimodal information during the segmentation process, a feasible solution is to employ registration-based segmentation. Ma et al. proposed the registration-guided DL (RgDL) segmentation framework (22). The RgDL segmentation model primarily consists of two steps: registration-based contour propagation and DL-based segmentation. In the former, contours are generated and propagated by registering computed tomography (CT) with cone beam CT (CBCT); in the latter, the propagated contours are used as inputs to guide the DL model for accurate segmentation. Cabezas et al. summarized atlas-based segmentation for brain MRI images (23). Atlas-based segmentation involves the propagation of atlas labels onto the target image following the registration of the atlas template with the target image. To improve the lesion segmentation accuracy on the ISLES’22 dataset, Wu et al. proposed the W-Net (24), which incorporates the ANTs registration algorithm for registration, followed by fusing the registered images as input into the segmentation network. Zhu et al. introduced NeurReg (25), a multi-mask network capable of simultaneously addressing registration and segmentation tasks.
In terms of ImageNet top-1 accuracy, Common Objects in Context (COCO) object detection, and ADE20K semantic segmentation tasks, ConvNeXt outperforms Swin transformers (13). Enhancements and applications based on the ConvNeXt architecture have become prevalent in the field of computer vision, with examples including HoVerNet (26), LACN (27), Nextformer (28), Inceptionnext (29), and others.
In this study, drawing inspiration from the ConvNeXt module, we slightly modified the ConvNeXt module and named it the ConvNeXt-R module to adapt it for unsupervised registration tasks. The ConvNeXt-R module is used in an encoder-decoder structure to generate a deformable field. Based on the 3D-ConvNeXt-R encoder-decoder structure, the unsupervised registration network ConvNXMorph was further proposed. To further enhance the precision of the registration model, an attention gate (AG) mechanism and a cascaded registration network were introduced. Finally, the proposed cascaded ConvNXMorph + nnUNet-v2 structure was named St-RegSeg.
The primary contributions of this paper are as follows:
- This paper proposes a registration network with two fixed images and one moving image as input, thereby incorporating multimodal image information in an unsupervised registration network and enhancing registration accuracy.
- This paper proposes an unsupervised cascaded registration network named ConvNXMorph, which leverages the cascading concept and designs the ConvNeXt-R module as the backbone architecture.
- This paper proposes an unsupervised registration-based multimodal MRI stroke segmentation algorithm, named St-RegSeg, which can be employed for both the registration and segmentation of multimodal MRIs.
MethodsOther Section
Datasets
The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013). The ISLES 2022 dataset originates from the ISLES’22 Challenge (30). It consists of three-dimensional (3D) multimodal MRI data obtained from three centers, specifically designed for evaluating automatic segmentation methods for acute and subacute ischemic stroke lesions. The MRI modalities include DWI, ADC, and FLAIR sequences. The dataset contains a total of 250 cases, each of which has undergone skull stripping and expert annotation under the DWI sequence. Notably, the ADC sequence is derived from the DWI sequence and is essentially the same modality, ensuring that there is no displacement bias between ADC and DWI images, negating the need for registration between them. Due to differences in scanning parameters, geometric discrepancies exist between the FLAIR and DWI sequences during the acquisition process, leading to label shifts when loading corresponding labels. For example, in Figure 1, the stroke lesion labels in the FLAIR sequence are mistakenly marked in the ventricles.
St-RegSeg
In this study, a multimodal MRI stroke lesion segmentation framework based on unsupervised registration, named St-RegSeg, is proposed. It integrates cascaded ConvNXMorph and nnUNet-v2, as illustrated in Figure 2.
ConvNXMorph
Suppose
In this study, drawing on the ConvNeXt module, we made slight modifications to the ConvNeXt module to better adapt to the registration task for this dataset. As shown in Figure 4, this modified module is named ConvNeXt-R. Specifically, the layer normalization (Layer Norm) within ConvNeXt has been substituted with batch normalization (Batch Norm), and Batch Norm has been incorporated subsequent to the ultimate convolutional operation. Moreover, a Gaussian error linear unit (GELU) activation function is integrated subsequent to the residual connections, enhancing the module’s capacity to handle the peculiarities of the dataset effectively.
As illustrated in Figure 5, the ConvNeXt-R encoder-decoder architecture integrates input images of dimensions 144×224×224×3 through early fusion of
AG
To effectively capture both low-level and high-level semantic features of MRI, integrating low-level semantic information during the upsampling process is crucial. However, the direct connection strategy employed by ConvNeXt-R is overly simplistic and falls short of achieving this objective. Inspired by AG mechanism, found in long short-term memory (LSTM) networks, this study introduces a lightweight attention mechanism, termed AG, as depicted in Figure 6. After the upsampling step, to more effectively integrate feature information from different levels, the high-level semantic information
where
This approach effectively integrates low- and high-level semantic information. It enhances the robustness of the registration, filters out noise in the low-level semantic information, allowing the network to automatically focus on and align relevant features between two or more images during the registration process.
nnUNet-v2
In this study, the universal segmentation model nnUNet-v2 (31) was employed due to its strong generalizability and stable segmentation results. Within nnUNet-v2, each encoder and decoder comprises two convolutions with a kernel size of (3,3,3). Following each convolution, group normalization is applied, followed by the LeakyReLU activation function. nnUNet-v2 has achieved state-of-the-art (SOTA) performance in six recognized segmentation challenges.
Loss function
In the coarse registration stage, the loss function is as follows:
In the fine registration stage, the loss function is as follows:
Where,
Normalized cross-correlation (NCC) loss function
NCC is a similarity measurement method, primarily utilized in monomodal registration models. Compared to other methods, it offers advantages in terms of accuracy and reliability. The mathematical expression for NCC is as follows:
Where,
The larger the NCC value, the more similar the two images are. Inverting the NCC in the loss function is performed to convert it into a loss that can be minimized. The NCC loss function formula is as follows:
Modality independent neighborhood descriptor loss function
The modality independent neighborhood descriptor (MIND) is a similarity measurement method specifically designed for multimodal medical image registration. It quantifies similarity by utilizing local patterns around each voxel as features. This is achieved by comparing the central patch with patches located at a specific distance. The fundamental assumption of the MIND similarity function is that the local patterns around a voxel should exhibit similarity, even when considering different imaging modalities. First, define
Where,
MIND is modeled as a Gaussian function of
Where, r is the distance vector, and
To construct an image registration similarity loss function based on MIND, it is necessary to compute the mean absolute difference of MIND between the two images to be registered. Therefore, it is defined as follows:
Where,
Deformation field regularization
Optimizing solely based on the similarity metric may result in the registration field
Where, p represents the voxel position,
Cascaded ConvNXMorph
The cascaded registration network primarily consists of two parts: the coarse registration network and the fine registration network. Each registration sub-network is responsible for aligning the fixed image with the moving image. Subsequently, the moving image is warped using the predicted registration field and passed to the next cascaded sub-network, ultimately achieving the registration result. Figure 1 intuitively demonstrates the cascaded network structure.
In the coarse registration stage, ConvNXMorph (32) with an NCC loss function was employed. The objective of coarse registration is to achieve approximate alignment. The NCC loss function is highly suitable for unsupervised monomodal medical image registration. It is capable of measuring the similarity between images. Furthermore, it possesses translational invariance, enabling it to maintain a consistent correlation coefficient even if the images are moved. Additionally, the NCC loss takes into account the size of the entire image, capturing the structural similarity of the whole image rather than merely focusing on local features. This makes the NCC loss particularly effective during coarse registration.
In the fine registration phase, ConvNXMorph with MIND loss function (33) was used. Fine registration is conducted on the basis of the coarse registration results, with a focus on local registration to achieve higher accuracy. The MIND loss function is suitable for multimodal image registration. It measures the similarity between images by utilizing the pixel differences between central patches of different modalities and those located at specific distances. This approach is more capable of capturing local details and is sensitive to image transformations, making the MIND loss very effective for fine registration.
Preprocess and implementation details
All images were resized to 144×224×224. In the unsupervised registration model, an 8:2 random split was employed, with 200 cases used for training and 50 cases for validation. The data division for the segmentation network was identical to that of the registration network. To ensure robustness, the results of both the registration and segmentation networks were subjected to five-fold cross-validation.
All models used in this study were implemented using the PyTorch framework (34). The training and inference processes were conducted on an NVIDIA 3090Ti GPU (NVIDIA, Santa Clara, CA, USA). Initially, the
Baseline methods
ConvNXMorph was compared with six other registration models. This included two traditional registration algorithms and four DL-based registration algorithms. The traditional registration algorithms employed were advanced normalization tools (ANTs) [symmetric normalization (SyN)] and NiftyReg. The DL-based registration methods utilized are VoxelMorph (10), SYMNet (17), ViT-V-Net (12), and TransMorph (14). To ensure consistency and fairness in the evaluation, the proposed ConvNXMorph, along with the comparison models VoxelMorph, SYMNet, ViT-V-Net, and TransMorph, were configured with identical hyperparameters and dataset divisions. Additionally, the same nnUNet-v2 segmentation model was used after registration for all methods.
Evaluation metrics
During the registration process, seven evaluation metrics were used. Firstly, dice similarity coefficient (DSC) was utilized to calculate the overlap between the true stroke annotation and the annotated stroke lesion after registration. Secondly, the average inference time (in seconds) for two fixed images and one warped moving image was measured. Thirdly, mean squared error (MSE) was used as a similarity metric. Fourthly, the NCC between two fixed images and one warped moving image was calculated. Fifthly, the number of pixel values less than or equal to zero in the Jacobian matrix (
The smaller the time, the faster the registration speed; the smaller the values of
MSE is used to assess the difference between two fixed images and one warped moving image at the pixel level. The definition of MSE is as follows:
Where, M represents the image width, N represents the image height, and P represents the image length.
MI is a fundamental concept in information theory, intuitively measuring the difference between the pixel distributions of two fixed images and one warped moving image. The MI between
Where,
The evaluation of the deformation field’s regularity employed the criterion
In the task of MRI segmentation, the average DSC is widely adopted as the primary quantitative evaluation metric by most models. Additionally, we employed three other metrics to assess the effectiveness of all segmentation methods, namely Intersection over Union (IoU), 95% Hausdorff distance (HD95) and sensitivity. The definitions of these four metrics are as follows:
Where TP, FP, and FN represent the number of true positives, false positives, and false negatives pixels, respectively.
Statistical analysis
The distribution of data was assessed using the Kolmogorov-Smirnov test. For variables with a normal distribution, comparisons were made using the paired t-test. For variables not normally distributed, the Wilcoxon signed-rank test was used. All P values were from two-tailed tests, with statistical significance defined as a P value less than 0.05.
ResultsOther Section
Comparison of registration performance
This study conducted a quantitative evaluation of registration models using seven registration metrics. As shown in Table 1, the inference time of the non-cascaded ConvNXMorph was the shortest, at 0.814 seconds. This time is significantly lower than the inference times of other unsupervised models, namely VoxelMorph, SYMNet, ViT-V-Net, and TransMorph. Notably, compared to traditional registration models, the inference time of the cascaded ConvNXMorph was greatly reduced, approximately 1/80 of ANTs (SyN) and 1/76 of NiftyReg. The substantial reduction in registration inference time is crucial for clinical applications. Moreover, the cascaded ConvNXMorph improved the DSC by 25.31% compared to not using a registration algorithm and by 3.68% compared to the suboptimal TransMorph algorithm. Additionally, MSE is a robust metric in registration, and both the cascaded and non-cascaded ConvNXMorph outperformed existing registration models in terms of MSE. The MSE of the cascaded ConvNXMorph reached 1.795 (0.483)×105.
Table 1
Method | Time(s) ↓ | DSC (%) ↑ | MSE (×105) ↓ |
---|---|---|---|
– | – | 51.12±0.92 | 2.592±0.732 |
ANTs (SyN) | 65.912 | 67.71±1.03 | 2.172±0.594 |
NiftyReg | 62.143 | 67.24±0.84 | 2.189±0.643 |
VoxelMorph (2019, TMI) | 0.882 | 69.17±0.91 | 2.215±0.354 |
SYMNet (2020, CVPR) | 0.827 | 70.17±0.84 | 2.032±0.511 |
Vit-V-Net (2021, arXiv) | 1.043 | 72.14±1.21 | 2.047±0.954 |
TransMorph (2022, MIA) | 3.425 | 72.75±1.34 | 1.955±0.832 |
Non-c ConvNXMorph | 0.814 | 75.77±0.093 | 1.862±0.543 |
Cascaded ConvNXMorph | 1.611 | 76.43±1.15 | 1.795±0.483 |
–, no method used, meaning the time taken cannot be calculated. ↑, indicates that the higher the value, the better the registration effect; ↓, indicates that the lower the value, the worse the registration effect. SD, standard deviation; DSC, Dice similarity coefficient; MSE, mean squared error; ANTs, advanced normalization tools; SyN, symmetric normalization; TMI, IEEE Transactions on Medical Imaging; CVPR, Computer Vision and Pattern Recognition; MIA, Medical Image Analysis; Non-c, represents non-cascaded.
As depicted in Table 2, the non-cascaded ConvNXMorph achieved significant improvement in the NCC metric, and the cascaded ConvNXMorph further improved the NCC metric and registration accuracy, reaching 1.879 (±0.051). Furthermore, the cascaded ConvNXMorph achieved the best MI on the ISLES’22 dataset, with 3.693 (±0.772). Furthermore, the cascaded ConvNXMorph reached the SOTA in the number and percentage of
Table 2
Method | NCC ↑ | MI ↑ | ||
---|---|---|---|---|
– | 1.213±0.049 | 2.913±1.043 | 0 | 0 |
ANTS (SyN) | 1.619±0.030 | 3.154±0.987 | 1,098±324 | 0.015±0.004 |
NiftyReg | 1.632±0.045 | 3.214±1.127 | 4,752±1,432 | 0.066±0.020 |
VoxelMorph (2019, TMI) | 1.578±0.078 | 3.324±0.614 | 3,837±943 | 0.053±0.013 |
SYMNet (2020, CVPR) | 1.594±0.032 | 3.274±1.124 | 1,732±656 | 0.024±0.009 |
Vit-V-Net (2021, arXiv) | 1.613±0.079 | 3.389±0.917 | 5,132±3,243 | 0.071±0.045 |
TransMorph (2022, MIA) | 1.732±0.054 | 3.478±0.542 | 1,943±787 | 0.027±0.011 |
Non-c ConvNXMorph | 1.833±0.047 | 3.587±0.843 | 1,593±579 | 0.022±0.008 |
Cascaded ConvNXMorph | 1.879±0.051 | 3.693±0.772 | 1,014±507 | 0.014±0.007 |
–, no method used. ↑, indicates that the higher the value, the better the registration effect; ↓, indicates that the lower the value, the worse the registration effect. SD, standard deviation; NCC, normalized cross-correlation; MI, mutual information; ANTs, advanced normalization tools; SyN, symmetric normalization; TMI, IEEE Transactions on Medical Imaging; CVPR, Computer Vision and Pattern Recognition; MIA, Medical Image Analysis; Non-c, represents non-cascaded.
Table 3
Method | P value (cascade ConvNXMorph > baseline) | ||||
---|---|---|---|---|---|
DSC | MSE | NCC | MI | ||
ANTs (SyN) | 1e−5 | 0.01 | 4e−5 | 0.02 | 0.04 |
NiftyReg | 2e−4 | 0.01 | 3e−4 | 0.03 | 4e−3 |
VoxelMorph | 2e−4 | 0.01 | 2e−4 | 0.02 | 3e−3 |
SYMNet | 1e−4 | 0.02 | 7e−6 | 0.03 | 0.04 |
Vit-V-Net | 8e−3 | 0.03 | 3e−3 | 0.03 | 0.02 |
TransMorph | 5e−3 | 0.04 | 0.01 | 0.02 | 0.04 |
P values are calculated by paired t-test, and all P values are significant. DSC, Dice similarity coefficient; MSE, mean squared error; NCC, normalized cross-correlation; MI, mutual information; ANTs, advanced normalization tools; SyN, symmetric normalization.
Finally, Table 4 summarizes the parameters and floating point operations (FLOPs) for all registration methods. Compared to TransMorph, the cascaded ConvNXMorph utilized only 10.01% of its parameters, and the FLOPs were reduced by 8.35%.
Table 4
Models | Params (M) | FLOPs (G) |
---|---|---|
VoxelMorph | 0.27 | 319.52 |
SYMNet | 1.10 | 339.76 |
Vit-V-Net | 31.33 | 351.30 |
TransMorph | 46.75 | 602.24 |
Non-c ConvNXMorph | 2.34 | 275.98 |
Cascaded ConvNXMorph | 4.68 | 551.96 |
Params, parameters; FLOPs, floating point operations; M, million; G, gigaflops; Non-c, non-cascaded.
Comparison of segmentation performance
Table 5 lists the nnUNet-v2 segmentation results based on different registration algorithms. The first two rows represent the DSC without using the FLAIR modality and without registration of the FLAIR modality, respectively. Subsequently, the remaining rows correspond to the use of ANTs (SyN), NiftyReg, VoxelMorph, SYMNet, Vit-V-Net, TransMorph, non-cascaded ConvNXMorph, and cascaded ConvNXMorph, respectively.
Table 5
Method | DSC (%) ↑ | IoU (%) ↑ | HD95 (mm) ↓ | Sensitivity (%) ↑ |
---|---|---|---|---|
Without FLAIR | 78.70±0.07 | 72.90±0.07 | 3.41±0.01 | 74.23±0.05 |
FLAIR no-registrated | 78.54±0.09 | 72.78±0.02 | 3.37±0.03 | 74.15±0.03 |
ANTS (SyN) | 79.30±0.04 | 72.23±0.06 | 3.26±0.05 | 75.11±0.04 |
NiftyReg | 79.23±0.08 | 72.20±0.05 | 3.25±0.05 | 75.02±0.07 |
VoxelMorph (2019, TMI) | 79.47±0.02 | 72.97±0.03 | 3.20±0.06 | 75.35±0.02 |
SYMNet (2020, CVPR) | 79.53±0.05 | 73.17±0.04 | 3.14±0.04 | 75.98±0.05 |
Vit-V-Net (2021, arXiv) | 79.59±0.03 | 73.56±0.07 | 3.10±0.03 | 76.21±0.04 |
TransMorph (2022, MIA) | 79.64±0.04 | 74.17±0.08 | 3.05±0.02 | 77.11±0.06 |
Non-c ConvNXMorph | 79.95±0.06 | 74.98±0.03 | 3.03±0.07 | 78.12±0.03 |
Cascade ConvNXMorph | 80.14±0.05 | 75.42±0.05 | 3.01±0.03 | 78.31±0.04 |
The best results are highlighted in bold on a gray background. ↑, indicates that the higher the value, the better the segmentation effect; ↓, indicates that the lower the value, the worse the segmentation effect. SD, standard deviation; DSC, Dice similarity coefficient; IoU, Intersection over Union; HD95, 95% Hausdorff distance; FLAIR, fluid-attenuated inversion recovery; ANTs, advanced normalization tools; SyN, symmetric normalization; TMI, IEEE Transactions on Medical Imaging; CVPR, Computer Vision and Pattern Recognition; MIA, Medical Image Analysis; Non-c, non-cascaded.
If only considering the DWI and ADC modalities, the DSC is 78.70%. However, the introduction of the unregistered FLAIR modality did not improve the segmentation results. On the contrary, the DSC decreased by 0.16%. This observation suggests that introducing unaligned multimodal images into the segmentation network not only fails to enhance segmentation performance but also leads to the network learning incorrect information due to anatomical misalignment between modalities, thereby affecting the performance of the segmentation network.
Meanwhile, compared to the unregistered method, the DSC of the ANTS (SyN), NiftyReg, VoxelMorph, SYMNet, Vit-V-Net, TransMorph, non-cascaded ConvNXMorph, and cascaded ConvNXMorph registration algorithms increased by 0.76%, 0.69%, 0.93%, 0.99%, 1.05%, 1.10%, 1.41%, and 1.60%, respectively. Compared to the sub-optimal TransMorph model, the cascaded ConvNXMorph demonstrated improvements in DSC, IoU, and sensitivity by 0.50%, 1.25%, and 1.2%, respectively, with a decrease in HD95 by 0.04 mm.
Table 6 shows the P values for these tests, and all statistical test results were very significant.
Table 6
Method | P value (cascade ConvNXMorph > baseline) | |||
---|---|---|---|---|
DSC | IoU | HD95 | Sensitivity | |
Without FLAIR | 1e−6 | 7e−7 | 2e−6 | 1e−8 |
FLAIR non-registered | 4e−6 | 1e−7 | 1e−5 | 4e−9 |
ANTs (SyN) | 7e−7 | 3e−7 | 6e−4 | 2e−9 |
NiftyReg | 2e−5 | 6e−8 | 1e−3 | 1e−7 |
VoxelMorph | 5e−6 | 2e−7 | 2e−3 | 1e−8 |
SYMNet | 1e−6 | 2e−7 | 5e−4 | 3e−8 |
Vit-V-Net | 2e−6 | 2e−6 | 6e−3 | 2e−8 |
TransMorph | 6e−6 | 2e−5 | 0.02 | 2e−6 |
P values are computed through paired t-tests, and all P values exhibit statistical significance. DSC, Dice similarity coefficient; IoU, Intersection over Union; HD95, 95% Hausdorff distance; FLAIR, fluid-attenuated inversion recovery; ANTs, advanced normalization tools; SyN, symmetric normalization.
To better illustrate the effectiveness of the St-RegSeg algorithm, we compared our method with five SOTA single-modal segmentation methods, including one CNN-based method (nnUnet) and four stroke lesion segmentation algorithms [HCSNet (37), SEAN (38), FRPNet (39), SrSNet (40)]. As shown in Table 7, St-RegSeg achieved the best performance across four metrics and significantly outperformed other segmentation algorithms. Compared to HCSNet (2023, JBHI), St-RegSeg improved the DSC, IoU, and sensitivity by 31.41%, 30.25%, and 36.16%, respectively, and reduced the HD95 by 3.92 mm. In comparison with the second-best network SrSNet (2024, ESWA), St-RegSeg enhanced the DSC, IoU, and sensitivity by 1.10%, 1.11%, and 1.58%, respectively, and decreased the HD95 by 0.19 mm, demonstrating its superior segmentation performance. Table 8 shows the p-values for these tests, and all statistical tests results were very significant.
Table 7
Method | DSC (%) ↑ | IoU (%) ↑ | HD95 (mm) ↓ | Sensitivity (%) ↑ |
---|---|---|---|---|
HCSNet (2023, JBHI) | 48.73±0.12 | 45.17±0.13 | 6.93±0.07 | 42.15±0.14 |
FRPNet (2024, CIBM) | 61.77±0.17 | 58.77±0.06 | 5.65±0.09 | 59.86±0.13 |
SEAN (2021, MICCAI) | 66.14±0.09 | 63.15±0.07 | 4.94±0.11 | 63.98±0.16 |
nnUnet (2021, Nature Methods) | 78.70±0.07 | 72.90±0.07 | 3.41±0.01 | 74.23±0.05 |
SrSNet (2024, ESWA) | 79.04±0.11 | 74.31±0.12 | 3.20±0.09 | 76.73±0.07 |
St-RegSeg | 80.14±0.05 | 75.42±0.05 | 3.01±0.03 | 78.31±0.04 |
↑, indicates that the higher the value, the better the segmentation effect; ↓, indicates that the lower the value, the worse the segmentation effect. SD, standard deviation; DSC, Dice similarity coefficient; IoU, Intersection over Union; HD95, 95% Hausdorff distance; JBHI, IEEE Journal of Biomedical and Health Informatics; CIBM, Computers in Biology and Medicine; MICCAI, Medical Image Computing and Computer Assisted Intervention Society; ESWA, Expert Systems With Applications.
Table 8
Method | P value (St-RegSeg > baseline) | |||
---|---|---|---|---|
DSC | IoU | HD95 | Sensitivity | |
HCSNet | 7e−11 | 2e−10 | 1e−8 | 7e−11 |
FRPNet | 2e−9 | 5e−11 | 3e−7 | 2e−10 |
SEAN | 1e−10 | 1e−9 | 5e−6 | 1e−9 |
nnUnet | 5e−6 | 4e−6 | 5e−6 | 2e−8 |
SrSNet | 6e−5 | 5e−6 | 0.01 | 3e−7 |
P values are calculated by paired t-test, and all P values are significant. DSC, Dice similarity coefficient; IoU, Intersection over Union; HD95, 95% Hausdorff distance.
Ablation study
Table 9 presents the ablation study conducted on the ISLES’22 dataset. In the ablation studies, we validated the effectiveness of the ConvNeXt-R module, triple-channel input, cascaded registration strategy, and AG mechanism. Using TransMorph as a baseline, components were gradually replaced and added to analyze the contribution of each component, and statistical analyses were conducted to verify the presence of significant differences. The baseline DSC was 79.64 (±0.04).
Table 9
Method | If1 (ADC) | If2 (DWI) | Coarse-Reg (NCC/MIND) | Fine-Reg (NCC/MIND) | AG | Backbone | DSC (%), mean ± SD |
---|---|---|---|---|---|---|---|
TransMorph | × | √ | NCC | × | × | Swin Transformer | 79.64±0.04 |
ConvNXMorph | × | √ | NCC | × | × | ConvNeXt-R | 79.81±0.05 |
ConvNXMorph | × | √ | MIND | × | × | ConvNeXt-R | 79.47±0.06 |
ConvNXMorph | × | √ | NCC | × | √ | ConvNeXt-R | 79.88±0.03 |
ConvNXMorph | √ | √ | NCC | × | √ | ConvNeXt-R | 79.95±0.06 |
Cascaded ConvNXMorph | √ | √ | NCC | NCC | √ | ConvNeXt-R | 79.96±0.07 |
Cascaded ConvNXMorph | √ | √ | NCC | MIND | √ | ConvNeXt-R | 80.14±0.05 |
ADC, apparent diffusion coefficient; DWI, diffusion-weighted imaging; NCC, normalized cross-correlation; MIND, Modality Independent Neighborhood Descriptor; AG, attention gate; DSC, Dice similarity coefficient; SD, standard deviation; If1, the first fixed image (ADC); If2, the second fixed image (DWI).
Initially, the introduction of the ConvNeXt-R module resulted in a 0.17% increase in DSC compared to the baseline (P=0.008), indicating that the incorporation of the ConvNeXt-R module is beneficial for the St-RegSeg framework. Subsequently, the MIND loss function was introduced, leading to a 0.34% decrease in DSC compared to the second ablation study. This decrease is speculated to be due to the MIND loss function’s focus on pixel-level information matching, which may not consider the overall image information as effectively as the NCC loss function.
In the fourth ablation study, the introduction of the AG mechanism resulted in a 0.07% increase in DSC compared to the second ablation study (P=0.034), thereby validating the effectiveness of the AG mechanism. In the fifth ablation study, the introduction of the triple-channel mechanism led to a 0.42% increase in DSC compared to the fourth ablation experiment (P=0.001), confirming the effectiveness of the triple-channel mechanism.
In the sixth ablation study, a cascaded registration network strategy was employed, and the use of the NCC loss function in the fine registration phase did not result in significant improvement (P=0.542). The seventh ablation study used the MIND loss function in the fine registration phase, which led to a 0.19% increase in DSC compared to the fifth ablation experiment (P=0.003). The reason for this outcome is speculated to be that the model trained with the NCC loss function during the coarse registration phase had already reached its optimal state, and hence, further optimization using the NCC loss function did not significantly enhance the results. However, employing the MIND loss function allowed for further optimization of the segmentation effect.
DiscussionOther Section
Qualitative analysis in the registration phase
The stroke lesion segmentation labels for ISLES’22 were annotated under DWI and ADC modalities. To visualize the results, the labels were loaded onto the registered FLAIR modality. The qualitative results of the MRI registration are shown in Figure 7. Notably, compared to the original FLAIR modality annotations, the ANTs (SyN), NiftyReg, VoxelMorph, SYMNet, ViT-V-Net, and TransMorph algorithms all showed improvements in the alignment of the lesion areas with the annotations. Among these, the coarse registration ConvNXMorph appeared visually superior to the other six registration methods. It accurately deformed the moving image, as evidenced by the boundaries of the stroke lesion structure. The cascaded ConvNXMorph further enhanced the deformation precision of the moving image. Intuitively, the registration results of the cascaded ConvNXMorph were almost consistent with the fixed image structure in terms of the stroke lesion structure, surpassing the performance of the baseline methods.
Qualitative analysis in the segmentation phase
This study selected nnUNet-v2 as the segmentation model. Even without the use of the FLAIR modality, the network demonstrates good segmentation performance. To further enhance segmentation precision, an approach from the registration perspective was adopted, aligning different modalities before they are input into the segmentation model. This improvement ensures that the lesion areas contain precise information, thereby facilitating accurate segmentation.
Figure 8 displays examples of stroke lesion segmentation by non-cascaded ConvNXMorph, cascaded ConvNXMorph, and other unsupervised registration models. Compared to other methods, St-RegSeg has several advantages. Firstly, it can more precisely segment and locate small, medium, and large stroke lesions (as seen in Case 1, Case 3, and Case 2, respectively). Additionally, St-RegSeg also shows fine lesion segmentation edges (as illustrated in Case 3). Moreover, even in situations with multiple stroke lesions, St-RegSeg successfully captures more lesions and presents results closer to the annotations (as shown in Case 1 and Case 4). In summary, the St-RegSeg framework demonstrates strong capabilities in handling multiple stroke lesions and accurately segmenting lesion edges.
The contribution of DWI and ADC modalities to segmentation accuracy
In Figure 9, the value of
In this study, the DSC for
In this study, the segmentation accuracy was highest when
ConclusionsOther Section
This study proposes the St-RegSeg framework, which integrates the unsupervised registration model ConvNXMorph with nnUNet-v2 for multimodal image registration and segmentation. The ConvNXMorph model introduces an AG mechanism, triple-channel input, a cascaded registration network, and uses the MIND multimodal loss function in the fine registration phase to achieve optimal registration and segmentation performance. ConvNXMorph also improves inference speed, significantly accelerating the process of multimodal medical image registration. Although the St-RegSeg framework is highly effective for multimodal medical image registration and segmentation, there is still room for further enhancement of segmentation outcomes. Specifically, within the St-RegSeg framework, nnUNet-v2 can be replaced with more advanced segmentation models such as nnFormer (41), STU-Net (42), or MedNeXt (43), thereby further improving the segmentation performance of this framework.
AcknowledgmentsOther Section
Funding: This work was supported in part by
FootnoteOther Section
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-24-725/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013).
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
ReferencesOther Section
- Chen L, Bentley P, Rueckert D. Fully automatic acute ischemic lesion segmentation in DWI using convolutional neural networks. Neuroimage Clin 2017;15:633-43. [Crossref] [PubMed]
- Zhu Z, He X, Qi G, Li Y, Cong B, Liu Y. Brain tumor segmentation based on the fusion of deep semantics and edge information in multimodal MRI. Inf Fusion 2023;91:376-87.
- Beg MF, Miller MI, Trouvé A, Younes L. Computing large deformation metric mappings via geodesic flows of diffeomorphisms. Int J Comput Vis 2005;61:139-57.
- Sokooti H, de Vos B, Berendsen F, Lelieveldt BPF, Išgum I, Staring M. Nonrigid Image Registration Using Multi-scale 3D Convolutional Neural Networks. Medical Image Computing and Computer Assisted Intervention − MICCAI 2017;1:232-9.
- Eppenhof K, Lafarge MW, Moeskops P, Veta M, Pluim J. Deformable image registration using convolutional neural networks. Medical Imaging 2018. doi:
10.1117/12.2292443 . - Simonsen J, Jensen OS. Spatial Transformer Networks. Adv neural Inf Process Syst 2015;28:2017-25.
- de Vos BD, Berendsen FF, Viergever MA, Staring M, Išgum I. End-to-end unsupervised deformable image registration with a convolutional neural network. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. DLMIA ML-CDS 2017 2017. Lecture Notes in Computer Science, Springer, Cham; 2017:10553.
- Sheikhjafari A, Punithakumar K, Ray N. Unsupervised Deformable Image Registration with Fully Connected Generative Neural Network. Midl; 2018:1-9.
- Balakrishnan G, Zhao A, Sabuncu MR, Guttag J, Dalca AV. VoxelMorph: A Learning Framework for Deformable Medical Image Registration. IEEE Trans Med Imaging 2019; Epub ahead of print. [Crossref]
- Balakrishnan G, Zhao A, Sabuncu MR, Guttag J, Dalca AV. An Unsupervised Learning Model for Deformable Medical Image Registration. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2018:9252-60.
- Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020. arXiv:2010.11929.
- Chen J, He Y, Frey EC, Li Y, Du Y. ViT-V-Net: Vision Transformer for Unsupervised Volumetric Medical Image Registration. arXiv 2021. arXiv:2104.06468.
- Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV 2021). IEEE, New York, 2021:9992-10002.
- Chen J, Frey EC, He Y, Segars WP, Li Y, Du Y. TransMorph: Transformer for unsupervised medical image registration. Med Image Anal 2022;82:102615. [Crossref] [PubMed]
- Zhu Y, Lu S. Swin-VoxelMorph: A Symmetric Unsupervised Learning Model for Deformable Medical Image Registration Using Swin Transformer. Springer Nature Switzerland; 2022:78-87.
- Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, Wang M. Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation. In: Karlinsky L, Michaeli T, Nishino K, editors. Computer Vision – ECCV 2022 Workshops. Lecture Notes in Computer Science; ECCV 2022; vol 13803.
- Mok TCW, Chung ACS. Fast symmetric diffeomorphic image registration with convolutional neural networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2020:4644-53.
- Kim B, Kim DH, Park SH, Kim J, Lee JG, Ye JC. CycleMorph: Cycle consistent unsupervised deformable image registration. Med Image Anal 2021;71:102036. [Crossref] [PubMed]
- Zhao S, Lau T, Luo J, Chang EI, Xu Y. Unsupervised 3D End-to-End Medical Image Registration With Volume Tweening Network. IEEE J Biomed Health Inform 2020;24:1394-404. [Crossref] [PubMed]
- Cheng Z, Guo K, Wu C, Shen J, Qu L. U-Net cascaded with dilated convolution for medical image registration. Chinese Automation Congress (CAC) (Hangzhou: IEEE); 2019:3647-51.
- Zhao S, Dong Y, Chang E, Xu Y. Recursive cascaded networks for unsupervised medical image registration. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); 2019:10600-10.
- Ma L, Chi W, Morgan HE, Lin MH, Chen M, Sher D, Moon D, Vo DT, Avkshtol V, Lu W, Gu X. Registration-guided deep learning image segmentation for cone beam CT-based online adaptive radiotherapy. Med Phys 2022;49:5304-16. [Crossref] [PubMed]
- Cabezas M, Oliver A, Lladó X, Freixenet J, Cuadra MB. A review of atlas-based segmentation for magnetic resonance brain images. Comput Methods Programs Biomed 2011;104:e158-77. [Crossref] [PubMed]
- Wu Z, Zhang X, Li F, Wang S, Huang L, Li J. W-Net: A boundary-enhanced segmentation network for stroke lesions. Expert Systems with Applications: An International Journal 2023. doi:
10.1016/j.eswa.2023.120637 . - Zhu W, Myronenko A, Xu Z, Li W, Roth H, Huang Y, Milletari F, Xu D. NeurReg: Neural registration and its application to image segmentation. IEEE Workshop/Winter Conference on Applications of Computer Vision, WACV; 2020:3606-15.
- Li J, Wang C, Huang B, Zhou Z. ConvNeXt-backbone HoVerNet for Nuclei Segmentation and Classification. arXiv 2022. arXiv:2202.13560.
- Fan S, Liang W, Ding D, Yu H. LACN: A lightweight attention-guided ConvNeXt network for low-light image enhancement. Eng Appl Artif Intell 2023;117:105632.
- Jiang Y, Yu J, Yang W, Zhang B, Wang Y. Nextformer: a convnext augmented conformer for end-to-end speech recognition. arXiv 2022. arXiv:2206.14747.
- Yu W, Zhou P, Yan S, Wang X. InceptionNeXt: When Inception Meets ConvNeXt. Proceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition 2024, Seattle; 2024 June 17-21.
- Hernandez Petzsche MR, de la Rosa E, Hanning U, Wiest R, Valenzuela W, Reyes M, et al. ISLES 2022: A multi-center magnetic resonance imaging stroke lesion segmentation dataset. Sci Data 2022;9:762. [Crossref] [PubMed]
- Isensee F, Jaeger PF, Kohl SAA, Petersen J, Maier-Hein KH. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods 2021;18:203-11. [Crossref] [PubMed]
- Avants BB, Epstein CL, Grossman M, Gee JC. Symmetric diffeomorphic image registration with cross-correlation: evaluating automated labeling of elderly and neurodegenerative brain. Med Image Anal 2008;12:26-41. [Crossref] [PubMed]
- Heinrich MP, Jenkinson M, Bhushan M, Matin T, Gleeson FV, Brady SM, Schnabel JA. MIND: modality independent neighbourhood descriptor for multi-modal deformable registration. Med Image Anal 2012;16:1423-35. [Crossref] [PubMed]
- Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv 2019. arXiv:1912.01703.
- Pluim JP, Maintz JB, Viergever MA. Mutual-information-based registration of medical images: a survey. IEEE Trans Med Imaging 2003;22:986-1004. [Crossref] [PubMed]
- Ashburner J. A fast diffeomorphic image registration algorithm. Neuroimage 2007;38:95-113. [Crossref] [PubMed]
- Liu L, Chang J, Liu Z, Zhang P, Xu X, Shang H. Hybrid Contextual Semantic Network for Accurate Segmentation and Detection of Small-Size Stroke Lesions From MRI. IEEE J Biomed Health Inform 2023;27:4062-73. [Crossref] [PubMed]
- Liang K, Han K, Li X, Cheng X, Li Y, Wang Y, Yu Y. Symmetry-Enhanced Attention Network for Acute Ischemic Infarct Segmentation with Non-contrast CT Images. Springer International Publishing. Medical Image Computing and Computer Assisted Intervention – MICCAI 2021:432-41.
- Wu Z, Zhang X, Li F, Wang S, Li J. A feature-enhanced network for stroke lesion segmentation from brain MRI images. Comput Biol Med 2024;174:108326. [Crossref] [PubMed]
- Li T, An X, Di Y, Gui C, Yan Y, Liu S, Ming D. SrSNet: Accurate segmentation of stroke lesions by a two-stage segmentation framework with asymmetry information. Expert Syst Appl 2024;254:124329.
- Zhou HY, Guo J, Zhang Y, Yu L, Wang L, Yu Y. nnFormer: Interleaved Transformer for Volumetric Segmentation. arXiv 2021. arXiv:2109.03201.
- Huang Z, Wang H, Deng Z, Ye J, Su Y, Sun H, He J, Gu Y, Gu L, Zhang S, Qiao Y. STU-Net: Scalable and Transferable Medical Image Segmentation Models Empowered by Large-Scale Supervised Pre-training. arXiv 2023. arXiv:2304.06716.
- Roy S, Koehler G, Ulrich C, Baumgartner M, Petersen J, Isensee F, Jaeger PF, Maier-Hein K. MedNeXt: Transformer-driven Scaling of ConvNets for Medical Image Segmentation. arXiv 2023. arXiv:2303.09975.