Multimodal magnetic resonance imaging synthesis via cross-modal feature fusion and hierarchical feature extraction
Introduction
Magnetic resonance imaging (MRI) excites hydrogen protons in the body using a powerful magnetic field and bursts of radiofrequency (RF), detecting and utilizing the electromagnetic signals emitted by these excited nuclei to generate intricate images of interior structures. High spatial resolution and superior soft-tissue contrast are provided by the non-ionizing MRI technique (1), making it commonly used for imaging areas such as the brain, spine, joints, and internal organs. It produces images with multiple contrasts, such as T1-weighted, T2-weighted, and fluid attenuated inversion recovery (FLAIR), each produced using distinct pulse patterns that emphasize certain tissue characteristics. T1-weighted images are a great way to see anatomical features, T2-weighted images emphasize fluids and pathological changes, while FLAIR images highlight lesions by suppressing the signal from fluids. This multi-contrast imaging technique allows for a comprehensive assessment of lesions from different perspectives, offering richer information for clinical diagnosis.
In clinical practice, multimodal MRI acquisition faces several challenges, including long scanning times, motion-induced artifacts, variations in acquisition protocols across institutions, and high resource consumption. These issues not only increase the burden on patients but may also affect image quality and the consistency of cross-institutional studies. As a result, multimodal MRI synthesis has become increasingly necessary. Using synthesis techniques, missing or artifact-laden modalities can be generated from existing ones, reducing scanning time, mitigating the effects of motion artifacts, and addressing inter-institutional variability in protocols. Additionally, synthesized multimodal images give doctors more detailed information, which enhances clinical judgment and improves diagnostic precision.
Traditional multimodal image synthesis methods typically rely on physically or statistically based image reconstruction techniques. These methods achieve image transformation by incorporating mapping relationships between modalities, informed by the physical principles of imaging. Alignment-based methods synthesize images by applying geometric transformations, assuming that different modalities share the same underlying anatomical structure (2). This method calculates the spatial differences between the source and destination domains’ input images, applying these deviations to generate synthesized images. Intensity-based transformation methods (3), on the other hand, move away from the geometric constraints of alignment approaches and focus on intensity mapping between images, addressing inconsistencies during MRI acquisition. These techniques use linear or nonlinear mappings to discover the connection between input and output images. In linear mapping, patches of the input image are represented as a sparse linear combination of patches from both the source and target domains. Nonlinear mappings, often implemented through neural networks, are more prevalent as they capture complex relationships between images (4). Additionally, cross-modal joint dictionaries can be constructed, using regularization to maintain the geometric structure of images. Although this method preserves spatial consistency and anatomical structures, traditional alignment-based synthesis approaches have shown accuracy limitations in practice. Jog et al. (5) introduced a random forest regression-based method for MRI synthesis, significantly improving the preservation of structural detail by learning complex mapping relationships between modalities. This method has demonstrated strong performance in generating synthetic MRI images, enhancing diagnostic accuracy. Nguyen et al. (6) proposed a location-sensitive deep network for cross-domain medical image synthesis, focusing on spatial information to enhance the quality of synthesized images. Results showed significant improvements in transforming images between different modalities, boosting both accuracy and image quality.
While these traditional and early learning-based methods laid the foundation for multimodal MRI synthesis, they still suffer from limited accuracy and generalization. This has motivated the rapid shift toward deep learning approaches in recent years. Deep learning techniques have shown remarkable results in multimodal image synthesis (7) in recent years, effectively modeling complex nonlinear relationships and generating high-quality synthetic images. Generative Adversarial Networks (GANs), introduced by Goodfellow et al. (8), allow for the creation of lifelike graphics in a variety of fields by using adversarial training (9) between a discriminator and a generator (10). This approach has pioneered a new direction in generative modeling. By adding conditional inputs (11), conditional GANs (cGANs) expand on the conventional GAN architecture and enable the creation of particular kinds of samples. Conditional information is incorporated into the generation process by cGANs, which greatly enhances the controllability and diversity of generated images, making them suitable for tasks such as image generation, transformation, and completion. Wasserstein GAN (12) improves GAN training stability and generation quality by introducing the Wasserstein distance, optimizing a continuous cost function, and addressing common issues like mode collapse and instability, leading to better convergence and performance. A perceptual adversarial network was introduced by Wang et al. (13) to enhance the visual quality and detail retention of generated images by combining perceptual loss with adversarial loss. This method is used for a number of picture alteration tasks and maximizes perceived similarity in image production. A unified many-to-many mapping technique for many image translation problems is investigated by InjectionGAN (14). By introducing a novel network structure, it efficiently performs inter-domain conversions, enhancing the flexibility and adaptability of image translation. MedGAN (15) greatly increases the precision and efficacy of medical image analysis by utilizing the adversarial training process of GANs to accomplish high-quality picture synthesis between various modalities. Hi-Net (16), a multimodal MRI hybrid fusion network, utilizes correlations between different modalities, employing various fusion strategies to capture and hierarchically integrate unimodal features. Yang et al. (17) employed cGANs to transform a given modality into a target modality in MRI images, using a framework that addresses complex brain structures by integrating low-level features and high-level representations across modalities. Transformers have been used for a number of medical imaging applications, including image segmentation (18), image alignment (19), and image synthesis (20), leveraging their strengths in global feature modeling. Zhang et al. (21) created a pushing transformer network (PTNet) in the MRI field to synthesize MRI images of infants, which utilizes Transformer layers and multi-scale pyramid representations to enhance the stability of adversarial training. T2Net (22) implements integrated processing of MRI reconstruction and super-resolution, greatly enhancing the production of motion artifact-free, high-quality, super-resolved images, even from highly undersampled and degraded MRI data. Residual vision transformers (ResViT) (23) introduce a central bottleneck structure within the aggregated residual transformer (ART) block, significantly enhancing the synthesis quality of multi-contrast MRI and computed tomography (CT) images derived from MRI. Diffusion models (24) have also become a potent generating framework, demonstrating significant potential across various image synthesis tasks (25), including applications in medical imaging.
In this paper, we propose a novel multimodal MRI image synthesis framework that combines a transformer-based global contextual modeling branch and a ResOctaveBlock-based multi-frequency local feature extraction branch in a unified structure, termed the Transformer-ResOctaveBlock (TROB). This hybrid architecture enables comprehensive feature representations by capturing both global dependencies and fine-grained local details. In addition, we introduce an attention-based feature fusion module to facilitate cross-modal feature interaction, ensuring effective alignment and integration of multimodal information. Extensive experiments on datasets demonstrate the superior performance of our method in terms of structural consistency and fine-detail preservation.
Methods
This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.
Network architecture
The network synthesizes a missing target MRI modality from the available input modalities rather than inpainting within a single modality, and it is composed of three main modules: the feature extraction module, the feature fusion module, and the image generation module, which is illustrated in Figure 1. The network receives MRI multimodal images as input, aiming to synthesize high-quality missing images. Initially, the input images undergo feature extraction at various scales and levels via the feature extraction network. Subsequently, the feature fusion network integrates these extracted features, optimizing the utilization of multimodal information. Ultimately, the image generation network employs the integrated features to produce the final target image.
Feature extraction module
The feature extraction module is designed to progressively capture multi-scale features from the input multimodal MRI images, enabling the modeling of global contextual information as well as multi-frequency local details. This module consists of three downsampling stages, each comprising two consecutive convolutional structures.
The first convolutional structure employs a standard 3×3 convolution operation, combined with batch normalization and a rectified linear unit (ReLU) activation function, to extract local spatial features and suppress noise. Given an input feature map, the output of this block can be formally represented, but since the operation is standard, we omit detailed equations here for brevity.
The second convolutional structure incorporates the TROB, which serves as the core of our feature extraction design. TROB integrates global contextual modeling and multi-frequency local feature extraction by combining two parallel branches: a Transformer branch and a ResOctaveBlock branch (Figure 2).
Transformer branch
The input feature map is first converted into a sequence through patch embedding, enabling global modeling. Multi-head self-attention and a feed-forward network are then applied to capture long-range dependencies across the entire image. The output is reshaped back into spatial dimensions, providing globally consistent features that preserve anatomical structures.
ResOctaveBlock branch
This branch is based on multi-frequency convolution. The input feature map is decomposed into high- and low-frequency components, which are processed independently through dedicated convolutional operations. To enhance detail sensitivity, a frequency-doubling convolution is incorporated, expanding the receptive field while amplifying boundary and texture patterns such as lesion edges. The low-frequency map is upsampled and fused with the high-frequency counterpart through weighted addition, followed by a residual connection to improve stability.
Finally, the outputs from the two branches are concatenated along the channel dimension and passed through a 1×1 convolution to fuse global and local information into a unified representation. By embedding the TROB in each downsampling stage, the feature extraction module progressively captures not only multi-scale local features but also global context and fine-grained multi-frequency details, resulting in a comprehensive and robust feature representation for subsequent multimodal synthesis.
Moreover, the reconstruction loss function in the module is:
where denotes the reconstructed image of . loss is employed to quantify the variation between the reconstructed and original images. The loss ensures the completeness and interpretability of modal features by requiring the network to extract sufficient information from each modality to reconstruct its original image, while enhancing multimodal fusion effectiveness to prevent information loss or overfitting.
Feature fusion module
The feature fusion module is crucial for multimodal image synthesis, effectively integrating features from different modalities to generate high-quality output images by utilizing complementary information. The proposed feature fusion module aims to maximize the retention of feature information from each modality and enhance the accuracy and detail of the final synthesized image through a well-designed fusion strategy.
Two degrees of fusion are mostly covered by current multimodal fusion techniques. Early fusion maximizes the usage of complementary information while perhaps adding noise by directly stacking raw data from various modalities to create a single input for the deep network. The late fusion might not fully exploit low-level information between modalities. An attention fusion block (AFB) is made to flexibly weight distinct inputs in order to increase overall performance, learn from different modalities, and better capture inter-modal correlations. Figure 1 shows how the AFB receives feature representations from each feature extraction network’s first pooling layer. The feature representations from the second layer are then mixed with the output of this AFB module and fed into the next AFB module. This process results in the presence of three AFB modules within the fusion network.
In the AFB module, the attention weights are computed for each modality’s feature map as the formula Eq. [2]:
where Wa stands for the attention weights’ learning parameters and the softmax function makes sure the weights add up to 1.
Subsequently, the fused feature map is obtained through the weighted summation of each modality:
Image generation module
The fused features Ffusion are converted into the final missing images in an image generation network, which is based on a classical GAN model. The GAN is made up of a discriminator and a generator. The generator G is tasked with creating a picture that resembles the target modality. Consequently, the objective function of the generator is:
where represents the image generated and Y denotes the real image.
Furthermore, the objective function of the discriminator D can be stated as:
The following is the definition of the overall objective function:
In this study, the three loss terms are equally weighted, following common practice in related works. Preliminary tests with manually adjusted weights showed no consistent improvements, while equal weighting led to stable convergence.
A two-dimensional (2D) image that matches the generator’s output in size is supplied to the discriminator. The discriminator’s architecture comprises four convolutional layers, each of which is followed by a LeakyReLU. A stride of two is applied to the first four convolutional layers.
Datasets
The 2018 Multimodal Brain Tumor Segmentation Challenge (BraTS2018) dataset was utilized to verify the efficacy of the model suggested in this research (26). The dataset includes MRI scans of 285 patients from 19 different institutions, including those with low-grade glioma (LGG) and glioblastoma (GBM). Aligned MRI from four modalities—T1, T1c (contrast enhanced), T2, and FLAIR—with each modality volume of 240×240×155 are included in each patient’s scan data. The T1, T2, and FLAIR images from these datasets are used in the experiments, and 2D images are used to confirm the efficacy and stability of the suggested synthesis technique. Consequently, the particular datasets are separated as follows.
Dataset of artifact-free images
T1, T2, and FLAIR pictures from the BraTS2018 dataset were used for the synthesis experiments in the clear dataset. Each 2D slice (240×240) from the core region, which is the network’s input data, is cropped to a size of 192×192 to fit the input size of the network. The 285 subjects are split into a training set and a testing set at random, with 20% of the data (57 subjects) going to testing and 80% of the data (228 subjects) going to training, in order to fully assess the generalization capacity of the suggested approach. A fixed random seed was used to define this split once, and the identical partition was consistently applied across all methods and ablation studies. This ensures fair comparison and reproducibility, without any resampling or re-splitting across experiments. During both training and testing, the clear dataset consists of standard MRIs acquired under optimal conditions, free from any artifacts, ensuring the generation of high-quality images.
Dataset of simulated motion artifacts
To evaluate the robustness of the network against images containing motion artifacts, this study introduces a portion of MRI with simulated motion artifacts into the training set. Specifically, 10% of T1 images in the training data (from the 228 subjects) were simulated to contain motion artifacts. The simulation process was performed entirely in K-space to emulate translational motion during acquisition. First, artifact-free MRIs were transformed into K-space using the Fourier transform. Then, random row-wise phase shifts were applied along the phase-encoding direction to mimic patient displacement during scanning. Finally, the modified K-space data were converted back into the image domain using the inverse Fourier transform, yielding motion-corrupted images that closely resemble artifacts observed in practice.
According to MRI imaging principles, when an object moves during scan time, spatial misregistration occurs. Motion along the phase-encoding direction is particularly prone to introducing artifacts that appear as ghosting or repeated structures in the reconstructed images. By simulating random translational motion in K-space, the generated images accurately reproduce these effects, enabling the model to learn from both artifact-free and artifact-contaminated data. This strategy improves robustness for real-world clinical scenarios where patient motion is inevitable.
During MRI imaging, the signal in K-space can be expressed mathematically as follows:
where represents the signal in K-space, indicates the spatial distribution of the object, and and are the coordinates in the frequency and phase encoding directions, respectively.
When the imaging object is displaced during the scanning process, a corresponding phase offset is generated. Assuming the object undergoes translational motion in the phase encoding direction with a displacement of , the corresponding phase encoding signal becomes:
Specifically, if the target is displaced by along the phase encoding direction while scanning, a phase shift will be introduced in the K-space signal of a specific row (27). By applying a random offset to each row of data in K-space, the total K-space data of one slice containing motion artifacts can be obtained. The reconstructed image from motion-corrupted data using 2D fast Fourier transform (FFT) will show motion artifacts.
Experiments
Implementation details
All networks are trained with the Adam optimizer. The proposed model undergoes training for 100 epochs, starting with an initial learning rate of 0.0002 for the first 50 epochs, which is then linearly reduced to 0 over the remaining epochs. The implementation is carried out using PyTorch.
Comparison methods and evaluation metrics
This study evaluates the proposed synthesis approach against the state-of-the-art cross-modal synthesis techniques Pix2pix (28), MM-Syns (29), Hi-Net (16), and ResViT (23) in order to confirm its efficacy. The following is a summary of these techniques. (I) Pix2pix. This technique learns the mapping relationship between paired samples to provide high-quality synthesis from the input to the target image. (II) MM-Syns. By combining several input modalities and learning shared a modal invariant potential space, this approach can handle missing data and improve synthesis accuracy. (III) Hi-Net. This technique suggests a hybrid fusion approach that combines many input modalities to generate an image. (IV) ResViT. This technique also suggests a hybrid fusion approach that combines multiple input modalities to generate an image. To ensure fair comparison, all baseline methods were implemented using their official codes and the default parameter settings reported in the original publications. All models were trained under the same dataset partition and hardware environment. The proposed framework was trained under identical conditions, with only minor adjustments to the learning rate and weight decay for stable convergence. Extensive hyperparameter tuning was intentionally avoided in order to maintain fairness across methods. For Pix2Pix, this study adopts the 70×70 PatchGAN discriminator and the paired translation setup as in the original formulation.
To quantitatively assess the performance of the proposed method, this study employs three standard evaluation metrics: peak signal-to-noise ratio (PSNR), Structural Similarity Index (SSIM), and Normalized Root Mean Square Error (NRMSE). These evaluation metrics are consistent with standard practice in BraTS benchmarking and are widely adopted in recent multimodal MRI synthesis studies. PSNR gauges the quality of the image’s reconstruction by comparing the synthesized and original images at the pixel level. The difference between the real and synthesized images decreases as the PSNR value increases. SSIM compares the brightness, contrast, and structural information of two images to determine how comparable they are and to estimate the perceptual quality of an image. Higher structural similarity between the two images is indicated by SSIM values nearer 1, which range from 0 to 1. NRMSE standardizes the root mean square error to quantify the discrepancy between the synthesized image and the real image; a lower NRMSE indicates a higher quality of the synthesized image.
Results
FLAIR synthesis
The BraTS dataset is used in this paper to assess the effectiveness of synthesizing FLAIR pictures from T1 and T2, and Table 1 summarizes the quantitative comparison findings of the suggested approach with other cutting-edge techniques. The proposed approach outperforms competing methods in all three metrics—PSNR, SSIM, and NRMSE—as shown in Table 1. These results demonstrate that our method provides considerable benefits in terms of pixel-level image quality, successfully preserving structural elements and texture information.
Table 1
| Methods | PSNR ↑ | SSIM ↑ | NRMSE ↓ |
|---|---|---|---|
| Pix2pix (T2→FLAIR) | 27.93±1.22 | 0.81±0.12 | 0.21±0.54 |
| MM-Syns (T1 + T2→FLAIR) | 26.74±0.78 | 0.86±0.06 | 0.17±0.25 |
| Hi-Net (T1 + T2→FLAIR) | 27.75±2.56 | 0.82±0.08 | 0.19±0.17 |
| ResViT (T1 + T2→FLAIR) | 29.17±1.25 | 0.87±0.09 | 0.17±0.36 |
| Ours (T1 + T2→FLAIR)† | 32.61±1.31 | 0.90±0.07 | 0.15±0.35 |
Data are presented as mean ± standard deviation. †, results of the proposed method. ↑, indicates higher is better; ↓, indicates lower is better. Italicized values indicate the best performance among all the compared methods. FLAIR, fluid attenuated inversion recovery; NRMSE, normalized root mean square error; PSNR, peak signal-to-noise ratio; SSIM, structural similarity index measure; T1, T1-weighted; T2, T2-weighted.
Figure 3 illustrates a qualitative comparison of the synthesized MRI images generated by Pix2pix, MM-Syns, Hi-Net, ResViT, our proposed method (ours), and the ground truth images. For qualitative comparison (Figure 3), representative cases were randomly selected from the test set across different patients, illustrating typical synthesis results. As shown, our method consistently achieves superior performance in terms of structural consistency and pathological detail preservation. In particular, compared with other methods, our results exhibit more accurate and complete lesion depiction (indicated by red arrows), closely matching the anatomical structures of the ground truth images. Notably, the images synthesized by our approach preserve fine-grained details and clear boundaries of the lesions, avoiding the over-smoothing and structural distortions observed in the results of Pix2pix and MM-Syns. While Hi-Net and ResViT produce more realistic appearances, their outputs still suffer from local blurriness and reduced contrast in pathological regions. In contrast, our method not only captures the global anatomical structures but also effectively restores the subtle local variations, providing more reliable and diagnostically valuable synthetic images. These results demonstrate the advantage of our proposed framework in integrating global contextual information and multi-frequency local details, leading to improved structural fidelity and clearer lesion delineation in synthesized multimodal MRI images.
Figure 4 shows the absolute error graphs between the result graphs of different image generation methods and the real images. The closer the gray value is to 0, the smaller the error. It can be seen that the error of our method is the smallest and there is no obvious difference.
To evaluate the robustness of the proposed method under artifact interference in multimodal MRI images, the following comparative experiments were conducted: artifact interference was introduced into T1 images, and its impact on the quality of synthesized FLAIR images was observed and compared with results from clear T1 inputs. Table 2 presents the quantitative results for clear and artifact inputs. The findings indicate that even with artifacts in the T1 input, the synthesized FLAIR images exhibit only a slight degradation compared to clear inputs, with PSNR decreasing from 32.61 to 30.02 (−2.59 dB) and SSIM decreasing from 0.90 to 0.88 (−0.02), while NRMSE increased from 0.15 to 0.18. Despite these changes, the overall quality remains high, demonstrating the network’s ability to effectively handle artifact interference.
Table 2
| Input type | PSNR ↑ | SSIM ↑ | NRMSE ↓ |
|---|---|---|---|
| Clear T1 input | 32.61±1.31 | 0.90±0.07 | 0.15±0.35 |
| Artifact T1 input | 30.02±1.68 | 0.88±0.09 | 0.18±0.45 |
Data are presented as mean ± standard deviation. ↑, indicates higher is better; ↓, indicates lower is better. FLAIR, fluid attenuated inversion recovery; NRMSE, normalized root mean square error; PSNR, peak signal-to-noise ratio; SSIM, structural similarity index measure; T1, T1-weighted.
Figure 5 illustrates a qualitative comparison of FLAIR images generated from clear and artifact inputs. Despite the presence of significant motion artifacts in the input T1 images, the generated FLAIR images maintain the global structural consistency and preserve critical pathological features, closely resembling the target FLAIR images. These results confirm that the proposed method effectively utilizes multimodal information to compensate for single-modality artifact interference, exhibiting strong robustness.
Additional tasks
To further evaluate the generalizability and effectiveness of our proposed framework, we conducted experiments on two additional multimodal synthesis tasks: (I) synthesizing T2 images from T1 and FLAIR images, and (II) synthesizing T1 images from T2 and FLAIR images. Figures 6,7 present the qualitative comparison results of these two tasks, with red arrows highlighting lesion regions of interest. For the T1 and FLAIR to T2 synthesis task, our method effectively reconstructs fine-grained anatomical structures and lesion areas, with high consistency with the ground truth T2 images. The synthesized images exhibit clear lesion boundaries and preserve the overall structural integrity of brain tissues, demonstrating the robustness of our framework in learning cross-modality mappings. Similarly, in the synthesis of T1 images from T2 and FLAIR inputs, our method generates synthetic images that faithfully capture both global structures and subtle pathological variations. The results highlight the versatility of the proposed approach across different modality combinations, suggesting its potential for various clinical scenarios.
To complement the qualitative comparisons, we also conducted a quantitative evaluation on the two additional synthesis tasks. Tables 3,4 summarize the PSNR, SSIM, and RMSE metrics for the synthesis of T2 images from T1 and FLAIR inputs, as well as for the synthesis of T1 images from T2 and FLAIR inputs. As shown in the tables, our proposed framework consistently outperforms other methods in both tasks, achieving higher PSNR and SSIM scores and lower RMSE values. These results further confirm the superior performance and generalizability of our approach across different modality combinations.
Table 3
| Methods | PSNR ↑ | SSIM ↑ | NRMSE ↓ |
|---|---|---|---|
| Pix2pix (T1→T2) | 27.52±1.12 | 0.82±0.13 | 0.21±0.48 |
| MM-Syns (T1 + FLAIR→T2) | 26.51±0.85 | 0.85±0.07 | 0.20±0.33 |
| Hi-Net (T1 + FLAIR→T2) | 27.62±2.40 | 0.82±0.11 | 0.19±0.18 |
| ResViT (T1 + FLAIR→T2) | 29.65±1.35 | 0.86±0.09 | 0.18±0.34 |
| Ours (T1 + FLAIR→T2)† | 31.46±1.25 | 0.89±0.06 | 0.16±0.33 |
Data are presented as mean ± standard deviation. †, results of the proposed method. ↑, indicates higher is better; ↓, indicates lower is better. Italicized values indicate the best performance among all the compared methods. BraTs2018, 2018 Multimodal Brain Tumor Segmentation Challenge; FLAIR, fluid attenuated inversion recovery; NRMSE, normalized root mean square error; PSNR, peak signal-to-noise ratio; SSIM, structural similarity index measure; T1, T1-weighted; T2, T2-weighted.
Table 4
| Methods | PSNR ↑ | SSIM ↑ | NRMSE ↓ |
|---|---|---|---|
| Pix2pix (FLAIR→T1) | 28.20±1.15 | 0.85±0.10 | 0.19±0.56 |
| MM-Syns (T2 + FLAIR→T1) | 27.33±0.74 | 0.87±0.06 | 0.16±0.24 |
| Hi-Net (T2 + FLAIR→T1) | 28.13±2.51 | 0.84±0.07 | 0.18±0.16 |
| ResViT (T2 + FLAIR→T1) | 29.52±1.28 | 0.88±0.08 | 0.16±0.28 |
| Ours (T2 + FLAIR→T1)† | 30.58±1.28 | 0.89±0.06 | 0.15±0.32 |
Data are presented as mean ± standard deviation. †, results of the proposed method. ↑, indicates higher is better; ↓, indicates lower is better. Italicized values indicate the best performance among all the compared methods. BraTs2018, 2018 Multimodal Brain Tumor Segmentation Challenge; FLAIR, fluid attenuated inversion recovery; NRMSE, normalized root mean square error; PSNR, peak signal-to-noise ratio; SSIM, structural similarity index measure; T1, T1-weighted; T2, T2-weighted.
Ablation study
To evaluate the contribution of each component in our proposed framework, we conducted an ablation study by progressively adding key modules and analyzing their effects on the generated images. Figure 8 shows the visual comparison results of different ablation settings, where the enlarged regions highlight the differences in structural fidelity and detail preservation. The quantitative evaluation results for these ablation settings are summarized in Table 5, further validating the contribution of each module to the overall performance.
Table 5
| Methods | PSNR ↑ | SSIM ↑ | NRMSE ↓ |
|---|---|---|---|
| Only convolution | 23.78±2.51 | 0.82±0.15 | 0.24±0.42 |
| Only the ResOctave module | 26.32±1.79 | 0.86±0.19 | 0.22±0.35 |
| Only the Transformer module | 28.48±1.65 | 0.86±0.09 | 0.19±0.23 |
| Complete structure† | 32.61±1.31 | 0.90±0.07 | 0.15±0.35 |
Data are presented as mean ± standard deviation. †, results of the proposed method. ↑, indicates higher is better; ↓, indicates lower is better. Italicized values indicate the best performance among all the compared methods. FLAIR, fluid attenuated inversion recovery; NRMSE, normalized root mean square error; PSNR, peak signal-to-noise ratio; SSIM, structural similarity index measure; T1, T1-weighted; T2, T2-weighted.
The baseline model, which excludes the TROB, produces synthetic images that lack clear anatomical details and exhibit blurry lesion boundaries. When the TROB is integrated into the feature extraction module, there is a significant improvement in the global structural consistency and local contrast of the lesions. Furthermore, by incorporating the fusion module that leverages multi-modal information exchange, the generated images better capture cross-modal correlations, leading to improved lesion depiction and reduced artifacts. As shown in the enlarged regions, our final model, with both the TROB and the fusion module, consistently achieves the most accurate and fine-grained lesion representations, closely resembling the ground truth. The clear boundaries and rich local textures in the final outputs underscore the effectiveness of integrating global contextual modeling and multi-frequency local feature enhancement.
These results demonstrate the critical role of each module in our framework. In particular, the TROB module enhances both global context understanding and multi-frequency local details, while the fusion module ensures efficient cross-modal feature interaction. Together, they enable the generation of high-quality and diagnostically reliable synthetic MRI images.
Discussion
In this work, we proposed a robust multimodal MRI synthesis framework that leverages the TROB for parallel global and local feature extraction and integrates an attention-based feature fusion module. The experimental results demonstrate the effectiveness of our approach in synthesizing high-quality missing modality images, with both quantitative and qualitative metrics consistently outperforming existing methods.
The ablation study confirms that each module in our framework plays a crucial role in achieving superior performance. Specifically, the inclusion of the TROB module significantly improves global contextual understanding and multi-frequency local detail preservation, while the attention-based feature fusion module ensures effective cross-modal feature alignment and integration. Furthermore, the experiments involving simulated motion artifacts highlight the robustness of our model in challenging clinical scenarios, where input images may be degraded.
Our additional experiments synthesizing T2 and T1 images demonstrate the versatility of the proposed framework, indicating that the method is not limited to a specific synthesis task but can adapt to different modality combinations. Although BraTS2018 provides sufficient data for benchmarking, reliance on a single dataset remains a limitation. Future work will extend validation to newer BraTS editions and external datasets to further assess generalization and clinical relevance.
However, despite these promising results, several limitations should be noted. First, the dataset used in this study, although widely adopted, may not fully capture the diversity of real-world clinical data, potentially limiting generalization. Second, the current model focuses on 2D slices, and further exploration into three-dimensional volume synthesis could enhance spatial coherence in generated images. Finally, integrating clinical evaluation metrics beyond PSNR, SSIM, and NRMSE could provide a more comprehensive assessment of clinical relevance.
Conclusions
In this study, we presented a novel generative deep learning framework for multimodal MRI synthesis, combining a transformer-based global contextual modeling branch and a ResOctaveBlock-based multi-frequency local feature extraction branch within a unified TROB module. An attention-based feature fusion module further enhances cross-modal interactions, leading to more accurate and diagnostically relevant synthetic images.
Extensive experiments on the BraTS2018 dataset demonstrate that our approach outperforms state-of-the-art methods in terms of both structural fidelity and fine-detail preservation, even under simulated motion artifacts. Additional synthesis tasks of T1 and T2 images highlight the adaptability and generalizability of the proposed framework across different modality combinations.
Future work will focus on extending this approach to three-dimensional volume synthesis and validating its performance in real-world clinical workflows. We also plan to investigate the integration of clinical usability assessments to ensure that the generated images meet diagnostic standards.
Acknowledgments
None.
Footnote
Funding: None.
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1265/coif). H.G. serves as a consultant for Midea Group (Shanghai) Co. Ltd. The other authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Huang J, Zhang S, Metaxas D. Efficient MR image reconstruction for compressed MR imaging. Med Image Anal 2011;15:670-9. [Crossref] [PubMed]
- Tseng KL, Lin YL, Hsu W, Huang Y. Joint sequence learning and cross-modality convolution for 3D biomedical segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA; 2017:6393-400.
- Shin HC, Tenenholtz NA, Rogers JK, Schwarz CG, Senjem ML, Gunter JL, Andriole KP, Michalski M. Medical image synthesis for data augmentation and anonymization using generative adversarial networks. In: Gooya A, Goksel O, Oguz I, Burgos N, editors. Simulation and Synthesis in Medical Imaging (SASHIMI 2018). Lecture Notes in Computer Science. Cham: Springer; 2018:1-11.
- Liu S, Thung KH, Lin W, Shen D, Yap PT. Hierarchical Nonlocal Residual Networks for Image Quality Assessment of Pediatric Diffusion MRI With Limited and Noisy Annotations. IEEE Trans Med Imaging 2020;39:3691-702. [Crossref] [PubMed]
- Jog A, Carass A, Roy S, Pham DL, Prince JL. Random forest regression for magnetic resonance image synthesis. Med Image Anal 2017;35:475-88. [Crossref] [PubMed]
- Nguyen HV, Zhou K, Vemulapalli R. Cross-domain synthesis of medical images using efficient location-sensitive deep network. In: Navab N, Hornegger J, Wells WM, Frangi AF, editors. Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. Lecture Notes in Computer Science. Cham: Springer; 2015:677-84.
- Yu Y, Gong Z, Zhong P, Shan J. Unsupervised representation learning with deep convolutional neural network for remote sensing images. In: Zhang Y, Bai X, Zhao F, editors. Image and Graphics. ICIG 2017. Lecture Notes in Computer Science. Cham: Springer; 2017:97-108.
- Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Advances in Neural Information Processing Systems. NeurIPS 2014. Montreal, Canada; 2014:2672-80.
- Iqbal T, Ali H. Generative Adversarial Network for Medical Images (MI-GAN). J Med Syst 2018;42:231. [Crossref] [PubMed]
- Ledig C, Theis L, Huszár F, Caballero J, Cunningham A, Acosta A, Acosta A, Aitken A, Tejani A, Totz J, Wang Z, Shi WZ. Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA; 2017:4681-90.
- Mirza M, Osindero S. Conditional generative adversarial nets. arXiv:1411.1784 [Preprint]. 2014. Available online: https://doi.org/10.48550/arXiv.1411.1784
- Arjovsky M, Chintala S, Bottou L. Wasserstein generative adversarial networks. In: Proceedings of the 34th International Conference on Machine Learning. ICML 2017. Proceedings of Machine Learning Research; 2017:214-23.
- Wang C, Xu C, Wanga C, Tao D. Perceptual Adversarial Networks for Image-to-Image Transformation. IEEE Trans Image Process 2018;27:4066-79. [Crossref] [PubMed]
- Xu W, Shawn K, Wang G. Toward learning a unified many-to-many mapping for diverse image translation. Pattern Recognit 2019;93:570-80.
- Armanious K, Jiang C, Fischer M, Küstner T, Hepp T, Nikolaou K, Gatidis S, Yang B. MedGAN: Medical image translation using GANs. Comput Med Imaging Graph 2020;79:101684. [Crossref] [PubMed]
- Zhou T, Fu H, Chen G, Shen J, Shao L. Hi-Net: Hybrid-Fusion Network for Multi-Modal MR Image Synthesis. IEEE Trans Med Imaging 2020;39:2772-81. [Crossref] [PubMed]
- Yang Q, Li N, Zhao Z, Fan X, Chang EI, Xu Y. MRI Cross-Modality Image-to-Image Translation. Sci Rep 2020;10:3753. [Crossref] [PubMed]
- Valanarasu JM, Oza P, Hacihaliloglu I, Patel VM. Medical transformer: gated axial-attention for medical image segmentation. In: de Bruijne M, Cattin PC, Cotin S, Padoy N, Speidel S, Zheng Y, Essert C, editors. Medical Image Computing and Computer-Assisted Intervention – MICCAI 2021. Lecture Notes in Computer Science. Cham: Springer; 2021:36-46.
- Chen J, He Y, Frey EC, Chang T, Li Y, Du Y. Vit-v-net: vision transformer for unsupervised volumetric medical image registration. arXiv:2104.06468 [Preprint]. 2021. Available online: https://doi.org/10.48550/arXiv.2104.06468
- Kamran SA, Hossain KF, Tavakkoli A, Zuckerbrod SL, Baker SA. Vtgan: semi-supervised retinal image synthesis and disease prediction using vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021:3235-45.
- Zhang X, He X, Guo J, Ettehadi N, Aw N, Semanek D, Wang Y. Ptnet: a high-resolution infant MRI synthesizer based on transformer. arXiv:2105.13993 [Preprint]. 2021. Available online: https://doi.org/10.48550/arXiv.2105.13993
- Feng CM, Yan Y, Fu H, Chen L, Xu Y. Task transformer network for joint MRI reconstruction and super-resolution. In: Medical Image Computing and Computer-Assisted Intervention – MICCAI 2021. Lecture Notes in Computer Science. Cham: Springer; 2021:307-17.
- Dalmaz O, Yurt M, Cukur T. ResViT: Residual Vision Transformers for Multimodal Medical Image Synthesis. IEEE Trans Med Imaging 2022;41:2598-614. [Crossref] [PubMed]
- Dorent R, Haouchine N, Kogl F, Joutard S, Juvekar P, Torio E, Golby A, Ourselin S, Frisken S, Vercauteren T, Kapur T, Wells WM 3rd. Unified Brain MR-Ultrasound Synthesis using Multi-Modal Hierarchical Representations. Med Image Comput Comput Assist Interv 2023;2023:448-58. [Crossref] [PubMed]
- Oh HJ, Jeong WK. Diffmix: diffusion model-based data synthesis for nuclei segmentation and classification in imbalanced pathology image datasets. In: de Bruijne M, Cattin PC, Cotin S, Padoy N, Speidel S, Zheng Y, Essert C, editors. Medical Image Computing and Computer-Assisted Intervention – MICCAI 2023. Lecture Notes in Computer Science. Cham: Springer; 2023:337-45.
- Menze BH, Jakab A, Bauer S, Kalpathy-Cramer J, Farahani K, Kirby J, et al. The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS). IEEE Trans Med Imaging 2015;34:1993-2024. [Crossref] [PubMed]
- Medley M, Yan H, Rosenfeld D. An improved algorithm for 2-D translational motion artifact correction. IEEE Trans Med Imaging 1991;10:548-53. [Crossref] [PubMed]
- Isola P, Zhu JY, Zhou T, Efros AA. Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017). Honolulu, HI, USA; 2017:1125-34.
- Chartsias A, Joyce T, Giuffrida MV, Tsaftaris SA, Multimodal MR. Synthesis via Modality-Invariant Latent Representation. IEEE Trans Med Imaging 2018;37:803-14. [Crossref] [PubMed]

