CSAFusion: a convolutional neural network (CNN)-based and Swin Transformer network for multi-modal medical image fusion
Introduction
Background
Synthesizing medical images from diverse modalities such as computed tomography (CT), magnetic resonance imaging (MRI), and single-photon emission computed tomography (SPECT) combines their complementary strengths to enhance diagnostic accuracy. This strategy has demonstrated considerable potential in a range of clinical applications, including tumor detection (1), joint disease assessment (2), and endoscopic image fusion (3).
CT is particularly effective for differentiating tissues with varying densities, such as soft tissue and bone, and excels at depicting high-density structures including bone, calcifications, and acute hemorrhage (4). Although contrast agents can improve visualization of soft tissues, CT remains less effective than MRI for delineating soft-tissue structures such as cerebral white and gray matter. MRI reconstructs three-dimensional images via advanced computational methods and provides fine discrimination of subtle soft-tissue differences, including muscle, fat, white matter, and gray matter (5). By adjusting acquisition parameters such as T1, T2, and diffusion-weighted imaging, MRI can selectively emphasize different tissue characteristics. SPECT focuses on functional imaging, providing insight into the metabolic and functional status of organs and tissues such as the heart, brain, and skeleton (6). In contrast to CT and MRI, which primarily depict anatomy, SPECT emphasizes functional information and offers unique advantages in identifying functional abnormalities, including tumor metabolism and myocardial ischemia.
The functional information provided by SPECT and the anatomical detail provided by MRI can thus be synergistically fused, allowing clinicians to visualize metabolic activity and anatomical structures within a single composite image. However, substantial differences in contrast and grayscale distribution exist between SPECT and MRI. SPECT images typically exhibit lower spatial resolution and higher noise, which can induce information loss or blur, whereas MRI lacks direct visualization of functional or metabolic activity.
Similarly, fusing T1-weighted MRI, which highlights anatomical structure, with T2-weighted MRI, which provides high lesion-to-background contrast, enables more precise lesion localization and comprehensive evaluation. However, T1 and T2 sequences share partially redundant information, and mitigating redundancy during fusion remains challenging. Effective fusion must simultaneously preserve fine structural details and retain critical lesion information.
In addition, combining CT-derived bone information with MRI-derived soft-tissue and lesion detail permits simultaneous visualization of osseous and soft-tissue structures in a single image, which is particularly beneficial for complex pathologies. Yet, CT and MRI differ markedly in grayscale range, spatial resolution, and contrast: CT emphasizes high-density structures such as bone, whereas MRI accentuates low-density soft tissues. Balancing contributions from both modalities and avoiding dominance of one modality therefore represent another key challenges in fusion.
As a preprocessing step, we apply feature-/keypoint-based registration (7-9) to align multimodal images prior to fusion. Robust landmarks are detected and matched across modalities, outliers are removed, and an initial rigid/affine transformation is estimated. When necessary, a lightweight deformable refinement is added. This pipeline reduces ghosting, enforces structural correspondence, and improves the subsequent fusion quality.
Before the widespread adoption of deep learning, multi-modal image fusion predominantly relied on traditional approaches such as pixel-based methods (10), transform-domain strategies (11), sparse representation (12), and hybrid schemes (13). These traditional methods (14) are attractive due to their conceptual simplicity, ease of implementation, and suitability for specific tasks. However, they face clear limitations, including suboptimal detail preservation, information loss, poor robustness to noise, and lack of semantic understanding. With advances in machine learning, deep learning-based methods have gradually superseded traditional approaches. Deep models provide powerful feature extraction, improved semantic representation, and more effective cross-modal information utilization. They eliminate the need for hand-crafted fusion rules, require training only for specific tasks, and offer greater flexibility for integrating multi-modal data.
More recently, deep learning-based fusion has become dominant by learning feature representations directly from data and avoiding complex hand-engineered decomposition and synthesis pipelines, thereby largely replacing traditional methods (14). Nonetheless, existing deep learning-based fusion approaches still encounter three important challenges: (I) imbalance between local [convolutional neural network (CNN)] and global (Transformer) feature extraction. Conventional CNN-based methods excel at capturing local detail, while Transformer-based methods, although capable of modeling long-range dependencies, can struggle to emphasize locally salient regions. (II) Information loss in multi-scale fusion. In multi-scale architectures, serial fusion across multiple branches can lead to progressive loss of fine texture details, and differences in contrast, dynamic range, and noise statistics between modalities further complicate fusion. (III) Suboptimal cross-modal feature weighting. Because different modalities contribute information of varying importance across regions, the fusion process requires adaptive, content-aware weighting of features from each modality.
To address the intrinsic difficulties of medical image fusion and the limitations of both traditional and deep learning-based paradigms, we propose a multi-modal fusion technique that combines adaptive convolution (AC) with the Swin Transformer (15,16). The network adopts a U-Net framework (17), enhancing feature extraction via a multi-branch architecture and progressively enlarging the receptive field to accommodate anatomical structures and multi-scale texture patterns of varying sizes. Within this architecture, we introduce AC and Swin Transformer modules. AC dynamically generates convolutional kernel weights conditioned on the input, enabling real-time modulation of filters and providing lightweight access to global contextual information. The Swin Transformer captures long-range dependencies through window multi-head self-attention (W-MSA) combined with shifted windows (SW-MSA) to establish cross-window connections; together with hierarchical patch merging, which progressively increases the receptive field, it effectively models global context and interdependencies among local features while maintaining computational efficiency. These two components are complementary and jointly enhance both local and global feature extraction.
In addition, we incorporate a multi-attention mechanism, including spatial attention (18) and squeeze-and-excitation (SE) channel attention (19), to account for varying feature importance across modalities and emphasize more informative channels and regions. We further design a dual fusion architecture comprising an encoder fusion module and a decoder fusion module. The decoder fusion module follows conventional fusion practice, integrating upsampled decoder features with skip-connected encoder features to restore spatial detail and reconstruct high-quality fused images. The encoder fusion module focuses on same-level feature fusion, integrating complementary information from parallel branches—such as edges and tissue structures—while reducing information loss through parallel multi-branch processing. The fusion layers adaptively combine features recalibrated by channel attention, thereby mitigating information attenuation in deeper layers.
Our main contributions can be summarized as follows:
- A novel architecture optimized for multi-modal medical image fusion. We propose a U-Net-derived framework with AC in the encoder and Swin Transformer modules in the decoder, enabling effective multi-resolution feature extraction and a balanced integration of local and global information.
- A multi-attention strategy for modality-aware feature enhancement. We introduce both spatial attention and SE-based channel attention to perform dynamic feature weighting and amplify clinically critical representations while suppressing redundant or noisy responses.
- A dual fusion mechanism for complementary information integration. We design encoder- and decoder-level fusion modules that adaptively combine complementary cross-modal features at multiple stages, improving structural preservation, reducing information loss, and yielding fused images better suited to clinical interpretation.
Related work
Traditional medical image fusion method
Multi-modal medical synthesis aims to consolidate data from heterogeneous imaging sources into a single representation, thereby combining the strengths of different imaging techniques. Traditional fusion methods primarily rely on mathematical transformations and classical image processing, achieving fusion by extracting and combining frequency-domain features—such as edges, textures, and intensity—from multiple modalities. As a representative family of approaches, the wavelet transform (WT) (20-23) characterizes local image features via multi-scale decomposition, making it well suited for fusing structural and textural information in medical images. For example, Basheer et al. (24) proposed a fusion method based on Discrete Wavelet Transform (DWT) and Stationary Wavelet Transform (SWT), incorporating Lucy-Richardson preprocessing to enhance image quality and optimizing approximate fusion coefficients through convolution. This strategy substantially improves the sharpness and informational richness of the fused images. However, DWT tends to lose fine details during decomposition and reconstruction and requires manual selection of convolution kernels. Although SWT alleviates some of these issues, it substantially increases computational complexity.
Beyond transform- and gradient-based strategies, traditional fusion pipelines typically presume accurate cross-modal alignment. In practice, marker-based or landmark-based registration—using external fiducial markers or anatomical/keypoint landmarks (e.g., vessel bifurcations, cortical sulci, or bony landmarks)—is often performed before fusion to enforce spatial correspondence, thereby reducing ghosting, edge duplication, and inconsistencies in pixel-, region-, and frequency-domain fusion. Nonetheless, its effectiveness depends on reliable marker placement and robust landmark detection. Manual or semi-automatic workflows introduce operator dependence and workflow burden, and rigid/affine alignment may be insufficient in the presence of organ deformation, necessitating deformable registration models that further increase computational cost.
Among gradient-based fusion methods, edge features extracted by operators such as Sobel filters are frequently employed to enhance fusion results. Li et al. (25) proposed a framework that combines Joint Bilateral Filtering (JBF) with Local Gradient Energy (LGE): input images are decomposed into structural and “energy” layers using JBF; structural layers are fused using tensor representations and neighborhood energy derived from LGE, whereas energy layers are fused using an ℓ1-maximization strategy; the final fused image is then obtained by recombining these layers. Although this approach performs well in edge preservation, it remains constrained by the intrinsic limitations of traditional methods. With the evolution of deep learning, such conventional techniques have been progressively supplanted by data-driven fusion strategies, which offer clear advantages in feature self-learning, cross-modal correlation modeling, and computational efficiency.
DL methods
Over the past few years, rapid advances in deep learning for computer vision have driven a shift from conventional techniques toward deep learning-based strategies in multi-modal image fusion. A pioneering study (26) introduced a medical image fusion approach based on CNNs, using a weight-sharing CNN architecture to generate weight maps that effectively combine pixel activity information from different modalities. Hou et al. (27) further proposed an unsupervised end-to-end deep learning framework that reconstructs fused images from learned fusion features via convolutional operations. However, purely deep learning-based methods still face challenges related to interpretability and stability. As a result, hybrid approaches that combine traditional techniques with deep learning have become an active research focus. For instance, Liu et al. (28) developed the multi-scale pyramid residual weighting network (LYWNet), which fuses high-frequency details and low-frequency structures through hierarchical residual blocks. Zhou et al. (29) proposed an edge-enhanced hybrid dilated residual attention network (EH-DRAN) that effectively extracts multi-scale, fine-grained features by combining dilated convolutions with residual attention modules and introducing gradient operators to strengthen edge-detail learning. A dual-discriminator conditional generative adversarial nets (GAN) (30) was designed to generate realistic fused outputs with improved texture fidelity via adversarial generator-discriminator training. Cheng et al. (31) adopted a self-evolving training strategy with memory modules, using intermediate fusion results as pseudo supervision in unsupervised settings. Although CNN- and GAN-based methods obviate the need for hand-crafted fusion rules, they remain limited: CNNs emphasize local features and struggle to model long-range dependencies, whereas GANs are prone to artifacts and noise due to training instability and high complexity. Consequently, how to effectively integrate global and local information within deep learning frameworks, while simultaneously improving robustness and efficiency, remains a key research problem.
The success of Transformer architectures in natural language processing has spurred their adoption in image fusion networks. Transformer-based architectures are increasingly used for multi-modal fusion because they can jointly model local and global dependencies across modalities. Liu et al. (32) designed a hybrid CNN-Transformer network that captures both short-range and long-range relationships, thereby improving cross-modal feature extraction. Zhou et al. (33) proposed an unsupervised image fusion method built on densely connected high-resolution CNNs and hybrid Transformers, in which convolutional features are fed into fine-grained attention modules to generate global representations. Tang et al. (34) developed a multi-resolution adaptive Transformer framework that combines ACs and Transformer blocks for localized and global processing, respectively, and extracts complementary information at different scales through a multi-scale architecture.
Despite these advances, Transformer-based fusion methods still suffer from high computational complexity, especially for high-resolution images, since standard self-attention scales quadratically with the number of pixels. The Swin Transformer has been introduced to mitigate this problem by using a shifted window mechanism to substantially reduce computational cost. Ma et al. (35) proposed the SwinFusion framework, which leverages Swin Transformer-based long-range modeling for cross-domain learning and employs self-attention for intra-domain fusion and cross-attention for inter-domain fusion, thereby fully exploiting both local detail and global context, particularly long-range dependencies.
In parallel, auxiliary attention mechanisms have been widely incorporated into fusion networks. Di et al. (36) designed a model that combines shallow multi-scale convolutions with deep DenseNet features, enhanced by a refined efficient channel attention with spatial branch (ECA-S) module for joint channel-spatial attention. Cui et al. (37) introduced a detail-amplified channel attention mechanism to differentially emphasize luminance and gradient maps. Xie et al. (38) proposed a two-stage, end-to-end, unsupervised framework for multi-modal medical image fusion that couples a Swin Transformer, responsible for capturing long-range context, with a multi-scale CNN that refines local detail. A residual Swin-convolution fusion (RSCF) block performs learnable multi-scale feature fusion, and an adaptive weighting (AW) module embedded in the loss function dynamically determines how much information from each modality should be preserved. Nevertheless, such attention- and Transformer-based paradigms can still be vulnerable to noise, which may degrade fusion quality and limit their robustness in challenging clinical scenarios.
Methods
Framework overview
The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. As illustrated in Figure 1, the CSAFusion framework is built on a U-Net backbone and incorporates three key innovations: ACs that enable flexible kernel adaptation, Swin Transformer modules that capture long-range dependencies, and a dual-fusion design operating in both the encoder and decoder stages. Initially, the pair of concatenated input images, denoted as and , is fed into the network. The encoder performs progressive downsampling to extract increasingly high-level features while reducing the spatial resolution of the feature maps, ultimately producing deeply abstract representations. The decoder then restores spatial resolution through upsampling, integrating features from the encoder to generate the final fused output.
Given that CT and MRI are single-channel grayscale images, whereas SPECT is an RGB color image, the proposed framework automatically identifies the channel type of each input. For RGB inputs such as SPECT, an RGB-YUV color conversion is applied to address channel mismatches (39). Specifically, the SPECT image is converted from the RGB space to the YUV space, separating luminance information (Y channel) and chrominance information (U and V channels) to independently process these two types of information during fusion. The transformed image is channel-wise concatenated with the MRI image and then fed into the network along with the CT or MRI image.
After processing and fusing the Y, U, and V channels, the resulting YUV image is converted back to RGB space to produce the final visualized fused image.
As shown in Figure 2, The network architecture is primarily divided into three layers. In the first layer, the concatenated source images are input into the Inc module, which consists of two convolutional encoder (CE) modules. Each part is composed of AC, batch normalization (BN), and rectified linear unit (RELU), as described by the following formula:
Encoder
AC
At the network input, as shown in Figure 3, the Inc module applies a shallow stack of standard convolutions to efficiently capture low-level cues such as edges and fine textures. Its output is then passed to the first encoder stage, where we introduce a single CE built on our AC. AC preserves the standard 3×3 kernel size but generates a sample-dependent gating vector that reweights the kernel parameters, enabling per-image, spatially structured filtering; the 1×1 layers remain fixed to control complexity. The gate is produced by two lightweight branches that condense the input features into input-aware and output-aware kernel codes, which are subsequently fused so that the modulation jointly reflects how each input channel should respond and how each output channel should aggregate information. For efficiency, AC is implemented via unfolding and batched matrix multiplication, making it a plug-and-play replacement for conventional 3×3 convolutions. Restricting AC to the first encoder level concentrates adaptivity where low-level details are richest and complements the global modeling used in deeper stages, yielding sharper structures and improved cross-modal consistency with modest overhead. Compared with a standard 3×3 convolution, AC introduces only a small constant-factor cost [from gating and unfold/general matrix multiply (GEMM)] and is used solely in the initial encoder layer; thus, the increase in end-to-end inference time is limited and remains within an acceptable range. The CE module is built around the AC, which can be formulated as follows:
At the second encoder stage, the network employs two parallel CE modules, whose outputs are fed into the encoder fusion module, where features from both branches are aggregated. At the third encoder stage, three parallel CE modules are used in a similar manner, and their outputs are likewise integrated by the encoder fusion module.
The CE modules progressively extract higher-level features while reducing the spatial resolution of the feature maps via downsampling, and they also provide skip-connection features that are passed to the decoder.
For clarity, Figure 4 visualizes the intermediate multi-scale feature maps and attention maps obtained during training.
Decoder
The upsampling process in the decoder restores the features extracted by the encoder to the original image resolution. The decoder fusion module then integrates the skip-connected features from the encoder with the features produced by the current decoder stage. As shown in Figure 5, the decoder module (ST) is implemented using a Swin Transformer, which can be described by the following formulation:
Dual fusion architecture
Fusion module in encoder
AC differs from standard convolution in that it adaptively modulates the convolution kernels, adjusting their weights in real time based on the input features, thereby providing lightweight global information capture. Because different branches can learn different types of features, we employ varying numbers of branches at different network depths. Serial (cascaded) branches may suffer from progressive attenuation of feature details due to information bottlenecks; therefore, we adopt a parallel-branch design, feeding the outputs of all branches simultaneously into the fusion module. As illustrated in Figure 6, the fusion module is designed to adaptively integrate the features extracted by the branches at each stage, thereby more effectively combining complementary information. This module facilitates effective integration of features from different modalities and fuses their complementary components, helping to preserve the completeness of information in the subsequent fusion results. The formulation of the fusion (FU) module is as follows:
Fusion module in decoder
As shown in Figure 2, after the encoder performs initial feature extraction and hierarchical fusion, the decoder upsamples and reconstructs the fused image. Each decoder stage incorporates a Swin Transformer module to address the limitation of conventional convolutions in modeling long-range dependencies. While AC captures such dependencies in a lightweight manner, its combination with CNNs and Swin Transformer achieves a better balance between local detail and global context, with Swin Transformer maintaining computational efficiency via its window-based mechanism.
The feature integration module primarily merges the features produced by the Swin Transformer with the skip-connected features from the encoder. This design facilitates smooth information flow, preserves fine-grained low-level details, and helps mitigate gradient vanishing. Each decoder unit further includes a channel attention mechanism to enhance informative channels while suppressing less relevant ones, enabling dynamic feature refinement. In addition, a dual attention scheme—comprising both channel and spatial attention—improves the separation of foreground and background regions, thereby enhancing structural clarity.
Within the overall network, this final output module ensures the quality and effectiveness of the fused features and plays a pivotal role in the performance of the entire framework.
Multiple attention mechanisms
Channel attention mechanism
After the decoder fusion module combines the skip-connected features from the encoder with the features generated at the current decoding stage, the result is passed through an SE channel attention mechanism, as illustrated in Figure 7. The SE module is defined as follows:
Here, linear represents the fully connected layer (linear transformation), used for dimensionality reduction and expansion to learn the importance of each channel. RELU is the rectified linear unit activation function, introducing nonlinear capabilities. Sig (sigmoid): normalizes the weights to the range (0, 1), enabling the channel weights to be used for feature enhancement. Because multi-modal medical image fusion must preserve critical information, the channel attention mechanism adaptively learns the relative importance of different channels, suppresses features from less relevant channels, and strengthens the representations of the most informative ones.
Spatial attention mechanism
The formula for the spatial attention module (CM) is as follows:
Here, Conv represents the convolution operation, used to capture spatial information and generate the spatial attention map. As shown in Figure 8, is the output feature of the spatial attention module, which adjusts the feature strength at different positions through weighting. The spatial attention mechanism effectively preserves the structural information of the source image by separating the background and foreground, addressing the issue of target region blurring.
Loss function
This study introduces a composite loss framework tailored to multi-modal medical image synthesis. The goal of multi-modal fusion is to generate an integrated image that faithfully reflects the distinctive attributes of each modality, which requires preserving critical source information, maintaining structural coherence, enhancing boundary and textural details, and balancing local and global representations. To this end, the proposed loss function combines three complementary components—structural similarity index (SSIM) (40), regional mutual information (RMI) (41), and contrast loss (42)—to jointly improve both the objective quality and visual appearance of the fused images. By enforcing these multi-objective constraints, the network is guided toward a more comprehensive optimization of image structure, content, and perceptual quality.
The SSIM loss primarily aims to preserve structural information. It evaluates similarity by computing local means, variances, and covariance, thereby effectively constraining the structural integrity of the fused image. Within each local window, luminance, contrast, and structural terms are jointly regulated so that the fused result remains consistent with the source images in terms of structural layout, such as edge orientation and texture organization.
Here, μx, μy are the local means of image x and image y, respectively; , are the local variances of image x and image y, respectively; σy is the covariance between image x and image y; and C1, C2 is a constant to prevent division by zero. The SSIM loss is defined as:
The RMI loss focuses on the local statistical properties of the image by computing mutual information within neighborhood regions. By explicitly considering spatial relationships between pixels, it provides a region-level measure of similarity. Concretely, we calculate mutual information over local patches to quantify the shared statistical information between the two source modalities (e.g., CT/MRI, T1/T2) and the fused image.
Here, M is a matrix, typically a structure tensor, which describes the local geometric structure of the image within a specific region. is a small constant, usually a very small value. The computation prevents singular matrices by ensuring the determinant of M remains non-zero. I is the identity matrix, with the same dimensions as matrix M. The identity matrix ensures numerical stability during determinant computation, especially when the determinant of M approaches zero.
The contrast loss is designed to enhance both local and global aspects of the image, balancing fine-scale and large-scale features to improve overall clarity. It explicitly strengthens local and global contrast and detail, emphasizing edge sharpness, texture variation, and the conspicuity of lesions against the background. At the same time, it suppresses over-smoothing and loss of detail, so that the fused image appears sharper and more structurally well-defined for clinical interpretation.
Here, Llocal represents the local texture variation or difference, typically used to describe the local texture changes in a specific region of the image. Ωi refers to the local neighborhood, which, for each pixel i, includes adjacent pixels or small regions. The neighborhood is usually a fixed-size window used to analyze local texture. Xj represents the grayscale or feature value of the j-th pixel in the local region Ωi. μi denotes the mean or local average of the local region Ωi. ω is typically part of the regularization term, used to avoid division by zero or numerical instability during computation.
Using SSIM alone is not robust across modalities: CT-MRI, MRI-SPECT, and T1-T2 exhibit nonlinear relationships, so genuine modality-specific differences may be incorrectly penalized as “dissimilarity”. Relying solely on RMI provides no explicit constraint on edge and texture preservation, which can lead to local structural degradation. Using only contrast loss ignores structural and cross-modal consistency, making the overall visual appearance more susceptible to imbalance.
By jointly enforcing structural similarity, RMI, and contrast constraints, the proposed loss achieves a more comprehensive optimization of fused image quality.
Statistical analysis
The distribution of each metric was first assessed using the Shapiro-Wilk test. When the normality assumption was satisfied, overall differences among fusion methods were evaluated with a one-way repeated-measures analysis of variance (ANOVA). If the ANOVA result was significant, pairwise post hoc comparisons between CSAFusion and each competing method were performed with Bonferroni correction for multiple testing. When normality was violated, the non-parametric Friedman test was applied, followed by Wilcoxon signed-rank tests with Bonferroni adjustment. All statistical tests were two-sided, and a P value <0.05 was considered to indicate statistical significance.
Results
Datasets and training detailed
This study used the publicly available IXI dataset (IXI Dataset, Brain Development, https://brain-development.org/ixi-dataset/; https://brain-development.org/ixi-dataset/) and the Harvard Whole Brain Atlas (https://www.med.harvard.edu/aanlib/; https://www.med.harvard.edu/aanlib/). The IXI dataset comprises nearly 600 multi-sequence MRI examinations [T1, T2, proton density (PD), magnetic resonance angiography (MRA), diffusion weighted imaging (DWI), etc.] from healthy volunteers, acquired on 1.5 T and 3 T scanners at three London hospitals. Because the T1- and T2-weighted images differ in size and are not directly aligned, pre-registration was required. For this purpose, we employed advanced normalization tools (ANTs), a C++ library specifically designed for medical image analysis and widely regarded as a state-of-the-art toolkit for registration and segmentation. ANTs are particularly effective for aligning images acquired under different scanning protocols or across subjects. To ensure high-quality cross-sequence alignment between T1 and T2, we performed pre-registration using ‘antsRegistration’ with a Rigid → Affine → SyN pipeline, followed by visual quality control and quantitative screening. We calculated SSIM and normalized cross correlation between the registered T1 and T2 images as consistency measures, applied empirical thresholds to flag low-quality registrations, and then re-registered or excluded samples that did not meet the criteria. From the Harvard Whole Brain Atlas, we obtained 113 paired MRI-T2 and CT scans and 98 paired SPECT-MRI scans; these pairs were already well aligned and did not require additional registration. All registered and pre-aligned image pairs were then uniformly resized to 256×256 pixels. For the CT-MRI, SPECT-MRI, and MRI T1-MRI T2 datasets, we randomly split the paired samples into training, validation, and test sets using a 70%/15%/15% ratio. Training samples were further cropped into 120×120 patches, standardized, and used to train the proposed framework.
All experiments were conducted using the PyTorch framework on an NVIDIA A100 GPU with 40 GB of memory. The software environment consisted of CUDA 11.7, Python 3.10, and PyTorch 1.9.1. During training, the batch size was set to 8, the number of epochs to 200, and the initial learning rate to 1×10−5. We adopted the ReduceLROnPlateau strategy for learning rate scheduling, which is well suited to scenarios with substantial fluctuations in validation loss and provides flexible, adaptive adjustment. The weights for the composite loss function were set to [1, 1, 1, 2.5, 1, 1].
Comparative experiments and evaluation indicators
Contemporary benchmark methods for multi-modal medical image fusion were used for quantitative comparison: JBF with LGE (INS), Multi-scale Adaptive Transformer (MATR), MUFusion, DDcGAN, EMFusion, and SwinFusion. INS represents a traditional approach that combines JBF to decompose images into structural and “energy” layers with an LGE operator to enhance edge preservation and intensity integration. In contrast, the remaining methods are deep learning-based. MATR integrates ACs with an adaptive Transformer to jointly fuse functional (metabolic) and structural (anatomical) information. MUFusion incorporates memory units and uses intermediate fusion results as additional supervisory signals during training to achieve self-evolving learning (31). EMFusion simultaneously leverages shallow constraints (e.g., image saliency and information richness as intuitive cues) and deep constraints derived from pre-trained encoders to objectively extract and preserve modality-specific information (36). SwinFusion focuses on an attention-guided cross-domain module designed to thoroughly integrate complementary information from different imaging modalities while capturing long-range interdependencies (33).
To evaluate the quality of the fused images, we adopted seven widely used metrics: SSIM, spatial frequency (SF) (43), average gradient (AG) (44), edge preservation index (EPI) (45), feature mutual information (FMI) (46), visual information fidelity (VIF) (47), and the non-reference quality index for enhancement (QNCIE) (48). SSIM measures the similarity between the fused image and the source images in terms of structural integrity, luminance, and contrast. Its formulation is given as follows:
Here, represents luminance comparison, represents contrast comparison (quantified by the standard deviation of pixel intensities), and represents structure comparison.
SF reflects the overall activity level of the image. The formula is as follows:
Here, H represents the fused image, and M and N denote the height and width of the image, respectively.
The AG reflects the clarity of the image. The formula is as follows:
The EPI quantifies how well the fused image retains boundary information from both source images, with a particular focus on edge strength and orientation fidelity. Its formulation is given as follows:
The Nonlinear Correlation Information Entropy (QNCIE) serves to gauge the degree to which the synthesized image conserves informational content from the origin inputs, leveraging entropy-based nonlinear correlations. The associated equation is provided below:
FMI evaluates how well the fused output preserves informative features (e.g., gradients, DCT/wavelet coefficients) from both source images. Let be a feature extractor (such as gradient magnitude or transform coefficients):
VIF is a full-reference perceptual index grounded in natural-scene statistics; it computes the ratio of information that the fused image preserves about a reference (source) relative to the information available in the reference itself, aggregated across subbands:
Because clinical deployment requires low-latency inference, we evaluated CSAFusion with a batch size of 1 on typical 2D inputs using a single modern GPU. The hybrid design—ACs in the encoder and windowed Swin attention in the decoder—keeps the model compact and ensures that computation scales approximately with image area, making per-case inference practical in routine settings. Under the above hardware and resolution configuration, the end-to-end runtime for the main experiments in this study is approximately 5 seconds, though this may vary with different hardware and image resolutions.
Discussion
Comparison results and analysis
Figure 9 presents the input images from four representative SPECT-MRI pairs together with the fused outputs generated by the competing algorithms. Visual inspection shows that all five alternative methods exhibit notable limitations compared with the proposed approach. The INS-based fusion strategy appears particularly effective at highlighting SPECT-derived functional information; however, its heavy use of high-contrast red hues often introduces anatomical distortions. Overemphasis on functional activity obscures structural detail, leading to poorly defined ventricular contours, indistinct cortical gyri, and blurred gray-white matter boundaries. The MATR-based method clearly fails to preserve certain white matter regions in the MRI, as it retains large dark regions from SPECT, resulting in structurally incomplete fusion. MUFusion better maintains internal MRI structures, but the fused images may contain artifacts in some regions, with insufficiently sharp edges and occasional missing details. SwinFusion suffers from mild color distortion and information loss in edge transition areas, and the depiction of fine anatomical structures such as the brainstem and basal ganglia is degraded.
MRSCFusion also presents several shortcomings: its color mapping, biased toward highly saturated purple/red tones, overemphasizes functional signals and blurs the gray-white matter interface and ventricular boundaries; ring-like halo or banding artifacts arise along the cortical surface with unnatural edge transitions; fine anatomical textures (e.g., thalamus, basal ganglia, brainstem) are overly smoothed, with localized artifacts and missing details in certain regions; and the overall dynamic range is compressed, causing saturation of hotspots and masking low-contrast structural information. In contrast to these deep learning-based methods, the traditional baseline method can inherit structural information from MRI and functional information from SPECT reasonably well, but its results resemble a simple superposition of the two modalities rather than a truly integrated fusion. Functional details from SPECT may overwhelm MRI structures in some cases, limiting complementary fusion performance for complex multimodal pairs such as MRI and SPECT. By comparison, the proposed CSAFusion method yields more natural color transitions, richer hierarchical detail, and sharper boundaries in deep structures (e.g., thalamus), and is visually more suitable for clinical interpretation. Quantitatively, CSAFusion consistently outperforms all comparison methods across the three fusion tasks. For each modality pair, it achieves the highest or near-highest values on all seven image quality metrics, and its advantages over competing methods are statistically significant (all P<0.05).
Table 1 summarizes the results for seven widely used quantitative metrics: SSIM, SF, AG, EPI, FMI, VIF, and the QNCIE. SSIM is a perceptual index that jointly evaluates luminance preservation, contrast maintenance, and structural consistency. SF reflects the richness of details and textures, but should not be artificially inflated by noise. AG measures the rate of intensity change, providing an indicator of image sharpness. EPI evaluates how well edge information from the source images is preserved in the fused image, emphasizing high-frequency structures such as edges and contours. FMI quantifies the amount of shared feature-level information (e.g., textures and salient structures) between the fused image and the source images, thereby indicating how effectively complementary features are integrated. VIF measures the extent to which visual information from a reference is preserved in the fused result, based on natural scene statistics and a human visual system model; higher values indicate better fidelity and lower perceptual distortion. QNCIE characterizes the overall enhancement quality and effective information content of the fused image, capturing both contrast improvement and information richness.
Table 1
| Metric | CSAFusion | MATR | MUFusion | SwinFusion | EMFusion | INS | MRSCFusion |
|---|---|---|---|---|---|---|---|
| SSIM_SPECT | 0.8882 | 0.5471 | 0.4960 | 0.7374 | 0.8539 | 0.9058 | 0.8586 |
| SSIM_MRI | 0.8066 | 0.4726 | 0.5774 | 0.7331 | 0.5305 | 0.4317 | 0.4847 |
| SF | 29.8220 | 21.7758 | 45.9531 | 30.2826 | 15.1542 | 15.8542 | 13.6463 |
| AG | 9.3850 | 7.6665 | 11.6490 | 10.0829 | 6.1013 | 6.5552 | 5.2398 |
| EPI_SPECT | 0.8815 | 0.5583 | 0.5740 | 0.7551 | 0.8891 | 0.8415 | 0.8748 |
| EPI_MRI | 0.9002 | 0.8871 | 0.4602 | 0.7908 | 0.3023 | 0.3225 | 0.3115 |
| QNCIE | 0.9304 | 0.8082 | 0.9093 | 0.9058 | 0.9043 | 0.9081 | 0.8909 |
| FMI | 0.8805 | 0.8700 | 0.7361 | 0.8590 | 0.7563 | 0.8468 | 0.8758 |
| VIF | 0.6302 | 0.3993 | 0.2953 | 0.3464 | 0.4546 | 0.4073 | 0.5450 |
AG, average gradient; EPI, edge preservation index; FMI, feature mutual information; JBF, Joint Bilateral Filtering; INS, JBF with LGE; LGE, Local Gradient Energy; MATR, Multi-scale Adaptive Transformer; MRI, magnetic resonance imaging; QNCIE, non-reference quality index for enhancement; SF, spatial frequency; SPECT, single-photon emission computed tomography; SSIM, structural similarity index; VIF, visual information fidelity.
Figure 10 shows a comparison of fusion performance between the proposed method and five alternative approaches on 10 SPECT-MRI validation pairs. The quantitative metrics summarized in Table 1 are the averages over these 10 cases. From Table 1 and Figure 10, it is evident that CSAFusion achieves the highest overall similarity to the two source images in terms of SSIM. INS exhibits higher preservation of SPECT content but lower preservation of MRI structure, likely because the strong SPECT colors dominate and partially obscure MRI anatomical information.
With respect to SF, the proposed model also maintains rich textural detail. Although MUFusion and EMFusion yield similarly high SF values, visual inspection shows that artifacts in MUFusion and noise in EMFusion contribute to artificially increased SF. A similar pattern is observed for AG, whereas the images produced by CSAFusion do not present obvious noise or artifacts. In terms of EPI, the outputs of competing models fail to fully preserve boundary information from the source images, whereas CSAFusion retains edge and contour details more comprehensively.
On the two complementary-information metrics, CSAFusion likewise performs best. For FMI (higher is better), CSAFusion achieves 0.8805, slightly outperforming MRSCFusion (0.8758) and MATR (0.8700), followed by SwinFusion (0.8590) and INS (0.8468), with EMFusion (0.7563) and MUFusion (0.7361) trailing behind. For VIF, CSAFusion attains the highest value of 0.6302, indicating the strongest perceptual fidelity to the source images. The next best is MRSCFusion (0.5450), followed by EMFusion (0.4546), INS (0.4073), MATR (0.3993), SwinFusion (0.3464), and MUFusion (0.2953). These findings suggest that CSAFusion not only integrates complementary features more effectively (as reflected by higher FMI) but also preserves visual information with less perceptual distortion (as reflected by higher VIF), which is consistent with the qualitative observations. For QNCIE, CSAFusion again delivers the best performance, confirming its empirical superiority in overall fusion quality. In MRI-SPECT fusion specifically, CSAFusion achieves significantly higher SSIM, EPI, and QNCIE than all other methods (all P<0.05), indicating better preservation of both anatomical boundaries and functional information. The statistically significant improvement in QNCIE further demonstrates that CSAFusion more effectively enhances non-contrast functional imaging than existing fusion strategies.
Taken together, the superiority of CSAFusion can be attributed to how each architectural component addresses known failure modes in multi-modal fusion. The dual fusion architecture performs feature integration twice—laterally in the encoder at matching scales and again in the decoder via skip connections—reducing the information bottlenecks inherent to purely serial pipelines and directly mitigating boundary erosion, which is consistent with its stronger EPI and more balanced SSIM with respect to both source images. ACs generate content-aware kernels that tailor feature extraction to modality-specific characteristics (e.g., noisy functional patterns in SPECT versus fine anatomical detail in MRI), improving cross-modal weighting and preventing one modality (typically SPECT chroma) from overwhelming the other, in line with the higher similarity to both inputs. Channel (SE) and spatial attention modules recalibrate salient channels and foreground regions, suppressing background clutter and preventing SPECT’s vivid colors from overshadowing MRI structures; this yields clearer deep-structure boundaries and more faithful tissue textures.
Based on these analyses, CSAFusion demonstrably outperforms current state-of-the-art and benchmark methods in multi-modal medical image fusion.
Ablation experiment
Dual fusion architecture module
To evaluate the effectiveness of the encoder fusion module, we conducted an ablation study using Fusion Strategy 1. Specifically, as shown in Figure 11, the feature extraction blocks within each encoder layer were connected in series, the fusion module was removed, and the resulting features were passed directly to the decoder via skip connections.
Attention mechanism module
To assess the effectiveness of the channel attention modules used in each encoder CE module and decoder DE module, as well as the spatial attention module applied before the final output to separate foreground from background, we performed an ablation study using Fusion Strategy 2. Specifically, as shown in Figure 11, the channel attention modules following the CE and DE blocks were removed, and the spatial attention module in the final stage was also omitted.
AC and Swin Transformer module
To evaluate the effectiveness of AC in the encoder and the Swin Transformer module in the decoder, we conducted an ablation study using Fusion Strategy 3. In this setting, all AC layers were replaced with standard convolutions, and the Swin Transformer modules were substituted with standard Transformer blocks.
Table 2 reports the quantitative results for three ablated variants of the network: (I) a model without the dual-fusion architecture; (II) a model without the attention mechanisms; and (III) a model in which ACs and Swin Transformer modules are replaced by standard convolutions and standard Transformer blocks, respectively. As shown, the complete network exhibits a clearly superior ability to fuse cross-modal features. The ablation results indicate that removing the dual-fusion architecture reduces SSIM by 78%, while removing the attention mechanisms reduces SSIM by 77% and leads to an excessively high AG. Replacing AC with standard convolution and Swin Transformer with a standard Transformer results in an 11% decrease in QNCIE.
Table 2
| Metric | CSAFusion | Ablation 1 | Ablation 2 | Ablation 3 |
|---|---|---|---|---|
| SSIM_SPECT | 0.8882 | 0.1036 | 0.1129 | 0.2922 |
| SSIM_MRI | 0.8066 | 0.2505 | 0.0174 | 0.1169 |
| SF | 29.8220 | 38.1506 | 13.1647 | 23.6752 |
| AG | 9.3850 | 22.6969 | 65.5597 | 9.3590 |
| EPI_SPECT | 0.8815 | 0.8484 | 0.7329 | 0.5880 |
| EPI_MRI | 0.9002 | 0.8712 | 0.3033 | 0.1452 |
| QNCIE | 0.9304 | 0.8695 | 0.8630 | 0.8275 |
| FMI | 0.8805 | 0.6612 | 0.4552 | 0.4004 |
| VIF | 0.6302 | 0.2325 | 0.1651 | 0.1311 |
AG, average gradient; EPI, edge preservation index; FMI, feature mutual information; MRI, magnetic resonance imaging; QNCIE, non-reference quality index for enhancement; SF, spatial frequency; SPECT, single-photon emission computed tomography; SSIM, structural similarity index; VIF, visual information fidelity.
Beyond these primary metrics, the complementary-information and perceptual-fidelity indicators follow the same pattern. For FMI (denoted as “FMI” in Table 2, i.e., FMI), CSAFusion achieves 0.8805, whereas Ablations 1, 2, and 3 obtain 0.6612 (≈−25%), 0.4552 (≈−48%), and 0.4004 (≈−55%), respectively. For VIF, CSAFusion reaches 0.6302, compared with 0.2325 (≈−63%), 0.1651 (≈−74%), and 0.1311 (≈−79%) for the three ablated variants. These findings indicate that the dual-fusion architecture and attention mechanisms are crucial for effectively integrating complementary cross-modal features (higher FMI) while suppressing noise and artifacts to preserve perceptual fidelity (higher VIF). Relative to the model without the dual-fusion architecture, the full network adaptively preserves more structural details; compared with the model without attention, it better emphasizes salient cues while nearly eliminating noise; and in contrast to the version using standard convolutions and a standard Transformer, the proposed design more effectively exploits cross-modal complementarities to produce higher-quality fusion results.
Analysis of parameters in loss functions
In the loss function, we assign distinct weights to each component to balance the structural similarity, RMI, and contrast terms. In this experiment, we assess the influence of individual components via ablation by varying a single weight while keeping the others fixed. As shown in Figure 12, adjusting the weight coefficients illustrates how the similarity loss impacts the fusion results. Based on these experiments, we ultimately select [1, 1, 1, 2.5, 1, 1] as the optimal weighting configuration.
CT and MRI image fusion
To assess the generalizability of the proposed fusion technique, we further applied it to additional modality combinations. Because CT excels at depicting high-density anatomical structures and MRI is superior for visualizing soft-tissue details, we fused paired CT-MRI scans and compared our method against the same five state-of-the-art approaches described earlier. For this evaluation, we used ten CT-MRI image pairs obtained from established medical image repositories.
Figure 13 shows the fused outputs produced by the five benchmark methods. Although these techniques generate reasonably sharp fused images, each exhibits distinct limitations when compared with CSAFusion. The MATR-based method delineates MRI soft tissue fairly well but often underrepresents high-density components such as bony structures from CT, causing them to appear attenuated. EMFusion balances MRI structural detail with CT density information, yet in many cases dense CT regions overwhelm subtle MRI features, which could risk missing clinically important findings. MUFusion- and INS-based approaches both introduce interference in fine details and show noticeable degradation along anatomical boundaries. SwinFusion provides comparatively better results but can still produce artifacts in certain high-density CT regions. Overall, CSAFusion more effectively integrates complementary information from CT and MRI, yielding cleaner, more diagnostically useful fused images.
Figure 14 compares the quantitative fusion metrics of the proposed method with those of the five competing approaches on 10 CT-MRI test pairs. The numerical results in Table 3 are reported as averages over these 10 test cases. From Table 3 and Figure 14, it is evident that CSAFusion achieves the highest values in SSIM, EPI, and QNCIE. Because CT and MRI images inherently exhibit relatively smooth grayscale transitions—unlike SPECT images, which contain more distinct color patterns—most regions in CT and MRI show gradual intensity changes without sharp variations. As a result, the corresponding gradients are generally small, making SF and AG less informative in this setting; therefore, these two metrics are not reported here. Both the visual results and the objective evaluations consistently indicate that CSAFusion outperforms the other fusion methods. On the complementary-information and perceptual-fidelity metrics, CSAFusion also ranks first (FMI =0.8794, VIF =0.3353), reflecting effective cross-modal feature integration and high perceptual fidelity. EMFusion performs second best on these two metrics (FMI =0.8707, VIF =0.3343), while SwinFusion and INS occupy a mid-range position (FMI =0.8553/0.8298; VIF =0.3007/0.3097). For reference, MRSCFusion achieves a moderate FMI (0.8113) but a relatively low VIF (0.2836), suggesting slightly softened high-contrast CT boundaries and incomplete integration of CT edges with MRI soft-tissue textures, which is consistent with the qualitative observations. In CT-MRI fusion, CSAFusion provides significantly better structural preservation and edge retention than the comparison methods, as evidenced by its higher SSIM and EPI values (P<0.05). These findings indicate that CSAFusion is more effective at maintaining bony structures and overall anatomical integrity in the fused images.
Table 3
| Metric | CSAFusion | MATR | MUFusion | SwinFusion | EMFusion | INS | MRSCFusion |
|---|---|---|---|---|---|---|---|
| SSIM_CT | 0.8539 | 0.6744 | 0.4598 | 0.6635 | 0.8152 | 0.6860 | 0.7174 |
| SSIM_MRI | 0.8301 | 0.8129 | 0.6461 | 0.7171 | 0.8001 | 0.7262 | 0.7867 |
| EPI_CT | 0.8324 | 0.4604 | 0.4922 | 0.7783 | 0.7252 | 0.6110 | 0.6899 |
| EPI_MRI | 0.6937 | 0.7294 | 0.4261 | 0.6134 | 0.6241 | 0.5576 | 0.6985 |
| QNCIE | 0.9384 | 0.8979 | 0.8847 | 0.9077 | 0.9027 | 0.8903 | 0.9275 |
| FMI | 0.8794 | 0.8084 | 0.7823 | 0.8553 | 0.8707 | 0.8298 | 0.8113 |
| VIF | 0.3353 | 0.2996 | 0.2582 | 0.3007 | 0.3343 | 0.3097 | 0.2836 |
CT, computed tomography; EPI, edge preservation index; FMI, feature mutual information; JBF, Joint Bilateral Filtering; INS, JBF with LGE; LGE, Local Gradient Energy; MATR, Multi-scale Adaptive Transformer; MRI, magnetic resonance imaging; QNCIE, non-reference quality index for enhancement; SSIM, structural similarity index; VIF, visual information fidelity.
MRI-T1 and MRI-T2 image fusion
To further demonstrate the adaptability of the proposed fusion approach across different modalities, we also evaluated it on MRI scans with varying contrast properties, specifically T1- and T2-weighted sequences. Combining the anatomical detail provided by T1 with the high lesion contrast characteristic of T2 can help clinicians localize lesions more accurately and perform more comprehensive case analyses. Accordingly, 10 paired T1-T2 scans obtained from established medical repositories were used as validation cases.
Figure 15 presents the fusion results produced by CSAFusion alongside those of four benchmark methods. Although all approaches generate reasonably clear fused images, several limitations are apparent. The MATR-based method almost exclusively inherits information from the T2 modality, largely neglecting T1-derived features. MUFusion- and INS-based methods both exhibit noise and incomplete edge preservation. SwinFusion, in contrast, loses part of the T2 modality information. MRSCFusion produces relatively clean results but shows contrast compression and edge softening: the gray-white matter interface and cortical sulci are less distinct, and multiple T1-specific high-intensity structures are attenuated. These findings indicate that, compared with CSAFusion, the benchmark methods do not fully integrate the complementary information from T1 and T2. In contrast, CSAFusion better preserves fine anatomical detail and maintains stronger inter-modality consistency.
Figure 16 presents a comparison of fusion performance metrics for CSAFusion and four alternative methods across 10 MRI T1-T2 validation pairs. The quantitative results in Table 4 are reported as averages over these 10 cases. As shown in Table 4 and Figure 16, CSAFusion achieves the highest QNCIE value and superior EPI and SSIM scores with respect to both T1- and T2-weighted images, indicating an optimal balance in drawing informative details from both modalities and effectively exploiting their complementary characteristics. Overall, for T1/T2 fusion, CSAFusion markedly outperforms the competing methods, confirming its strong adaptability.
Table 4
| Metric | CSAFusion | MATR | MUFusion | SwinFusion | INS | MRSCFusion |
|---|---|---|---|---|---|---|
| SSIM_T1 | 0.7693 | 0.4513 | 0.4959 | 0.7347 | 0.7473 | 0.7574 |
| SSIM_T2 | 0.7698 | 0.4736 | 0.5774 | 0.7331 | 0.7693 | 0.7432 |
| EPI_T1 | 0.8736 | 0.5423 | 0.5744 | 0.8108 | 0.8609 | 0.6848 |
| EPI_T2 | 0.7793 | 0.6801 | 0.4403 | 0.7751 | 0.6975 | 0.6782 |
| QNCIE | 0.9485 | 0.9391 | 0.9093 | 0.9350 | 0.9346 | 0.9361 |
| FMI | 0.8014 | 0.6813 | 0.7713 | 0.7050 | 0.3902 | 0.7551 |
| VIF | 0.4218 | 0.2012 | 0.2428 | 0.3525 | 0.4050 | 0.4138 |
EPI, edge preservation index; FMI, feature mutual information; JBF, Joint Bilateral Filtering; INS, JBF with LGE; LGE, Local Gradient Energy; MATR, Multi-scale Adaptive Transformer; MRI, magnetic resonance imaging; QNCIE, non-reference quality index for enhancement; SSIM, structural similarity index; VIF, visual information fidelity.
On the complementary-information and perceptual-fidelity metrics, CSAFusion also ranks first (FMI =0.8014, VIF =0.4218), reflecting more effective cross-modal feature integration and higher perceptual fidelity. As seen from the per-pair curves in Figure 16, CSAFusion maintains higher and more stable FMI and VIF scores across most image pairs. MRSCFusion and INS approach CSAFusion in terms of VIF (0.4138 and 0.4050, respectively) but still lag behind in EPI and SSIM. MUFusion ranks second in FMI (0.7713) yet exhibits a low VIF (0.2428), consistent with residual artifacts and edge loss. SwinFusion and MATR yield mid-to-low FMI values (0.7050 and 0.6813) and moderate-to-low VIF values (0.3525 and 0.2012), reflecting partial loss of complementary cues and reduced perceptual quality.
For MRI T1-T2 fusion, CSAFusion achieves significantly better performance on texture- and contrast-related metrics—including AG, FMI, and VIF—than the competing methods (all P<0.05). These results suggest that the proposed method is more effective at preserving fine soft-tissue details and local contrast between T1- and T2-weighted images.
Conclusions
To mitigate the loss of fine details caused by hierarchical decomposition, this study proposes a novel multi-modal medical image fusion framework based on a U-Net architecture. Within this framework, we design a dual fusion mechanism that effectively addresses the challenge of adaptively integrating complementary information across modalities. In addition, we construct a hybrid network that combines AC in the encoder with Swin Transformer modules in the decoder to enhance both local and global feature extraction. Spatial attention and SE channel attention are further incorporated to guide the network toward higher-value feature representations. We also develop a composite loss function comprising three key components—SSIM, RMI, and contrast loss—which together substantially improve the overall quality and perceptual fidelity of the fused images. Extensive experiments on publicly available medical image datasets demonstrate that the proposed method consistently outperforms both traditional and state-of-the-art fusion techniques in terms of visual appearance and quantitative metrics, and exhibits strong generalizability across diverse imaging scenarios. These findings suggest that the proposed approach has considerable potential clinical value and can be extended to other medical imaging tasks. Building on this work, future research may adapt the architecture for three-dimensional (3D) medical image fusion to more faithfully preserve volumetric structural details.
Acknowledgments
None.
Footnote
Funding: This work was supported by
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1659/coif). All authors report that this study was supported by the National Natural Science Foundation of China (No. U21A20390) and Education Department Project of Jilin Province (No. JJKH20240945KJ). The authors have no other conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Li M, Kuang L, Xu S, Sha Z. Brain Tumor Detection Based on Multimodal Information Fusion and Convolutional Neural Network. IEEE Access 2019;7:180134-46.
- Khalid H, Hussain M, Al Ghamdi MA, Khalid T, Khalid K, Khan MA, Fatima K, Masood K, Almotiri SH, Farooq MS, Ahmed A. A Comparative Systematic Literature Review on Knee Bone Reports from MRI, X-rays and CT Scans Using Deep Learning and Machine Learning Methodologies. Diagnostics (Basel) 2020;10:518. [Crossref] [PubMed]
- Li L, Mazomenos E, Chandler JH, Obstein KL, Valdastri P, Stoyanov D, Vasconcelos F. Robust endoscopic image mosaicking via fusion of multimodal estimation. Med Image Anal 2023;84:102709. [Crossref] [PubMed]
- Azam MA, Khan KB, Salahuddin S, Rehman E, Khan SA, Khan MA, Kadry S, Gandomi AH. A review on multimodal medical image fusion: Compendious analysis of medical modalities, multimodal databases, fusion techniques and quality metrics. Comput Biol Med 2022;144:105253. [Crossref] [PubMed]
- Wang YH, Li G, Ma RH, Zhao YP, Zhang H, Meng JH, Mu CC, Sun CK, Ma XC. Diagnostic efficacy of CBCT, MRI, and CBCT-MRI fused images in distinguishing articular disc calcification from loose body of temporomandibular joint. Clin Oral Investig 2021;25:1907-14. [Crossref] [PubMed]
- Gemmell HG, Staff RT. Single Photon Emission Computed Tomography (SPECT). In: Sharp PF, Gemmell HG, Murray AD. editors. Practical Nuclear Medicine. London: Springer; 2005:21-33.
- Heinrich MP, Jenkinson M, Bhushan M, Matin T, Gleeson FV, Brady SM, Schnabel JA. MIND: modality independent neighbourhood descriptor for multi-modal deformable registration. Med Image Anal 2012;16:1423-35. [Crossref] [PubMed]
- Lowe DG. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 2004;60:91-110.
- Bay H, Tuytelaars T, Van Gool L. SURF: speeded up robust features. In: Leonardis A, Bischof H, Pinz A. editors. Computer Vision – ECCV 2006. Berlin, Heidelberg: Springer; 2006:404-17.
- Mitianoudis N, Stathaki T. Pixel-based and region-based image fusion schemes using ICA bases. Information Fusion 2007;8:131-42.
- Hill P, Al-Mualla ME, Bull D. Perceptual Image Fusion Using Wavelets. IEEE Trans Image Process 2017;26:1076-88. [Crossref] [PubMed]
- Li S, Yin H, Fang L. Group-sparse representation with dictionary learning for medical image denoising and fusion. IEEE Trans Biomed Eng 2012;59:3450-9. [Crossref] [PubMed]
- Liu Y, Liu S, Wang Z. A general framework for image fusion based on multi-scale transform and sparse representation. Information Fusion 2015;24:147-64.
- Zhang H, Xu H, Tian X, Jiang J, Ma J. Image fusion meets deep learning: A survey and perspective. Information Fusion 2021;76:323-36.
- Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal: IEEE; 2021:10012-22.
- Liu Z, Hu H, Lin Y, Yao Z, Xie Z, Wei Y, Ning J, Cao Y, Zhang Z, Dong L, Wei F, Guo B. Swin Transformer V2: Scaling Up Capacity and Resolution. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA, USA: IEEE; 2022:11999-2009.
- Ronneberger O, Fischer P, Brox T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In: Navab N, Hornegger J, Wells WM, Frangi AF. editors. Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. Cham: Springer; 2015:234-41.
- Woo S, Park J, Lee JY, Kweon IS. CBAM: Convolutional Block Attention Module. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y. editors. Computer Vision – ECCV 2018. Cham: Springer; 2018:3-19.
- Wang J, Ren J, Li H, Sun Z, Luan Z, Yu Z, Liang C, Monfared YE, Xu H, Hua Q. DDGANSE: Dual-Discriminator GAN with a Squeeze-and-Excitation Module for Infrared and Visible Image Fusion. Photonics 2022;9:150.
- Hu J, Shen L, Sun G. Squeeze-and-Excitation Networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE; 2018:7132-41.
- Zhang D. Wavelet transform. In: Zhang D. editor. Fundamentals of Image Data Mining: Analysis, Features, Classification and Retrieval. Cham: Springer; 2019:35-44.
- Xu X, Wang Y, Chen S. Medical image fusion using discrete fractional wavelet transform. Biomedical Signal Processing and Control 2016;27:103-11.
- Do MN, Vetterli M. The contourlet transform: an efficient directional multiresolution image representation. IEEE Trans Image Process 2005;14:2091-106. [Crossref] [PubMed]
- Basheer R, Jasim N. Multimodal Medical Image Fusion Enhancement Based on Wavelet Transform. Iraqi Journal of Science 2024;65:4576-87.
- Li X, Zhou F, Tan H, Zhang W, Zhao C. Multimodal medical image fusion based on joint bilateral filter and local gradient energy. Information Sciences 2021;569:302-25.
- Liu Y, Chen X, Cheng J, Peng H. A medical image fusion method based on convolutional neural networks. 2017 20th International Conference on Information Fusion (Fusion). Xi’an: IEEE; 2017:1-7.
- Hou R, Zhou D, Nie R, Liu D, Xiong L, Guo Y, Yu C. VIF-Net: An Unsupervised Framework for Infrared and Visible Image Fusion. IEEE Transactions on Computational Imaging 2020;6:640-51.
- Liu Y, Zhang S, Tang Y, Zhao X, He ZX. A multi-scale pyramid residual weight network for medical image fusion. Quant Imaging Med Surg 2025;15:1793-821. [Crossref] [PubMed]
- Zhou M, Zhang Y, Xu X, Wang J, Khalvati F. Edge-Enhanced Dilated Residual Attention Network for Multimodal Medical Image Fusion. 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). Lisbon: IEEE; 2024:4108-11.
- Ma J, Xu H, Jiang J, Mei X, Zhang XP. DDcGAN: A Dual-discriminator Conditional Generative Adversarial Network for Multi-resolution Image Fusion. IEEE Trans Image Process 2020; Epub ahead of print. [Crossref]
- Cheng C, Xu T, Wu XJ. MUFusion: A general unsupervised image fusion network based on memory unit. Information Fusion 2023;92:80-92.
- Liu Y, Zang Y, Zhou D, Cao J, Nie R, Hou R, Ding Z, Mei J. An Improved Hybrid Network With a Transformer Module for Medical Image Fusion. IEEE J Biomed Health Inform 2023;27:3489-500. [Crossref] [PubMed]
- Zhou Q, Ye S, Wen M, Huang Z, Ding M, Zhang X. Multi-modal medical image fusion based on densely-connected high-resolution CNN and hybrid transformer. Neural Comput & Applic 2022;34:21741-61.
- Tang W, He F, Liu Y, Duan Y. MATR: Multimodal Medical Image Fusion via Multiscale Adaptive Transformer. IEEE Trans Image Process 2022;31:5134-49. [Crossref] [PubMed]
- Ma J, Tang L, Fan F, Huang J, Mei X, Ma Y. SwinFusion: Cross-domain Long-range Learning for General Image Fusion via Swin Transformer. IEEE/CAA Journal of Automatica Sinica 2022;9:1200-17.
- Di J, Guo W, Liu J, Ren L, Lian J. AMMNet: A multimodal medical image fusion method based on an attention mechanism and MobileNetV3. Biomedical Signal Processing and Control 2024;96:106561.
- Xie X, Zhang X, Ye S, Xiong D, Ouyang L, Yang B, Luo Y, Zheng W, Dai H. MRSCFusion: Joint Residual Swin Transformer and Multiscale CNN for Unsupervised Multimodal Medical Image Fusion. IEEE Transactions on Instrumentation and Measurement 2023;72:1-17.
- Cui Y, Du H, Mei W. Infrared and Visible Image Fusion Using Detail Enhanced Channel Attention Network. IEEE Access 2019;7:182185-97.
- Yang Y, Yuhua P, Zhaoguang L. A Fast Algorithm for YCbCr to RGB Conversion. IEEE Transactions on Consumer Electronics 2007;53:1490-3.
- Wang Z, Bovik AC, Sheikh HR, Simoncelli EP. Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 2004;13:600-12. [Crossref] [PubMed]
- Zhao S, Wang Y, Yang Z, Cai D. Region mutual information loss for semantic segmentation. Advances in Neural Information Processing Systems 2019;32:1-11.
- Oakley JP, Bu H. Correction of simple contrast loss in color images. IEEE Trans Image Process 2007;16:511-22. [Crossref] [PubMed]
- Tsumura N, Sanpei K, Haneishi H, Miyake Y. An evaluation of image quality by spatial frequency analysis in digital halftoning. In: IS&T Annual Conference. Springfield: Society for Imaging Science and Technology; 1996:312-16.
- Zhang X, Ye P, Xiao G. VIFB: A Visible and Infrared Image Fusion Benchmark. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Seattle, WA, USA: IEEE; 2020:468-78.
- Reeder SB, Atalar E, Bolster BD Jr, McVeigh ER. Quantification and reduction of ghosting artifacts in interleaved echo-planar imaging. Magn Reson Med 1997;38:429-39. [Crossref] [PubMed]
- Sun Y, Cao B, Zhu P, Hu Q. Detfusion: a detection-driven infrared and visible image fusion network. In: Proceedings of the 30th ACM International Conference on Multimedia. New York: Association for Computing Machinery; 2022:4003-11.
- Liu Z, Blasch E, Xue Z, Zhao J, Laganiere R, Wu W. Objective Assessment of Multiresolution Image Fusion Algorithms for Context Enhancement in Night Vision: A Comparative Study. IEEE Trans Pattern Anal Mach Intell 2012;34:94-109. [Crossref] [PubMed]
- Zhang B, Pan Z, Yao K, Dong X. SAR decompressed image reconstruction algorithm based on generative adversarial network. Journal of University of Chinese Academy of Sciences 2025;42:666-76.


