A multi-scale pyramid residual weight network for medical image fusion
Introduction
Background
With the advancement of imaging technology, medical images are increasingly vital in modern clinical applications. Medical imaging technology enhances our understanding of human tissue structure, expanding its applications in diagnosis, surgical navigation, and beyond (1). There are two common types of medical images, as illustrated in Figure 1: anatomical images acquired using modern medical equipment and functional images. Anatomical images provide high-resolution anatomical information. For instance, magnetic resonance imaging (MRI) offers soft tissue information, and computed tomography (CT) provides texture features (2,3). However, anatomical images do not provide insights into blood flow and metabolic information. In contrast, functional images offer insights into organ blood flow and metabolism. Positron emission tomography (PET) images reveal oxygen and glucose metabolism in tissues, whereas single-photon emission computed tomography (SPECT) images provide insights into blood flow within organs. Nevertheless, functional images are limited by lower resolution and cannot accurately depict organ lesions. Given the limitations of single-modality imaging, there is an increasing need for more comprehensive information. Therefore, employing image fusion methods to merge images from different modalities into a single image can retain more complementary information from the two images. This reduces information redundancy, enhances information utilization, and benefits physicians’ visual perception and clinical diagnosis (4,5).

Due to the significant value of image fusion, numerous fusion algorithms have been proposed. These algorithms can be broadly categorized into traditional and recently emerged deep learning (DL) methods. In traditional methods, multiscale transforms are a classic approach widely applied in image fusion (6-8). Commonly used multiscale transform fusion algorithms include the pyramid transform (9), wavelet transform (10-12), non-subsampled contourlet transform (NSCT) (13-16), and non-subsampled shearlet transform (NSST) (17-20). Sparse representation is another commonly used traditional method. Unlike multiscale transforms, sparse representation learns to fuse multiple source images from an overcomplete dictionary, providing a more stable understanding of the source images that is less affected by poor inter-image registration (21,22). Saliency models are utilized to extract salient regions of an image, obtaining saliency features or weight maps for reconstructing the fused image (23).
With the rapid development of DL (24) in recent years, many image fusion algorithms based on DL have emerged. These works have provided new insights into medical image fusion with some success. However, there are still some limitations. For example, convolutional neural networks (CNNs) require fewer parameters during training and have a relatively simple structure. They can process images directly as input signals, bypassing the complexities of traditional feature extraction methods. However, high-frequency detail information is often drowned out by low-frequency context data. As a result, the texture details of the fused image mask the metabolic information characterizing the lesions in the functional image. This leads to a decrease in the importance of lesion expression in the fused image, which ultimately leads to the loss of key information such as the boundary contour of the tumor (25-29). Current feature fusion methods can only achieve shallow feature interaction for the features extracted by the network, and noise is introduced in the fusion process (25,26).
To solve the above problems, this paper proposes a new medical image fusion method, a multi-scale pyramid residual weight network (LYWNet). The contributions are as follows:
- A novel end-to-end three-port unsupervised medical image fusion CNN module is proposed, which can extract and retain information at different depths of the network.
- In order to improve the performance of the network and retain more metabolic information and texture information, modules such as the pyramid pooling module (PPM) are used to connect with it.
- The fusion network does not retain the brightness information of the functional image but generates the brightness information through the feature learning network.
- In order to achieve the depth fusion of different modal features and reduce the introduction of noise in the fusion process, a new feature fusion method is proposed.
Related work
In this section, we provide a concise overview of traditional medical image fusion methods, followed by an examination of fusion techniques based on DL.
Traditional medical image fusion method
Due to the critical importance and broad range of applications of multimodal medical image fusion, numerous effective traditional fusion methods have been proposed. These methods can be classified into five categories based on their underlying techniques: pyramid transform-based methods, wavelet-based methods, sparse representation-based methods, salient region-based methods, and others.
Pyramid transform-based methods (9,30) were the first multi-scale transform tools used for image fusion. These methods emulate the human visual perception mechanism by decomposing the input image into a series of multi-resolution images. High-resolution images capture detailed object features, whereas low-resolution images focus on the overall scene content. The sub-images are fused according to the predefined fusion rules, and the resulting fused image pyramid is obtained. The wavelet-based methods (10-12), which can analyze the direction of frequency subbands, can make full use of the texture direction information of the image content. Through multi-level wavelet transform, the source image is divided into low-frequency components capturing high-level features and multiple sets of three-dimensional (3D) high-frequency components at various scales. Fusion rules are then designed based on these distinct features. Methods based on sparse representation (18,19) align with the physiological characteristics of the human visual system, offering greater stability and interpretability for understanding source images. The fusion process of sparse representation-based image fusion is as follows: (I) each image is divided into overlapping blocks, with each block transformed into a vector; (II) the sparse representation of source image blocks is obtained using a predefined or learned dictionary; (III) specific fusion strategy; (IV) reconstruction of a fused image based on sparse representation of a fused image. Sparse representation methods can help reduce visual artifacts and improve robustness to registration errors. Edge-preserving filters used in salient region-based image fusion (23,31) have been widely used in medical image fusion: firstly, the source images are decomposed by an edge-preserving filter; secondly, different image fusion rules decompose the source images of different scales; finally, the fusion base and detail layers are added to reconstruct the fused image.
These methods apply the same representation approach to images from different modalities to retain consistent features. However, in multi-modal images, the key information can differ significantly. For instance, anatomical images convey density structure through contrast, whereas functional images emphasize color representation. Consequently, using identical representation methods for multi-modal images is not suitable.
DL methods
In recent years, the emergence of DL has provided new ideas for multimodal medical image fusion. Liu et al. (32) introduced the CNN for multimodal medical image fusion, in which the CNN synthesizes the pixel activity information of two source images to generate a weight map and realizes the fusion process by means of an image pyramid. Xu et al. (33) introduced an end-to-end unsupervised network for medical image fusion, incorporating both surface and deep constraints to retain information. The surface constraints focus on saliency and abundance to evaluate the activity level of the source image. In the deep constraints, the uniqueness is measured through the neural network to preserve unique information. Balakrishnan et al. (34) employed a Siamese convolutional network to integrate the pixel motion information from multiple multimodal medical images, creating a weight map for fusion. Tang et al. (35) proposed an adaptive convolution method, incorporating a global complementary context adaptive modulation convolution kernel, combined with an adaptive transformer to enhance global semantic extraction capabilities. The network architecture adopts a multi-scale design so as to fully obtain useful multi-modal information from the perspective of different scales. Fu et al. (25) proposed a multi-scale residual pyramid attention network (MSRPAN), which consists of a feature extractor, fusion mechanism, and reconstructor. The feature extractor consists of three MSRPAN blocks for extracting multi-scale features. Li et al. (26) proposed a multi-scale dual-branch residual attention (MSDRA) network, including a feature extraction module, a feature fusion module, and an image reconstruction module. The feature extraction module extracts image features by three MSDRA modules in series. At the same time, a Feature l1-norm fusion strategy is proposed to fuse the features obtained from the input image. Lahoud et al. (36) proposed a real-time image fusion method that uses a pre-trained neural network to generate a single image containing multi-modal source features. This method merges the images using a novel strategy based on the deep feature maps extracted by CNN. These feature maps are used to compute fusion weights, which guide the multi-modal image fusion process. Zhao et al. (37) introduced a medical image fusion algorithm combining a deep convolutional generative adversarial network (GAN) with a dense block model to produce fusion images rich in information. The network integrates the image generation module and discriminator module based on dense fast and codec, overcomes the shortcomings of manual design of active layer measurement in traditional methods, and processes the information in the middle layer according to dense fast; in this way, the loss of information is avoided. Song et al. (38) presented a multi-scale density network for medical image fusion composed of an encoding layer, a fusion layer, and a decoding layer. Filters of three sizes are used in their network to extract features which are then fused by using a fusion strategy.
According to GANs, Liu et al. (39) proposed an unsupervised algorithm for high-quality medical image fusion composed of a lightweight image enhancement deep network and a GAN. The lightweight image enhancement deep network improves the quality of the fused image, making it more suitable for the human visual perception system. In contrast, the GAN further enhances texture details and edge information. Huang et al. (28) proposed a multi-generator, multi-discriminator conditional generative adversarial network (cGAN). The first cGAN generates fusion images that closely resemble real ones; the second cGAN focuses on enhancing dense structural details and preserving functional information without distortion. Ma et al. (29) introduced a dual-discriminator conditional GAN. The generator produces fusion images resembling real ones, and two discriminators are employed to calculate content loss and distinguish the structural differences between the fusion image and the two source images. Fu et al. (40) proposed a GAN with a dual-stream attention mechanism (DSAGAN) for anatomical and functional image fusion. The method leverages a dual-stream architecture and multiscale convolutions to extract deep features. Xu et al. (41) presented a three-branch network architecture, Proportional-Integral-Derivative Network (PIDNet), for semantic segmentation, establishing a link between CNN and proportional integral derivative (PID) controllers through a proportional-derivative-integral control mechanism. Umirzakova et al. (42) introduced a Deep Residual Feature Distillation Channel Attention Network to enhance the super-resolution of medical images. This approach aims to optimize both performance and efficiency, offering valuable insights for improving the efficiency of medical image processing methods.
In the studies mentioned above, although there was a significant focus on the fusion of texture and functional information in MRI images, there needed to be more emphasis on texture details within the MRI images, with less attention given to the information in functional images. This has resulted in the dispersion of high-frequency semantic information within low-frequency contextual information, thereby reducing the discernibility of differences in functional information. This scenario is unfavorable for diagnosing lesion locations. Hence, we propose an end-to-end unsupervised network with a three-part structure in this work. Apart from extracting deep-layered information, this network retains varying depths of information, ensuring that high-frequency semantic details are not dispersed by low-frequency contextual information. Compared to other CNN-based methods, our approach preserves more functional information while maintaining the integrity of texture details, thereby allowing for better preservation and visualization of tumor contours and boundaries in the images.
Methods
In this section, we first formulate the problem through an analysis of the characteristics of multimodal images, and then introduce the architecture and loss function of the fusion network. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013).
Overview
Images and are used, where has dimensions and has dimensions , with H and W representing the height and width of the images, and C denoting the number of channels. The objective is to establish a mapping relationship between and to generate a fused image with dimensions , and . Multimodal image fusion typically involves three cases: (I) fusion of CT and MRI; (II) fusion of PET and MRI; (III) fusion of SPECT and MRI.
For PET/SPECT images, preserving functional information is crucial. Chromaticity information, which is distinct from PET/SPECT data, plays an important role and is therefore translated into the YCbCr color space. This transformation allocates chromaticity details across the Y, Cb, and Cr channels. During the fusion process, the Y (intensity) channel is combined with the MRI image, ensuring that the resulting fused image preserves the strength of information distribution in the Y channel. In CT images, critical information is embedded in the dense texture structure of high pixel intensity. Conversely, MRI images require careful preservation of intricate texture details. Consequently, the primary features of these distinct image modalities are based on chromaticity and texture information. The dual objectives of the fusion process are to preserve high-quality chromaticity for optimal reflection of functional data, and to maintain rich texture details for retaining essential soft tissue information. Through a carefully designed network structure, the Y channel data is input for learning, whereas the Cb and Cr channels are preserved to retain chromaticity information.
The network architecture for medical image fusion is presented in Figure 2. The framework consists of three modules: feature preprocessor, extractor, and fuser. Initially, image features are extracted through the feature input module. To capture a broader range of semantic information, convolution kernels of various sizes are employed to extract features, which are subsequently fused. These three modules are applied sequentially to extract the input features. To ensure that the output image contains both low- and high-dimensional information, a skip connection is utilized between the low-level and deep features. Finally, the fusion strategy in the image fusion module is used to fuse the 64-dimensional features of the input image, and the output is reconstructed by two convolution kernels and one convolution kernel to obtain the fused image. By extracting deeper high-dimensional information and retaining low-level features, the fusion algorithm incorporates more comprehensive information, preserving both functional data and high-frequency details. The subsequent subsections provide a detailed explanation of the algorithm.

Feature preprocessor
The feature preprocessor consists of input images and the feature input module, as shown in Figure 3. The feature input module is composed of four convolutional layers with different kernel sizes. The input image passes through the first three convolutional layers, and the outputs of each layer are concatenated along the channel dimension. The final output is then processed through a subsequent convolutional layer. Each convolutional layer is followed by batch normalization (BN) and a rectified linear unit (ReLU) activation function, which serve to normalize the input data and enhance the network’s nonlinearity.

Feature extractor
The feature extractor consists of three identical modules, as illustrated in Figure 4, which work collaboratively to extract features more comprehensively. To enhance feature diversity and preserve high-frequency information, as shown in Figure 5, this module incorporates three branches. The first branch utilizes pyramid transformation to extract features at various dimensions. Simultaneously, the second and third branches implement residual networks, which include residual connections to facilitate the flow of gradients and mitigate the vanishing gradient problem in deep networks. These residual networks, with distinct structures, ensure a more compact distribution of the extracted features by enabling the learning of complex patterns, even in deeper layers. Additionally, a micro-PPM is integrated into the second branch to enhance the network’s ability to capture global information. The outputs from all three branches are then fed into the boundary attention module, where boundary features are leveraged to guide the fusion of contextual information.

Micro-pyramid attention module
The pyramid attention module, introduced in (43), enhances the feature pyramid attention (25), thereby improving the performance of the feature pyramid. The multi-scale pyramid technique processes the image at different resolutions, enabling the model to effectively detect objects of varying sizes. In the feature pyramid attention, a mask module is incorporated into the traditional feature pyramid model to augment the multi-scale approach. This integration allows the model to prioritize relevant features, effectively combining the multi-scale and attention mechanisms to enhance performance. The pyramid attention mechanism is expressed in Eq. [1]:
Where is the output feature, is the original feature, is the convolution operation, and , , are the three-layer pyramid structure, respectively.
The input feature maps are pooled to obtain two feature maps of size , . The original image and the two feature maps are convolved with , , and filters and upsampled, respectively. Then, the residual mechanism is used to add the original feature map and them together to obtain a feature map. Although the sizeable convolutional filter can obtain a large receptive field, which is the area of the input image a network layer can “see” or cover in one pass and is essential for capturing broad context and understanding large structures in the image, it brings a large amount of computation and parameters (44). To address this, we adopt the densely connected layer mentioned in, which can alleviate the vanishing gradient problem and enhance feature propagation while significantly reducing the parameters. Therefore, a filter is replaced with two smaller convolution kernels, and a filter is replaced with three smaller convolution filters. This modification reduces computational complexity while maintaining the network’s ability to capture the necessary spatial information.
Multi-basic block and multi-bottleneck
The bottleneck structure significantly reduces both computational cost and parameter count while preserving the expressive power of the network. The concept of the bottleneck effectively balances computational efficiency with feature representation, making the network more parameter-efficient and reducing overall computational complexity. As illustrated in Figure 5, multilayer bottleneck and basic block structures are employed to extract deeper levels of detailed information.
Micro-PPM
The PPM (45,46) is introduced to aggregate contextual information from different regions, thereby enhancing the network’s ability to capture global information. This module applies pooling operations at multiple scales to the original feature map, generating several feature maps of varying sizes, which are then concatenated along the channel dimension. The resulting composite feature map effectively fuses information from multiple scales, balancing both global semantic and local detail information. However, the inclusion of too many scale sums in each channel leads to excessive computational cost. To address this issue, the number of pyramid layers is reduced in the proposed model, thereby decreasing the computational burden and accelerating processing, as illustrated in Figure 6.

Boundary attention-guided (BAG) fusion module
The BAG fusion module, as shown in Figure 7, is proposed to guide the fusion of detailed and contextual representations. Although context information is semantically accurate, significant spatial and geometric details can be lost in the process. The BAG module addresses this issue by directing the model to prioritize the boundary regions, allowing these areas to focus on detailed features, while other regions are filled with contextual information. This approach ensures that both fine-grained details and broader context are effectively integrated in the fused output. The eigenvectors corresponding to the feature maps of the three branches , , and are defined as , , , and BAG can be represented as:
When model depends more on detailed features, otherwise on contextual information.
Feature fuser
In the experiments, a feature distillation strategy (FDS) is proposed to fuse the extracted features. The primary objective of FDS is to integrate features from multiple sources or models to enhance the overall feature representation. By distilling and merging the most relevant information from each feature source, FDS enables the model to learn richer and more comprehensive features, thereby improving performance on downstream tasks. This fusion process not only helps to retain critical information from each feature source but also provides a more robust and efficient feature representation, resulting in improved model accuracy and generalization. The formula is expressed as:
where is the fused feature. In the proposed strategy, , , and are the features obtained from the original images 1 and 2 through the feature extraction module, respectively. The quantization results of SPECT-MRI, PET-MRI, and CT-MRI at T ranging from 1 to 30 are shown in Figures 8-10. We can see in the three pictures that the structural similarity (SSIM) (47) and peak signal-to-noise ratio (PSNR) parameters increase with the increase of T, whereas the remaining four parameters increase first and then tend to balance.



Loss function
MRI and PET have different characteristics due to different imaging mechanisms. In order to retain more features and improve the quality of the fused image, the loss function should fully consider the retention of features and the characteristics of the image. In this paper, the loss function of our LYWNet is defined as:
where , , and denote similarity loss, the cosine similarity loss, and the edge loss, respectively. α and β are two trade-off parameters to balance these three terms.
Similarity loss
The similarity loss limits the images the network generates so that the fused image retains part of the information from the input image. The similarity loss function is showed in:
where denotes the matrix 2-norm and denotes the input image or . denotes the fused image with , and is the number of fused images.
Intensity loss
The input is two images of different modalities, and the contrast between the images has a significant difference in the vector. The cosine loss function is a loss function to judge the similarity between two vectors, making the fusion image and generated by the network more similar at the vector level. Thus, we use cosine similarity loss as the intensity loss; it is defined as:
represents the true label, which belongs to , and the margin parameter represents the cosine similarity threshold.
Edge loss function
The edge loss function can preserve the fused image’s edge details to improve the fused structure’s visual quality and perceptual fidelity. By improving visual consistency between different images and transitioning at image edges, discontinuities and the perception of unnaturalness can be reduced; it is defined as:
Sobel(·) represents the edge features extracted from image. Using the Sobel filter, MSE represents the mean square error loss and compares the difference between two edge features.
Results
This section presents a comparison of the proposed LYWNet with several state-of-the-art methods for SPECT-MRI, PET-MRI, and CT-MRI image fusion. The experiments are conducted qualitatively and quantitatively using a publicly available dataset. Additionally, an ablation study, efficiency analysis, and parameter comparison are included.
Datasets and training details
In the experiments, a total of 329 pairs of SPECT-MRI, 318 pairs of PET-MRI, and 184 pairs of CT-MRI images were prepared for training, whereas 30 pairs of SPECT-MRI, PET-MRI, and CT-MRI images were used for testing. All images were sourced from the Whole Brain Atlas (http://www.med.harvard.edu/AANLIB/home.html), created by Harvard Medical School. The MRI, SPECT, PET, and CT images were coregistered, with consistent pixel sizes across all modalities. The net is implemented by PyTorch (https://pytorch.org/) with an NVIDIA RTX 3090 GPU (NVIDIA, Santa, Clara, CA, USA). As for parameters, , , , the learning rate is set to 0.0001, and the optimization algorithm is Adam (48). The network is trained for 1,000 epochs with a batch size of 16. When the number of epochs reaches 1,000, the training process is completed.
Comparison with other fusion methods
The performance of LYWNet is compared with 12 state-of-the-art methods, including CNN (32), enhanced medical image fusion network (EMFusion) (33), local extreme map guided multi-modal image fusion (LEGFF) (49), Laplacian redecomposition (LRD) (50), multiscale adaptive transformer (MATR) (35), multi-level edge-preserving filtering-multi-level morphological gradient-pulse-coupled neural network (MLCF-MLMG-PCNN) (51), multi-scale DenseNet (MSDNet) (38), multiscale double-branch residual attention (MSDRA) (26), NSCT (14), non-subsampled contour let-multi-scale morphological gradient-pulse coupled neural network (NSST-MSMG-PCNN) (19), zero-learning fast (34), NSST-parameter-adaptive pulse-couple neural network (NSST-PAPCNN) (18), and DSAGAN (40). Among these methods, LEGFF, LRD, NSCT, MLCF-MLMG-PCNN, and NSST-PAPCNN are traditional techniques, whereas CNN, EMFusion, MATR, MSDNet, MSDRA, NSST-MSMG-PCNN, zero-learning fast, and DSAGAN are DL-based approaches.
For the quantitative comparison, six metrics are employed to evaluate the fusion performance: SSIM, PSNR, mutual information (MI) (52), Fowlkes-Mallows index (FMI) (53), adjusted Rand index (ARI), and correlation coefficient (CC) (54). SSIM is used to assess the SSIM, ensuring that critical details, such as soft tissue and bone structures, are preserved in the fused image, which is essential for precise lesion localization. PSNR evaluates the image clarity by measuring noise levels, which is crucial for accurately visualizing subtle features. MI quantifies the shared information between different modalities (e.g., CT and MRI), ensuring that the fused image integrates unique characteristics from each modality for a comprehensive representation. FMI provides an indication of the boundary of lesions, aiding in accurate lesion localization, minimizing inadvertent damage to normal tissues, and enhancing treatment planning accuracy. ARI is employed to assess the agreement between the clustering results of the image and the true labels. CC measures the correlation between the fused and original images, ensuring the preservation of intensity consistency across modalities. Among these metrics, SSIM specifically extracts three key features from the image: brightness, contrast, and structure, and then computes the similarity between these features. The range of the SSIM range is [0, 1]. The larger the SSIM, the smaller the structural loss and distortion. SSIM is defined as follows:
where and are the means of images and , respectively; and are the variances of images and , respectively; is the covariance of the images and ; , , and are constants. To prevent cases where the denominator approaches zero, small constants, , , , , , and .
PSNR is the ratio of peak power to noise power in the fused image. A higher PSNR value indicates a smaller difference in quality between the two images.
where is the maximum value of the pixels in image F.
MI is used to measure the degree of information overlap between the fused image and the original image and evaluate the fusion quality. A larger MI indicates a higher similarity between the fused image and the original image.
where and are the joint probability distributions for images and ; and images and , respectively. , , and are the marginal probability distributions for images , , and , respectively.
ARI is a statistic that measures the similarity between two data clusters. The ARI is calculated by comparing real and fused images. A larger ARI indicates a higher similarity between the fused image and the original image.
where true positive (TP) stands for the number of pixels in the intersection of the object boundary generated by the algorithm and the standard boundary, that is, the number of object pixels correctly detected. False positive (FP) represents the number of pixels in the object boundary generated by the algorithm that do not intersect the standard boundary, that is, the number of object pixels detected by the algorithm incorrectly. False negative (FN) represents the number of pixels in the standard boundary that do not intersect the object boundary generated by the algorithm, that is, the number of object pixels that the algorithm fails to detect.
CC measures the linear correlation between the fused image and the source images. A higher CC indicates greater similarity between the fused image and the source images.
where COV (·) stands for calculating the covariance of the two input images, and and stand for the standard deviation of and , respectively.
SPECT and MRI image fusion
The qualitative fusion results for typical SPECT and MRI image pairs are presented in Figure 11. The proposed method demonstrates three key advantages. First, in functional and anatomical images, the lesion location is more prominently reflected in functional images. Our approach mitigates the large amount of irregular mosaic information in SPECT images by incorporating the rich edge information from MRI, while preserving a substantial amount of functional information. Second, compared to other methods, the lesion location’s brightness and the color contrast with surrounding areas are enhanced, which aids clinicians in locating the lesion and identifying the nearby perfusion regions. For instance, in Figure 11G, MATR retains a significant amount of texture information, but the influence of high-frequency components disperses contextual details, resulting in suboptimal reflection of functional information. Finally, to assist physicians in detecting the lesion, the method preserves the texture information reflected in MRI, while safeguarding it from being obscured by the functional information in SPECT.

The quantitative results are presented in Table 1. The proposed method achieves optimal performance in terms of SSIM, PSNR, ARI, FMI, and CC. These results indicate that the method effectively preserves high-quality features from the source images with minimal distortion. The relatively lower MI scores can be attributed to the following factors: in order to enhance the functional information in SPECT images, the distribution of texture information in MRI was reduced, which resulted in a lower MI score. Given that brightness information is less pronounced in SPECT compared to the texture information in MRI, improving the retention of MRI texture information would lead to a higher MI score, whereas prioritizing SPECT brightness retention results in a lower MI score. Consequently, competing methods, which preserve more MRI texture information, tend to produce higher MI scores. MI is commonly used to measure the amount of shared information between fused images, and maximizing MI can sometimes lead to a reduction in texture fidelity. However, in many clinical scenarios, functional data plays a crucial role in detecting abnormalities that may not be apparent in structural imaging alone. For instance, in oncology, PET scans reveal metabolic activity, which can highlight malignant tumors through elevated glucose uptake, an aspect that may be less discernible in texture-based imaging techniques such as MRI. Similarly, in neurology, functional imaging provides valuable insights into metabolic changes within specific brain regions, aiding early diagnosis or tracking disease progression. In such cases, functional activity may be more informative than texture details. The proposed method outperforms others in terms of correlation metrics, such as feature similarity and SSIM. The superior SSIM, PSNR, ARI, FMI, and CC results demonstrate that the method successfully retains functional information while preserving important MRI texture features. Moreover, the method exhibits a low standard deviation in SSIM, ARI, and FMI, contributing to the stability of the generated fusion images in terms of both structure and feature preservation.
Table 1
Method | Metrics | |||||
---|---|---|---|---|---|---|
SSIM | PSNR | MI | ARI | FMI | CC | |
CNN (32) | 0.5052±0.0617 | 15.7235±1.5331 | 0.7002±0.0591 | 0.2879±0.0716 | 0.4440±0.0659 | 0.6378±0.0441 |
EMFusion (33) | 0.5086±0.0626 | 16.0851±1.8244 | 0.7085±0.0674 | 0.6861±0.0603 | 0.7718±0.0454 | 0.6150±0.0417 |
LEGFF (49) | 0.5059±0.0630 | 15.6472±1.5488 | 0.6947±0.0673 | 0.6720±0.0604 | 0.7618±0.0479 | 0.6384±0.0416 |
LRD (50) | 0.5101±0.0611 | 15.5640±1.4369 | 0.7363±0.0679* | 0.6490±0.0661 | 0.7426±0.0508 | 0.6567±0.0415 |
MATR (35) | 0.4656±0.0544 | 16.1183±1.3801 | 0.6735±0.1132 | 0.4889±0.0857 | 0.6147±0.0706 | 0.6197±0.0456 |
MLCF-MLMG-PCNN (51) | 0.5381±0.0555 | 17.3068±1.1369 | 0.6700±0.0655 | 0.6706±0.0639 | 0.7596±0.0478 | 0.6474±0.0424 |
MSDNet (38) | 0.5329±0.0588 | 16.6933±1.4636 | 0.7132±0.0787 | 0.6662±0.0626 | 0.7561±0.0478 | 0.6313±0.0380 |
MSDRA (26) | 0.5096±0.0544 | 14.9558±1.0011 | 0.6973±0.0601 | 0.6810±0.0713 | 0.7688±0.0574 | 0.6486±0.0392 |
NSCT (14) | 0.5066±0.0629 | 15.7038±1.4344 | 0.7109±0.0717 | 0.6337±0.0602 | 0.7307±0.0454 | 0.6296±0.0412 |
NSST-MSMG-PCNN (19) | 0.5328±0.0568 | 17.2072±1.1414 | 0.6513±0.0629 | 0.5576±0.0695 | 0.6710±0.0551 | 0.6432±0.0434 |
NSST-PAPCNN (18) | 0.5068±0.0635 | 15.6586±1.4248 | 0.7181±0.0685 | 0.6349±0.0633 | 0.7316±0.0469 | 0.6377±0.0414 |
Zero-learning fast (34) | 0.5038±0.0588 | 15.9215±1.3025 | 0.7012±0.0774 | 0.6881±0.0600 | 0.7740±0.0451 | 0.6379±0.0406 |
DSAGAN (40) | 0.4896±0.0493 | 15.9315±0.9365 | 0.6899±0.0516 | 0.4828±0.0731 | 0.6099±0.0607 | 0.6514±0.0400 |
LYWNet | 0.5592±0.0536* | 17.3594±1.0211* | 0.7045±0.0643 | 0.6884±0.0534* | 0.7741±0.0388* | 0.6612±0.0439* |
On 30 test image pairs, the quantitative results of fusion results obtained by different fusion methods on five metrics are shown below (mean ± standard deviation are shown, *: optimal). SSIM, structural similarity; PSNR, peak signal-to-noise ratio; MI, mutual information; FMI, Fowlkes-Mallows index; ARI, adjusted Rand index; CC, correlation coefficient; MRI, magnetic resonance imaging; SPECT, single-photon emission computed tomography; LYWNet, multi-scale pyramid residual weight network; CNN, convolutional neural network; EMFusion, enhanced medical image fusion network; LEGFF, local extreme map guided multi-modal image fusion; LRD, Laplacian redecomposition; MATR, multiscale adaptive transformer; MLCF-MLMG-PCNN, multi-level edge-preserving filtering-multi-level morphological gradient-pulse-coupled neural network; MSDNet, multi-scale DenseNet; MSDRA, multiscale double-branch residual attention; NSCT, non-subsampled contour let transform; NSST-MSMG-PCNN, non-subsampled contour let-multi-scale morphological gradient-pulse coupled neural network; NSST-PAPCNN, NSST-parameter-adaptive pulse-couple neural network; DSAGAN, generative adversarial network based on dual-stream attention mechanism.
Table 2 presents the inference time of different neural network methods. Although LYWNet achieves high accuracy, its inference time suggests that optimization may be needed for real-time deployment in resource-limited clinical settings. Figure 12 shows the quantitative comparison results of our proposed method with 12 other methods on 30 SPECT-MRI test image pairs. From Figure 12 and Table 1, it can be seen that the proposed method produces the lowest standard deviation on SSIM, ARI, and FMI metrics. Specifically, the SSIM value of the proposed method is greater than that of other competitive methods in 30 sets of test data.
Table 2
Methods | Inference time (s) |
---|---|
CNN (32) | 9.9461 |
EMFusion (33) | 48.4052 |
MATR (35) | 210.8680 |
MSDNet (38) | 19.0274 |
MSDRA (24) | 19.0274 |
NSST-MSMG-PCNN (19) | 400.7225 |
Zero-learning fast (34) | 114.4541 |
DSAGAN (40) | 12.6487 |
LYWNet | 20.02578 |
On 30 test image pairs, the inference time of fusion results obtained by different fusion methods on five metrics are shown. MRI, magnetic resonance imaging; SPECT, single-photon emission computed tomography; LYWNet, multi-scale pyramid residual weight network; CNN, convolutional neural network; EMFusion, enhanced medical image fusion network; MATR, multiscale adaptive transformer; MSDNet, multi-scale DenseNet; MSDRA, multiscale double-branch residual attention; NSST-MSMG-PCNN, non-subsampled contour let-multi-scale morphological gradient-pulse coupled neural network; DSAGAN, generative adversarial network based on dual-stream attention mechanism.

PET and MRI image fusion
The qualitative comparison results are presented in Figure 13. Given that this fusion task is similar to the fusion of SPECT and MRI images, the analysis of the qualitative results follows the same approach as in SPECT and MRI image fusion. On the one hand, the proposed method enhances the visibility of functional information; on the other hand, it effectively preserves critical texture detail from the MRI images.

The quantitative results are presented In Table 3. The proposed method achieves the highest values for SSIM, PSNR, ARI, and FMI, indicating superior fusion performance. Figure 14 illustrates the quantitative comparison between the proposed method and 12 other techniques on 30 PET-MRI test image pairs. As shown in both Figure 14 and Table 3, the proposed method demonstrates the lowest standard deviation in PSNR and MI. Specifically, the SSIM value of the proposed method surpasses that of other competing methods across all 30 test data sets.
Table 3
Method | Metrics | |||||
---|---|---|---|---|---|---|
SSIM | PSNR | MI | ARI | FMI | CC | |
CNN (32) | 0.4813±0.0825 | 13.4574±2.1690 | 0.6937±0.1576 | 0.2980±0.1717 | 0.4359±0.1530 | 0.6125±0.0267 |
EMFusion (33) | 0.4839±0.0857 | 13.4739±2.1741 | 0.7133±0.1803 | 0.4446±0.2338 | 0.5666±0.1926 | 0.5960±0.0188 |
LEGFF (49) | 0.4701±0.0805 | 13.0700±2.1626 | 0.7172±0.1826 | 0.4135±0.2732 | 0.5322±0.2340 | 0.6016±0.0183 |
LRD (50) | 0.4770±0.0873 | 12.9515±2.4644 | 0.7293±0.1791* | 0.3637±0.2411 | 0.4894±0.2132 | 0.6205±0.0235 |
MATR (35) | 0.4358±0.0747 | 13.4686±1.9793 | 0.6770±0.1706 | 0.3987±0.2172 | 0.5285±0.1804 | 0.5795±0.0172 |
MLCF-MLMG-PCNN (51) | 0.4972±0.0743 | 14.1186±2.14779 | 0.7085±0.1725 | 0.4353±0.2424 | 0.5571±0.2006 | 0.6313±0.0326 |
MSDNet (38) | 0.4966±0.0875 | 13.5518±2.41166 | 0.7221±0.1842 | 0.4197±0.2390 | 0.5439±0.1995 | 0.6059±0.0249 |
MSDRA (26) | 0.4808±0.0676 | 13.0671±1.9244 | 0.7171±0.1554 | 0.4241±0.2829 | 0.5369±0.2504 | 0.6030±0.0184 |
NSCT (14) | 0.4755±0.0839 | 13.3549±2.1704 | 0.7075±0.1696 | 0.3819±0.2392 | 0.5066±0.2075 | 0.5961±0.0242 |
NSST-MSMG-PCNN (19) | 0.4962±0.0748 | 14.1408±2.1388 | 0.6784±0.1466 | 0.2813±0.1101 | 0.4262±0.1003 | 0.6323±0.0345* |
NSST-PAPCNN (18) | 0.4767±0.0850 | 13.2122±2.2915 | 0.7258±0.1776 | 0.4022±0.2564 | 0.5235±0.2204 | 0.6050±0.0218 |
Zero-learning fast (34) | 0.4771±0.0873 | 13.3424±2.2029 | 0.7144±0.1998 | 0.4730±0.2377 | 0.5930±0.1909 | 0.5926±0.0264 |
DSAGAN (40) | 0.4477±0.0588 | 14.1914±1.1313 | 0.6931±0.1431 | 0.4093±0.2908 | 0.5215±0.2565 | 0.6217±0.0196 |
LYWNet | 0.5195±0.0730* | 14.5324±1.7365* | 0.6914±0.1431 | 0.5237±0.1725* | 0.6484±0.0134 * | 0.6291±0.0193 |
On 30 test image pairs, the quantitative results of fusion results obtained by different fusion methods on five metrics are shown below (mean ± standard deviation are shown, *: optimal). SSIM, structural similarity; PSNR, peak signal-to-noise ratio; MI, mutual information; FMI, Fowlkes-Mallows index; ARI, adjusted Rand index; CC, correlation coefficient; MRI, magnetic resonance imaging; PET, positron emission tomography; LYWNet, multi-scale pyramid residual weight network; CNN, convolutional neural network; EMFusion, enhanced medical image fusion network; LEGFF, local extreme map guided multi-modal image fusion; LRD, Laplacian redecomposition; MATR, multiscale adaptive transformer; MLCF-MLMG-PCNN, multi-level edge-preserving filtering-multi-level morphological gradient-pulse-coupled neural network; MSDNet, multi-scale DenseNet; MSDRA, multiscale double-branch residual attention; NSCT, non-subsampled contour let transform; NSST-MSMG-PCNN, non-subsampled contour let-multi-scale morphological gradient-pulse coupled neural network; NSST-PAPCNN, non-subsampled shearlet transform-parameter-adaptive pulse-couple neural network; DSAGAN, generative adversarial network based on dual-stream attention mechanism.

CT and MRI image fusion
The qualitative fusion results for typical CT and MRI image pairs are presented in Figure 15. In the fusion results of EMFusion and MLCF-MLMG-PCNN, the texture details in the MRI images are notably reduced when compared to other methods. Similarly, in the results of NSCT, NSST-PAPCNN, and Zero, the dense structures present in the CT images are diminished. In contrast, the proposed method effectively preserves both the texture details and dense structures, maintaining high intensity in the fused images.

The quantitative results across six metrics are presented in Table 4. Figure 16 provides a comparison between the proposed method and 12 other techniques, evaluated on 30 CT-MRI test image pairs. The proposed method achieves the highest values for SSIM and PSNR. As observed in the SPECT and MRI image fusion, a trade-off was made between the preservation of detail and the enhancement of functional information, leading to a lower performance in FMI, ARI, and CC for the fusion of MRI and CT images, particularly in terms of gray-scale details. However, compared with similar algorithms, the proposed method demonstrates significant improvements in the quality of the fused image’s texture information, while also substantially reducing the distortion in the fused image.
Table 4
Method | Metrics | |||||
---|---|---|---|---|---|---|
SSIM | PSNR | MI | ARI | FMI | CC | |
CNN (32) | 0.5068±0.0471 | 11.8545±1.1291 | 0.5108±0.0689 | 0.4788±0.1541 | 0.6442±0.0.1221 | 0.5139±0.0782 |
EMFusion (33) | 0.5366±0.0413 | 12.6214±1.1422 | 0.5082±0.0748 | 0.6825±0.0448 | 0.8001±0.0275* | 0.5183±0.0774 |
LEGFF (49) | 0.5058±0.0507 | 11.7268±1.0360 | 0.5159±0.0761 | 0.6644±0.0480 | 0.7834±0.0321 | 0.5245±0.0783 |
LRD (50) | 0.5085±0.0467 | 11.4911±0.9681 | 0.5327±0.0704 | 0.6251±0.0632 | 0.7531±0.0460 | 0.5298±0.0765 |
MATR (35) | 0.4845±0.0382 | 12.5892±1.1390 | 0.5147±0.0977 | 0.4067±0.0627 | 0.5849±0.0597 | 0.5168±0.0794 |
MLCF-MLMG-PCNN (51) | 0.5299±0.0435 | 11.9269±1.2340 | 0.5425±0.0778* | 0.6351±0.0540 | 0.7610±0.0365 | 0.5210±0.0777 |
MSDNet (38) | 0.5236±0.0500 | 12.6454±0.9106 | 0.5421±0.0828 | 0.6122±0.0652 | 0.7433±0.0428 | 0.5294±0.0753 |
MSDRA (26) | 0.5091±0.0461 | 11.4519±0.6630 | 0.5025±0.0721 | 0.6884±0.0436* | 0.6884±0.0280 | 0.5183±0.0731 |
NSCT (14) | 0.5020±0.0470 | 11.7038±1.0661 | 0.5124±0.0662 | 0.5344±0.0801 | 0.6854±0.0586 | 0.5224±0.0784 |
NSST-MSMG-PCNN (19) | 0.5051±0.0471 | 12.0371±1.1796 | 0.5128±0.0706 | 0.4083±0.0536 | 0.5871±0.0372 | 0.5198±0.0785 |
NSST-PAPCNN (18) | 0.5002±0.0476 | 12.0233±0.8576 | 0.5307±0.0619 | 0.4433±0.0384 | 0.6152±0.0228 | 0.5320±0.0743* |
Zero-learning fast (34) | 0.4892±0.0445 | 12.7460±0.6583 | 0.4829±0.1184 | 0.6695±0.0446 | 0.7884±0.0291 | 0.4921±0.0700 |
DSAGAN (40) | 0.4687±0.0424 | 13.6992±0.698 | 0.4974±0.0492 | 0.4365±0.0726 | 0.6096±0.0517 | 0.5120±0.0697 |
LYWNet | 0.5376±0.0442* | 13.9202±0.7265* | 0.5343±0.0801 | 0.6342±0.0608 | 0.7589±0.0361 | 0.5270±0.0714 |
On 30 test image pairs, the quantitative results of fusion results obtained by different fusion methods on five metrics are shown (mean ± standard deviation are shown, *: optimal). SSIM, structural similarity; PSNR, peak signal-to-noise ratio; MI, mutual information; FMI, Fowlkes-Mallows index; ARI, adjusted Rand index; CC, correlation coefficient; MRI, magnetic resonance imaging; CT, computed tomography; LYWNet, muLti-scale pyramid residual weight network; CNN, convolutional neural network; EMFusion, enhanced medical image fusion network; LEGFF, local extreme map guided multi-modal image fusion; LRD, Laplacian redecomposition; MATR, multiscale adaptive transformer; MLCF-MLMG-PCNN; multi-level edge-preserving filtering-Multi-level morphological gradient-pulse-coupled neural network; MSDNet, multi-scale DenseNet; MSDRA, multiscale double-branch residual attention; NSCT, non-subsampled contour let transform; NSST-MSMG-PCNN, non-subsampled contour let-multi-scale morphological gradient-pulse coupled neural network; NSST-PAPCNN, non-subsampled shearlet transform-parameter-adaptive pulse-couple neural network; DSAGAN, generative adversarial network based on dual-stream attention mechanism.

Ablation study
Selection of the number of LYW blocks
The performance of the feature extractor with one, two, and three blocks was tested on 30 pairs of SPECT-MRI, PET-MRI, and CT-MRI images. The fusion results for these tests are presented in Figure 17. Observation and comparison of the experimental results revealed that the image generated using one block exhibits higher brightness, whereas the image produced by two blocks shows lower brightness. These differences highlight the impact of the number of blocks in the feature extractor on the fusion process, specifically in terms of image brightness.

Tables 5-7 summarize the quantitative evaluation on SPECT-MRI, PET-MRI, and CT-MRI, respectively, by applying one LYW block, two LYW blocks, and three LYW blocks. In Tables 5-7, the values with the best results are marked in red. In Tables 5,6, we can see that there are five metrics (SSIM, PSNR, MI, ARI, and FMI) that perform best in the SPECT-MRI and PET-MRI fusion images at three blocks. In Table 7, we can see that there are three metrics (SSIM, PSNR, and ARI) that perform best in the CT-MRI fusion images at three blocks. This indicates that the fusion image produced using a network of three blocks retains more gradient information and reduces distortion more effectively. Therefore, the use of three LYW blocks is optimal for SPECT-MRI, PET-MRI, and CT-MRI image fusion.
Table 5
Method | Metrics | |||||
---|---|---|---|---|---|---|
SSIM | PSNR | MI | ARI | FMI | CC | |
1 block | 0.5059±0.0507 | 11.7825±0.5691 | 0.6615±0.0662 | 0.5989±0.0573 | 0.7019±0.0474 | 0.6848±0.0416* |
2 blocks | 0.5482±0.0510 | 17.2316±1.0077 | 0.6781±0.0697 | 0.5806±0.0544 | 0.6904±0.0430 | 0.6656±0.0435 |
3 blocks | 0.5592±0.0536* | 17.3594±1.0211* | 0.7045±0.0643* | 0.6884±0.0534* | 0.7741±0.0388* | 0.6612±0.0439 |
SSIM, structural similarity; PSNR, peak signal-to-noise ratio; MI, mutual information; FMI, Fowlkes-Mallows index; ARI, adjusted Rand index; CC, correlation coefficient; MRI, magnetic resonance imaging; SPECT, single-photon emission computed tomography; LYW block, multi-scale pyramid residual weight block.
Table 6
Method | Metrics | |||||
---|---|---|---|---|---|---|
SSIM | PSNR | MI | ARI | FMI | CC | |
1 block | 0.4740±0.0623 | 12.1067±1.4643 | 0.6680±0.1759 | 0.3711±0.2665 | 0.4883±0.2402 | 0.6479±0.0278* |
2 blocks | 0.5040±0.0686 | 14.2148±2.3485 | 0.6839±0.1511 | 0.4443±0.1922 | 0.5736±0.1530 | 0.6348±0.0229 |
3 blocks | 0.5195±0.0730* | 14.5234±1.7365* | 0.6914±0.1431* | 0.5237±0.1725* | 0.6484±0.1343* | 0.6291±0.0193 |
SSIM, structural similarity; PSNR, peak signal-to-noise ratio; MI, mutual information; FMI, Fowlkes-Mallows index; ARI, adjusted Rand index; CC, correlation coefficient; MRI, magnetic resonance imaging; PET, positron emission tomography; LYW block, multi-scale pyramid residual weight block.
Table 7
Method | Metrics | |||||
---|---|---|---|---|---|---|
SSIM | PSNR | MI | ARI | FMI | CC | |
1 block | 0.4617±0.0431 | 10.8655±0.6578 | 0.5116±0.0601 | 0.5235±0.0668 | 0.6778±0.0474 | 0.5227±0.0723 |
2 blocks | 0.5435±0.0402* | 14.3888±0.8219* | 0.5106±0.0759 | 0.5548±0.0612 | 0.7010±0.0427 | 0.5295±0.0758* |
3 blocks | 0.5376±0.0442 | 13.9202±0.7265 | 0.5343±0.0801* | 0.6342±0.0608* | 0.7589±0.0361* | 0.5270±0.0801 |
SSIM, structural similarity; PSNR, peak signal-to-noise ratio; MI, mutual information; FMI, Fowlkes-Mallows index; ARI, adjusted Rand index; CC, correlation coefficient; MRI, magnetic resonance imaging; CT, computed tomography; LYW block, multi-scale pyramid residual weight block.
Selection of the fusion strategy
The feature fusion module is utilized to combine the extracted features. To evaluate the effectiveness of the proposed FDS method, a comparison is conducted with five commonly used methods at T=30. This comparison aims to demonstrate the superior performance of the FDS method in the fusion task.
Addition strategy
The addition strategy is one of the most commonly used feature fusion methods. Its formula is denoted as follows:
where is the fused feature. and are the extracted features of image 1 and 2, respectively.
Average strategy
The average strategy is another commonly used feature fusion method. Its formula is denoted as follows:
where is the fused feature. and are the extracted features of image 1 and 2, respectively.
Feature energy ratio strategy (FER) (25)
FER is another commonly used feature fusion method. Its formula is denoted as follows:
where is the fused feature, , , and are the extracted features of image 1 and 2, respectively.
Feature L1-norm strategy (FL1N) (26)
FL1N is another commonly used feature fusion method. Its formula is denoted as Equation 17. Where is the fused feature, , , and are the extracted features of image 1 and 2, respectively.
Average L1-norm weight strategy (AL1NW)
AL1NW is another commonly used feature fusion method. Its formula is denoted as Eq. [17], where is the fused feature, , , and are the extracted features of image 1 and 2, respectively.
The six fusion strategies are presented in Figure 18, where it can be observed that the addition strategy results in higher brightness, whereas the FL1N strategy introduces more noise. In the fusion images of SPECT-MRI and PET-MRI, the proposed method demonstrates a better ability to highlight functional information, thereby aiding in the detection and diagnosis of lesions. The quantitative results, as shown in Tables 8-10, indicate that the proposed method achieves optimal performance in SSIM, PSNR, ARI, and FMI for SPECT-MRI, SSIM, PSNR, and FMI for PET-MRI, and ARI and FMI for CT-MRI. However, the FDS approach does not perform as favorably across some of the metrics. This is due to the fact that FDS prioritizes the enhancement of specific feature characteristics, such as texture or structural clarity, over the broader optimization of quantitative metrics such as PSNR or SSIM. Consequently, FDS may yield perceptual quality improvements that are not fully captured by traditional metrics, particularly when these metrics emphasize pixel-level fidelity rather than perceptual or contextual quality.

Table 8
Method | Metrics | |||||
---|---|---|---|---|---|---|
SSIM | PSNR | MI | ARI | FMI | CC | |
Addition | 0.5095±0.0572 | 13.4644±0.6480 | 0.7553±0.0587* | 0.4989±0.1121 | 0.6251±0.0930 | 0.6883±0.0419* |
Average | 0.5453±0.0509 | 17.1652±1.0138 | 0.6612±0.0630 | 0.6871±0.0535 | 0.7732±0.0388 | 0.6799±0.0418 |
FER | 0.5350±0.05450 | 16.7410±1.2321 | 0.7021±0.0721 | 0.6870±0.0547 | 0.7729±0.0547 | 0.6297±0.0425 |
FL1N | 0.4718±0.0499 | 15.8987±0.7707 | 0.6709±0.0529 | 0.6522±0.0665 | 0.7466±0.0489 | 0.6069±0.0380 |
AL1NW | 0.5576±0.0518 | 17.1294±1.0367 | 0.6620±0.0640 | 0.6871±0.0535 | 0.7732±0.0389 | 0.6777±0.0423 |
FDS | 0.5592±0.0536* | 17.3594±1.0211* | 0.7045±0.0643 | 0.6884±0.0534* | 0.7741±0.0388* | 0.6612±0.0439 |
SSIM, structural similarity; PSNR, peak signal-to-noise ratio; MI, mutual information; FMI, Fowlkes-Mallows index; ARI, adjusted Rand index; CC, correlation coefficient; MRI, magnetic resonance imaging; SPECT, single-photon emission computed tomography; Addition, addition strategy; Average, average strategy; FER, feature energy ratio strategy; FL1N, feature L1-norm strategy; AL1NW, average L1-norm weight strategy; FDS, feature distillation strategy.
Table 9
Method | Metrics | |||||
---|---|---|---|---|---|---|
SSIM | PSNR | MI | ARI | FMI | CC | |
Addition | 0.4889±0.0756 | 12.8918±1.4781 | 0.7500±0.1549* | 0.3820±0.1351 | 0.5219±0.1132 | 0.6626±0.0241* |
Average | 0.5153±0.0692 | 14.0559±2.3653 | 0.6744±0.1500 | 0.5274±0.1767* | 0.6413±0.1276 | 0.6580±0.0197 |
FER | 0.4997±0.0742 | 14.0409±1.7081 | 0.6871±0.1477 | 0.4877±0.1585 | 0.6175±0.1250 | 0.5948±0.0189 |
FL1N | 0.4333±0.0609 | 13.0068±1.5452 | 0.6653±0.1301 | 0.4891±0.1587 | 0.6182±0.1235 | 0.5553 ±0.287 |
AL1NW | 0.5116±0.0697 | 15.0075±2.3423* | 0.6731±0.1482 | 0.5236±0.1742 | 0.6482±0.1357 | 0.6544±0.0170 |
FDS | 0.5195±0.0730* | 14.5234±1.7365 | 0.6914±0.1431 | 0.5237±0.1725 | 0.6484±0.1343* | 0.6291±0.0193 |
SSIM, structural similarity; PSNR, peak signal-to-noise ratio; MI, mutual information; FMI, Fowlkes-Mallows index; ARI, adjusted Rand index; CC, correlation coefficient; MRI, magnetic resonance imaging; PET, positron emission tomography; Addition, addition strategy; Average, average strategy; FER, feature energy ratio strategy; FL1N, feature L1-norm strategy; AL1NW, average L1-norm weight strategy; FDS, feature distillation strategy.
Table 10
Method | Metrics | |||||
---|---|---|---|---|---|---|
SSIM | PSNR | MI | ARI | FMI | CC | |
Addition | 0.5018±0.0474 | 11.6724±0.8403 | 0.5461±0.0840 | 0.5343±0.0751 | 0.6851±0.0469 | 0.5489±0.0746* |
Average | 0.5442±0.0420 | 14.6443±0.7937* | 0.5125±0.0856 | 0.6301±0.0604 | 0.7401±0.0362 | 0.5450±0.07750 |
FER | 0.5382±0.0435 | 13.4851±0.7923 | 0.5405±0.0801 | 0.6121±0.0666 | 0.7429±0.0395 | 0.5257±0.0735 |
FL1N | 0.4498±0.0350 | 13.0860±0.5644 | 0.4299±0.0444 | 0.5784±0.0490 | 0.7205±0.0332 | 0.4841±0.0640 |
AL1NW | 0.5476±0.0407* | 14.4763±0.7526 | 0.6326±0.0918* | 0.6326±0.0613 | 0.7576±0.0366 | 0.5437±0.0747 |
FDS | 0.5376±0.0442 | 13.9202±0.7265 | 0.5343±0.0801 | 0.6342±0.0608* | 0.7589±0.0361* | 0.5270±0.0714 |
SSIM, structural similarity; PSNR, peak signal-to-noise ratio; MI, mutual information; FMI, Fowlkes-Mallows index; ARI, adjusted Rand index; CC, correlation coefficient; MRI, magnetic resonance imaging; CT, computed tomography; FER, feature energy ratio strategy; Addition, addition strategy; Average, average strategy; FL1N, feature L1-norm strategy; AL1NW, average L1-norm weight strategy; FDS, feature distillation strategy.
Conclusions
A novel fusion algorithm based on a LYWNet is proposed for medical image fusion. Initially, the original multi-modal images are input into a feature preprocessor to extract semantic information. The extracted features are then passed through a feature extractor to capture both low- and high-dimensional information. A new fusion strategy, the FDS, is introduced to combine the extracted features, which are subsequently processed by a reconstructor to generate the fused image. After the training process, the fused image can be directly obtained without the need for additional parameter adjustments or settings when inputting the original images. Extensive experimental comparisons and objective evaluation metrics demonstrate that the proposed algorithm outperforms existing methods in terms of visual quality and most performance metrics.
Acknowledgments
None.
Footnote
Funding: None.
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-24-851/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013).
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Shen D, Wu G, Suk HI. Deep Learning in Medical Image Analysis. Annu Rev Biomed Eng 2017;19:221-48. [Crossref] [PubMed]
- Adelsmayr G, Janisch M, Müller H, Holzinger A, Talakic E, Janek E, Streit S, Fuchsjäger M, Schöllnast H. Three dimensional computed tomography texture analysis of pulmonary lesions: Does radiomics allow differentiation between carcinoma, neuroendocrine tumor and organizing pneumonia? Eur J Radiol 2023;165:110931. [Crossref] [PubMed]
- Cohen JG, Reymond E, Medici M, Lederlin M, Lantuejoul S, Laurent F, Toffart AC, Moreau-Gaudry A, Jankowski A, Ferretti GR. CT-texture analysis of subsolid nodules for differentiating invasive from in-situ and minimally invasive lung adenocarcinoma subtypes. Diagn Interv Imaging 2018;99:291-9. [Crossref] [PubMed]
- Liu Y, Chen X, Wang Z, Wang ZJ, Ward RK, Wang X. Deep learning for pixel-level image fusion: Recent advances and future prospects. Inf Fusion 2018;42:158-73.
- James AP, Dasarathy BV. Medical image fusion: A survey of the state of the art. Inf Fusion 2014;19:4-19.
- Wang Z, Cui Z, Zhu Y. Multi-modal medical image fusion by Laplacian pyramid and adaptive sparse representation. Comput Biol Med 2020;123:103823. [Crossref] [PubMed]
- Nencini F, Garzelli A, Baronti S, Alparone L. Remote sensing image fusion using the curvelet transform. Inf Fusion 2007;8:143-56.
- Liu Y, Liu S, Wang Z. A general framework for image fusion based on multi-scale transform and sparse representation. Inf Fusion 2015;24:147-64.
- Du J, Li W, Xiao B, Nawaz Q. Union Laplacian pyramid with multiple features for medical image fusion. Neurocomputing 2016;194:326-39.
- Singh R, Khare A. Fusion of multimodal medical images using Daubechies complex wavelet transform – A multiresolution approach. Inf Fusion 2014;19:49-60.
- Xu X, Wang Y, Chen S. Medical image fusion using discrete fractional wavelet transform. Biomed Signal Process Control 2016;27:103-11.
- Ganasala P, Prasad AD. Medical image fusion based on laws of texture energy measures in stationary wavelet transform domain. Int J Imag Syst Tech 2020;30:544-57.
- Jose J, Gautam N, Tiwari M, Tiwari T, Suresh A, Sundararaj V, Mr R. An image quality enhancement scheme employing adolescent identity search algorithm in the NSST domain for multimodal medical image fusion. Biomed Signal Process Control 2021;66:102480.
- Alseelawi N, Hazim HT. Salim ALRikabi HT. A Novel Method of Multimodal Medical Image Fusion Based on Hybrid Approach of NSCT and DTCWT. International Journal of Online and Biomedical Engineering 2022;18:114-33.
- Tawfik N, Elnemr HA, Fakhr M, Dessouky MI, El-Samie FEA. Multimodal Medical Image Fusion Using Stacked Auto-encoder in NSCT Domain. J Digit Imaging 2022;35:1308-25. [Crossref] [PubMed]
- Zhu Z, Zheng M, Qi G, Wang D, Xiang Y. A phase congruency and local Laplacian energy based multi-modality medical image fusion method in NSCT domain. IEEE Access 2019;7:20811-24.
- Diwakar M, Singh P, Shankar A. Multi-modal medical image fusion framework using co-occurrence filter and local extrema in NSST domain. Biomed Signal Process Control 2021;68:102788.
- Guo P, Xie G, Li R, Hu H. Multimodal medical image fusion with convolution sparse representation and mutual information correlation in NSST domain. Complex Intell Syst 2023;9:317-28.
- Tan W, Tiwari P, Pandey HM, Moreira C, Jaiswal AK. Multimodal medical image fusion algorithm in the era of big data. Neural Comput Appl 2020; [Crossref]
- Yin M, Liu X, Liu Y, Chen X. Medical Image Fusion With Parameter-Adaptive Pulse Coupled Neural Network in Nonsubsampled Shearlet Transform Domain. IEEE Trans Instrum Meas 2019;68:49-64.
- Zong J, Qiu T. Medical image fusion based on sparse representation of classified image patches. Biomed Signal Process Control 2017;34:195-205.
- Yang B, Yang C, Huang G. Efficient image fusion with approximate sparse representation. Int J Wavelets Multiresolut Inf Process 2016;14:1650024.
- Zhang X, Ma Y, Fan F, Zhang Y, Huang J. Infrared and visible image fusion via saliency analysis and local edge-preserving multi-scale decomposition. J Opt Soc Am A Opt Image Sci Vis 2017;34:1400-10. [Crossref] [PubMed]
- Ma J, Zhang H, Yi P, Wang Z. SCSCN: A Separated Channel-Spatial Convolution Net With Attention for Single-View Reconstruction. IEEE Trans Ind Electron 2020;67:8649-58.
- Fu J, Li W, Du J, Huang Y. A multiscale residual pyramid attention network for medical image fusion. Biomed Signal Process Control 2021;66:102488.
- Li W, Peng X, Fu J, Wang G, Huang Y, Chao F. A multiscale double-branch residual attention network for anatomical-functional medical image fusion. Comput Biol Med 2022;141:105005. [Crossref] [PubMed]
- Li B, Hwang JN, Liu Z, Li C, Wang Z. PET and MRI image fusion based on a dense convolutional network with dual attention. Comput Biol Med 2022;151:106339. [Crossref] [PubMed]
- Huang J, Le Z, Ma Y, Fan F, Zhang H, Yang L. MGMDcGAN: medical image fusion using multi-generator multi-discriminator conditional generative adversarial network. IEEE Access 2020;8:55145-57.
- Ma J, Xu H, Jiang J, Mei X, Zhang XP. DDcGAN: A Dual-discriminator Conditional Generative Adversarial Network for Multi-resolution Image Fusion. IEEE Trans Image Process 2020; Epub ahead of print. [Crossref]
- Chen J, Li X, Luo L, Mei X, Ma J. Infrared and visible image fusion based on target-enhanced multiscale transform decomposition. Inform Sciences 2020;508:64-78.
- Hu J, Li S. The multiscale directional bilateral filter and its application to multisensor image fusion. Inf Fusion 2012;13:196-206.
- Liu Y, Chen X, Cheng J, Peng H. A medical image fusion method based on convolutional neural networks. IEEE International Conference on Information Fusion 2017. doi:
10.23919/ICIF.2017.8009769 . - Xu H, Ma J. EMFusion: An unsupervised enhanced medical image fusion network. Inf Fusion 2021;76:177-86.
- Balakrishnan R, Priya R. Multimodal Medical Image Fusion based on Deep Learning Neural Network for Clinical Treatment Analysis. International Journal of ChemTech Research 2018;11:160-76.
- Tang W, He F, Liu Y, Duan Y. MATR: Multimodal Medical Image Fusion via Multiscale Adaptive Transformer. IEEE Trans Image Process 2022;31:5134-49. [Crossref] [PubMed]
- Lahoud F, Süsstrunk S. Zero-learning fast medical image fusion. IEEE International Conference on Information Fusion 2019. doi:
10.23919/FUSION43075.2019.9011178 . - Zhao C, Wang T, Lei B. Medical image fusion method based on dense block and deep convolutional generative adversarial network. Neural Comput Appl 2021;33:6595-610.
- Song X, Wu XJ, Li H. MSDNet for medical image fusion. International Conference on Image and Graphics 2019;278-288. doi:
10.1007/978-3-030-34110-7_24 . - Liu Y, Li Z, Feng J, Gu Y. An Unsupervised GAN-based Quality-enhanced Medical Image Fusion Network. IEEE Conference on Telecommunications, Optics and Computer Science 2022:429-432. doi:
10.1109/TOCS56154.2022.10016141 . - Fu J, Li W, Du J, Xu L. DSAGAN: A generative adversarial network based on dual-stream attention mechanism for anatomical and functional image fusion. Inform Sciences 2021;576:484-506.
- Xu J, Xiong Z, Bhattacharyya SP. PIDNet: A real-time semantic segmentation network inspired by PID controllers. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023;19529-39.
- Umirzakova S, Mardieva S, Muksimova S, Ahmad S, Whangbo T. Enhancing the Super-Resolution of Medical Images: Introducing the Deep Residual Feature Distillation Channel Attention Network for Optimized Performance and Efficiency. Bioengineering (Basel) 2023;10:1332. [Crossref] [PubMed]
- He J, Deng Z, Zhou L, Wang Y, Qiao Y. Adaptive Pyramid Context Network for Semantic Segmentation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019;7511-20.
- Huang G, Liu Z, Lauerns VDM, Weinberger KQ. Densely connected convolutional networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2017;4700-4708. doi:
10.1109/CVPR.2017.243 . - Zhao H, Shi J, Qi X, Wang X, Jia J. Pyramid scene parsing network. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2017;2881-2890. doi:
10.1109/CVPR.2017.660 . - Pan H, Hong Y, Sun W, Jia Y. Deep Dual-Resolution Networks for Real-Time and Accurate Semantic Segmentation of Traffic Scenes. IEEE Trans Intell Transport Syst 2023;3:3448-60. [Crossref]
- Wang Z, Bovik AC, Sheikh HR, Simoncelli EP. Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 2004;13:600-12. [Crossref] [PubMed]
- Zhang Z. Improved Adam Optimizer for Deep Neural Networks. IEEE/ACM 26th International Symposium on Quality of Service (IWQoS) 2018;1-2.
- Zhang Y, Xiang W, Zhang S, Shen J, Wei R, Bai X, Zhang L, Zhang Q. Local extreme map guided multi-modal brain image fusion. Front Neurosci 2022;16:1055451. [Crossref] [PubMed]
- Li X, Guo X, Han P, Wang X, Li H, Luo T. Laplacian Redecomposition for Multimodal Medical Image Fusion. IEEE Trans Instrum Meas 2020;69:6880-90.
- Tan W, Thitøn W, Xiang P, Zhou H. Multi-modal brain image fusion based on multi-level edge-preserving filtering. Biomed Signal Process Control. 2021;64:102280.
- Qu G, Zhang D, Yan P. Information measure for performance of image fusion. Electron Lett 2002;38:313-5.
- Fowlkes EB, Mallows CL. A method for comparing two hierarchical clusterings. J Am Stat Assoc 1983;78:553-69.
- Deshmukh M, Bhosale U. Image fusion and image quality assessment of fused images. International Journal of Image Processing 2010;4:484.