A multi-scale pyramid residual weight network for medical image fusion

Yiwei Liu; Shaoze Zhang; Yao Tang; Xihai Zhao; Zuo-Xiang He

doi:10.21037/qims-24-851

Original Article

A multi-scale pyramid residual weight network for medical image fusion

Yiwei Liu¹, Shaoze Zhang¹, Yao Tang², Xihai Zhao¹, Zuo-Xiang He^3,4

¹Center for Biomedical Imaging Research, School of Biomedical Engineering, Tsinghua University, Beijing, China; ²Department of Biomedical Engineering, Tsinghua University, Beijing, China; ³School of Clinical Medicine, Tsinghua University, Beijing, China; ⁴Department of Nuclear Medicine, Beijing Tsinghua Changgung Hospital, Beijing, China

Contributions: (I) Conception and design: Y Liu; (II) Administrative support: ZX He, X Zhao; (III) Provision of study materials or patients: Y Liu, S Zhang; (IV) Collection and assembly of data: Y Liu, Y Tang; (V) Data analysis and interpretation: Y Liu, S Zhang; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Xihai Zhao, MD, PhD. Center for Biomedical Imaging Research, School of Biomedical Engineering, Tsinghua University, No. 1 Shuangqing Road, Haidian District, Beijing 100084, China. Email: xihaizhao@tsinghua.edu.cn; Zuo-Xiang He, MD, PhD. Department of Nuclear Medicine, Beijing Tsinghua Changgung Hospital, 168 Litang Road, Changping District, Beijing 102218, China; School of Clinical Medicine, Tsinghua University, Beijing, China. Email: zuoxianghe@hotmail.com.

Background: Due to the inherent limitations of imaging sensors, acquiring medical images that simultaneously provide functional metabolic information and detailed structural organization remains a significant challenge. Multi-modal image fusion has emerged as a critical technology for clinical diagnosis and surgical navigation, as it enables the integration of complementary information from different imaging modalities. However, existing deep learning (DL)-based fusion methods often face difficulties in effectively combining high-frequency detail information with low-frequency contextual information, which frequently leads to the degradation of high-frequency details. Therefore, there is a pressing need for a method that addresses these challenges, preserving both high- and low-frequency information while maintaining clear structural contours. In response to this issue, a novel convolutional neural network (CNN), named the multi-scale pyramid residual weight network (LYWNet), is proposed. The objective of this approach is to improve the fusion process by effectively integrating high- and low-frequency information, thereby enhancing the quality and accuracy of multimodal image fusion. This method aims to overcome the limitations of current fusion techniques and ensure the preservation of both functional and structural details, ultimately contributing to more precise clinical diagnoses and better surgical navigation outcomes.

Methods: We propose a novel CNN, LYWNet, designed to address these challenges. LYWNet is composed of three modules: (I) data preprocessing module: utilizes three convolutional layers to extract both deep and shallow features from the input images. (II) Feature extraction module: incorporates three identical multi-scale pyramid residual weight (LYW) blocks in series, each featuring three interactive branches to preserve high-frequency detail information effectively. (III) Image reconstruction module: utilizes a fusion algorithm based on feature distillation to ensure the effective integration of functional and anatomical information. The proposed image fusion algorithm enhances the interaction of contextual cues and retains the metabolic details from functional images while preserving texture details from anatomical images.

Results: The proposed LYWNet demonstrated its ability to retain high-frequency details during feature extraction, effectively combining them with low-frequency contextual information. The fusion results exhibited reduced differences between the fused image and the original images. The structural similarity (SSIM) and peak signal-to-noise ratio (PSNR) were 0.5592±0.0536 and 17.3594±1.0211, respectively, for single-photon emission computed tomography-magnetic resonance imaging (SPECT-MRI), 0.5195±0.0730 and 14.5324±1.7365 for PET-MRI; 0.5376±0.0442 and 13.9202±0.7265 for magnetic resonance imaging-computed tomography.

Conclusions: LYWNet excels at integrating high-frequency detail information and low-frequency contextual information, addressing the deficiencies of existing DL-based image fusion methods. This approach provides superior fused images that retain the functional metabolic information and anatomical texture, making it a valuable tool for clinical diagnosis and surgical navigation.

Keywords: Medical image fusion; unsupervised learning; residual; attention; multiscale

Submitted Apr 26, 2024. Accepted for publication Jan 09, 2025. Published online Feb 26, 2025.

doi: 10.21037/qims-24-851

Introduction

Background

With the advancement of imaging technology, medical images are increasingly vital in modern clinical applications. Medical imaging technology enhances our understanding of human tissue structure, expanding its applications in diagnosis, surgical navigation, and beyond (1). There are two common types of medical images, as illustrated in Figure 1: anatomical images acquired using modern medical equipment and functional images. Anatomical images provide high-resolution anatomical information. For instance, magnetic resonance imaging (MRI) offers soft tissue information, and computed tomography (CT) provides texture features (2,3). However, anatomical images do not provide insights into blood flow and metabolic information. In contrast, functional images offer insights into organ blood flow and metabolism. Positron emission tomography (PET) images reveal oxygen and glucose metabolism in tissues, whereas single-photon emission computed tomography (SPECT) images provide insights into blood flow within organs. Nevertheless, functional images are limited by lower resolution and cannot accurately depict organ lesions. Given the limitations of single-modality imaging, there is an increasing need for more comprehensive information. Therefore, employing image fusion methods to merge images from different modalities into a single image can retain more complementary information from the two images. This reduces information redundancy, enhances information utilization, and benefits physicians’ visual perception and clinical diagnosis (4,5).

Figure 1 Examples of multi-modal medical images. The images are from the Whole Brain Atlas (http://www.med.harvard.edu/AANLIB/home.html) created by Harvard Medical School. CT, computed tomography; MRI, magnetic resonance imaging; SPECT, single-photon emission computed tomography; PET, positron emission tomography.

Due to the significant value of image fusion, numerous fusion algorithms have been proposed. These algorithms can be broadly categorized into traditional and recently emerged deep learning (DL) methods. In traditional methods, multiscale transforms are a classic approach widely applied in image fusion (6-8). Commonly used multiscale transform fusion algorithms include the pyramid transform (9), wavelet transform (10-12), non-subsampled contourlet transform (NSCT) (13-16), and non-subsampled shearlet transform (NSST) (17-20). Sparse representation is another commonly used traditional method. Unlike multiscale transforms, sparse representation learns to fuse multiple source images from an overcomplete dictionary, providing a more stable understanding of the source images that is less affected by poor inter-image registration (21,22). Saliency models are utilized to extract salient regions of an image, obtaining saliency features or weight maps for reconstructing the fused image (23).

With the rapid development of DL (24) in recent years, many image fusion algorithms based on DL have emerged. These works have provided new insights into medical image fusion with some success. However, there are still some limitations. For example, convolutional neural networks (CNNs) require fewer parameters during training and have a relatively simple structure. They can process images directly as input signals, bypassing the complexities of traditional feature extraction methods. However, high-frequency detail information is often drowned out by low-frequency context data. As a result, the texture details of the fused image mask the metabolic information characterizing the lesions in the functional image. This leads to a decrease in the importance of lesion expression in the fused image, which ultimately leads to the loss of key information such as the boundary contour of the tumor (25-29). Current feature fusion methods can only achieve shallow feature interaction for the features extracted by the network, and noise is introduced in the fusion process (25,26).

To solve the above problems, this paper proposes a new medical image fusion method, a multi-scale pyramid residual weight network (LYWNet). The contributions are as follows:

A novel end-to-end three-port unsupervised medical image fusion CNN module is proposed, which can extract and retain information at different depths of the network.
In order to improve the performance of the network and retain more metabolic information and texture information, modules such as the pyramid pooling module (PPM) are used to connect with it.
The fusion network does not retain the brightness information of the functional image but generates the brightness information through the feature learning network.
In order to achieve the depth fusion of different modal features and reduce the introduction of noise in the fusion process, a new feature fusion method is proposed.

Related work

In this section, we provide a concise overview of traditional medical image fusion methods, followed by an examination of fusion techniques based on DL.

Traditional medical image fusion method

Due to the critical importance and broad range of applications of multimodal medical image fusion, numerous effective traditional fusion methods have been proposed. These methods can be classified into five categories based on their underlying techniques: pyramid transform-based methods, wavelet-based methods, sparse representation-based methods, salient region-based methods, and others.

Pyramid transform-based methods (9,30) were the first multi-scale transform tools used for image fusion. These methods emulate the human visual perception mechanism by decomposing the input image into a series of multi-resolution images. High-resolution images capture detailed object features, whereas low-resolution images focus on the overall scene content. The sub-images are fused according to the predefined fusion rules, and the resulting fused image pyramid is obtained. The wavelet-based methods (10-12), which can analyze the direction of frequency subbands, can make full use of the texture direction information of the image content. Through multi-level wavelet transform, the source image is divided into low-frequency components capturing high-level features and multiple sets of three-dimensional (3D) high-frequency components at various scales. Fusion rules are then designed based on these distinct features. Methods based on sparse representation (18,19) align with the physiological characteristics of the human visual system, offering greater stability and interpretability for understanding source images. The fusion process of sparse representation-based image fusion is as follows: (I) each image is divided into overlapping blocks, with each block transformed into a vector; (II) the sparse representation of source image blocks is obtained using a predefined or learned dictionary; (III) specific fusion strategy; (IV) reconstruction of a fused image based on sparse representation of a fused image. Sparse representation methods can help reduce visual artifacts and improve robustness to registration errors. Edge-preserving filters used in salient region-based image fusion (23,31) have been widely used in medical image fusion: firstly, the source images are decomposed by an edge-preserving filter; secondly, different image fusion rules decompose the source images of different scales; finally, the fusion base and detail layers are added to reconstruct the fused image.

These methods apply the same representation approach to images from different modalities to retain consistent features. However, in multi-modal images, the key information can differ significantly. For instance, anatomical images convey density structure through contrast, whereas functional images emphasize color representation. Consequently, using identical representation methods for multi-modal images is not suitable.

DL methods

In recent years, the emergence of DL has provided new ideas for multimodal medical image fusion. Liu et al. (32) introduced the CNN for multimodal medical image fusion, in which the CNN synthesizes the pixel activity information of two source images to generate a weight map and realizes the fusion process by means of an image pyramid. Xu et al. (33) introduced an end-to-end unsupervised network for medical image fusion, incorporating both surface and deep constraints to retain information. The surface constraints focus on saliency and abundance to evaluate the activity level of the source image. In the deep constraints, the uniqueness is measured through the neural network to preserve unique information. Balakrishnan et al. (34) employed a Siamese convolutional network to integrate the pixel motion information from multiple multimodal medical images, creating a weight map for fusion. Tang et al. (35) proposed an adaptive convolution method, incorporating a global complementary context adaptive modulation convolution kernel, combined with an adaptive transformer to enhance global semantic extraction capabilities. The network architecture adopts a multi-scale design so as to fully obtain useful multi-modal information from the perspective of different scales. Fu et al. (25) proposed a multi-scale residual pyramid attention network (MSRPAN), which consists of a feature extractor, fusion mechanism, and reconstructor. The feature extractor consists of three MSRPAN blocks for extracting multi-scale features. Li et al. (26) proposed a multi-scale dual-branch residual attention (MSDRA) network, including a feature extraction module, a feature fusion module, and an image reconstruction module. The feature extraction module extracts image features by three MSDRA modules in series. At the same time, a Feature l1-norm fusion strategy is proposed to fuse the features obtained from the input image. Lahoud et al. (36) proposed a real-time image fusion method that uses a pre-trained neural network to generate a single image containing multi-modal source features. This method merges the images using a novel strategy based on the deep feature maps extracted by CNN. These feature maps are used to compute fusion weights, which guide the multi-modal image fusion process. Zhao et al. (37) introduced a medical image fusion algorithm combining a deep convolutional generative adversarial network (GAN) with a dense block model to produce fusion images rich in information. The network integrates the image generation module and discriminator module based on dense fast and codec, overcomes the shortcomings of manual design of active layer measurement in traditional methods, and processes the information in the middle layer according to dense fast; in this way, the loss of information is avoided. Song et al. (38) presented a multi-scale density network for medical image fusion composed of an encoding layer, a fusion layer, and a decoding layer. Filters of three sizes are used in their network to extract features which are then fused by using a fusion strategy.

According to GANs, Liu et al. (39) proposed an unsupervised algorithm for high-quality medical image fusion composed of a lightweight image enhancement deep network and a GAN. The lightweight image enhancement deep network improves the quality of the fused image, making it more suitable for the human visual perception system. In contrast, the GAN further enhances texture details and edge information. Huang et al. (28) proposed a multi-generator, multi-discriminator conditional generative adversarial network (cGAN). The first cGAN generates fusion images that closely resemble real ones; the second cGAN focuses on enhancing dense structural details and preserving functional information without distortion. Ma et al. (29) introduced a dual-discriminator conditional GAN. The generator produces fusion images resembling real ones, and two discriminators are employed to calculate content loss and distinguish the structural differences between the fusion image and the two source images. Fu et al. (40) proposed a GAN with a dual-stream attention mechanism (DSAGAN) for anatomical and functional image fusion. The method leverages a dual-stream architecture and multiscale convolutions to extract deep features. Xu et al. (41) presented a three-branch network architecture, Proportional-Integral-Derivative Network (PIDNet), for semantic segmentation, establishing a link between CNN and proportional integral derivative (PID) controllers through a proportional-derivative-integral control mechanism. Umirzakova et al. (42) introduced a Deep Residual Feature Distillation Channel Attention Network to enhance the super-resolution of medical images. This approach aims to optimize both performance and efficiency, offering valuable insights for improving the efficiency of medical image processing methods.

In the studies mentioned above, although there was a significant focus on the fusion of texture and functional information in MRI images, there needed to be more emphasis on texture details within the MRI images, with less attention given to the information in functional images. This has resulted in the dispersion of high-frequency semantic information within low-frequency contextual information, thereby reducing the discernibility of differences in functional information. This scenario is unfavorable for diagnosing lesion locations. Hence, we propose an end-to-end unsupervised network with a three-part structure in this work. Apart from extracting deep-layered information, this network retains varying depths of information, ensuring that high-frequency semantic details are not dispersed by low-frequency contextual information. Compared to other CNN-based methods, our approach preserves more functional information while maintaining the integrity of texture details, thereby allowing for better preservation and visualization of tumor contours and boundaries in the images.

Methods

In this section, we first formulate the problem through an analysis of the characteristics of multimodal images, and then introduce the architecture and loss function of the fusion network. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013).

Overview

Images $I_{a}$ and $I_{b}$ are used, where $I_{a}$ has dimensions $H \times W \times C_{a}$ and $I_{b}$ has dimensions $H \times W \times C_{b}$ , with H and W representing the height and width of the images, and C denoting the number of channels. The objective is to establish a mapping relationship between $I_{a}$ and $I_{b}$ to generate a fused image $I_{f}$ with dimensions $H \times W \times C_{f}$ , and $C_{f} = \max (C_{a}, C_{b})$ . Multimodal image fusion typically involves three cases: (I) fusion of CT and MRI; (II) fusion of PET and MRI; (III) fusion of SPECT and MRI.

For PET/SPECT images, preserving functional information is crucial. Chromaticity information, which is distinct from PET/SPECT data, plays an important role and is therefore translated into the YCbCr color space. This transformation allocates chromaticity details across the Y, Cb, and Cr channels. During the fusion process, the Y (intensity) channel is combined with the MRI image, ensuring that the resulting fused image preserves the strength of information distribution in the Y channel. In CT images, critical information is embedded in the dense texture structure of high pixel intensity. Conversely, MRI images require careful preservation of intricate texture details. Consequently, the primary features of these distinct image modalities are based on chromaticity and texture information. The dual objectives of the fusion process are to preserve high-quality chromaticity for optimal reflection of functional data, and to maintain rich texture details for retaining essential soft tissue information. Through a carefully designed network structure, the Y channel data is input for learning, whereas the Cb and Cr channels are preserved to retain chromaticity information.

The network architecture for medical image fusion is presented in Figure 2. The framework consists of three modules: feature preprocessor, extractor, and fuser. Initially, image features are extracted through the feature input module. To capture a broader range of semantic information, convolution kernels of various sizes are employed to extract features, which are subsequently fused. These three modules are applied sequentially to extract the input features. To ensure that the output image contains both low- and high-dimensional information, a skip connection is utilized between the low-level and deep features. Finally, the fusion strategy in the image fusion module is used to fuse the 64-dimensional features of the input image, and the output is reconstructed by two $3 \times 3$ convolution kernels and one $1 \times 1$ convolution kernel to obtain the fused image. By extracting deeper high-dimensional information and retaining low-level features, the fusion algorithm incorporates more comprehensive information, preserving both functional data and high-frequency details. The subsequent subsections provide a detailed explanation of the algorithm.

Figure 2 Structure diagram of our algorithm. RGB, red green blue; LYW block, multi-scale pyramid residual weight block.

Feature preprocessor

The feature preprocessor consists of input images and the feature input module, as shown in Figure 3. The feature input module is composed of four convolutional layers with different kernel sizes. The input image passes through the first three convolutional layers, and the outputs of each layer are concatenated along the channel dimension. The final output is then processed through a subsequent convolutional layer. Each convolutional layer is followed by batch normalization (BN) and a rectified linear unit (ReLU) activation function, which serve to normalize the input data and enhance the network’s nonlinearity.

Figure 3 Structure diagram of feature input module. BN, batch normalization; ReLU, rectified linear unit.

Feature extractor

The feature extractor consists of three identical modules, as illustrated in Figure 4, which work collaboratively to extract features more comprehensively. To enhance feature diversity and preserve high-frequency information, as shown in Figure 5, this module incorporates three branches. The first branch utilizes pyramid transformation to extract features at various dimensions. Simultaneously, the second and third branches implement residual networks, which include residual connections to facilitate the flow of gradients and mitigate the vanishing gradient problem in deep networks. These residual networks, with distinct structures, ensure a more compact distribution of the extracted features by enabling the learning of complex patterns, even in deeper layers. Additionally, a micro-PPM is integrated into the second branch to enhance the network’s ability to capture global information. The outputs from all three branches are then fed into the boundary attention module, where boundary features are leveraged to guide the fusion of contextual information.

Figure 4 Structure diagram of our LYW block. LYW block, multi-scale pyramid residual weight block.

Figure 5 The structural details of basic block, multi-basic block, bottle neck, multi-bottle neck. BN, batch normalization; ReLU, rectified linear unit.

Micro-pyramid attention module

The pyramid attention module, introduced in (43), enhances the feature pyramid attention (25), thereby improving the performance of the feature pyramid. The multi-scale pyramid technique processes the image at different resolutions, enabling the model to effectively detect objects of varying sizes. In the feature pyramid attention, a mask module is incorporated into the traditional feature pyramid model to augment the multi-scale approach. This integration allows the model to prioritize relevant features, effectively combining the multi-scale and attention mechanisms to enhance performance. The pyramid attention mechanism is expressed in Eq. [1]:

$H (x) = ((1 + P_{1} (P_{2} (P_{3} (x))))) \times V (x)$ [1]

Where $H (x)$ is the output feature, $x$ is the original feature, $V$ is the convolution operation, and $P_{1}$ , $P_{2}$ , $P_{3}$ are the three-layer pyramid structure, respectively.

The input feature maps are pooled to obtain two feature maps of size $\frac{1}{2}$ , $\frac{1}{4}$ . The original image and the two feature maps are convolved with $3 \times 3$ , $5 \times 5$ , and $7 \times 7$ filters and upsampled, respectively. Then, the residual mechanism is used to add the original feature map and them together to obtain a feature map. Although the sizeable convolutional filter can obtain a large receptive field, which is the area of the input image a network layer can “see” or cover in one pass and is essential for capturing broad context and understanding large structures in the image, it brings a large amount of computation and parameters (44). To address this, we adopt the densely connected layer mentioned in, which can alleviate the vanishing gradient problem and enhance feature propagation while significantly reducing the parameters. Therefore, a $5 \times 5$ filter is replaced with two smaller $3 \times 3$ convolution kernels, and a $7 \times 7$ filter is replaced with three smaller $3 \times 3$ convolution filters. This modification reduces computational complexity while maintaining the network’s ability to capture the necessary spatial information.

Multi-basic block and multi-bottleneck

The bottleneck structure significantly reduces both computational cost and parameter count while preserving the expressive power of the network. The concept of the bottleneck effectively balances computational efficiency with feature representation, making the network more parameter-efficient and reducing overall computational complexity. As illustrated in Figure 5, multilayer bottleneck and basic block structures are employed to extract deeper levels of detailed information.

Micro-PPM

The PPM (45,46) is introduced to aggregate contextual information from different regions, thereby enhancing the network’s ability to capture global information. This module applies pooling operations at multiple scales to the original feature map, generating several feature maps of varying sizes, which are then concatenated along the channel dimension. The resulting composite feature map effectively fuses information from multiple scales, balancing both global semantic and local detail information. However, the inclusion of too many scale sums in each channel leads to excessive computational cost. To address this issue, the number of pyramid layers is reduced in the proposed model, thereby decreasing the computational burden and accelerating processing, as illustrated in Figure 6.

Figure 6 Structure diagram of micro-pyramid pooling module. BN, batch normalization; ReLU, rectified linear unit.

Boundary attention-guided (BAG) fusion module

The BAG fusion module, as shown in Figure 7, is proposed to guide the fusion of detailed and contextual representations. Although context information is semantically accurate, significant spatial and geometric details can be lost in the process. The BAG module addresses this issue by directing the model to prioritize the boundary regions, allowing these areas to focus on detailed features, while other regions are filled with contextual information. This approach ensures that both fine-grained details and broader context are effectively integrated in the fused output. The eigenvectors corresponding to the feature maps of the three branches $A$ , $B$ , and $C$ are defined as $v_{A}$ , $v_{B}$ , $v_{C}$ , and BAG can be represented as:

$σ = s i g m o i d ({\vec{v}}_{B})$ [2]

$o u t_{b a g} = (1 - σ) \otimes {\vec{v}}_{c} + σ \otimes {\vec{v}}_{b}$ [3]

Figure 7 Structure diagram of boundary attention-guided fusion module.

When $s i g m o i d > 0.5$ model depends more on detailed features, otherwise on contextual information.

Feature fuser

In the experiments, a feature distillation strategy (FDS) is proposed to fuse the extracted features. The primary objective of FDS is to integrate features from multiple sources or models to enhance the overall feature representation. By distilling and merging the most relevant information from each feature source, FDS enables the model to learn richer and more comprehensive features, thereby improving performance on downstream tasks. This fusion process not only helps to retain critical information from each feature source but also provides a more robust and efficient feature representation, resulting in improved model accuracy and generalization. The formula is expressed as:

$f = k_{1} * F_{1} + k_{2} * F_{2}$ [4]

where $f$ is the fused feature. In the proposed strategy, $k_{1} = \frac{e^{\frac{F_{1}}{T}}}{e^{\frac{F_{1}}{T}} + e^{\frac{F_{2}}{T}}}$ , $k_{1} = \frac{e^{\frac{F_{2}}{T}}}{e^{\frac{F_{1}}{T}} + e^{\frac{F_{2}}{T}}}$ , $F_{1}$ and $F_{2}$ are the features obtained from the original images 1 and 2 through the feature extraction module, respectively. The quantization results of SPECT-MRI, PET-MRI, and CT-MRI at T ranging from 1 to 30 are shown in Figures 8-10. We can see in the three pictures that the structural similarity (SSIM) (47) and peak signal-to-noise ratio (PSNR) parameters increase with the increase of T, whereas the remaining four parameters increase first and then tend to balance.

Figure 8 Quantitative comparison of our FDS for SPECT and MRI image fusion with proposed fusion methods. The means of metrics for different T are shown in the legends. The images are from the Whole Brain Atlas (http://www.med.harvard.edu/AANLIB/home.html) created by Harvard Medical School. MRI, magnetic resonance imaging; SPECT, single-photon emission computed tomography; FDS, feature distillation strategy; SSIM, structural similarity; PSNR, peak signal-to-noise ratio; MI, mutual information; FMI, Fowlkes-Mallows index; ARI, adjusted Rand index; CC, correlation coefficient; T, parameters of the FDS.

Figure 9 Quantitative comparison of our FDS for PET and MRI image fusion with proposed fusion methods. The means of metrics for different T are shown in the legends. The images are from the Whole Brain Atlas (http://www.med.harvard.edu/AANLIB/home.html) created by Harvard Medical School. MRI, magnetic resonance imaging; PET, positron emission tomography; FDS, feature distillation strategy; SSIM, structural similarity; PSNR, peak signal-to-noise ratio; MI, mutual information; FMI, Fowlkes-Mallows index; ARI, adjusted Rand index; CC, correlation coefficient; T, parameters of the FDS.

Figure 10 Quantitative comparison of our FDS for CT and MRI image fusion with proposed fusion methods. The means of metrics for different T are shown in the legends. The images are from the Whole Brain Atlas (http://www.med.harvard.edu/AANLIB/home.html) created by Harvard Medical School. CT, computed tomography; MRI, magnetic resonance imaging; FDS, feature distillation strategy; SSIM, structural similarity; PSNR, peak signal-to-noise ratio; MI, mutual information; FMI, Fowlkes-Mallows index; ARI, adjusted Rand index; CC, correlation coefficient; T, parameters of the FDS.

Loss function

MRI and PET have different characteristics due to different imaging mechanisms. In order to retain more features and improve the quality of the fused image, the loss function should fully consider the retention of features and the characteristics of the image. In this paper, the loss function of our LYWNet is defined as:

$L = L_{s i m i l a r i t y} + α L_{i n t e n s i t y} + β L_{e d}$ [5]

where $L_{s i m i l a r i t y}$ , $L_{i n t e n s i t y}$ , and $L_{e d}$ denote similarity loss, the cosine similarity loss, and the edge loss, respectively. α and β are two trade-off parameters to balance these three terms.

Similarity loss

The similarity loss limits the images the network generates so that the fused image retains part of the information from the input image. The similarity loss function is showed in:

$L_{s i m i l a r i t y} = \frac{1}{M} \sum_{k = 1}^{M} {‖ F_{k} - I {}_{k} ‖}_{2}^{2}$ [6]

where ${‖ \cdot ‖}_{2}$ denotes the matrix 2-norm and $I$ denotes the input image $A$ or $B$ . $F_{k}$ denotes the fused image with $k \in M$ , and $M$ is the number of fused images.

Intensity loss

The input is two images of different modalities, and the contrast between the images has a significant difference in the vector. The cosine loss function is a loss function to judge the similarity between two vectors, making the fusion image $F$ and $I$ generated by the network more similar at the vector level. Thus, we use cosine similarity loss as the intensity loss; it is defined as:

$L_{intensity} = {\begin{cases} 1 - \cos (F_{k}, I_{k}), y_{k} = 1 \\ \max (0, \cos (F_{k}, I_{k}) - margin), y_{k} = - 1 \end{cases}$ [7]

$y_{i}$ represents the true label, which belongs to ${- 1, 1}$ , and the margin parameter represents the cosine similarity threshold.

Edge loss function

The edge loss function can preserve the fused image’s edge details to improve the fused structure’s visual quality and perceptual fidelity. By improving visual consistency between different images and transitioning at image edges, discontinuities and the perception of unnaturalness can be reduced; it is defined as:

$L_{e d g e} = MSE (Sobel (F_{k}), Sobel (I_{k}))$ [8]

Sobel(·) represents the edge features extracted from image. Using the Sobel filter, MSE represents the mean square error loss and compares the difference between two edge features.

Results

This section presents a comparison of the proposed LYWNet with several state-of-the-art methods for SPECT-MRI, PET-MRI, and CT-MRI image fusion. The experiments are conducted qualitatively and quantitatively using a publicly available dataset. Additionally, an ablation study, efficiency analysis, and parameter comparison are included.

Datasets and training details

In the experiments, a total of 329 pairs of SPECT-MRI, 318 pairs of PET-MRI, and 184 pairs of CT-MRI images were prepared for training, whereas 30 pairs of SPECT-MRI, PET-MRI, and CT-MRI images were used for testing. All images were sourced from the Whole Brain Atlas (http://www.med.harvard.edu/AANLIB/home.html), created by Harvard Medical School. The MRI, SPECT, PET, and CT images were coregistered, with consistent pixel sizes across all modalities. The net is implemented by PyTorch (https://pytorch.org/) with an NVIDIA RTX 3090 GPU (NVIDIA, Santa, Clara, CA, USA). As for parameters, $α = 25$ , $β = 4$ , $T = 30$ , the learning rate is set to 0.0001, and the optimization algorithm is Adam (48). The network is trained for 1,000 epochs with a batch size of 16. When the number of epochs reaches 1,000, the training process is completed.

Comparison with other fusion methods

The performance of LYWNet is compared with 12 state-of-the-art methods, including CNN (32), enhanced medical image fusion network (EMFusion) (33), local extreme map guided multi-modal image fusion (LEGFF) (49), Laplacian redecomposition (LRD) (50), multiscale adaptive transformer (MATR) (35), multi-level edge-preserving filtering-multi-level morphological gradient-pulse-coupled neural network (MLCF-MLMG-PCNN) (51), multi-scale DenseNet (MSDNet) (38), multiscale double-branch residual attention (MSDRA) (26), NSCT (14), non-subsampled contour let-multi-scale morphological gradient-pulse coupled neural network (NSST-MSMG-PCNN) (19), zero-learning fast (34), NSST-parameter-adaptive pulse-couple neural network (NSST-PAPCNN) (18), and DSAGAN (40). Among these methods, LEGFF, LRD, NSCT, MLCF-MLMG-PCNN, and NSST-PAPCNN are traditional techniques, whereas CNN, EMFusion, MATR, MSDNet, MSDRA, NSST-MSMG-PCNN, zero-learning fast, and DSAGAN are DL-based approaches.

For the quantitative comparison, six metrics are employed to evaluate the fusion performance: SSIM, PSNR, mutual information (MI) (52), Fowlkes-Mallows index (FMI) (53), adjusted Rand index (ARI), and correlation coefficient (CC) (54). SSIM is used to assess the SSIM, ensuring that critical details, such as soft tissue and bone structures, are preserved in the fused image, which is essential for precise lesion localization. PSNR evaluates the image clarity by measuring noise levels, which is crucial for accurately visualizing subtle features. MI quantifies the shared information between different modalities (e.g., CT and MRI), ensuring that the fused image integrates unique characteristics from each modality for a comprehensive representation. FMI provides an indication of the boundary of lesions, aiding in accurate lesion localization, minimizing inadvertent damage to normal tissues, and enhancing treatment planning accuracy. ARI is employed to assess the agreement between the clustering results of the image and the true labels. CC measures the correlation between the fused and original images, ensuring the preservation of intensity consistency across modalities. Among these metrics, SSIM specifically extracts three key features from the image: brightness, contrast, and structure, and then computes the similarity between these features. The range of the SSIM range is [0, 1]. The larger the SSIM, the smaller the structural loss and distortion. SSIM is defined as follows:

$\begin{array}{l} S S I M_{I, F} = \sum_{i, j} \frac{2 μ_{i} μ_{f} + C_{1}}{μ_{i}^{2} + μ_{f}^{2} + C_{1}} \cdot \frac{2 σ_{i} σ_{f} + C_{2}}{σ_{i}^{2} + σ_{f}^{2} + C_{2}} \cdot \frac{σ_{i f} + C_{3}}{σ_{i} σ_{f} + C_{3}} \\ S S I M_{F}^{A, B} = \frac{S S I M_{A, F} + S S I M_{B, F}}{2} \end{array}$ [9]

where $μ_{i}$ and $μ_{f}$ are the means of images $I$ and $F$ , respectively; $σ_{i}$ and $σ_{f}$ are the variances of images $I$ and $F$ , respectively; $σ_{i f}$ is the covariance of the images $I$ and $F$ ; $C_{1}$ , $C_{2}$ , and $C_{3}$ are constants. To prevent cases where the denominator approaches zero, small constants, $C_{1} = {(K_{1} \times L)}^{2}$ , $C_{2} = {(K_{2} \times L)}^{2}$ , $C_{3} = \frac{C_{2}}{2}$ , $K_{1} = 0.01$ , $K_{2} = 0.03$ , and $L = 255$ .

PSNR is the ratio of peak power to noise power in the fused image. A higher PSNR value indicates a smaller difference in quality between the two images.

$\begin{array}{l} M S E_{I F} = \frac{1}{M} \sum_{k = 1}^{M} {‖ F_{k} - I_{k} ‖}_{2}^{2} \\ P S N R_{I F} = 10 \cdot \log_{10} (\frac{M A X_{F}^{2}}{S E_{I F}}) = 20 \cdot \log_{10} (\frac{M A X_{F}}{\sqrt{M S E_{I F}}}) \\ P S N R_{F}^{A, B} = \frac{P S N R_{A F} + P S N R_{B F}}{2} \end{array}$ [10]

where ${MAX}_{F}$ is the maximum value of the pixels in image F.

MI is used to measure the degree of information overlap between the fused image and the original image and evaluate the fusion quality. A larger MI indicates a higher similarity between the fused image and the original image.

$\begin{array}{l} M I_{I F} = \sum_{i, f} p_{i, f} \log \frac{p_{I F} (i, f)}{p_{I} (i) p_{F} (f)} \\ M I_{F}^{A, B} = \frac{(M I_{A F}, M I_{B F})}{2} \end{array}$ [11]

where $p_{A F} (a, f)$ and $p_{B F} (b, f)$ are the joint probability distributions for images $A$ and $F$ ; and images $B$ and $F$ , respectively. $p_{A} (a)$ , $p_{B} (b)$ , and $p_{F} (f)$ are the marginal probability distributions for images $A$ , $B$ , and $F$ , respectively.

ARI is a statistic that measures the similarity between two data clusters. The ARI is calculated by comparing real and fused images. A larger ARI indicates a higher similarity between the fused image and the original image.

$A R I = \frac{T P \cdot T N - F P \cdot F N}{(T P + F P) \cdot (T P + F N) \cdot \sqrt{(T N + F P) \cdot (T N + F N)}}$ [12]

$F M I = \frac{T P}{(T P + F P) \cdot (T P + F N)}$ [13]

where true positive (TP) stands for the number of pixels in the intersection of the object boundary generated by the algorithm and the standard boundary, that is, the number of object pixels correctly detected. False positive (FP) represents the number of pixels in the object boundary generated by the algorithm that do not intersect the standard boundary, that is, the number of object pixels detected by the algorithm incorrectly. False negative (FN) represents the number of pixels in the standard boundary that do not intersect the object boundary generated by the algorithm, that is, the number of object pixels that the algorithm fails to detect.

CC measures the linear correlation between the fused image and the source images. A higher CC indicates greater similarity between the fused image and the source images.

$\begin{array}{l} ρ_{I, F} = \frac{C O V (I, F)}{σ_{I} σ_{F}} \\ ρ_{F}^{A, B} = \frac{(ρ_{A, F}, ρ_{B, F})}{2} \end{array}$ [14]

where COV (·) stands for calculating the covariance of the two input images, and $σ_{I}$ and $σ_{F}$ stand for the standard deviation of $I$ and $F$ , respectively.

SPECT and MRI image fusion

The qualitative fusion results for typical SPECT and MRI image pairs are presented in Figure 11. The proposed method demonstrates three key advantages. First, in functional and anatomical images, the lesion location is more prominently reflected in functional images. Our approach mitigates the large amount of irregular mosaic information in SPECT images by incorporating the rich edge information from MRI, while preserving a substantial amount of functional information. Second, compared to other methods, the lesion location’s brightness and the color contrast with surrounding areas are enhanced, which aids clinicians in locating the lesion and identifying the nearby perfusion regions. For instance, in Figure 11G, MATR retains a significant amount of texture information, but the influence of high-frequency components disperses contextual details, resulting in suboptimal reflection of functional information. Finally, to assist physicians in detecting the lesion, the method preserves the texture information reflected in MRI, while safeguarding it from being obscured by the functional information in SPECT.

Figure 11 Comparisons of fusion results on SPECT-MRI. (A) and (B) are the MRI and SPECT image, respectively. (C-O) are the fusion results of (A) and (B), respectively by CNN, EMFusion, LEGFF, LRD, MATR, MLCF-MLMG-PCNN, MSDNet, MSDRA, NSCT, NSST-MSMG-PCNN, NSST-PAPCNN, zero-learning fast, and DSAGAN. (P) is the fusion result of the proposed algorithm. (H), (N), and (I) images exhibit anomalous colors, whereas (C), (D), (G), (I), (J), (K), (L), and (M) preserve more detailed texture information. Meanwhile, (E), (F), (O), and (P) retain a higher level of functional information. The images are from the Whole Brain Atlas (http://www.med.harvard.edu/AANLIB/home.html) created by Harvard Medical School. MRI, magnetic resonance imaging; SPECT, single-photon emission computed tomography; CNN, convolutional neural network; EMFusion, enhanced medical image fusion network; LEGFF, local extreme map guided multi-modal image fusion; LRD, Laplacian redecomposition; MATR, multiscale adaptive transformer; MLCF-MLMG-PCNN, multi-level edge-preserving filtering-multi-level morphological gradient-pulse-coupled neural network; MSDNet, multi-scale DenseNet; MSDRA, multiscale double-branch residual attention; NSCT, non-subsampled contour let transform; NSST-MSMG-PCNN, non-subsampled contour let-Multi-scale morphological gradient-pulse coupled neural network; NSST-PAPCNN, non-subsampled shearlet transform-parameter-adaptive pulse-couple neural network; DSAGAN, generative adversarial network based on dual-stream attention mechanism; LYWNet, multi-scale pyramid residual weight network.

The quantitative results are presented in Table 1. The proposed method achieves optimal performance in terms of SSIM, PSNR, ARI, FMI, and CC. These results indicate that the method effectively preserves high-quality features from the source images with minimal distortion. The relatively lower MI scores can be attributed to the following factors: in order to enhance the functional information in SPECT images, the distribution of texture information in MRI was reduced, which resulted in a lower MI score. Given that brightness information is less pronounced in SPECT compared to the texture information in MRI, improving the retention of MRI texture information would lead to a higher MI score, whereas prioritizing SPECT brightness retention results in a lower MI score. Consequently, competing methods, which preserve more MRI texture information, tend to produce higher MI scores. MI is commonly used to measure the amount of shared information between fused images, and maximizing MI can sometimes lead to a reduction in texture fidelity. However, in many clinical scenarios, functional data plays a crucial role in detecting abnormalities that may not be apparent in structural imaging alone. For instance, in oncology, PET scans reveal metabolic activity, which can highlight malignant tumors through elevated glucose uptake, an aspect that may be less discernible in texture-based imaging techniques such as MRI. Similarly, in neurology, functional imaging provides valuable insights into metabolic changes within specific brain regions, aiding early diagnosis or tracking disease progression. In such cases, functional activity may be more informative than texture details. The proposed method outperforms others in terms of correlation metrics, such as feature similarity and SSIM. The superior SSIM, PSNR, ARI, FMI, and CC results demonstrate that the method successfully retains functional information while preserving important MRI texture features. Moreover, the method exhibits a low standard deviation in SSIM, ARI, and FMI, contributing to the stability of the generated fusion images in terms of both structure and feature preservation.

Table 1

Quantitative comparison results of the proposed LYWNet with 12 competitors on SPECT and MRI image fusion

Method	Metrics
Method	SSIM	PSNR	MI	ARI	FMI	CC
CNN (32)	0.5052±0.0617	15.7235±1.5331	0.7002±0.0591	0.2879±0.0716	0.4440±0.0659	0.6378±0.0441
EMFusion (33)	0.5086±0.0626	16.0851±1.8244	0.7085±0.0674	0.6861±0.0603	0.7718±0.0454	0.6150±0.0417
LEGFF (49)	0.5059±0.0630	15.6472±1.5488	0.6947±0.0673	0.6720±0.0604	0.7618±0.0479	0.6384±0.0416
LRD (50)	0.5101±0.0611	15.5640±1.4369	0.7363±0.0679*	0.6490±0.0661	0.7426±0.0508	0.6567±0.0415
MATR (35)	0.4656±0.0544	16.1183±1.3801	0.6735±0.1132	0.4889±0.0857	0.6147±0.0706	0.6197±0.0456
MLCF-MLMG-PCNN (51)	0.5381±0.0555	17.3068±1.1369	0.6700±0.0655	0.6706±0.0639	0.7596±0.0478	0.6474±0.0424
MSDNet (38)	0.5329±0.0588	16.6933±1.4636	0.7132±0.0787	0.6662±0.0626	0.7561±0.0478	0.6313±0.0380
MSDRA (26)	0.5096±0.0544	14.9558±1.0011	0.6973±0.0601	0.6810±0.0713	0.7688±0.0574	0.6486±0.0392
NSCT (14)	0.5066±0.0629	15.7038±1.4344	0.7109±0.0717	0.6337±0.0602	0.7307±0.0454	0.6296±0.0412
NSST-MSMG-PCNN (19)	0.5328±0.0568	17.2072±1.1414	0.6513±0.0629	0.5576±0.0695	0.6710±0.0551	0.6432±0.0434
NSST-PAPCNN (18)	0.5068±0.0635	15.6586±1.4248	0.7181±0.0685	0.6349±0.0633	0.7316±0.0469	0.6377±0.0414
Zero-learning fast (34)	0.5038±0.0588	15.9215±1.3025	0.7012±0.0774	0.6881±0.0600	0.7740±0.0451	0.6379±0.0406
DSAGAN (40)	0.4896±0.0493	15.9315±0.9365	0.6899±0.0516	0.4828±0.0731	0.6099±0.0607	0.6514±0.0400
LYWNet	0.5592±0.0536*	17.3594±1.0211*	0.7045±0.0643	0.6884±0.0534*	0.7741±0.0388*	0.6612±0.0439*

On 30 test image pairs, the quantitative results of fusion results obtained by different fusion methods on five metrics are shown below (mean ± standard deviation are shown, *: optimal). SSIM, structural similarity; PSNR, peak signal-to-noise ratio; MI, mutual information; FMI, Fowlkes-Mallows index; ARI, adjusted Rand index; CC, correlation coefficient; MRI, magnetic resonance imaging; SPECT, single-photon emission computed tomography; LYWNet, multi-scale pyramid residual weight network; CNN, convolutional neural network; EMFusion, enhanced medical image fusion network; LEGFF, local extreme map guided multi-modal image fusion; LRD, Laplacian redecomposition; MATR, multiscale adaptive transformer; MLCF-MLMG-PCNN, multi-level edge-preserving filtering-multi-level morphological gradient-pulse-coupled neural network; MSDNet, multi-scale DenseNet; MSDRA, multiscale double-branch residual attention; NSCT, non-subsampled contour let transform; NSST-MSMG-PCNN, non-subsampled contour let-multi-scale morphological gradient-pulse coupled neural network; NSST-PAPCNN, NSST-parameter-adaptive pulse-couple neural network; DSAGAN, generative adversarial network based on dual-stream attention mechanism.

Table 2 presents the inference time of different neural network methods. Although LYWNet achieves high accuracy, its inference time suggests that optimization may be needed for real-time deployment in resource-limited clinical settings. Figure 12 shows the quantitative comparison results of our proposed method with 12 other methods on 30 SPECT-MRI test image pairs. From Figure 12 and Table 1, it can be seen that the proposed method produces the lowest standard deviation on SSIM, ARI, and FMI metrics. Specifically, the SSIM value of the proposed method is greater than that of other competitive methods in 30 sets of test data.

Table 2

Quantitative comparison results of the proposed LYWNet with eight competitors on SPECT and MRI image fusion

Methods	Inference time (s)
CNN (32)	9.9461
EMFusion (33)	48.4052
MATR (35)	210.8680
MSDNet (38)	19.0274
MSDRA (24)	19.0274
NSST-MSMG-PCNN (19)	400.7225
Zero-learning fast (34)	114.4541
DSAGAN (40)	12.6487
LYWNet	20.02578

On 30 test image pairs, the inference time of fusion results obtained by different fusion methods on five metrics are shown. MRI, magnetic resonance imaging; SPECT, single-photon emission computed tomography; LYWNet, multi-scale pyramid residual weight network; CNN, convolutional neural network; EMFusion, enhanced medical image fusion network; MATR, multiscale adaptive transformer; MSDNet, multi-scale DenseNet; MSDRA, multiscale double-branch residual attention; NSST-MSMG-PCNN, non-subsampled contour let-multi-scale morphological gradient-pulse coupled neural network; DSAGAN, generative adversarial network based on dual-stream attention mechanism.

Figure 12 Quantitative comparison of our LYWNet for SPECT and MRI image fusion with other image fusion methods. The means of metrics for different methods are shown in the legends. The images are from the Whole Brain Atlas (http://www.med.harvard.edu/AANLIB/home.html) created by Harvard Medical School. MRI, magnetic resonance imaging; SPECT, single-photon emission computed tomography; LYWNet, multi-scale pyramid residual weight network; SSIM, structural similarity; PSNR, peak signal-to-noise ratio; MI, mutual information; FMI, Fowlkes-Mallows index; ARI, adjusted Rand index; CC, correlation coefficient; CNN, convolutional neural network; EMFusion, enhanced medical image fusion network; LEGFF, local extreme map guided multi-modal image fusion; LRD, Laplacian redecomposition; MATR, multiscale adaptive transformer; MLCF-MLMG-PCNN, multi-level edge-preserving filtering-multi-level morphological gradient-pulse-coupled neural network; MSDNet, multi-scale DenseNet; MSDRA, multiscale double-branch residual attention; NSCT, non-subsampled contour let transform; NSST-MSMG-PCNN, non-subsampled contour let-multi-scale morphological gradient-pulse coupled neural network; NSST-PAPCNN, non-subsampled shearlet transform-parameter-adaptive pulse-couple neural network; DSAGAN, generative adversarial network based on dual-stream attention mechanism.

PET and MRI image fusion

The qualitative comparison results are presented in Figure 13. Given that this fusion task is similar to the fusion of SPECT and MRI images, the analysis of the qualitative results follows the same approach as in SPECT and MRI image fusion. On the one hand, the proposed method enhances the visibility of functional information; on the other hand, it effectively preserves critical texture detail from the MRI images.

Figure 13 Quantitative comparison of our LYWNet for PET and MRI image fusion with other image fusion methods. The means of metrics for different methods are shown in the legends. The images are from the Whole Brain Atlas (http://www.med.harvard.edu/AANLIB/home.html) created by Harvard Medical School. MRI, magnetic resonance imaging; SPECT, single-photon emission computed tomography; LYWNet, multi-scale pyramid residual weight network; SSIM, structural similarity; PSNR, peak signal-to-noise ratio; MI, mutual information; FMI, Fowlkes-Mallows index; ARI, adjusted Rand index; CC, correlation coefficient; CNN, convolutional neural network; EMFusion, enhanced medical image fusion network; LEGFF, local extreme map guided multi-modal image fusion; LRD, Laplacian redecomposition; MATR, multiscale adaptive Transformer; MLCF-MLMG-PCNN, multi-level edge-preserving filtering-Multi-level morphological gradient-pulse-coupled neural network; MSDNet, multi-scale DenseNet; MSDRA, multiscale double-branch residual attention; NSCT, non-subsampled contour let transform; NSST-MSMG-PCNN, non-subsampled contour let-multi-scale morphological gradient-pulse coupled neural network; NSST-PAPCNN, non-subsampled shearlet transform-parameter-adaptive pulse-couple neural network; DSAGAN, generative adversarial network based on dual-stream attention mechanism.

The quantitative results are presented In Table 3. The proposed method achieves the highest values for SSIM, PSNR, ARI, and FMI, indicating superior fusion performance. Figure 14 illustrates the quantitative comparison between the proposed method and 12 other techniques on 30 PET-MRI test image pairs. As shown in both Figure 14 and Table 3, the proposed method demonstrates the lowest standard deviation in PSNR and MI. Specifically, the SSIM value of the proposed method surpasses that of other competing methods across all 30 test data sets.

Table 3

Quantitative comparison results of the proposed LYWNet with twelve competitors on PET and MRI image fusion

Method	Metrics
Method	SSIM	PSNR	MI	ARI	FMI	CC
CNN (32)	0.4813±0.0825	13.4574±2.1690	0.6937±0.1576	0.2980±0.1717	0.4359±0.1530	0.6125±0.0267
EMFusion (33)	0.4839±0.0857	13.4739±2.1741	0.7133±0.1803	0.4446±0.2338	0.5666±0.1926	0.5960±0.0188
LEGFF (49)	0.4701±0.0805	13.0700±2.1626	0.7172±0.1826	0.4135±0.2732	0.5322±0.2340	0.6016±0.0183
LRD (50)	0.4770±0.0873	12.9515±2.4644	0.7293±0.1791*	0.3637±0.2411	0.4894±0.2132	0.6205±0.0235
MATR (35)	0.4358±0.0747	13.4686±1.9793	0.6770±0.1706	0.3987±0.2172	0.5285±0.1804	0.5795±0.0172
MLCF-MLMG-PCNN (51)	0.4972±0.0743	14.1186±2.14779	0.7085±0.1725	0.4353±0.2424	0.5571±0.2006	0.6313±0.0326
MSDNet (38)	0.4966±0.0875	13.5518±2.41166	0.7221±0.1842	0.4197±0.2390	0.5439±0.1995	0.6059±0.0249
MSDRA (26)	0.4808±0.0676	13.0671±1.9244	0.7171±0.1554	0.4241±0.2829	0.5369±0.2504	0.6030±0.0184
NSCT (14)	0.4755±0.0839	13.3549±2.1704	0.7075±0.1696	0.3819±0.2392	0.5066±0.2075	0.5961±0.0242
NSST-MSMG-PCNN (19)	0.4962±0.0748	14.1408±2.1388	0.6784±0.1466	0.2813±0.1101	0.4262±0.1003	0.6323±0.0345*
NSST-PAPCNN (18)	0.4767±0.0850	13.2122±2.2915	0.7258±0.1776	0.4022±0.2564	0.5235±0.2204	0.6050±0.0218
Zero-learning fast (34)	0.4771±0.0873	13.3424±2.2029	0.7144±0.1998	0.4730±0.2377	0.5930±0.1909	0.5926±0.0264
DSAGAN (40)	0.4477±0.0588	14.1914±1.1313	0.6931±0.1431	0.4093±0.2908	0.5215±0.2565	0.6217±0.0196
LYWNet	0.5195±0.0730*	14.5324±1.7365*	0.6914±0.1431	0.5237±0.1725*	0.6484±0.0134 *	0.6291±0.0193

On 30 test image pairs, the quantitative results of fusion results obtained by different fusion methods on five metrics are shown below (mean ± standard deviation are shown, *: optimal). SSIM, structural similarity; PSNR, peak signal-to-noise ratio; MI, mutual information; FMI, Fowlkes-Mallows index; ARI, adjusted Rand index; CC, correlation coefficient; MRI, magnetic resonance imaging; PET, positron emission tomography; LYWNet, multi-scale pyramid residual weight network; CNN, convolutional neural network; EMFusion, enhanced medical image fusion network; LEGFF, local extreme map guided multi-modal image fusion; LRD, Laplacian redecomposition; MATR, multiscale adaptive transformer; MLCF-MLMG-PCNN, multi-level edge-preserving filtering-multi-level morphological gradient-pulse-coupled neural network; MSDNet, multi-scale DenseNet; MSDRA, multiscale double-branch residual attention; NSCT, non-subsampled contour let transform; NSST-MSMG-PCNN, non-subsampled contour let-multi-scale morphological gradient-pulse coupled neural network; NSST-PAPCNN, non-subsampled shearlet transform-parameter-adaptive pulse-couple neural network; DSAGAN, generative adversarial network based on dual-stream attention mechanism.

Figure 14 Comparisons of fusion results on PET-MRI. (A) and (B) are the MRI and PET image, respectively. (C-O) are the fusion results of (A) and (B), respectively by CNN, EMFusion, LEGFF, LRD, MATR, MLCF-MLMG-PCNN, MSDNet, MSDRA, NSCT, NSST-MSMG-PCNN, NSST-PAPCNN, zero-learning fast and DSAGAN. (P) is the fusion result of the proposed algorithm. Excessive functional details in components (H), (L), and (F) result in less distinct texture information, whereas components (C), (D), (E), (F), (G), (I), (J), (K), (M), (N), (O), and (P) retain a balance of both texture and functional information. The images are from the Whole Brain Atlas (http://www.med.harvard.edu/AANLIB/home.html) created by Harvard Medical School. MRI, magnetic resonance imaging; LYWNet, multi-scale pyramid residual weight network; CNN, convolutional neural network; EMFusion, enhanced medical image fusion network; LEGFF, local extreme map guided multi-modal image fusion; LRD, Laplacian redecomposition; MATR, multiscale adaptive Transformer; MLCF-MLMG-PCNN, multi-level edge-preserving filtering-Multi-level morphological gradient-pulse-coupled neural network; MSDNet, multi-scale DenseNet; MSDRA, multiscale double-branch residual attention; NSCT, non-subsampled contour let transform; NSST-MSMG-PCNN, non-subsampled contour let-multi-scale morphological gradient-pulse coupled neural network; NSST-PAPCNN, non-subsampled shearlet transform-parameter-adaptive pulse-couple neural network; DSAGAN, generative adversarial network based on dual-stream attention mechanism.

CT and MRI image fusion

The qualitative fusion results for typical CT and MRI image pairs are presented in Figure 15. In the fusion results of EMFusion and MLCF-MLMG-PCNN, the texture details in the MRI images are notably reduced when compared to other methods. Similarly, in the results of NSCT, NSST-PAPCNN, and Zero, the dense structures present in the CT images are diminished. In contrast, the proposed method effectively preserves both the texture details and dense structures, maintaining high intensity in the fused images.

Figure 15 Comparisons of fusion results on CT-MRI (A) and (B) are the MRI and CT image, respectively. (C-P) are the fusion results of (A) and (B), respectively by CNN, EMFusion, LEGFF, LRD, MATR, MLCF-MLMG-PCNN, MSDNet, MSDRA, NSCT, NSST-MSMG-PCNN, NSST-PAPCNN, zero-learning fast, and DSAGAN. (P) is the fusion result of the proposed algorithm. The images are from the Whole Brain Atlas (http://www.med.harvard.edu/AANLIB/home.html) created by Harvard Medical School. MRI, magnetic resonance imaging; LYWNet, multi-scale pyramid residual weight network; CNN, convolutional neural network; EMFusion, enhanced medical image fusion network; LEGFF, local extreme map guided multi-modal image fusion; LRD, Laplacian redecomposition; MATR, multiscale adaptive transformer; MLCF-MLMG-PCNN, multi-level edge-preserving filtering-multi-level morphological gradient-pulse-coupled neural network; MSDNet, multi-scale DenseNet; MSDRA, multiscale double-branch residual attention; NSCT, non-subsampled contour let transform; NSST-MSMG-PCNN, non-subsampled contour let-multi-scale morphological gradient-pulse coupled neural network; NSST-PAPCNN, non-subsampled shearlet transform-parameter-adaptive pulse-couple neural network; DSAGAN, generative adversarial network based on dual-stream attention mechanism.

The quantitative results across six metrics are presented in Table 4. Figure 16 provides a comparison between the proposed method and 12 other techniques, evaluated on 30 CT-MRI test image pairs. The proposed method achieves the highest values for SSIM and PSNR. As observed in the SPECT and MRI image fusion, a trade-off was made between the preservation of detail and the enhancement of functional information, leading to a lower performance in FMI, ARI, and CC for the fusion of MRI and CT images, particularly in terms of gray-scale details. However, compared with similar algorithms, the proposed method demonstrates significant improvements in the quality of the fused image’s texture information, while also substantially reducing the distortion in the fused image.

Table 4

Quantitative comparison results of the proposed LYWNet with 12 competitors on CT and MRI image fusion

Method	Metrics
Method	SSIM	PSNR	MI	ARI	FMI	CC
CNN (32)	0.5068±0.0471	11.8545±1.1291	0.5108±0.0689	0.4788±0.1541	0.6442±0.0.1221	0.5139±0.0782
EMFusion (33)	0.5366±0.0413	12.6214±1.1422	0.5082±0.0748	0.6825±0.0448	0.8001±0.0275*	0.5183±0.0774
LEGFF (49)	0.5058±0.0507	11.7268±1.0360	0.5159±0.0761	0.6644±0.0480	0.7834±0.0321	0.5245±0.0783
LRD (50)	0.5085±0.0467	11.4911±0.9681	0.5327±0.0704	0.6251±0.0632	0.7531±0.0460	0.5298±0.0765
MATR (35)	0.4845±0.0382	12.5892±1.1390	0.5147±0.0977	0.4067±0.0627	0.5849±0.0597	0.5168±0.0794
MLCF-MLMG-PCNN (51)	0.5299±0.0435	11.9269±1.2340	0.5425±0.0778*	0.6351±0.0540	0.7610±0.0365	0.5210±0.0777
MSDNet (38)	0.5236±0.0500	12.6454±0.9106	0.5421±0.0828	0.6122±0.0652	0.7433±0.0428	0.5294±0.0753
MSDRA (26)	0.5091±0.0461	11.4519±0.6630	0.5025±0.0721	0.6884±0.0436*	0.6884±0.0280	0.5183±0.0731
NSCT (14)	0.5020±0.0470	11.7038±1.0661	0.5124±0.0662	0.5344±0.0801	0.6854±0.0586	0.5224±0.0784
NSST-MSMG-PCNN (19)	0.5051±0.0471	12.0371±1.1796	0.5128±0.0706	0.4083±0.0536	0.5871±0.0372	0.5198±0.0785
NSST-PAPCNN (18)	0.5002±0.0476	12.0233±0.8576	0.5307±0.0619	0.4433±0.0384	0.6152±0.0228	0.5320±0.0743*
Zero-learning fast (34)	0.4892±0.0445	12.7460±0.6583	0.4829±0.1184	0.6695±0.0446	0.7884±0.0291	0.4921±0.0700
DSAGAN (40)	0.4687±0.0424	13.6992±0.698	0.4974±0.0492	0.4365±0.0726	0.6096±0.0517	0.5120±0.0697
LYWNet	0.5376±0.0442*	13.9202±0.7265*	0.5343±0.0801	0.6342±0.0608	0.7589±0.0361	0.5270±0.0714

On 30 test image pairs, the quantitative results of fusion results obtained by different fusion methods on five metrics are shown (mean ± standard deviation are shown, *: optimal). SSIM, structural similarity; PSNR, peak signal-to-noise ratio; MI, mutual information; FMI, Fowlkes-Mallows index; ARI, adjusted Rand index; CC, correlation coefficient; MRI, magnetic resonance imaging; CT, computed tomography; LYWNet, muLti-scale pyramid residual weight network; CNN, convolutional neural network; EMFusion, enhanced medical image fusion network; LEGFF, local extreme map guided multi-modal image fusion; LRD, Laplacian redecomposition; MATR, multiscale adaptive transformer; MLCF-MLMG-PCNN; multi-level edge-preserving filtering-Multi-level morphological gradient-pulse-coupled neural network; MSDNet, multi-scale DenseNet; MSDRA, multiscale double-branch residual attention; NSCT, non-subsampled contour let transform; NSST-MSMG-PCNN, non-subsampled contour let-multi-scale morphological gradient-pulse coupled neural network; NSST-PAPCNN, non-subsampled shearlet transform-parameter-adaptive pulse-couple neural network; DSAGAN, generative adversarial network based on dual-stream attention mechanism.

Figure 16 Quantitative comparison of our LYWNet for CT and MRI image fusion with other image fusion methods. The means of metrics for different methods are shown in the legends. The images are from the Whole Brain Atlas (http://www.med.harvard.edu/AANLIB/home.html) created by Harvard Medical School. MRI, magnetic resonance imaging; LYWNet, multi-scale pyramid residual weight network; FMI, Fowlkes-Mallows index; ARI, adjusted Rand index; CC, correlation coefficient; CNN, convolutional neural network; EMFusion, enhanced medical image fusion network; LEGFF, local extreme map guided multi-modal image fusion; LRD, Laplacian redecomposition; MATR, multiscale adaptive Transformer; MLCF-MLMG-PCNN, multi-level edge-preserving filtering-Multi-level morphological gradient-pulse-coupled neural network; MSDNet, multi-scale DenseNet; MSDRA, multiscale double-branch residual attention; NSCT, non-subsampled contour let transform; NSST-MSMG-PCNN, non-subsampled contour let-multi-scale morphological gradient-pulse coupled neural network; NSST-PAPCNN, non-subsampled shearlet transform-parameter-adaptive pulse-couple neural network; DSAGAN, generative adversarial network based on dual-stream attention mechanism.

Ablation study

Selection of the number of LYW blocks

The performance of the feature extractor with one, two, and three blocks was tested on 30 pairs of SPECT-MRI, PET-MRI, and CT-MRI images. The fusion results for these tests are presented in Figure 17. Observation and comparison of the experimental results revealed that the image generated using one block exhibits higher brightness, whereas the image produced by two blocks shows lower brightness. These differences highlight the impact of the number of blocks in the feature extractor on the fusion process, specifically in terms of image brightness.

Figure 17 Comparisons of the fusion results for one to three blocks. (A), (B), (F), G), (K), and (L) are the original images. (C), (H), and (M) are the fusion results of 1 block. (D), (I) and (N) are the fusion results of 2 blocks. (E), (J), and (O) are the fusion results of 3 blocks. MRI, magnetic resonance imaging; SPECT, single-photon emission computed tomography.

Tables 5-7 summarize the quantitative evaluation on SPECT-MRI, PET-MRI, and CT-MRI, respectively, by applying one LYW block, two LYW blocks, and three LYW blocks. In Tables 5-7, the values with the best results are marked in red. In Tables 5,6, we can see that there are five metrics (SSIM, PSNR, MI, ARI, and FMI) that perform best in the SPECT-MRI and PET-MRI fusion images at three blocks. In Table 7, we can see that there are three metrics (SSIM, PSNR, and ARI) that perform best in the CT-MRI fusion images at three blocks. This indicates that the fusion image produced using a network of three blocks retains more gradient information and reduces distortion more effectively. Therefore, the use of three LYW blocks is optimal for SPECT-MRI, PET-MRI, and CT-MRI image fusion.

Table 5

Fusion metrics of one to three LYW blocks for 30 pairs of SPECT-MRI images (mean ± standard deviation are shown, *: optimal)

Method	Metrics
Method	SSIM	PSNR	MI	ARI	FMI	CC
1 block	0.5059±0.0507	11.7825±0.5691	0.6615±0.0662	0.5989±0.0573	0.7019±0.0474	0.6848±0.0416*
2 blocks	0.5482±0.0510	17.2316±1.0077	0.6781±0.0697	0.5806±0.0544	0.6904±0.0430	0.6656±0.0435
3 blocks	0.5592±0.0536*	17.3594±1.0211*	0.7045±0.0643*	0.6884±0.0534*	0.7741±0.0388*	0.6612±0.0439

SSIM, structural similarity; PSNR, peak signal-to-noise ratio; MI, mutual information; FMI, Fowlkes-Mallows index; ARI, adjusted Rand index; CC, correlation coefficient; MRI, magnetic resonance imaging; SPECT, single-photon emission computed tomography; LYW block, multi-scale pyramid residual weight block.

Table 6

Fusion metrics of one to three LYW blocks for 30 pairs of PET-MRI images (mean ± standard deviation are shown, *: optimal)

Method	Metrics
Method	SSIM	PSNR	MI	ARI	FMI	CC
1 block	0.4740±0.0623	12.1067±1.4643	0.6680±0.1759	0.3711±0.2665	0.4883±0.2402	0.6479±0.0278*
2 blocks	0.5040±0.0686	14.2148±2.3485	0.6839±0.1511	0.4443±0.1922	0.5736±0.1530	0.6348±0.0229
3 blocks	0.5195±0.0730*	14.5234±1.7365*	0.6914±0.1431*	0.5237±0.1725*	0.6484±0.1343*	0.6291±0.0193

SSIM, structural similarity; PSNR, peak signal-to-noise ratio; MI, mutual information; FMI, Fowlkes-Mallows index; ARI, adjusted Rand index; CC, correlation coefficient; MRI, magnetic resonance imaging; PET, positron emission tomography; LYW block, multi-scale pyramid residual weight block.

Table 7

Fusion metrics of one to three LYW blocks for 30 pairs of CT-MRI images (mean ± standard deviation are shown, *: optimal)

Method	Metrics
Method	SSIM	PSNR	MI	ARI	FMI	CC
1 block	0.4617±0.0431	10.8655±0.6578	0.5116±0.0601	0.5235±0.0668	0.6778±0.0474	0.5227±0.0723
2 blocks	0.5435±0.0402*	14.3888±0.8219*	0.5106±0.0759	0.5548±0.0612	0.7010±0.0427	0.5295±0.0758*
3 blocks	0.5376±0.0442	13.9202±0.7265	0.5343±0.0801*	0.6342±0.0608*	0.7589±0.0361*	0.5270±0.0801

SSIM, structural similarity; PSNR, peak signal-to-noise ratio; MI, mutual information; FMI, Fowlkes-Mallows index; ARI, adjusted Rand index; CC, correlation coefficient; MRI, magnetic resonance imaging; CT, computed tomography; LYW block, multi-scale pyramid residual weight block.

Selection of the fusion strategy

The feature fusion module is utilized to combine the extracted features. To evaluate the effectiveness of the proposed FDS method, a comparison is conducted with five commonly used methods at T=30. This comparison aims to demonstrate the superior performance of the FDS method in the fusion task.

Addition strategy

The addition strategy is one of the most commonly used feature fusion methods. Its formula is denoted as follows:

$F = F_{1} + F_{2}$ [15]

where $F$ is the fused feature. $F_{1}$ and $F_{2}$ are the extracted features of image 1 and 2, respectively.

Average strategy

The average strategy is another commonly used feature fusion method. Its formula is denoted as follows:

$F = \frac{F_{1} + F_{2}}{2}$ [16]

where $F$ is the fused feature. $F_{1}$ and $F_{2}$ are the extracted features of image 1 and 2, respectively.

Feature energy ratio strategy (FER) (25)

FER is another commonly used feature fusion method. Its formula is denoted as follows:

$F = k_{1} * F_{1} + k_{2} * F_{2}$ [17]

where $F$ is the fused feature, $k_{1} = \frac{F_{1}^{2}}{F_{1}^{2} + F_{2}^{2}}$ , $k_{2} = \frac{F_{2}^{2}}{F_{1}^{2} + F_{2}^{2}}$ , $F_{1}$ and $F_{2}$ are the extracted features of image 1 and 2, respectively.

Feature L₁-norm strategy (FL₁N) (26)

FL₁N is another commonly used feature fusion method. Its formula is denoted as Equation 17. Where $F$ is the fused feature, $k_{1} = {‖ F_{1} ‖}_{1} / ({‖ F_{1} ‖}_{1} + {‖ F_{2} ‖}_{1})$ , $k_{2} = {‖ F_{2} ‖}_{1} / ({‖ F_{1} ‖}_{1} + {‖ F_{2} ‖}_{1})$ , $F_{1}$ and $F_{2}$ are the extracted features of image 1 and 2, respectively.

Average L₁-norm weight strategy (AL₁NW)

AL₁NW is another commonly used feature fusion method. Its formula is denoted as Eq. [17], where $F$ is the fused feature, $k_{1} = {‖ F_{1} ‖}_{1} / 2$ , $k_{2} = {‖ F_{2} ‖}_{1} / 2$ , $F_{1}$ and $F_{2}$ are the extracted features of image 1 and 2, respectively.

The six fusion strategies are presented in Figure 18, where it can be observed that the addition strategy results in higher brightness, whereas the FL1N strategy introduces more noise. In the fusion images of SPECT-MRI and PET-MRI, the proposed method demonstrates a better ability to highlight functional information, thereby aiding in the detection and diagnosis of lesions. The quantitative results, as shown in Tables 8-10, indicate that the proposed method achieves optimal performance in SSIM, PSNR, ARI, and FMI for SPECT-MRI, SSIM, PSNR, and FMI for PET-MRI, and ARI and FMI for CT-MRI. However, the FDS approach does not perform as favorably across some of the metrics. This is due to the fact that FDS prioritizes the enhancement of specific feature characteristics, such as texture or structural clarity, over the broader optimization of quantitative metrics such as PSNR or SSIM. Consequently, FDS may yield perceptual quality improvements that are not fully captured by traditional metrics, particularly when these metrics emphasize pixel-level fidelity rather than perceptual or contextual quality.

Figure 18 Fusion results of the five methods and proposed strategy on MRI-SPECT, MRI-PET, and MRI-CT. (A), (E), (I), (M), (Q), and (U) are the original images, the remaining ones generate images for the six image fusion strategies. The images are from the Whole Brain Atlas (http://www.med.harvard.edu/AANLIB/home.html) created by Harvard Medical School. MRI, magnetic resonance imaging; SPECT, single-photon emission computed tomography; FER, Feature Energy Ratio Strategy; FL₁N, Feature L₁-Norm Strategy; AL₁NW, Average L₁-Norm Weight Strategy; FDS, Feature Distillation Strategy.

Table 8

Fusion metrics of the six strategies for SPECT-MRI pairs (mean ± standard deviation are shown, *: optimal)

Method	Metrics
Method	SSIM	PSNR	MI	ARI	FMI	CC
Addition	0.5095±0.0572	13.4644±0.6480	0.7553±0.0587*	0.4989±0.1121	0.6251±0.0930	0.6883±0.0419*
Average	0.5453±0.0509	17.1652±1.0138	0.6612±0.0630	0.6871±0.0535	0.7732±0.0388	0.6799±0.0418
FER	0.5350±0.05450	16.7410±1.2321	0.7021±0.0721	0.6870±0.0547	0.7729±0.0547	0.6297±0.0425
FL₁N	0.4718±0.0499	15.8987±0.7707	0.6709±0.0529	0.6522±0.0665	0.7466±0.0489	0.6069±0.0380
AL₁NW	0.5576±0.0518	17.1294±1.0367	0.6620±0.0640	0.6871±0.0535	0.7732±0.0389	0.6777±0.0423
FDS	0.5592±0.0536*	17.3594±1.0211*	0.7045±0.0643	0.6884±0.0534*	0.7741±0.0388*	0.6612±0.0439

SSIM, structural similarity; PSNR, peak signal-to-noise ratio; MI, mutual information; FMI, Fowlkes-Mallows index; ARI, adjusted Rand index; CC, correlation coefficient; MRI, magnetic resonance imaging; SPECT, single-photon emission computed tomography; Addition, addition strategy; Average, average strategy; FER, feature energy ratio strategy; FL₁N, feature L₁-norm strategy; AL₁NW, average L₁-norm weight strategy; FDS, feature distillation strategy.

Table 9

Fusion metrics of the six strategies for PET-MRI pairs (mean ± standard deviation are shown, *: optimal)

Method	Metrics
Method	SSIM	PSNR	MI	ARI	FMI	CC
Addition	0.4889±0.0756	12.8918±1.4781	0.7500±0.1549*	0.3820±0.1351	0.5219±0.1132	0.6626±0.0241*
Average	0.5153±0.0692	14.0559±2.3653	0.6744±0.1500	0.5274±0.1767*	0.6413±0.1276	0.6580±0.0197
FER	0.4997±0.0742	14.0409±1.7081	0.6871±0.1477	0.4877±0.1585	0.6175±0.1250	0.5948±0.0189
FL₁N	0.4333±0.0609	13.0068±1.5452	0.6653±0.1301	0.4891±0.1587	0.6182±0.1235	0.5553 ±0.287
AL₁NW	0.5116±0.0697	15.0075±2.3423*	0.6731±0.1482	0.5236±0.1742	0.6482±0.1357	0.6544±0.0170
FDS	0.5195±0.0730*	14.5234±1.7365	0.6914±0.1431	0.5237±0.1725	0.6484±0.1343*	0.6291±0.0193

SSIM, structural similarity; PSNR, peak signal-to-noise ratio; MI, mutual information; FMI, Fowlkes-Mallows index; ARI, adjusted Rand index; CC, correlation coefficient; MRI, magnetic resonance imaging; PET, positron emission tomography; Addition, addition strategy; Average, average strategy; FER, feature energy ratio strategy; FL₁N, feature L₁-norm strategy; AL₁NW, average L₁-norm weight strategy; FDS, feature distillation strategy.

Table 10

Fusion metrics of the six strategies for CT-MRI pairs (mean ± standard deviation are shown, *: optimal)

Method	Metrics
Method	SSIM	PSNR	MI	ARI	FMI	CC
Addition	0.5018±0.0474	11.6724±0.8403	0.5461±0.0840	0.5343±0.0751	0.6851±0.0469	0.5489±0.0746*
Average	0.5442±0.0420	14.6443±0.7937*	0.5125±0.0856	0.6301±0.0604	0.7401±0.0362	0.5450±0.07750
FER	0.5382±0.0435	13.4851±0.7923	0.5405±0.0801	0.6121±0.0666	0.7429±0.0395	0.5257±0.0735
FL₁N	0.4498±0.0350	13.0860±0.5644	0.4299±0.0444	0.5784±0.0490	0.7205±0.0332	0.4841±0.0640
AL₁NW	0.5476±0.0407*	14.4763±0.7526	0.6326±0.0918*	0.6326±0.0613	0.7576±0.0366	0.5437±0.0747
FDS	0.5376±0.0442	13.9202±0.7265	0.5343±0.0801	0.6342±0.0608*	0.7589±0.0361*	0.5270±0.0714

SSIM, structural similarity; PSNR, peak signal-to-noise ratio; MI, mutual information; FMI, Fowlkes-Mallows index; ARI, adjusted Rand index; CC, correlation coefficient; MRI, magnetic resonance imaging; CT, computed tomography; FER, feature energy ratio strategy; Addition, addition strategy; Average, average strategy; FL₁N, feature L₁-norm strategy; AL₁NW, average L₁-norm weight strategy; FDS, feature distillation strategy.

Conclusions

A novel fusion algorithm based on a LYWNet is proposed for medical image fusion. Initially, the original multi-modal images are input into a feature preprocessor to extract semantic information. The extracted features are then passed through a feature extractor to capture both low- and high-dimensional information. A new fusion strategy, the FDS, is introduced to combine the extracted features, which are subsequently processed by a reconstructor to generate the fused image. After the training process, the fused image can be directly obtained without the need for additional parameter adjustments or settings when inputting the original images. Extensive experimental comparisons and objective evaluation metrics demonstrate that the proposed algorithm outperforms existing methods in terms of visual quality and most performance metrics.

Acknowledgments

None.

Footnote

Funding: None.

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-24-851/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013).

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Shen D, Wu G, Suk HI. Deep Learning in Medical Image Analysis. Annu Rev Biomed Eng 2017;19:221-48. [Crossref] [PubMed]
Adelsmayr G, Janisch M, Müller H, Holzinger A, Talakic E, Janek E, Streit S, Fuchsjäger M, Schöllnast H. Three dimensional computed tomography texture analysis of pulmonary lesions: Does radiomics allow differentiation between carcinoma, neuroendocrine tumor and organizing pneumonia? Eur J Radiol 2023;165:110931. [Crossref] [PubMed]
Cohen JG, Reymond E, Medici M, Lederlin M, Lantuejoul S, Laurent F, Toffart AC, Moreau-Gaudry A, Jankowski A, Ferretti GR. CT-texture analysis of subsolid nodules for differentiating invasive from in-situ and minimally invasive lung adenocarcinoma subtypes. Diagn Interv Imaging 2018;99:291-9. [Crossref] [PubMed]
Liu Y, Chen X, Wang Z, Wang ZJ, Ward RK, Wang X. Deep learning for pixel-level image fusion: Recent advances and future prospects. Inf Fusion 2018;42:158-73.
James AP, Dasarathy BV. Medical image fusion: A survey of the state of the art. Inf Fusion 2014;19:4-19.
Wang Z, Cui Z, Zhu Y. Multi-modal medical image fusion by Laplacian pyramid and adaptive sparse representation. Comput Biol Med 2020;123:103823. [Crossref] [PubMed]
Nencini F, Garzelli A, Baronti S, Alparone L. Remote sensing image fusion using the curvelet transform. Inf Fusion 2007;8:143-56.
Liu Y, Liu S, Wang Z. A general framework for image fusion based on multi-scale transform and sparse representation. Inf Fusion 2015;24:147-64.
Du J, Li W, Xiao B, Nawaz Q. Union Laplacian pyramid with multiple features for medical image fusion. Neurocomputing 2016;194:326-39.
Singh R, Khare A. Fusion of multimodal medical images using Daubechies complex wavelet transform – A multiresolution approach. Inf Fusion 2014;19:49-60.
Xu X, Wang Y, Chen S. Medical image fusion using discrete fractional wavelet transform. Biomed Signal Process Control 2016;27:103-11.
Ganasala P, Prasad AD. Medical image fusion based on laws of texture energy measures in stationary wavelet transform domain. Int J Imag Syst Tech 2020;30:544-57.
Jose J, Gautam N, Tiwari M, Tiwari T, Suresh A, Sundararaj V, Mr R. An image quality enhancement scheme employing adolescent identity search algorithm in the NSST domain for multimodal medical image fusion. Biomed Signal Process Control 2021;66:102480.
Alseelawi N, Hazim HT. Salim ALRikabi HT. A Novel Method of Multimodal Medical Image Fusion Based on Hybrid Approach of NSCT and DTCWT. International Journal of Online and Biomedical Engineering 2022;18:114-33.
Tawfik N, Elnemr HA, Fakhr M, Dessouky MI, El-Samie FEA. Multimodal Medical Image Fusion Using Stacked Auto-encoder in NSCT Domain. J Digit Imaging 2022;35:1308-25. [Crossref] [PubMed]
Zhu Z, Zheng M, Qi G, Wang D, Xiang Y. A phase congruency and local Laplacian energy based multi-modality medical image fusion method in NSCT domain. IEEE Access 2019;7:20811-24.
Diwakar M, Singh P, Shankar A. Multi-modal medical image fusion framework using co-occurrence filter and local extrema in NSST domain. Biomed Signal Process Control 2021;68:102788.
Guo P, Xie G, Li R, Hu H. Multimodal medical image fusion with convolution sparse representation and mutual information correlation in NSST domain. Complex Intell Syst 2023;9:317-28.
Tan W, Tiwari P, Pandey HM, Moreira C, Jaiswal AK. Multimodal medical image fusion algorithm in the era of big data. Neural Comput Appl 2020; [Crossref]
Yin M, Liu X, Liu Y, Chen X. Medical Image Fusion With Parameter-Adaptive Pulse Coupled Neural Network in Nonsubsampled Shearlet Transform Domain. IEEE Trans Instrum Meas 2019;68:49-64.
Zong J, Qiu T. Medical image fusion based on sparse representation of classified image patches. Biomed Signal Process Control 2017;34:195-205.
Yang B, Yang C, Huang G. Efficient image fusion with approximate sparse representation. Int J Wavelets Multiresolut Inf Process 2016;14:1650024.
Zhang X, Ma Y, Fan F, Zhang Y, Huang J. Infrared and visible image fusion via saliency analysis and local edge-preserving multi-scale decomposition. J Opt Soc Am A Opt Image Sci Vis 2017;34:1400-10. [Crossref] [PubMed]
Ma J, Zhang H, Yi P, Wang Z. SCSCN: A Separated Channel-Spatial Convolution Net With Attention for Single-View Reconstruction. IEEE Trans Ind Electron 2020;67:8649-58.
Fu J, Li W, Du J, Huang Y. A multiscale residual pyramid attention network for medical image fusion. Biomed Signal Process Control 2021;66:102488.
Li W, Peng X, Fu J, Wang G, Huang Y, Chao F. A multiscale double-branch residual attention network for anatomical-functional medical image fusion. Comput Biol Med 2022;141:105005. [Crossref] [PubMed]
Li B, Hwang JN, Liu Z, Li C, Wang Z. PET and MRI image fusion based on a dense convolutional network with dual attention. Comput Biol Med 2022;151:106339. [Crossref] [PubMed]
Huang J, Le Z, Ma Y, Fan F, Zhang H, Yang L. MGMDcGAN: medical image fusion using multi-generator multi-discriminator conditional generative adversarial network. IEEE Access 2020;8:55145-57.
Ma J, Xu H, Jiang J, Mei X, Zhang XP. DDcGAN: A Dual-discriminator Conditional Generative Adversarial Network for Multi-resolution Image Fusion. IEEE Trans Image Process 2020; Epub ahead of print. [Crossref]
Chen J, Li X, Luo L, Mei X, Ma J. Infrared and visible image fusion based on target-enhanced multiscale transform decomposition. Inform Sciences 2020;508:64-78.
Hu J, Li S. The multiscale directional bilateral filter and its application to multisensor image fusion. Inf Fusion 2012;13:196-206.
Liu Y, Chen X, Cheng J, Peng H. A medical image fusion method based on convolutional neural networks. IEEE International Conference on Information Fusion 2017. doi: 10.23919/ICIF.2017.8009769.
Xu H, Ma J. EMFusion: An unsupervised enhanced medical image fusion network. Inf Fusion 2021;76:177-86.
Balakrishnan R, Priya R. Multimodal Medical Image Fusion based on Deep Learning Neural Network for Clinical Treatment Analysis. International Journal of ChemTech Research 2018;11:160-76.
Tang W, He F, Liu Y, Duan Y. MATR: Multimodal Medical Image Fusion via Multiscale Adaptive Transformer. IEEE Trans Image Process 2022;31:5134-49. [Crossref] [PubMed]
Lahoud F, Süsstrunk S. Zero-learning fast medical image fusion. IEEE International Conference on Information Fusion 2019. doi: 10.23919/FUSION43075.2019.9011178.
Zhao C, Wang T, Lei B. Medical image fusion method based on dense block and deep convolutional generative adversarial network. Neural Comput Appl 2021;33:6595-610.
Song X, Wu XJ, Li H. MSDNet for medical image fusion. International Conference on Image and Graphics 2019;278-288. doi: 10.1007/978-3-030-34110-7_24.
Liu Y, Li Z, Feng J, Gu Y. An Unsupervised GAN-based Quality-enhanced Medical Image Fusion Network. IEEE Conference on Telecommunications, Optics and Computer Science 2022:429-432. doi: 10.1109/TOCS56154.2022.10016141.
Fu J, Li W, Du J, Xu L. DSAGAN: A generative adversarial network based on dual-stream attention mechanism for anatomical and functional image fusion. Inform Sciences 2021;576:484-506.
Xu J, Xiong Z, Bhattacharyya SP. PIDNet: A real-time semantic segmentation network inspired by PID controllers. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023;19529-39.
Umirzakova S, Mardieva S, Muksimova S, Ahmad S, Whangbo T. Enhancing the Super-Resolution of Medical Images: Introducing the Deep Residual Feature Distillation Channel Attention Network for Optimized Performance and Efficiency. Bioengineering (Basel) 2023;10:1332. [Crossref] [PubMed]
He J, Deng Z, Zhou L, Wang Y, Qiao Y. Adaptive Pyramid Context Network for Semantic Segmentation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019;7511-20.
Huang G, Liu Z, Lauerns VDM, Weinberger KQ. Densely connected convolutional networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2017;4700-4708. doi: 10.1109/CVPR.2017.243.
Zhao H, Shi J, Qi X, Wang X, Jia J. Pyramid scene parsing network. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2017;2881-2890. doi: 10.1109/CVPR.2017.660.
Pan H, Hong Y, Sun W, Jia Y. Deep Dual-Resolution Networks for Real-Time and Accurate Semantic Segmentation of Traffic Scenes. IEEE Trans Intell Transport Syst 2023;3:3448-60. [Crossref]
Wang Z, Bovik AC, Sheikh HR, Simoncelli EP. Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 2004;13:600-12. [Crossref] [PubMed]
Zhang Z. Improved Adam Optimizer for Deep Neural Networks. IEEE/ACM 26th International Symposium on Quality of Service (IWQoS) 2018;1-2.
Zhang Y, Xiang W, Zhang S, Shen J, Wei R, Bai X, Zhang L, Zhang Q. Local extreme map guided multi-modal brain image fusion. Front Neurosci 2022;16:1055451. [Crossref] [PubMed]
Li X, Guo X, Han P, Wang X, Li H, Luo T. Laplacian Redecomposition for Multimodal Medical Image Fusion. IEEE Trans Instrum Meas 2020;69:6880-90.
Tan W, Thitøn W, Xiang P, Zhou H. Multi-modal brain image fusion based on multi-level edge-preserving filtering. Biomed Signal Process Control. 2021;64:102280.
Qu G, Zhang D, Yan P. Information measure for performance of image fusion. Electron Lett 2002;38:313-5.
Fowlkes EB, Mallows CL. A method for comparing two hierarchical clusterings. J Am Stat Assoc 1983;78:553-69.
Deshmukh M, Bhosale U. Image fusion and image quality assessment of fused images. International Journal of Image Processing 2010;4:484.

Cite this article as: Liu Y, Zhang S, Tang Y, Zhao X, He ZX. A multi-scale pyramid residual weight network for medical image fusion. Quant Imaging Med Surg 2025;15(3):1793-1821. doi: 10.21037/qims-24-851