Automated diagnosis of pulmonary nodules in 3D PET/CT images using dual-path densely connected networks with cross-modal fusion

Fulu Liao; Yongchun Cao; Junfeng Mao; Qiang Lin; Zhengxing Man; Zhengqi Cai; Xiaodi Huang

doi:10.21037/qims-2025-1037

Original Article

Automated diagnosis of pulmonary nodules in 3D PET/CT images using dual-path densely connected networks with cross-modal fusion

Fulu Liao^1,2, Yongchun Cao^1,2,3, Junfeng Mao⁴, Qiang Lin^1,2,3, Zhengxing Man^1,2,3, Zhengqi Cai^1,2,3, Xiaodi Huang⁵

¹School of Mathematics and Computer Science, Northwest Minzu University, Lanzhou, China; ²Gansu Provincial Engineering Research Center of Multi-modal Artificial Intelligence, Northwest Minzu University, Lanzhou, China; ³Key Laboratory of China’s Ethnic Languages and Information Technology of Ministry of Education, Northwest Minzu University, Lanzhou, China; ⁴Department of Nuclear Medicine, The 940th Hospital of Joint Logistics Support Force of Chinese People’s Liberation Army, Lanzhou, China; ⁵School of Computing, Mathematics and Engineering, Charles Sturt University, Albury, Australia

Contributions: (I) Conception and design: F Liao, Y Cao; (II) Administrative support: Q Lin, Z Man, Z Cai; (III) Provision of study materials or patients: Y Cao, J Mao; (IV) Collection and assembly of data: F Liao, Y Cao, J Mao; (V) Data analysis and interpretation: F Liao, Y Cao, X Huang; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Yongchun Cao, Master of Medicine. School of Mathematics and Computer Science, Northwest Minzu University, No. 1 Xibei Xincun, Chengguan District, Lanzhou 730030, China; Gansu Provincial Engineering Research Center of Multi-modal Artificial Intelligence, Northwest Minzu University, Lanzhou, China; Key Laboratory of China’s Ethnic Languages and Information Technology of Ministry of Education, Northwest Minzu University, Lanzhou, China. Email: cych33908@xbmu.edu.cn.

Background: Lung cancer, a highly lethal malignant disease, requires timely and accurate differentiation between benign and malignant pulmonary nodules (PNs) to enable early intervention and improved prognosis. Positron emission tomography/computed tomography (PET/CT) is a multimodal imaging technique that integrates metabolic information with anatomical details, playing a crucial role in tumor diagnosis. This study aimed to develop a multimodal fusion-based classification model for the automated diagnosis of PNs, ultimately supporting clinical decision-making.

Methods: We propose a novel multi-level cross-modal fusion classification framework, of which the core architecture comprises: (I) a dual-path densely connected network for hierarchically extracting modality-specific features; and (II) a multi-level cross-modal interaction mechanism to facilitate complementary feature fusion. This end-to-end framework performs a comprehensive diagnostic categorization of PNs, effectively distinguishing between benign and malignant cases, thereby enhancing the efficiency and accuracy of clinical decision-making.

Results: The proposed model was evaluated on a real-world clinical dataset. The experimental results demonstrate that it achieved an accuracy of 0.7778, a precision of 0.7590, a recall of 0.7968, and an F1 score of 0.7725.

Conclusions: The proposed model outperforms state-of-the-art baselines, validating the effectiveness of its feature extraction and multi-level cross-modal interaction strategy. These findings highlight the potential of the proposed model as a robust and reliable tool in clinical settings, capable of supporting intelligent, automated diagnosis of PNs.

Keywords: Positron emission tomography/computed tomography (PET/CT); pulmonary nodules (PNs); dense connectivity; cross-modal fusion; image classification

Submitted May 01, 2025. Accepted for publication Oct 31, 2025. Published online Dec 31, 2025.

doi: 10.21037/qims-2025-1037

Introduction

Lung cancer remains one of the deadliest malignancies globally (1), and early diagnosis combined with prompt intervention is critical to reducing mortality rates (2). However, lung cancer is typically asymptomatic in its early stages, with pulmonary nodules (PNs) being the most common clinical presentation (3). Therefore, accurate characterization of PNs is essential for the early detection of lung cancer (4,5). Owing to its complementary multimodal imaging capabilities, positron emission tomography/computed tomography (PET/CT) has emerged as an essential modality for enhancing diagnostic accuracy and staging in PN screening (6-9).

PET provides insights into lesion biology by assessing metabolic activity. However, it exhibits limited sensitivity for small nodules and can produce false-positive results due to overlapping metabolic profiles between benign and malignant lesions (10). In contrast, CT offers high-resolution visualization of morphological features (e.g., lobulation, spiculation, vacuolation). Nevertheless, its low soft-tissue contrast can compromise diagnostic sensitivity, and the presence of similar imaging features in both benign and malignant lesions (11) can reduce specificity. These limitations highlight the challenges of relying on single-modality imaging for PN assessment.

Convolutional neural networks (CNNs) (12,13), as a core deep learning (DL) technology, have achieved significant breakthroughs in computer vision due to their hierarchical feature extraction capability and end-to-end automation. In recent years, CNNs have demonstrated remarkable progress in medical image analysis across key tasks, such as disease classification (14-16), anatomical structure segmentation (17-19), and object detection (20-22), demonstrating substantial clinical potential.

Medical image classification is a core task in computer-aided diagnosis, and its performance improvements rely largely on targeted architectural optimizations of DL networks (23,24). To enhance classification performance by improving the extraction of lesion features from specific image modalities, significant research has focused on designing specialized architectures. These include lightweight CNN variants (25,26), enhanced residual networks (27,28), attention-integrated mechanisms (29), and Transformer-based modules (30). Such innovations have driven significant advances in clinical applications, including brain disease, lung cancer screening, and coronavirus disease 2019 (COVID-19) identification.

Constrained by the inherent limitations of individual imaging modalities, the sole reliance on single-modality information can easily lead to an incomplete representation of lesion characteristics, which may hinder clinical decision-making (31). Multimodal medical image fusion has emerged as a promising solution, significantly improving diagnostic accuracy through cross-modal feature complementarity (32-35) and has become a key technology for overcoming the bottleneck of single-modality information analysis. Kim et al. (36) utilized an attention mechanism to fuse anatomical and functional information from magnetic resonance (MR)-CT images. Joo et al. (37) extracted features from contrast-enhanced T1-MR and T2-MR images using dual three-dimensional (3D) ResNet and integrated these with clinical data to enhance breast cancer prediction. Gao et al. (38) proposed the PathNet network to progressively integrate multiscale magnetic resoanance imaging (MRI)-PET features through cross-modal hierarchical learning. He et al. (39) developed the hierarchical-order multimodal interaction fusion (HOMIF) network, which optimizes brain glioma classification via a hierarchical interaction strategy. Additionally, a 3D CNN based on PET/CT was introduced to mine tumor metabolism-morphology associations (40). Although existing studies have demonstrated the clinical value of multimodal fusion in tumor staging (36,37), brain disease classification (38,39), and PN identification (40), current strategies still fall short in capturing complex cross-modal correlations and fail to fully exploit the synergistic potential of multimodal features.

To advance the automated diagnosis of PNs, we propose a novel multilevel cross-modal feature fusion model for triple-class classification (non-nodule/benign nodules/malignant nodules). The architecture employs dual-path densely connected networks with identical structures to independently learn representative hierarchical features of different modalities from PET and CT images. Complementary enhancement of PET-CT bimodal features is achieved through a multilevel cross-modal interaction mechanism, enabling accurate classification based on the fused feature representations.

The main contributions of this paper can be summarized as follows:

We propose a novel three-class classification framework for automatic diagnosis of PNs.
We introduce a frequency-aware cross-modal feature fusion mechanism to enhance the integration of dual-modal features.
Extensive experiments on a real-world clinical dataset validate the effectiveness of our model, which outperforms existing state-of-the-art methods.

We present this article in accordance with the CLEAR reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1037/rc).

Methods

Experimental data

The PET/CT imaging data were collected from January 2016 to June 2023 at The 940th Hospital of Joint Logistics Support Force of Chinese People’s Liberation Army. A total of 1,378 participants underwent PET/CT scans using the Biograph True Point 64 (Siemens Healthineers, Erlangen, Germany). The resolution of the CT and PET images was 512×512 pixels (with an x-y axis pixel spacing of 0.9766 mm × 0.9766 mm) and 168×168 pixels (with an x-y axis pixel spacing of 4.0728 mm × 4.0728 mm), respectively, with a uniform slice thickness of 3.0 mm along the z-axis for both modalities.

The inclusion criteria for this study were as follows: (I) patients aged 18 years or older; (II) patients who underwent biopsy and were followed up after surgery with pathological confirmation; (III) a nodule diameter between 10 and 30 mm; and (IV) no history of previous surgery or chemotherapy. As a result, 812 participants were included in the final study cohort.

Three radiologists from the collaborating institution (each with more than 10 years of clinical experience) independently annotated the PET and CT images for the 812 enrolled participants, forming the final experimental dataset. The dataset comprises three diagnostic categories: no pulmonary nodule (NPN), benign pulmonary nodule (BPN), and malignant pulmonary nodule (MPN), with case distributions of 129 NPN, 346 BPN, and 337 MPN, respectively. The comprehensive imaging statistics are summarized in Table 1.

Table 1

The statistics of the PET/CT images used in this study

Category	NPN	BPN	MPN	Total
PET images (n)	129	346	337	812
CT images (n)	129	346	337	812
Total	258	692	674	1,624

BPN, benign pulmonary nodule; MPN, malignant pulmonary nodule; NPN, no pulmonary nodule; PET/CT, positron emission tomography/computed tomography.

Data preprocessing

To improve data-model compatibility and enhance model performance, we standardized the following preprocessing pipeline for all 812 pairs of original Digital imaging and Communications in Medicine (DICOM) images in this study.

CT-PET intensity calibration

To eliminate the impact of device parameters and individual patient variations, and to achieve spatial alignment of cross-modality data, standardized operations were performed on the raw PET and CT data. Specifically, CT values were converted to Hounsfield units (HU), and PET values were transformed to standardized uptake value (SUV).

Spatial alignment

By establishing voxel-level spatial correspondence across multimodal images, a foundation is laid for subsequent image analysis tasks. The process is as follows: first, the affine matrices of the raw PET and CT images are acquired; then, based on Eq. [1], the coordinates of both modalities are uniformly mapped to a standard reference coordinate system; next, the affine matrix from the standard reference coordinates to the target modality coordinates is computed; finally, the voxel coordinates of the image to be registered (either PET or CT) are transformed into the target coordinate system via matrix multiplication, and the gray values at resampled grid points are reconstructed using cubic spline interpolation.

$x^{'} = [\begin{matrix} a_{11} & a_{12} & a_{13} & a_{14} \\ a_{21} & a_{22} & a_{23} & a_{24} \\ a_{31} & a_{32} & a_{33} & a_{34} \\ 0 & 0 & 0 & 1 \end{matrix}] \cdot x$ [1]

where x and x′ are coordinate vectors, representing the position before and after the transformation.

Chest region extraction

To reduce the computational complexity of multimodal image analysis and eliminate interference from irrelevant structures, the chest region was extracted from whole-body scans. First, automated lung parenchyma segmentation was performed using a pre-trained 3D U-Net (41) on CT images, obtaining a segmentation mask and accurately locating the Z-axis boundary. Then, the CT lung segmentation mask was directly mapped to the PET domain. Finally, the chest region was extracted by symmetrically expanding along the sagittal/coronal planes to a 128×128×128 voxel volume, using the Z-axis geometric center of the mask from both modalities as the reference.

Modality-specific normalization

To eliminate the intensity distribution differences between modalities, enhance model robustness, and optimize multimodal feature fusion, the following preprocessing steps were applied: maximum-minimum normalization was performed on the CT images, and Z-score normalization was applied to the PET images. The calculations are detailed in Eqs. [2,3]:

$H U_{n o r m} = \frac{H U - H U_{\min}}{H U_{\max} - H U_{\min}}$ [2]

$S U V_{n o r m} = \frac{S U V - μ_{S U V}}{σ_{S U V}}$ [3]

where HU_norm is the result after CT normalization, HU_min and HU_max represent the minimum and maximum HU values of the region of interest, respectively. SUV_norm represents PET normalized results, and µ_SUV and σ_SUVrepresent the mean and standard deviation of SUV values, respectively.

The proposed methods

We propose an automated PN classification framework that leverages multimodal PET/CT feature fusion. Figure 1 depicts this framework comprising three core components: feature extraction, cross-modal feature fusion, and feature classification. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Institutional Ethics Committee of the 940th Hospital of the Joint Logistics Support Force of the Chinese People’s Liberation Army (No. 2024KYLL339) and informed consent was provided by all the patients.

Figure 1 The framework of the proposed multimodal feature fusion classification model. The framework comprises three components: (I) feature extractor: a dual-branch densely connected network that independently processes input data from different modalities; (II) feature fusion module: the BRWF module integrates multi-modal features extracted from different network layers; (III) classifier: responsible for performing the classification task based on the fused features. BRWF, bidirectional residual wavelet-convolution fusion; CT, computed tomography; DDGR, dynamic deformation-aware and gradient-enhanced residual block; FC, fully connected; HGMC, heterogeneous grouped multi-scale convolution; PET, positron emission tomography.

The feature extraction subnetwork utilizes a dual-path densely connected network with identical structures to hierarchically extract multi-scale spatial features from PET/CT images along different orientations. The feature fusion subnetwork employs wavelet-based convolutional operations to effectively integrate cross-modal features from multiple network layers, thereby enhancing the representation of complementary information from both PET and CT modalities. Finally, the feature classification subnetwork processes the fused multimodal features through sequential pooling layers, fully connected (FC) layers, and a Softmax function to generate the final diagnostic classification among three categories: NPN, BPN, and MPN.

Multimodal feature extraction

The multimodal feature extraction is performed using dual-path densely connected networks with symmetric structures. Each network branch consists of an initial 7×7×7 convolutional layer, followed by 3×3×3 max pooling, three sequentially connected dense-heterogeneous grouped multi-scale convolution (HGMC) blocks integrated with dynamic deformation-aware and gradient-enhanced residual block (DDGR), and a standalone dense-HGMC block. Notably, cross-hierarchical dense connectivity is preserved between down-sampled shallow feature maps across cascaded modules; this architectural design enables the network to more flexibly integrate information from different abstraction levels, thus significantly improving the model’s representational capabilities.

Dense-HGMC block

Following the dense connectivity mechanism of DenseNet (42), we propose a dense-HGMC block to effectively extract spatial features and integrate multi-scale information. As illustrated in Figure 2, the dense-HGMC block fundamentally modifies the original Dense Block structure by replacing the conventional BN-ReLU-Conv sequence with a HGMC module operation. This innovation uses HGMC to extract multi-scale spatial features from the aggregated feature maps of the input layers to significantly strengthen the network’s feature representation capability.

Figure 2 The schematic diagram of the dense-HGMC block structure. The core component, HGMC, is specifically designed to effectively extract multi-scale features from the aggregated feature maps across layers. HGMC, heterogeneous grouped multi-scale convolution.

For the input feature map X_0, the feature map X_lat the l^th(where 1≤l≤L) layer of the dense-HGMC block can be expressed as Eq. [4].

$X_{l} = D H G_{l} ([X_{0}, X_{1}, ..., X_{l - 1}])$ [4]

where [∙] denotes channel-wise concatenation, and DHG_l() represents the HGMC module operation. The output of the dense convolution block is the concatenation of the input feature maps and the output feature maps from each layer within the block, namely, Concate ( $X_{0}, X_{1}, ..., X_{L}$ ).

The HGMC module serves as a core component within the dense-HGMC block for hierarchical feature extraction, with its structural details depicted in Figure 3A. The processing pipeline begins with a 1×1×1 convolutional layer that performs channel dimension adjustment and preliminary feature integration. Subsequently, heterogeneously grouped convolutions with different sequential arrangements of convolution kernels are applied in parallel branches to capture complementary spatial-contextual features along different orientations. These dual-path outputs undergo element-wise additive fusion followed by 3×3×3 convolution-based cross-branch information consolidation, effectively compensating for potential global structural information loss inherent in grouped convolution operations. The concatenated feature map output from the grouped convolutions is then input into the multi-scale attention (MSA) module, as shown in Figure 3B, which extracts multi-scale features using convolution kernels of different sizes. Next, channel attention (SE weight) is employed via a weighting mechanism to automatically balance the contributions of multi-scale features, as illustrated in Figure 3C. Furthermore, residual connections are strategically incorporated within the HGMC module to stabilize gradient flow and mitigate the risk of vanishing gradients during deep network training.

Figure 3 The architecture of the HGMC module. (A) Overall view of the HGMC module. (B) Design of the internal MSA module. (C) The SE block within the MSA, which adaptively recalibrates multi-scale feature responses. GAP, global average pooling; HGMC, heterogeneous grouped multi-scale convolution; MSA, multi-scale attention; SE, squeeze-and-excitation.

The squeeze-and-excitation (SE) module enhances the key scale features by modeling the inter-channel dependencies, encouraging the network to focus on the relationships between cross-scale features, thereby improving the collaborative representation capability of multi-scale features. If X represents the input feature map, the output weight vector Y of the SE module can be formalized as Eq. [5]:

$Y = σ (W_{2} δ (W_{1} G A P (X))) X$ [5]

where GAP represents global average pooling, W₁ and W₂ represent FC layers, and δ and σ denote the rectified linear unit (ReLU) and Sigmoid activation functions, respectively.

DDGR block

Extending the transition layer in DenseNet, we integrate a DDGR block after the dense-HGMC block, with its architectural details depicted in Figure 4. This block serves dual purposes: (I) extracting shape characteristics and boundary features critical for PN diagnosis; and (II) downsampling feature maps to reduce spatial resolution while capturing higher-level abstraction. The DDGR block adopts a residual structure composed of a mainstream and a residual stream. The mainstream consists of a 3×3×3 depthwise separable convolution, a 3×3×3 deformable convolution, and a 1×1×1 regular convolution. By combining these three different convolutions, the model enhances its ability to model irregular shape features while reducing the number of parameters. The residual stream employs a gradient operator and a 1×1×1 regular convolution, explicitly extracting information such as edges and textures from the feature maps, thereby improving the model’s sensitivity to details. Finally, adding the outputs of the mainstream and residual stream via an element-wise addition forms a residual connection that facilitates information fusion and gradient flow. Average pooling efficiently compresses spatial information and refines semantics.

Figure 4 The structure of the dynamic deformation-aware and gradient-enhanced Residual block. It comprises a residual architecture incorporating deformable convolution and a gradient operator, along with an average pooling layer. This design effectively captures shape and edge features while simultaneously performing downsampling. DSConv, depthwise separable convolution.

We use the Sobel operator to extract gradient information from the feature map. The Sobel operator uses a fixed 3×3×3 convolution kernel, convolving the feature map in three spatial directions to obtain the corresponding gradient components. Let $G_{x} (x, y, z)$ , $G_{y} (x, y, z)$ and $G_{z} (x, y, z)$ represent the gradient components in the x, y, and z directions, respectively. The calculation of the gradient magnitude $G (x, y, z)$ of the feature map is given by the following Eq. [6]:

$G (x, y, z) = \sqrt{G_{x}^{2} (x, y, z) + G_{y}^{2} (x, y, z) + G_{z}^{2} (x, y, z)}$ [6]

Multimodal feature fusion

CT provides high-resolution structural information of nodule anatomical characteristics, including morphology and margin details, whereas PET captures functional profiles through metabolic activity quantification. In order to effectively fuse the complementary features of two different modalities to enhance the model’s ability to identify nodules, we developed a bidirectional residual wavelet-convolution fusion (BRWF) module, of which the structural details are presented in Figure 5.

Figure 5 An illustration of the BRWF module. This module decomposes and fuses features via the wavelet transform, establishes cross-band correlations through convolution, reconstructs spatial features via inverse transform, and enhances modality complementarity through bidirectional residual connections. BRWF, bidirectional residual wavelet-convolution fusion; CT, computed tomography; D, diagonal high-frequency; DWT, discrete wavelet transform; GELU, Gaussian error linear unit; H, horizontal high-frequency; IDWT, inverse discrete wavelet transform; L, low-frequency; PET, positron emission tomography; V, vertical high-frequency.

The BRWF module begins with a pixel-wise additive fusion of shallow PET/CT features to enable cross-modal interaction of shallow semantic information. Subsequently, discrete wavelet transform (DWT) decomposes these fused features into distinct frequency sub-bands, explicitly isolating high-frequency components (edge/texture details) from low-frequency counterparts (structural primitives). The features of the decomposed frequency sub-bands are further processed using a cascade convolution operation of Conv 1×1×1 → DWConv 3×3×3 → Gaussian error linear unit (GELU) → Conv 1×1×1, compressing channel dimensions while establishing cross-frequency spatial correlations, overcoming the limitations of traditional single-domain convolution in global structure modeling. The refined features undergo channel-wise decoupling via Chunk operations to reconstruct original frequency-band correspondences. Following the inverse discrete wavelet transform (IDWT), the spatial domain features are reconstructed and multiplied element-wise with the features before the DWT operation to form frequency domain attention weights, achieving adaptive enhancement of high-frequency details and low-frequency structures. Finally, residual connections integrate weighted fusion outputs with original PET/CT features, whereas bidirectional feature redistribution promotes synergistic enhancement between metabolic function and anatomical structure.

For the 3D feature map I ∈ R^Nx×Ny×Nz, the DWT is implemented using separable wavelet bases, applying a one-dimensional (1D) wavelet transform to each spatial dimension d ∈ {x, y, z} in sequence. If L denotes a low-pass filter, H denotes a high-pass filter, and Sx, Sy, Sz ∈ {L, H} denote subband indices, then the calculation of the 8 subband coefficients C_SxSySz (in the order of z, y, x dimensions) can be expressed as Eq. [7]:

$G_{S x S y S z} [i, j, k] = (↓_{x} (f_{S x} *_{x} (↓_{y} (f_{S y} *_{y} (↓_{z} (f_{S z} *_{z} I))))))$ [7]

where i, j, and k represent the indices of the subbands after downsampling in the x, y, and z dimensions, respectively. ∗_d denotes the convolution operation in dimension d, ↓_d denotes the downsampling operation in dimension d, and f_Sd denotes the filter. When Sd=L, f_Sd selects the low-pass filter; otherwise, it selects the high-pass filter. Through the above calculations, a set of 8 subbands C_SxSySzis obtained.

IDWT applies 1D inverse transforms to the x, y, and z dimensions in sequence, reconstructing spatial domain features by superimposing the contributions of all 8 subbands. Its formal representation as Eq. [8]:

$I (n_{x}, n_{y}, n_{z}) = \sum_{S x S y S z} ({\hat{f}}_{S z} *_{z} (↑_{z} ({\hat{f}}_{S y} *_{y} (↑_{y} ({\hat{f}}_{S x} *_{x} (↑_{x} C_{S x S y S z}))))))$ [8]

where (n_x, n_y, n_z) represent the positions of the reconstructed image in the x, y, and z dimensions. ↑_d denotes upsampling in dimension d. ${\hat{f}}_{S d}$ denotes the reconstruction filter. When Sd=L, ${\hat{f}}_{S d}$ a low-pass filter is selected; otherwise, a high-pass filter is used.

Wavelet-decomposed frequency domain features usually exhibit alternating positive–negative coefficients, and these values reflect the detailed information, such as edges and textures of the image. In this study, to prevent high-frequency information loss caused by negative-value truncation in ReLU activations [ReLU(x) = max(0, x)], we employ the GELU activation. Meanwhile, the smooth activation property of GELU makes the cross-sub-band gradient propagation more stable, thus realizing the efficient fusion and propagation of different sub-band features. The specific mathematical expression of GELU is provided in Eq. [9].

$G E L U (x) = x \cdot Φ (x) \approx x \cdot \frac{1}{2} (1 + \tanh (\sqrt{\frac{2}{π} (x + 0.44715 x^{3})}))$ [9]

where $Φ (x)$ is the cumulative distribution function of the Gaussian distribution.

Feature classification

The feature classification subnetwork consists of three sequential components: GAP, FC layer, and Softmax activation function. The GAP operation compresses the 3D feature map obtained from the final BRWF module fusion into a compact vector of size 1×1×C, reserving channel-wise semantics while discarding spatial redundancies. Subsequently, FC layers project this vector into class-aligned logits through linear transformation, completing the transformation from high-level semantic features to classification scores. Finally, the Softmax function normalizes these logits into probabilistic class predictions, with the maximal probability category determining the diagnostic classification (NPN/BPN/MPN).

The cross-entropy loss function is employed to measure the discrepancy between the model’s predicted probability distribution and the annotated ground truth, guiding parameter optimization during training. The mathematical formulation of the cross-entropy loss is defined as Eq. [10].

$l_{n} = - \frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{C} y_{i j} \log x_{i j}$ [10]

where x_ijrepresents the predicted probability, y_ijrepresents the true label, C represents the number of classes, and n represents the batch size.

Results

This section details the experimental configuration and presents the performance evaluation of the proposed model on a private PET/CT multimodal imaging dataset (see Table 1).

Experimental setup

We randomly split the dataset into 70% for training (570 image pairs) and 30% for testing (243 image pairs), ensuring a consistent evaluation protocol. Five-fold stratified cross-validation was performed on the training set for model selection and parameter tuning. The final results reported in this paper were obtained by evaluating the best-performing model on the held-out 30% test set. The model was trained for 200 epochs with early stopping to prevent overfitting, using the Adam optimizer with an initial learning rate of 1e−4 and cosine annealing decay. Detailed hyperparameter configurations are provided in Table 2.

Table 2

Parameter settings of the proposed model

Parameter	Setting	Other
Learning rate	1e−4	Cosine annealing restarts
Epoch	200
Optimizer	Adam
Batch size	8

The experiments were conducted using Python version 3.8 (Python Software Foundation, Wilmington, DE, USA), PyTorch 2.0.0, with Compute Unified Device Architecture (CUDA) version 11.8, and an NVIDIA GeForce RTX 4090 GPU (NVIDIA, Santa Clara, CA, USA), running on Ubuntu 20.04 (Canonical, London, UK).

Evaluation metrics

Model performance was evaluated using four established evaluation metrics: accuracy, precision, recall, and F1 score, with their mathematical formulations formally defined in Eqs. [11-14].

$A c c = \frac{T P + T N}{T P + F P + T N + F N}$ [11]

$P r e = \frac{T P}{T P + F P}$ [12]

$R e c a l l = \frac{T P}{T P + F N}$ [13]

$F 1 s c o r e = \frac{2 \times P r e \times R e c a l l}{P r e + R e c a l l}$ [14]

Where TP, FP, TN, and FN represent true positive, false positive, true negative, and false negative, respectively.

Experimental results

Performance results

To validate the classification efficacy of the proposed architecture, we compared the classification results of PET and CT single-modal, as well as the dual-modal feature fusion using three distinct fusion strategies. The evaluation metrics for the classification of the test samples are shown in Table 3.

Table 3

Experimental results of single-modal and different multi-modal fusion strategies

Modality	Fusion strategy	Accuracy	Precision	Recall	F1 score
PET	–	0.7490	0.7304	0.7642	0.7450
CT	–	0.7429	0.7309	0.7472	0.7395
PET + CT	Concat	0.7521	0.7330	0.7717	0.7464
PET + CT	Add	0.7531	0.7445	0.7718	0.7585
PET + CT	BRWF	0.7778	0.7590	0.7968	0.7725

BRWF, bidirectional residual wavelet-convolution fusion; CT, computed tomography; PET, positron emission tomography.

As shown in Table 3, all three multimodal fusion strategies outperformed the single-modal baseline across all evaluation metrics, confirming that integrating the bidirectional complementary information from PET and CT significantly enhances the model’s ability to identify lesions, thereby improving overall clinical diagnostic accuracy. Our proposed BRWF fusion strategy achieved the highest performance, with a particularly notable improvement in recall, which directly reduces the risk of missed diagnosis of malignant nodules (false negatives) and underscores its critical clinical value in developing reliable and robust medical imaging-assisted diagnostic systems. Although BRWF introduces a slight computational overhead, the substantial performance gain clearly justifies this cost.

Analysis of the proposed model’s prediction results through the confusion matrix (see Figure 6) indicates that the predominant classification errors occur between benign and malignant nodule categories.

Figure 6 Confusion matrix of classification results. BPN, benign pulmonary nodule; MPN, malignant pulmonary nodule; NPN, no pulmonary nodule.

Ablation experiment

The inter-dense block dense connections (Inter-DBC) mechanism augments multi-scale feature fusion through cross-layer feature reuse pathways, thereby enhancing fine-grained discrimination between benign and malignant nodules. Table 4 summarizes the comparative classification performance of the proposed model with and without the Inter-DBC mechanism.

Table 4

The impact of Inter-DBC on model performance

Inter-DBC	Accuracy	Precision	Recall	F1 score
×	0.7654	0.7460	0.7928	0.7617
√	0.7778	0.7590	0.7968	0.7725

Inter-DBC, inter-dense block connections.

As shown in Table 4, the Inter-DBC module significantly improved the model’s accuracy, precision, and F1 score. These improvements indicate that Inter-DBC enhances both the overall discriminative ability and decision confidence in nodule classification. Specifically, the increase in precision reduces false positives and minimizes unnecessary biopsies, whereas the higher F1 score indicates a better balance between sensitivity and specificity. Collectively, these improvements strengthen the model’s reliability and clinical applicability for early lung cancer screening.

The proposed framework integrates three core components: the HGMC and DDGR feature extraction modules, along with the BRWF fusion module. To systematically evaluate their synergistic contributions, we conducted ablation studies examining different module combinations. The evaluation metrics obtained under different module combination strategies are shown in Table 5. It should be noted that when a module is not used, the original structure of DenseNet in that module is maintained.

Table 5

The impact of different combination strategies on model performance

Case	HGMC	DDGR	BRWF	Accuracy	Precision	Recall	F1 score
#1	×	×	×	0.7133	0.7151	0.7457	0.7237
#2	√	×	×	0.7457	0.7246	0.7700	0.7371
#3	×	√	×	0.7362	0.7245	0.7479	0.7344
#4	×	×	√	0.7407	0.7162	0.7511	0.7278
#5	√	√	×	0.7521	0.7309	0.7717	0.7464
#6	√	×	√	0.7654	0.7486	0.7927	0.7633
#7	×	√	√	0.7572	0.7407	0.7862	0.7558
#8	√	√	√	0.7778	0.7590	0.7968	0.7725

BRWF, bidirectional residual wavelet-convolution fusion; DDGR, dynamic deformation-aware and gradient-enhanced residual; HGMC, heterogeneous grouped multi-scale convolution module.

As shown in Table 5, the addition of any individual module led to noticeable performance improvements, with the HGMC module (Case #2) contributing the most significant gains (+3.24% accuracy, +2.43% recall). The combinations of modules exhibited synergistic enhancement effects, particularly the HGMC + BRWF combination (Case #6), which achieved remarkable improvements (+5.21% accuracy, +4.7% recall). Ultimately, integrating all three modules (Case #8) achieved the best overall performance, with all evaluation metrics showing significant improvements over the baseline (range: 0.0439–0.0645). These experimental results validate the soundness of each module’s design and the effectiveness of their synergistic mechanisms, while demonstrating the strong and consistent performance improvement achieved through multi-module integration. Collectively, they provide empirical support for the model’s reliability and applicability as a clinical decision support tool.

To investigate the impact of hierarchical feature fusion depth, an ablation experiment was conducted on the fusion positions of hierarchical features to evaluate the performance of the BRWF module in fusing PET/CT features after different depths of the DDGR layers. The experimental results are shown in Table 6. Fusion positions 1 to 4 correspond to the output layers of the 1st to 3rd DDGR blocks and the output layer of the last dense-HGMC block in the feature extraction subnetwork, respectively.

Table 6

The impact of different fusion positions on model performance

Case	Fusion location				Accuracy	Precision	Recall	F1 score
Case	1	2	3	4	Accuracy	Precision	Recall	F1 score
#1	×	×	×	×	0.7521	0.7309	0.7717	0.7464
#2	√	√	√	×	0.7695	0.7512	0.8017	0.7667
#3	√	√	×	√	0.7712	0.7496	0.7912	0.7654
#4	√	×	√	√	0.7737	0.7529	0.7883	0.7679
#5	×	√	√	√	0.7751	0.7556	0.7937	0.7703
#6	√	√	√	√	0.7778	0.7590	0.7968	0.7725

As shown in Table 6, all multi-layer fusion strategies outperform the non-fusion baseline, with four-position fusion achieving the best overall gains (+2.57% accuracy, +2.81% precision, +2.51% recall, +2.61% F1 score). This performance improvement demonstrates that our hierarchical cross-modal interaction effectively captures complementary features—from local texture in shallow layers to global semantics in deeper layers. Consequently, it enhances the recognition of subtle malignant patterns and increases diagnostic confidence in distinguishing benign from malignant nodules.

Comparative analysis

We compared our framework with several representative backbone models, including ResNet (23), DenseNet (42), Inception (43), MobileViT (44), EfficientNet-V2 (45), Swin Transformer (46), ConvNeXt (47), and EfficientFormer (48). All models were trained on the same dataset with consistent preprocessing, training settings, and evaluation metrics. For a fair comparison, we replaced only the dual-path dense connection module with each benchmark backbone, keeping the fusion module (BRWF), classification layers, and loss function unchanged. This setup allowed us to isolate the effect of different feature extractors within a unified multimodal classification framework. The comparative results are presented in Figure 7 and Table 7.

Figure 7 Comparison of the proposed model with mainstream CNN classification models. The vertical axis represents the performance score for each evaluation metric, including accuracy (Acc), precision (Pre), recall, and F1 score. CNN, convolutional neural network.

Table 7

Comparison of model runtimes

Models	Times (s/epoch)	Param (MB)
ResNet	203.58	256.64
Inpection	205.32	239.30
Efficientent-V2	198.64	221.72
Mobilevit	212.63	56.51
Efficientformer	232.79	123.56
Swin-transformer	288.46	283.76
Convnext	230.99	286.71
Ours	257.73	139.76

As seen in Figure 7, the proposed model in this study demonstrates significant performance advantages, especially in the accuracy and recall evaluation indices, indicating that it effectively controls the risk of missed detection of PNs while ensuring overall classification accuracy, which is of great value for clinical decision support.

As shown in Table 7 and Figure 7, compared with the Swin-Transformer model, the proposed method achieves significantly higher classification performance while using fewer parameters and a shorter runtime, demonstrating both its effectiveness and efficiency in lung nodule classification. Although the inference time is slightly longer than that of some other models, the relatively small parameter count and substantial performance improvement make this modest computational overhead a reasonable and acceptable trade-off. Overall, these advantages render the model highly suitable for practical clinical deployment.

Discussion

This section presents a concise discussion of key factors influencing performance, including the causes of prediction errors, the influence of feature extraction sub-networks and a comparison of multimodal fusion methods, as well as the study’s limitations and future research directions.

The causes of prediction errors

From the analysis of the confusion matrix shown in Figure 6, it is clear that the model’s classification error mainly stems from the difficulty in discriminating benign from malignant nodules. Further analysis shows that this misclassification phenomenon is closely related to the interclass overlap of nodule image features, as shown in Figure 8. Specifically, some benign nodules exhibited the typical lobulation and spiculation morphology of malignant nodules in CT images (Figure 8A); at the same time, there was also a bidirectional interference in PET-CT images. On the one hand, some benign nodules showed hyperglycemia similar to that of malignant tumors in PET images due to inflammation, which was characterized as a high-uptake “hot zone” (Figure 8B). On the other hand, some malignant nodules with low proliferative activity showed a significant decrease in the SUVmax value (Figure 8C). This bidirectional heterogeneity leads to an overlap in the imaging characteristics of benign and malignant nodules, which poses a serious challenge to the feature decoupling ability of the DL model and significantly increases the difficulty of constructing an effective classification decision boundary.

Figure 8 Cases of misclassification by the proposed classification model in the classification of benign and malignant pulmonary nodules. Benign nodules (A,B) were incorrectly predicted as malignant nodules; malignant nodules (C) were incorrectly predicted as benign nodules. The top row is the CT images, and the bottom row is the PET images. The nodules on the CT images and PET images are highlighted with orange and blue circles, respectively. Arrow points to a pulmonary nodule. CT, computed tomography; PET, positron emission tomography.

Although DL demonstrates substantial potential in medical image analysis, its classification efficacy remains constrained by its inherent dependence on data. Unlike general computer vision tasks that leverage large-scale annotated datasets, medical imaging faces fundamental data constraints arising from three key factors: patient sample scarcity, ethical acquisition limitations, and multi-institutional sharing barriers, particularly those imposed by privacy regulations. These threefold challenges consequently restrict the effectiveness of traditional data augmentation approaches in enhancing the model’s generalization capability.

The impact of feature extraction mechanisms on model performance

In Section 3, systematic ablation studies demonstrated the critical roles of hierarchical feature extraction and multimodal fusion in optimizing classification performance. Building on these findings, this section further examines how distinct convolutional architectures within the feature extraction subnetwork impact diagnostic efficacy through a comparative analysis of alternative convolutional module implementations.

In the classification network shown in Figure 1, we replaced the core feature extraction module, the dense-HGMC block, with the residual block, inception block, and standard dense block for comparative experiments, while keeping the rest of the network structure unchanged. The experimental results, shown in Figure 9, indicate that the standard dense block, with its dense connection mechanism for feature reuse, demonstrates overall performance that is second only to the dense-HGMC block. Building upon the dense connection mechanism of the dense block, the HGMC module we constructed achieves performance improvement through a dual-optimization strategy. First, it captures multi-directional complementary features using a heterogeneous spatial convolution sequence, and then it dynamically weights features via an MSA module, significantly enhancing the model’s ability to recognize lesions of various sizes. Experimental results demonstrate that the introduction of the HGMC module leads to significant improvements across four key performance metrics: accuracy (+2.06%), precision (+1.83%), recall (+1.06%), and F1 score (+1.67%). These comparative results strongly confirm that building an efficient feature extraction mechanism, particularly with adaptive modeling capabilities for the diverse morphology of lesions in medical images, is key to improving the accuracy of benign and malignant PN classification.

Figure 9 The impact of feature extraction mechanisms on model performance. The vertical axis represents the performance score for each evaluation metric, including accuracy (Acc), precision (Pre), recall, and F1 score. HGMC, heterogeneous grouped multi-scale convolution.

Comparative analysis of multimodal fusion methods

In recent years, researchers have proposed various PET/CT fusion strategies. Li et al. (15) employed a dense fusion approach combining stitching and weighted voting to integrate information at both the channel and pixel levels, whereas Shao et al. (20) achieved pixel-level fusion through direct summation. Although these spatial-domain methods enhance the complementarity between metabolic and structural features, the fusion process remains a static, linear operation. It lacks explicit cross-modal interaction, making it difficult to capture nonlinear dependencies and frequency-dependent features. Li et al. (49) addressed explicit feature interaction using cross-attention, but this approach has high computational complexity and remains confined to the spatial domain, neglecting frequency-domain priors.

To address these limitations, this paper proposes the BRWF fusion method, which enables cross-modal interaction in the spatial domain while incorporating DWT to extract frequency-domain information. Through cross-frequency convolution cascading and a bidirectional residual redistribution mechanism, it models synergistic relationships across different frequencies, allowing adaptive enhancement between morphological and metabolic features. This approach addresses the limitations of traditional spatial-domain and attention-based fusion, enhancing the discriminative power and interpretability of PET/CT lung nodule classification.

The limitations and future improvement directions

Although the proposed PET/CT fusion framework achieved competitive performance, several limitations remain. First, the cohort was from a single-center and of moderate size, which may constrain external validity; in future work, we plan to construct a multi-center, heterogeneous PET/CT database via cross-institutional data integration to strengthen generalizability.

Second, the “black-box” nature of DL models limits interpretability and clinical trust. To address this, we will develop a deep-artificial hybrid feature-fusion framework that combines CNN high-level semantics with expert-annotated morphological descriptors (e.g., texture heterogeneity, edge regularity) through cross-modal fusion, thereby improving transparency and clinical alignment.

Third, although the BRWF module adaptively preserves frequency information, the feature extraction capacity could be further enhanced by incorporating classical priors and relational modeling to capture anatomical dependencies.

Conclusions

This study presents an automatic framework for PN diagnosis based on multimodal feature interaction, significantly enhancing the discriminative performance of PET/CT imaging through an innovative dual-modal feature fusion mechanism. First, a cross-modal data preprocessing pipeline was developed to ensure precise alignment of multimodal images through intensity calibration, normalization, and 3D spatial registration. Next, a dual-path densely connected network architecture was designed to separately extract PET metabolic features and CT morphological features. Building on this, a cross-modal feature interaction module was introduced, employing a multi-level heterogeneous feature fusion strategy to integrate complementary metabolic and morphological information into discriminative feature representations. Experimental results on a clinical PET/CT dataset demonstrate that the proposed model outperforms mainstream comparative models in classification performance, achieving accuracy, precision, recall, and F1 score values of 0.7778, 0.7590, 0.7968, and 0.7725, respectively.

Acknowledgments

None.

Footnote

Reporting Checklist: The authors have completed the CLEAR reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1037/rc

Data Sharing Statement: Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1037/dss

Funding: This work was supported in part by the National Natural Science Foundation of China (Nos. 62362058 and 61562075), the Key R&D Plan of Gansu Province (Nos. 24YFGA048 and 21YF5GA063), the Gansu Province’s Key Provincial Talents Program (No. 2023RCXM56), the Young and Middle-aged Talents Training Program of State Ethnic Affairs Commission, the Natural Science Foundation of Gansu Province (Nos. 20JR5RA511 and 22JR11RA236), the Fundamental Research Fund for the Central Universities (Nos. 31920240092, 31920240099 and 31920250034), and the Youth Ph.D. Foundation of Education Department of Gansu Province (No. 2021QB-063).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1037/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Institutional Ethics Committee of the 940th Hospital of the Joint Logistics Support Force of the Chinese People’s Liberation Army (No. 2024KYLL339) and informed consent was taken from all the patients.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, Bray F. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J Clin 2021;71:209-49. [Crossref] [PubMed]
Woodard GA, Jones KD, Jablons DM. Lung Cancer Staging and Prognosis. Cancer Treat Res 2016;170:47-75. [Crossref] [PubMed]
Qian F, Yang W, Chen Q, Zhang X, Han B. Screening for early stage lung cancer and its correlation with lung nodule detection. J Thorac Dis 2018;10:S846-59. [Crossref] [PubMed]
Ruparel M, Quaife SL, Navani N, Wardle J, Janes SM, Baldwin DR. Pulmonary nodules and CT screening: the past, present and future. Thorax 2016;71:367-75. [Crossref] [PubMed]
Aberle DR, Adams AM, Berg CD, Black WC, Clapp JD, Fagerstrom RM, Gareen IF, Gatsonis C, Marcus PM, Sicks JD. Reduced lung-cancer mortality with low-dose computed tomographic screening. N Engl J Med 2011;365:395-409. [Crossref] [PubMed]
Kligerman S, Digumarthy S. Staging of non-small cell lung cancer using integrated PET/CT. AJR Am J Roentgenol 2009;193:1203-11. [Crossref] [PubMed]
Davidson MR, Gazdar AF, Clarke BE. The pivotal role of pathology in the management of lung cancer. J Thorac Dis 2013;5:S463-78. [Crossref] [PubMed]
Rami-Porta R, Crowley JJ, Goldstraw P. The revised TNM staging system for lung cancer. Ann Thorac Cardiovasc Surg 2009;15:4-9.
Han Y, Ma Y, Wu Z, Zhang F, Zheng D, Liu X, Tao L, Liang Z, Yang Z, Li X, Huang J, Guo X. Histologic subtype classification of non-small cell lung cancer using PET/CT images. Eur J Nucl Med Mol Imaging 2021;48:350-60. [Crossref] [PubMed]
Zhou T, Zhang X, Lu H, Li Q, Liu L, Zhou H. GMRE-iUnet: Isomorphic Unet fusion model for PET and CT lung tumor images. Comput Biol Med 2023;166:107514. [Crossref] [PubMed]
Boellaard R, O’Doherty MJ, Weber WA, Mottaghy FM, Lonsdale MN, Stroobants SG, et al. FDG PET and PET/CT: EANM procedure guidelines for tumour PET imaging: version 1.0. Eur J Nucl Med Mol Imaging 2010;37:181-200. [Crossref] [PubMed]
Khan A, Sohail A, Zahoora U, Qureshi AS. A survey of the recent architectures of deep convolutional neural networks. Artif Intell Rev 2020;53:5455-516.
Jiang H, Diao Z, Shi T, Zhou Y, Wang F, Hu W, Zhu X, Luo S, Tong G, Yao YD. A review of deep learning-based multiple-lesion recognition from medical images: classification, detection and segmentation. Comput Biol Med 2023;157:106726. [Crossref] [PubMed]
Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, Thrun S. Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017;542:115-8. [Crossref] [PubMed]
Li T, Mao J, Yu J, Zhao Z, Chen M, Yao Z, Fang L, Hu B. Fully automated classification of pulmonary nodules in positron emission tomography-computed tomography imaging using a two-stage multimodal learning approach. Quant Imaging Med Surg 2024;14:5526-40. [Crossref] [PubMed]
Ma B, Feng Y, Chen G, Li C, Xia Y. Federated adaptive reweighting for medical image classification. Pattern Recognition 2023;144:109880.
Ma X, Lin Q, Zeng X, Cao Y, Man Z, Liu C, Huang X. Interactive segmentation for accurately isolating metastatic lesions from low-resolution, large-size bone scintigrams. Phys Med Biol 2025;
Cao Y, Liu L, Chen X, Man Z, Lin Q, Zeng X, Huang X. Segmentation of lung cancer-caused metastatic lesions in bone scan images using self-defined model with deep supervision. Biomedical Signal Processing and Control 2023;79:104068.
Isensee F, Jaeger PF, Kohl SAA, Petersen J, Maier-Hein KH. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods 2021;18:203-11. [Crossref] [PubMed]
Wu X, Zhang H, Sun J, Wang S, Zhang Y. YOLO-MSRF for lung nodule detection. Biomedical Signal Processing and Control 2024;94:106318.
Bansal S, Singh M, Dubey RK, Panigrahi BK. Multi-objective Genetic Algorithm Based Deep Learning Model for Automated COVID-19 Detection Using Medical Image Data. J Med Biol Eng 2021;41:678-89. [Crossref] [PubMed]
Liu Y, Li Y, Jiang M, Wang S, Ye S, Walsh S, Yang G. SOCR-YOLO: Small Objects Detection Algorithm in Medical Images. Int J Imaging Syst Technol 2024;34:e23130.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition 2016:770-8.
Kumar R, Kumbharkar P, Vanam S, Sharma S. Medical images classification using deep learning: a survey. Multimedia Tools and Applications 2024;83:19683-728.
Pereira S, Pinto A, Alves V, Silva CA. Brain Tumor Segmentation Using Convolutional Neural Networks in MRI Images. IEEE Trans Med Imaging 2016;35:1240-51. [Crossref] [PubMed]
Ardila D, Kiraly AP, Bharadwaj S, Choi B, Reicher JJ, Peng L, Tse D, Etemadi M, Ye W, Corrado G, Naidich DP, Shetty S. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat Med 2019;25:954-61. [Crossref] [PubMed]
Shen W, Zhou M, Yang F, Yu D, Dong D, Yang C, Zang Y, Tian J. Multi-crop convolutional neural networks for lung nodule malignancy suspiciousness classification. Pattern Recognition 2017;61:663-73.
Zhang K, Liu X, Shen J, Li Z, Sang Y, Wu X, et al. Clinically Applicable AI System for Accurate Diagnosis, Quantitative Measurements, and Prognosis of COVID-19 Pneumonia Using Computed Tomography. Cell 2020;181:1423-1433.e11. [Crossref] [PubMed]
Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition 2018;7132-41.
Lu D, Popuri K, Ding GW, Balachandar R, Beg MF. Multimodal and multiscale deep neural networks for the early diagnosis of Alzheimer’s disease using structural MR and FDG-PET images. Sci Rep 2018;8:5697. [Crossref] [PubMed]
Li Y, Zhao J, Lv Z, Li J. Medical image fusion method by deep learning. International Journal of Cognitive Computing in Engineering 2021;2:21-9.
Zhang T, Shi M. Multi-modal neuroimaging feature fusion for diagnosis of Alzheimer’s disease. J Neurosci Methods 2020;341:108795. [Crossref] [PubMed]
Steyaert S, Pizurica M, Nagaraj D, Khandelwal P, Hernandez-Boussard T, Gentles AJ, Gevaert O. Multimodal data fusion for cancer biomarker discovery with deep learning. Nat Mach Intell 2023;5:351-62. [Crossref] [PubMed]
Zhu L, Xu Y, Fu H, Xu X, Goh RSM, Liu Y. Partially Supervised Unpaired Multi-modal Learning for Label-Efficient Medical Image Segmentation. In: Xu X, Cui Z, Rekik I, Ouyang X, Sun K. editors. Machine Learning in Medical Imaging. MLMI 2024. Lecture Notes in Computer Science. Cham: Springer Nature Switzerland; 2025. doi: 10.1007/978-3-031-73290-4_9.
Castellano G, Esposito A, Lella E, Montanaro G, Vessio G. Automated detection of Alzheimer’s disease: a multi-modal approach with 3D MRI and amyloid PET. Sci Rep 2024;14:5210. [Crossref] [PubMed]
Kim B, Lee GY, Park SH. Attention fusion network with self-supervised learning for staging of osteonecrosis of the femoral head (ONFH) using multiple MR protocols. Med Phys 2023;50:5528-40. [Crossref] [PubMed]
Joo S, Ko ES, Kwon S, Jeon E, Jung H, Kim JY, Chung MJ, Im YH. Multimodal deep learning models for the prediction of pathologic response to neoadjuvant chemotherapy in breast cancer. Sci Rep 2021;11:18800. [Crossref] [PubMed]
Gao X, Shi F, Shen D, Liu M. Task-Induced Pyramid and Attention GAN for Multimodal Brain Image Imputation and Classification in Alzheimer’s Disease. IEEE J Biomed Health Inform 2022;26:36-43. [Crossref] [PubMed]
He M, Han K, Zhang Y, Chen W. Hierarchical-order multimodal interaction fusion network for grading gliomas. Phys Med Biol 2021;66:215016. [Crossref] [PubMed]
Shao X, Niu R, Shao X, Gao J, Shi Y, Jiang Z, Wang Y. Application of dual-stream 3D convolutional neural network based on (18)F-FDG PET/CT in distinguishing benign and invasive adenocarcinoma in ground-glass lung nodules. EJNMMI Phys 2021;8:74. [Crossref] [PubMed]
Hofmanninger J, Prayer F, Pan J, Röhrich S, Prosch H, Langs G. Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem. Eur Radiol Exp 2020;4:50. [Crossref] [PubMed]
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017:2261-9. doi: 10.1109/CVPR.2017.243.
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016;2818-26. doi: 10.1109/CVPR.2016.308.
Liu J, Chen H, Zhou W. Improved mobilevit: A more efficient light-weight convolution and vision transformer hybrid model. J. Phys.: Conf. Ser 2023;2562:012012.
Tan M, Le Q. EfficientNetv2: Smaller models and faster training. Proceedings of the 38th International Conference on Machine Learning, PMLR 2021;139:10096-106.
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada. 2021;9992-10002. doi: 10.1109/ICCV48922.2021.00986.
Liu Z, Mao H, Wu CY, Feichtenhofer C, Darrell T, Xie S. 2022. A convnet for the 2020s. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA. 2022;11966-11976, doi: 10.1109/CVPR52688.2022.01167.
Li Y, Yuan G, Wen Y, Hu J, Evangelidis G, Tulyakov S, Wang Y, Ren J. Efficientformer: Vision transformers at mobilenet speed. Advances in Neural Information Processing Systems 2022;35:12934-49.
Li K, Li T, Zhang L, Mao J, Shi X, Yao Z, Fang L, Hu B. MI2A: A Multimodal Information Interaction Architecture for Automated Diagnosis of Lung Nodules Using PET/CT Imaging. IEEE Sensors Journal 2025;25:28547-59.

Cite this article as: Liao F, Cao Y, Mao J, Lin Q, Man Z, Cai Z, Huang X. Automated diagnosis of pulmonary nodules in 3D PET/CT images using dual-path densely connected networks with cross-modal fusion. Quant Imaging Med Surg 2026;16(1):9. doi: 10.21037/qims-2025-1037