Downsampling attention fusion network (DAFNet): a You Only Look Once network for lung nodule detection

Linfang Li; Xuan Wen; Mengfei Li; Xinfa Wang; Xiangrong Feng; Xiangpeng Lv

doi:10.21037/qims-2026-1-0050

Original Article

Downsampling attention fusion network (DAFNet): a You Only Look Once network for lung nodule detection

Linfang Li¹, Xuan Wen², Mengfei Li³, Xinfa Wang², Xiangrong Feng¹, Xiangpeng Lv⁴

¹School of Information Engineering, Henan Institute of Science and Technology, Xinxiang, China; ²School of Computer Science and Technology, Henan Institute of Science and Technology, Xinxiang, China; ³Central Hospital of Jiaozuo Coal Group Co. Ltd., Jiaozuo, China; ⁴SICU of Civil Aviation General Hospital, Civil Aviation Clinical Medical College of Peking University, Beijing, China

Contributions: (I) Conception and design: L Li, X Feng; (II) Administrative support: X Lv, X Wang; (III) Provision of study materials or patients: M Li; (IV) Collection and assembly of data: M Li, X Wen; (V) Data analysis and interpretation: All authors; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Xiangpeng Lv, MM. SICU of Civil Aviation General Hospital, Civil Aviation Clinical Medical College of Peking University, No.76 Chaoyang Road, Chaoyang District, Beijing 100000, China. Email: lvxppku@163.com.

Background: Lung nodule detection in computed tomography (CT) imaging is crucial for the early diagnosis and treatment of lung diseases. However, accurate detection remains challenging due to the varying sizes, shapes, and densities of nodules, as well as interference from surrounding tissues. This study aimed to develop a novel deep learning framework to improve the accuracy and efficiency of lung nodule detection.

Methods: We developed the downsampling attention fusion network (DAFNet). Specifically, we incorporated a dual-branch downsampling module that employs a parallel learning strategy to effectively extract multiscale features while reducing computational complexity as compared to traditional convolutional modules. Additionally, we designed a global attention module (GAM), which cascades the channel and spatial attention mechanisms to enhance feature representation across different dimensions. To evaluate DAFNet, experiments were conducted on a lung CT image dataset (LCTD) and the public dataset Lung Nodule Analysis 2016 (LUNA16). LCTD comprises 2,172 CT images from 1,060 individuals. On the LUNA16, common object detection metrics such as mean average precision (mAP) and recall were adopted for evaluation, rather than the official specified free-response receiver operating characteristic (FROC) evaluation protocol.

Results: On the augmented LCTD, DAFNet achieved 93.2% precision and 90.7% recall. Meanwhile, on LUNA16, DAFNet outperformed state-of-the-art methods in terms of mAP. Furthermore, DAFNet achieved GPU-accelerated near-real-time inference with a processing speed of 1.5 ms per image, and the model parameter size was only 2.5 M, enabling efficient and lightweight inference.

Conclusions: Our study examined a novel method for detecting lung nodules, offering new insights and technical references for relevant research fields. The proposed modules can be flexibly and seamlessly integrated into various detection frameworks as plug-and-play components, effectively enhancing the flexibility and scalability of lung nodule detection. As our DAFNet does not incorporate false-positive reduction or malignancy grading in its nodule detection, the clinical translation and practical application of the model remain to be further validated and optimized in future work.

Keywords: Lung nodule detection; YOLOv11; adaptive downsampling module (ADSM); global attention module (GAM)

Submitted Jan 08, 2026. Accepted for publication May 15, 2026. Published online Jun 10, 2026.

doi: 10.21037/qims-2026-1-0050

Introduction

Lung cancer remains a significant global health challenge (1). According to data from the World Health Organization, lung cancer accounts for over 1.76 million deaths annually worldwide (2). Lung nodules, a primary indicator of lung cancer, typically manifest as abnormal growths, with diameters less than 30 mm. Their small size and subtle presentation often make early detection difficult, leading to missed opportunities for timely diagnosis (3,4). Furthermore, lung nodules exhibit considerable morphological heterogeneity, uncertain distribution, a tendency to adhere to surrounding tissues, and a lack of specific distinguishing features, posing significant challenges in clinical diagnosis. Computed tomography (CT)-based noninvasive diagnostic methods are widely employed for lung nodule detection (5). However, the volumetric nature of CT imaging generates a massive amount of data, requiring clinicians to manually screen large datasets. This process is labor-intensive and susceptible to fatigue, inefficiencies, and diagnostic errors, including missed or incorrect diagnoses (6). Consequently, the development of automated detection methods is crucial for enhancing the accuracy and efficiency of lung nodule identification and to ultimately improving clinical decision-making and patient outcomes.

To address these challenges, researchers have proposed various approaches for lung nodule detection, broadly categorized into traditional and deep learning-based. Traditional approaches for lung nodule detection primarily depend on the analysis of physical and visual features such as nodule morphology, density, texture, and spatial information. These typically involve techniques for feature extraction and target-based classification. Regarding feature extraction, the main methods include threshold segmentation, morphological operations, and principal component analysis (7,8). In contrast, target-based classification methods include clustering and support vector machines (SVMs) (9,10). For example, Wu and Wang (11) proposed a method of selecting regions within the gray-scale range of lung nodules through threshold segmentation and then identifying and locating suspected nodules based on shape characteristics. However, the regions of lung nodules selected by the method are not sufficiently accurate. Meanwhile, in Khan et al.’s approach (12), CT images are subjected to contrast enhancement, segmentation, and feature extraction, after which this constructed effective feature set is sent into an SVM classifier to eliminate lung nodules; however, the generalizability of this method is limited. Unfortunately, not only are traditional detection methods time-consuming but offer insufficient representation ability when the extracted feature is overly simple or weak generalization and poor robustness when the extracted feature is too specific.

In recent years, advanced feature extraction methods have demonstrated promise in the field of medical image analysis, and the research in this field has examined various approaches, such as attention mechanism-enhanced segmentation networks and deep learning-driven image denoising and enhancement (13-15). These studies have provided the foundation for further improvement of image representation capabilities. For the specific task of lung nodule detection, the currently available mainstream deep learning methods are typically divided into two-stage and one-stage methods. Two-stage methods generate the regions of interest, which are then subjected to target detection. Examples of these methods include region-based convolutional neural network (R-CNN) (16), faster region-based convolutional neural network (Faster R-CNN) (17), and mask region-based convolutional neural network (Mask R-CNN) (18), among others. Inspired by Faster R-CNN, Xu et al. (19) employed a multiscale training strategy and deformable convolution to enhance the sensory field region for target detection. Through experiments on the Lung Nodule Analysis 2016 (LUNA16) dataset, it was found that the accuracy of this method was 90.7%, superior to the 76.4% achieved by Faster R-CNN. However, this method entails high computational complexity and considerable resource requirements. Su et al. (20) also employed Faster R-CNN for lung nodule detection, and, by modifying the structure and optimizing the parameters, achieved an accuracy of 91.2%. In general, two-stage detection methods are problematic as it relates real-time performance and candidate frame localization, which limit the accuracy and robustness in the detection of lung nodules.

One-stage methods include single-shot detector (SSD) (21), RetinaNet (22), and the You Only Look Once (YOLO) series, among others. Khosravan and Bagci (23) proposed single-shot, single-scale lung nodule detection, which includes a unified network to construct a three-dimensional CNN detection path with a single feed-forward channel. However, it is difficult for the method to adequately extract the multimodal morphological features of lung nodules. Tang et al. (24) introduced a multiscale dual-branch attention module for feature extraction and used the dual-branch attention module for feature fusion. However, the complex network structure limited the robustness of the model and caused an expansion in the number of parameters. Inspired by YOLOv4, Mei et al. (25) integrated depth-over-parametric convolutional layers and convolutional attention modules to extract lung nodule features and improved the focus loss function to optimize model training. However, the model performance was hampered by lack of precision in the channel-pruning strategy. Inspired by YOLOv7, Wu et al. (26) proposed the YOLO-multiscale receptive field (YOLO-MSRF) method. They designed a small-object detection layer to focus on the detection of small targets of lung nodules and introduced an MSRF module for improving the learning of the fine-grained features of the surrounding nodules. In addition, a full-dimensional convolution module was proposed to weight the input data and thus enhance the nodule-sensitive features. However, this method exhibited overfitting problems, resulting in a decline in generalization performance. Ma and Wu (27) integrated a receiving field module in YOLOv8 to enhance the extraction of nodule features and introduced inner intersection over union (Inner-IoU) to optimize the positioning accuracy of the detector. However, this method demonstrated difficulties in detecting small-target nodules, and on the LUNA16 data, its accuracy was only 81.3%.

To address these issues, we proposed the downsampling attention fusion network (DAFNet) for lung nodule detection, which can provide a more favorable tradeoff between model complexity and performance. Specifically, the proposed DAFNet consists of two main components: the adaptive downsampling module (ADSM) and the global attention module (GAM). The ADSM employs a dual-branch parallel learning strategy, which facilitates the efficient extraction of multiscale features from lung CT images while significantly reducing the number of parameters. Meanwhile, the GAM cascades both channel and spatial attention modules, which highlight important features across critical channels and spatial locations, thereby enhancing the overall feature representation capacity of the images. In this study, we examined influence of these downsampling and attention mechanisms on DAFNet’s performance in detecting lung nodules.

The principle contributions of this work are described below:

We constructed a lung CT image dataset (LCTD) comprising 2,172 high-quality CT images from 1,060 individuals. Specifically, the dataset includes 804 males (1,496 CT images) and 256 females (676 CT images), with ages ranging from 40 to 87 years. This dataset is expected to promote the advancement of lung nodule detection technologies and facilitate their clinical application.
We designed a practical double-branch, downsampling module to reduce redundant features. By employing a parallel learning strategy, this module can efficiently extract multilevel discriminative features. Our findings demonstrate that this strategy not only significantly improves detection accuracy but also further achieves network lightweighting.
We developed a cascaded feature module by stacking effective channel and spatial attention modules. This approach enables the model to focus on relevant features while discarding nonessential information, thereby improving the performance of the extracted features.

We present this article in accordance with the STARD-AI reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2026-1-0050/rc).

Methods

Study rationale

The currently available methods for lung nodule detection have clear limitations. Many methods are overly complex and fail to balance accuracy and efficiency. For instance, pretreatment methods segment lung parenchyma or extract regions of interest before detection, leading to increased artificial intervention and low efficiency. The YOLOv11 detection method can automatically learn the features of lung nodules, but the fixed sampling convolution limits its ability to extract multiscale features. In addition, YOLOv11 lacks effective channel and spatial selection mechanisms and is constrained by redundant channel interference and context modeling bottlenecks. Integrating dynamic sampling with channel-spatial attention may effectively address these issues, as this can overcome the shortcomings of local receptive fields and feature selection for lung nodule detection in traditional convolution and generate feature representations with stronger context awareness.

It is worth noting that the complex lung background is not conducive to the extraction of low-frequency information such as nodule contours or high-frequency details such as edges and textures. To address these challenges, we designed the aforementioned ADSM. Traditional stride convolution tends to attenuate low-frequency signals, while learning weight-based dynamic pooling often selectively suppresses the fragile textures of small nodules and increases the computational burden. ADSM adopts a dual-branch strategy, concurrently using max pooling to capture high-frequency edge features and average pooling to retain low-frequency context. This design effectively compensates for the missed detection problem caused by boundary truncation in the detection of pleural-sided nodules, enhancing the complementarity of multiscale features. For low-contrast nodules such as ground-glass nodules, we developed GAM. GAM overcomes the issues of the squeeze-and-excitation network (SENet)—whose global average pooling tends to smooth out weak signals—and the efficient channel attention network (ECANet)—which is limited by local interaction, by combining channel rearrangement with a shared-weight multilayer perceptron (MLP) architecture. This architecture models long-range dependencies across channels while maintaining spatial resolution, which facilitates the association of subtle density changes in ground-glass nodules with the global context and suppresses the interference of vascular artifacts. These components were integrated into the dynamic DAFNet to provide improved detection of lung nodules.

DAFNet architecture

A schematic of the DAFNet is provided in Figure 1. In the DAFNet, the input images initially are resized to a uniform dimension of 640×640×3, after which the backbone network conducts feature extraction. The innovative, lightweight ADSM, which replaces the convolution module, performs feature extraction and downsampling. This approach not only enhances the feature representation of nodules but also significantly reduces the number of parameters. Next, the multilevel output features {X1, X2, X3} are passed into the concatenation module to strengthen the cross-scale correlation of discriminative features and highlight the target regions. Within the neck network, the bidirectional feature pyramid network (BiFPN) structure enables feature fusion in both top-down and bottom-up directions. Subsequently, the GAM module is applied for weight redistribution, mitigating the attenuation of attention during feature fusion and transmission. The weight-reallocated features {Y1, Y2, Y3} are then fed into the head network. Furthermore, multiscale prediction layers {H1, H2, H3} are employed to detect targets at varying scales. To further improve the network’s training performance, the classification loss function is combined with the bounding box regression loss function, optimizing the overall loss function to ensure greater stability and efficiency during training.

Figure 1 The workflow of our proposed DAFNet. (A) ADSM. (B) GAM. BN, batch normalization; C2PSA, cross-stage partial spatial attention; C3k2, cross stage partial 3 with kernel size 2; Concat, concatenation; Conv, convolution; DAFNet, downsampling attention fusion network; MLP, multilayer perceptron; SiLU, sigmoid linear unit; SPPF, spatial pyramid pooling-fast.

To provide a comprehensive overview of the proposed DAFNet architecture, the details of its structure are provided in Table 1, which clearly outlines the feature map sizes for both input and output, as well as the number of parameters for each module.

Table 1

Details of the proposed DAFNet

Layer	Input size (pixels)	Output size (pixels)	Repetitions	Parameters
Input	3×640×640	3×640×640	1	0
Conv	3×640×640	16×320×320	1	464
ADSM + C3k2	16×320×320	256×20×20	4	601,232
SPPF	256×20×20	256×20×20	1	164,608
C2PSA	256×20×20	256×20×20	1	249,728
Unsample + Concat + C3k2	256×20×20	64×80×80	2	143,392
ADSM	64×80×80	64×40×40	1	10,368
Concat	[64×40×40, 128×40×40]	192×40×40	1	0
C3k2	192×40×40	128×40×40	1	86,720
GAM	128×40×40	128×40×40	1	410,240
ADSM	128×40×40	128×20×20	1	33,528
Concat	[128×20×20, 256×20×20]	384×20×20	1	0
C3k2	384×20×20	256×20×20	1	378,880
Detect	[64×80×80, 128×40×40, 256×20×20]		1	430,867
Total trainable parameters				2,510,027

Data are presented as number. ADSM, adaptive downsampling module; C2PSA, cross-stage partial spatial attention; C3k2, cross stage partial 3 with kernel size 2; Concat, concatenation; Conv, convolution; DAFNet, downsampling attention fusion network; GAM, global attention module; SPPF, spatial pyramid pooling-fast.

Lightweight ADSM module

The ADSM module includes pooling operations instead of traditional convolution for downsampling. Meanwhile, the branch design allows the final feature map to retain both the original feature information and any additional features obtained through different processing paths, thereby enhancing the feature representation capability.

A schematic of the ADSM module is provided in Figure 2. For the ADSM module, it is assumed that the input feature is X ∈ ℝ^W^×^H^×^C, where W, H, and C denote the width, height, and the number of channels, respectively. First, an average pooling operation on X is performed to reduce computational complexity while preserving essential feature information, which is expressed as follows:

$X_{avg} = {AvgPool}_{_{k = 2}}^{s = 1} (X)$ [1]

Figure 2 Flowchart of the ADSM module. ADSM, adaptive downsampling module; SiLU, sigmoid linear unit.

where AvgPool() is the average pooling function, k is the kernel size, and s is the stride.

Second, the feature map $X_{avg} \in R^{(W - 1) \times (H - 1) \times C}$ , resulting from the average pooling operation, is split into $X_{avg1} \in R^{(W - 1) \times (H - 1) \times \frac{C}{2}}$ and $X_{avg2} \in R^{(W - 1) \times (H - 1) \times \frac{C}{2}}$ along the channel dimension. For the feature map $X_{avg1}$ , a convolution operation with a stride of 2, is applied. To ensure the stability and efficiency of network training, batch normalization is performed on the local features extracted by the convolution. Subsequently, the sigmoid-weighted linear unit (SiLU) activation function is applied to the batch-normalized feature map to introduce nonlinearity. The feature map obtained from branch one can be represented as follows:

$X_{out1} = SiLU (BN ({Conv}_{3} (X_{avg1})))$ [2]

where Conv₃() is a 3×3 convolution operation, and BN() is a batch normalization function. For the feature map $X_{avg2}$ , a max pooling operation is implemented to extract key features such as textures, shapes, and edges of the nodules while significantly reducing the size of the feature map. This operation can be expressed as follows:

$\overset{\land}{X_{avg2}} = {MaxPool}_{k = 3}^{s = 2} (X_{avg2})$ [3]

where MaxPool() is the max pooling function. Subsequently, the final output of branch two is obtained through a sequence of operations, including a 1×1 convolution, batch normalization, and the SiLU nonlinear activation function. This process can be expressed as follows:

$X_{out2} = SiLU (BN ({Conv}_{1} (\overset{\land}{X_{avg2}})))$ [4]

Finally, a concatenation operation is performed, where $X_{out1}$ and $X_{out2}$ are concatenated along the channel dimension to produce the output feature map. This operation enables the model to more effectively capture the diversity and complexity of the features, thereby significantly improving the generalization and robustness of the model. The output feature map of the ADSM module can be expressed as follows:

$X_{out} = Concat (X_{out1}, X_{out2})$ [5]

where concat() represents the channel-wise concatenation operation.

GAM

We designed an effective GAM by cascading the channel attention and the spatial attention modules, which can refine the morphological, textural, and spatial features of lung nodules. The synergistic effect of the dual-attention mechanism significantly enhances the discriminability of feature representation. The schematic of this design is provided in Figure 3.

Figure 3 Simplified flowchart of the GAM. (A) Channel attention structure. (B) Spatial attention structure. Conv, convolution; GAM, global attention module; MLP, multilayer perceptron.

The channel attention module fully leverages the cross-channel interaction information of the input data and establishes a dynamic channel weight allocation mechanism. This enables the network to focus more on important channel features while suppressing irrelevant ones. The structure of the channel attention submodule is illustrated in Figure 3B. First, the channel shuffle is introduced to enhance cross-channel information interaction. The shuffled feature map is then passed through an MLP with shared weights to enable learning of the nonlinear relationships between channels. Next, a reverse channel shuffle is applied to enable learning of channel weight vector to restore the original channel order. Finally, the sigmoid function is used to generate normalized channel weights, which are applied to the input feature map via channel-wise scalar multiplication. This process can be expressed as follows:

$F_{CAM} = Sigmoid ({CS}^{- 1} MLP (CS (F))) ⊙ F$ [6]

where $F$ is the input feature; CS() and CS⁻¹() are the channel shuffle and inverse channel shuffle operations, respectively; MLP() is the MLP; and $⊙$ is the element-wise multiplication operation.

The spatial attention module thoroughly explores the spatial information of the input data, dynamically adjusting the weight assigned to each spatial position, thereby enabling the network to focus on critical spatial regions. The structure of the spatial attention module is illustrated in Figure 3B. Two 7×7 convolution operations are performed to generate the spatial attention weight matrix. Subsequently, a sigmoid activation function is applied to obtain the weight for each spatial location. The calculation for this is as follows:

$F_{SAM} = Sigmoid ({Conv}_{7} ({Conv}_{7} (F_{CAM})))$ [7]

where Conv₇() is a 7×7 convolution operation. Finally, element-wise multiplication between the channel attention feature map and the spatial attention feature map is completed to derive the final attention feature map, which can be expressed as follows:

$F_{GAM} = F_{CAM} ⊙ F_{SAM}$ [8]

Loss function

The DAFNet architecture consists of two optimization branches. The integration of classification loss and regression loss can help effectively guide the model’s training process and further optimize its parameters.

Classification loss function: given the uneven distribution of training samples, cross-entropy loss is applied for classification during the model training process. Additionally, introducing modulation factors to the original cross-entropy loss helps guide the model to allocate greater attention to samples that are more difficult to classify correctly. L_cls is expressed as follows:

$L_{cls} = - \frac{1}{N} \sum_{m = 1}^{N} \sum_{n = 1}^{C} (y_{m, n} α {(1 - p_{m, n})}^{γ} \log (p_{m, n}) + (1 - y_{m, n}) (1 - α) p_{m, n}^{γ} \log (1 - p_{m, n}))$ [9]

where N and C are the number and class of samples, respectively; y_m_,_n indicates that the m^th sample pertains to the n^th class; and p_m_,_n is the probability that the model predicts the m^th sample to be in the n^th class. Here, α=0.25 is a balancing factor which adjusts the weights between positive and negative samples, and γ=2 is a focusing parameter which controls the attention to difficult samples.

Bounding box loss function: complete intersection over union (CIoU) is included to optimize object localization. The CIoU loss fully accounts for the overlap area between the predicted and ground truth boxes while also introducing a penalty term for the distance between the center points to minimize the deviation between them. Additionally, the CIoU loss evaluates the shape similarity between the two bounding boxes by analyzing the consistency of their aspect ratios. This multidimensional optimization strategy enables a more comprehensive measurement of the matching quality between the two bounding boxes. The CIoU loss is formulated as follows:

$L_{bbox} = 1 - I o U + \frac{d^{2}}{c^{2}} + α v$ [10]

Here, IoU is the intersection-over-union loss, d is the distance between the centers of the two bounding boxes, c is the diagonal length of the smallest enclosing rectangle, and v is the consistency of the aspect ratio, and α is the weight coefficient.

Total loss function: the weighted sum of the classification loss and the bounding box loss (28) are included to optimize both the classification and localization tasks simultaneously while ensuring their efficient integration and minimizing potential conflicts. The total loss function can be expressed as follows:

$Loss = λ_{1} L_{cls} + λ_{2} L_{bbox}$ [11]

where λ₁ and λ₂ are equal to 1 and 2.5, respectively.

Experimental dataset

In our experiments, we used two datasets: the publicly available LUNA16 and our constructed LCTD. Data augmentation techniques were applied to the LCTD. The LUNA16 dataset includes 888 high-quality chest CT scans, was annotated under the consensus of at least three radiologists, and served as a benchmark for lung nodule detection. The two datasets are available online (https://github.com/lilinfang-hist/DAFNet).

For the LCTD, the study included a convenience sample of patients whose lung CT scans were already available in the hospital’s radiology archive. These scans had been acquired previously for clinical purposes and were retrospectively collected for use in the study. Participants were identified from these clinically indicated lung CT images, and all high-quality CT scans suitable for lung nodule detection were considered eligible. Images of the LCTD were acquired with 64-slice and 128-slice spiral CT scanners (Philips, Amsterdam, the Netherlands) from 2021 to 2024. The dataset consists of 2,172 high-resolution lung CT images from 1,060 individuals, including 804 males (1,496 CT images) and 256 females (676 CT images), with an age range of 40 to 87 years. All images were saved in JPEG format. Strict privacy protection measures were implemented: sensitive information, such as the hospital name, patient personal details (including name, age, and gender), and examination time (located in the upper right corner of the images), were systematically anonymized to ensure compliance with medical ethical standards. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments, and ethical approval was obtained from the Ethics Committee of Jiaozuo Coal Industry (Group) Co. Ltd. Central Hospital (approval No. KYXM202401002). The requirement for written informed consent was waived due to the retrospective nature of the analysis.

Notably, data annotation was performed manually with LabelMe software through a collaboration between three physicians from the hospital’s information and radiology departments. Nodule information was verified based on expert reports from the radiology department and image interpretations by thoracic surgeons. Figure 4 illustrates the lung nodule-labeling process; in each image, the original unlabeled images are shown on the left, while the labeled versions are on the right. To facilitate a clearer observation of image details, the upper half displays the image at its original scale, while the lower half shows a partially magnified view.

Figure 4 The labeling process of the lung nodules.

To augment the dataset, we employed three data augmentation techniques: image rotation, central region cropping, and the addition of Gaussian noise. The detailed image augmentation process is presented in Figure 5. These methods expanded the original dataset to 8,688 images. In Figure 5, the first column shows the original images, the second column shows the images that have been rotated to alter their orientation, the third shows the images subjected to central cropping to emphasize the focal area, and the fourth column shows images with added Gaussian noise to simulate potential interference during acquisition or transmission.

Figure 5 The image augmentation process.

Implementation details

Experimental setting

In terms of experimental configuration, both our DAFNet and the comparator methods were evaluated in the same experimental environment and parameter settings for the LCTD and LUNA16 datasets. All experiments were implemented with Python 3.9.18 with PyTorch 2.0.1+cu117 (Python Software Foundation, Wilmington, DE, USA), running on a GeForce RTX 4090 GPU (Nvidia, Santa Clara, CA, USA) hosted on a server with the Ubuntu 20.04.6 LTS operating system and 128 GB of memory. To meet input specifications, all samples for training, testing, and validation were resized to 640×640×3. The dataset was split into training, testing, and validation sets at an 8:1:1 ratio, with a batch size of 16. Additionally, DAFNet was optimized using stochastic gradient descent, with an initial learning rate of 0.01, a weight decay of 0.0005, and a total of 300 training epochs.

Evaluation metrics

Precision, recall, F1 score, and mean average precision (mAP) are commonly used to quantitatively assess model performance. Precision is the proportion of predicted positive samples that are actually positive, while recall is the proportion of actual positive samples correctly identified by the model. The F1 score provides a harmonic balance between these two metrics, addressing the limitations of relying on a single metric. Meanwhile, mAP is the weighted mean of average precision across different recall values, offering a more comprehensive evaluation of both classification and localization performance for various samples at different thresholds. In this study, a prespecified threshold of 0.5 was used to define positive results. In general, higher values of these quantitative indicators correspond to superior model performance in their respective aspects.

It should be explicitly noted that, for evaluation with the LUNA16 dataset, we followed a unified experimental protocol consistent for all comparator methods, using recall, mAP, and parameter count as core metrics, rather than the official LUNA16 free-response receiver operating characteristic (FROC) framework. Thus, the reported results are only comparable with studies adopting the same protocol and cannot be directly compared with LUNA16 studies using the standard FROC method.

Results

Comparison experiments

Comparison on the augmented LCTD dataset

To validate the detection performance of the proposed DAFNet, we conducted five independent repeated experiments based on the augmented LCTD dataset and compared it with other state-of-the-art YOLO methods. The results are presented in Table 2. DAFNet achieved 93.2%, 90.7%, and 96.3% in precision, recall, and mAP, respectively, representing improvements of 3.9%, 5.1%, and 4.4% compared to YOLOv11n. These enhancements effectively reduce the risk of missed diseased tissue while minimizing the number of false positives. Notably, although YOLOv9t has fewer parameters and faster inference speed, its precision, recall, and mAP were lower by 10.9%, 8.4%, and 5.8%, respectively, compared to DAFNet. Despite having only 0.78 M more parameters than YOLOv9t, DAFNet demonstrated significantly improved performance. In terms of F1 score, DAFNet outperformed the other methods, demonstrating a 4.5% increase over YOLOv11n. Furthermore, the Welch t-test was used to conduct pairwise significance tests on the precision, recall, mAP, and F1 score of DAFNet and each comparator model. The P values for all comparisons for the four indicators were less than 0.001, indicating that the performance improvement provided by DAFNet was statistically significant; moreover, the t-values corresponding to the F1 indicator were all negative, indicating that the average F1 value of DAFNet was higher than that of all the comparator models.

Table 2

Comparison with state-of-art YOLO methods on the augmented LCTD

Method	Params (M)	Inference (ms)	Precision (%)	Recall (%)	mAP (%)	F1 (%)	t value	P value
YOLOv8n (27)	2.68	1.7	91.0±0.42	86.5±0.51	91.8±0.39	88.7±0.45	−11.75	<0.001
YOLOv9t (29)	1.73	1.8	82.3±0.56	82.3±0.48	90.5±0.41	82.3±0.53	−32.04	<0.001
YOLOv10n (30)	2.69	1.6	87.8±0.49	86.1±0.53	90.4±0.43	86.9±0.48	−17.71	<0.001
YOLOv11n (31)	2.58	1.5	89.3±0.47	85.6±0.52	91.9±0.40	87.4±0.50	−15.56	<0.001
DAFNet	2.51	1.5	93.2±0.43	90.7±0.49	96.3±0.36	91.9±0.41	–	–

Data are presented as mean ± standard deviation unless otherwise stated. DAFNet, downsampling attention fusion network; F1, F1 score; LCTD, lung computed tomography image dataset; mAP, mean average precision; YOLO, You Only Look Once.

As can be surmised from the results in Table 2, YOLOv8 achieved the highest scores in both the precision and recall metrics. This suggests that YOLOv8 has good performance in lung nodule detection, which can be mainly attributed to the fact that its neck layer includes the path aggregation feature pyramid network, enhancing the multiscale feature fusion capability. YOLOv9 scored the lowest in the metrics of precision, recall, and F1 score. This is because YOLOv9 sacrifices feature extraction capabilities to improve parameter efficiency and thus has the fewest parameters among all methods. Although the performance of YOLOv10 was significantly better than YOLOv9, except for mAP, the scores of its precision, recall, and F1 were all below 90%. This reflects the insufficient modeling capability of YOLOv10 for lung nodule detection, which is due to the fact that the non-maximum-suppression-free design of YOLOv10 is more suitable for dense target scenarios, but lung nodules are usually sparsely distributed in CT images. YOLOv11 had slightly lower precision, recall, and F1 compared to YOLOv10 because YOLOv11 enhances the cross-scale fusion of nodule features through channel expansion and enhanced feature pyramids in the backbone layer. Notably, YOLOv11 had a slightly lower recall than did YOLOv10. This may be due to the fact that the dynamic label allocation strategy of YOLOv11 is not sufficiently sensitive to micronodules, resulting in a portion of them not being effectively detected.

To assess the stability of our DAFNet, we further conducted fivefold cross-validation. The independent repeated experiments verified the repeatability of the model performance, while the fivefold cross-validation comprehensively evaluating the model’s robustness under different data distributions. The results in Table 3 show that the model’s precision, recall, mAP, and F1 score remained highly consistent across all folds, and the standard deviations of each indicator were small, fully demonstrating the stability of its performance.

Table 3

Fivefold cross-validation on the augmented LCTD

Fold	Precision (%)	Recall (%)	mAP (%)	F1 (%)
Fold 1	92.8	90.2	95.8	91.5
Fold 2	93.6	91.3	96.7	92.4
Fold 3	92.5	89.8	95.5	91.1
Fold 4	93.8	91.5	97.0	92.6
Fold 5	93.0	90.5	96.1	91.7
Mean ± SD	93.1±0.55	90.7±0.61	96.2±0.40	91.9±0.42

LCTD, lung computed tomography image dataset; mAP, mean average precision; SD, standard deviation.

Comparison on the LUNA16 dataset

To evaluate generalization ability of our proposed DAFNet, we performed training and testing on the LUNA16 dataset. Under a unified experimental setup and parameter configuration, we conducted detailed experimental comparisons with mainstream deep learning methods based on the following metrics: recall, mAP, and the number of parameters.

The results on the LUNA16 dataset are presented in Table 4. DAFNet achieved 90.5% in recall and 95.9% in mAP, outperforming most state-of-the-art detection methods. Although multiscale convolutional neural network (MCNN) surpassed DAFNet by 4.3% in recall, it requires an additional 23.2 M parameters. Notably, DAFNet stands out among all comparison methods as the model with the fewest parameters, requiring only 2.5 M. Overall, DAFNet strikes a favorable balance between performance and efficiency, with its lightweight structure effectively reducing both missed and false detections while maintaining computational efficiency.

Table 4

Comparison with state-of-the-art methods on the LUNA16 dataset

Method	Recall (%)	mAP (%)	Params (M)
Deformable-DETR (32)	82.0±0.65	83.9±0.58	40
YOLOv6 (33)	92.1±0.48	94.6±0.51	18.5
YOLOv7 (34)	93.5±0.46	94.0±0.41	36.5
YOLOv8 (35)	92.6±0.49	94.2±0.43	11.1
CircleNet (36)	87.5±0.60	86.7±0.55	16.2
DCA-YOLO (37)	89.3±0.52	86.3±0.60	7.1
SCPM (38)	89.2±0.54	90.0±0.46	12.8
MCNN (39)	94.8±0.56	92.1±0.52	25.7
CANN (40)	94.6±0.50	91.1±0.46	6.3
DAFNet	90.5±0.46	95.9±0.40	2.5

Data are presented as mean ± standard deviation unless otherwise stated. CANN, channel-wise attention neural network; DAFNet, downsampling attention fusion network; DCA-YOLO, dense connection and attention based You Only Look Once; Deformable-DETR, deformable detection transformer; LUNA16, Lung Nodule Analysis 2016; mAP, mean average precision; MCNN, multi-scale convolutional neural network; SCPM, sphere center-points matching.

From the results in Table 4, it can be discerned that deformable detection transformer (Deformable-DETR), CircleNet, dense connection and attention-based YOLO (DCA-YOLO), and the sphere representation-based center-points matching (SCPM) methods exhibited unsatisfactory recall and mAP scores. Therefore, these methods do not have good nonlinear modeling capabilities for lung nodule detection. It is worth noting that MCNN achieved the highest recall score among all the methods compared. This is because it uses convolution kernels of different sizes in multiple channels for multiscale feature extraction targeted at nodules of varying sizes. However, the parallel multibranch structure design also leads to an increase in the number of parameters, with the number of parameters of MCNN reaching 25.7 M. In contrast, the channel-wise attention neural network (CANN) performed similarly to MCNN in terms of recall. However, CANN has 19.4 M fewer parameters than MCNN. This is mainly attributable to the channel attention mechanism in the CANN network, which can strengthen the feature channels that contribute significantly to nodule detection while suppressing redundant channels, thereby effectively improving parameter efficiency while focusing on key nodule features. Compared with traditional deep learning methods, YOLOv6, YOLOv7, and YOLOv8 demonstrated stronger nonlinear modeling capabilities. However, the number of parameters in YOLOv7 is significantly higher. This is because YOLOv7 adopts an extended high-efficiency layer aggregation network, which increases the depth and complexity of the backbone layer. The adoption of multiscale aggregation detection heads introduces more parameters. These improvements enhance the detection capability for tiny nodules. Consequently, YOLOv7 had the best recall among the YOLOv6, YOLOv7, and YOLOv8 methods. Although YOLOv6, YOLOv7, and YOLOv8 performed well in detection performance, the proposed DAFNet achieved satisfactory results in all three indicators of recall, mAP, and number of parameters, maintaining good detection performance while achieving the lowest number of model parameters.

We compared the detection performance of DAFNet with that of the other methods on the augmented LCTD and the LUNA16 dataset (Tables 2,4, respectively). The mAP and recall of DAFNet were 0.4% and 0.2% lower on the LUNA16 dataset than on the augmented LCTD, respectively. However, given that there are differences in sample size, nodule type, distribution location, and nodule size between the two datasets, the performance difference appears to be within a reasonable range. It is particularly worth noting that DAFNet scored over 90% in both mAP and recall on the LUNA16 dataset. The results fully demonstrate that our proposed DAFNet has favorable generalization capabilities.

Performance analysis

Precision-recall (PR) curve

The PR curve provides an intuitive visualization of the relationship between precision and recall at different decision thresholds. As shown in Figure 6, DAFNet outperformed the other methods, demonstrating higher precision when detecting the same lung nodules. The area under the PR curve serves as a quantitative metric for evaluating the overall performance of the model. We observed that DAFNet had the largest area under the curve, further validating its performance advantage. We also generated PR plots, which appear in Figure 7. DAFNet achieved 93.2% precision and 90.7% recall. In contrast, the recall values of all the other models remained below 90%, indicating a tendency to misclassify nodules as normal tissues, which can potentially increase clinical risks to patients.

Figure 6 PR curves for the different YOLO methods. PR, precision-recall; YOLO, You Only Look Once.

Figure 7 PR plots for the different YOLO methods. PR, precision-recall; YOLO, You Only Look Once.

mAP curve

To systematically assess the performance of different YOLO methods, each training epoch was evaluated, and the results were recorded at 20-epoch intervals. The mAP change curve for each method is presented in Figure 8. The experimental data showed that during the first 100 epochs of training, DAFNet exhibited a pronounced upward trend in mAP, reaching 0.1465, 0.6459, 0.6698, 0.7565, and 0.8128 at epochs 20, 40, 60, 80, and 100, respectively. Notably, in the early stages of training, due to the use of a relatively conservative learning rate, DAFNet initially lagged behind the other methods. However, as training progressed, DAFNet demonstrated robust learning capabilities, ultimately surpassing the other methods. Although the configuration of a low learning rate may result in slower convergence during the early training phases, it contributes to greater stability in the model’s overall training process. Furthermore, DAFNet provides a more favorable tradeoff between model complexity and performance. In Figure 9, the size of the circles reflects the number of model parameters, through which it can be surmised that DAFNet attained state-of-the-art performance.

Figure 8 mAP curves for the different YOLO methods. mAP, mean average precision; YOLO, You Only Look Once.

Figure 9 Performance comparison on the augmented LCTD dataset. DAFNet, downsampling attention fusion network; LCTD, lung computed tomography image dataset; YOLO, You Only Look Once; mAP, mean average precision.

Loss curve

Based on the log files recorded during the training process, we plotted the classification loss and bounding box loss curves, as shown in Figure 10. Initially, the loss values for all methods were relatively high but decreased rapidly within the first 50 epochs. As training progressed, each method gradually entered the convergence stage, with the loss values continuously decreasing and eventually stabilizing. Among the compared methods, YOLOv10 exhibited the slowest loss reduction rate. Meanwhile, YOLOv8, YOLOv9, and YOLOv11 had a comparable performance, with a similar loss decline rate, ultimately reaching a stable convergence point. Notably, our DAFNet demonstrated the fastest loss reduction, minimal fluctuation, and competitive overall performance.

Figure 10 Loss curves for the different YOLO methods. (A) The classification loss curve. (B) The bounding box loss curve. YOLO, You Only Look Once.

Visualization

Detection results

We present the comparative detection results in Figure 11. Our DAFNet achieved competitive performance relative to the other methods, with favorable localization accuracy and sensitivity. Notably, the confidence scores for the majority of the detected targets exceeded 90%, suggesting that the identification of suspected lesion areas is generally reliable. These findings indicate that DAFNET may be valuable in assisting clinical decision-making.

Figure 11 Detection results for the different YOLO methods. Red arrows indicate the positions of target nodules. DAF, downsampling attention fusion; YOLO, You Only Look Once.

Heatmaps

We employed gradient-weighted class activation map technology to perform feature response analysis on YOLOv11 and our DAFNet, with the visualization results being presented in Figure 12. The heatmaps indicate that the feature responses of the reference model YOLOv11 were primarily concentrated in the core region of the lesion. Although it achieved general localization, there was noticeable spatial dispersion. In contrast, DAFNet exhibited relatively precise focus-response characteristics, with distinct activations localized primarily at the lung nodules, helping to reduce computational redundancy.

Figure 12 Visualization via heatmaps. (A) Original images. (B) Heatmaps of the YOLOv11. (C) Heatmaps of the proposed DAFNet. DAFNet, downsampling attention fusion network; YOLO, You Only Look Once.

Ablation analysis

Performance under different attention modules

Inspired by YOLOv11n, we sequentially integrated content-aware reassembly of features (CARAFE), convolutional block attention module (CBAM), mixed local channel attention module (MLCA), and our GAM into the 20^th layer of the YOLOv11 network. Table 5 presents the experimental results, which demonstrate that incorporating these attention modules enhanced the detection metrics to varying degrees. Specifically, the introduction of GAM resulted in performance improvements of 2.2% in precision, 1.3% in recall, and 1.5% in mAP compared with the original YOLOv11n. Notably, the addition of GAM increased recall to 86.9%, higher than that of all the other attention modules. These favorable results can be attributed to the unique design of the GAM module. GAM effectively establishes both local and global dependencies through a channel-space approach, assigning attention weights to enhance pixel-level focus. Furthermore, the cascaded architecture enables interdimensional interactions, thereby enhancing the model’s ability to detect diseased regions.

Table 5

Comparison of attention modules in YOLOv11n on the augmented LCTD

Attention	Params (M)	Inference (ms)	Precision (%)	Recall (%)	mAP (%)
YOLOv11n	2.58	1.5	89.3±0.47	85.6±0.52	91.9±0.40
+ CARAFE (41)	2.72	2.2	92.1±0.45	85.3±0.50	92.1±0.37
+ CBAM (42)	2.60	1.6	89.6±0.50	85.1±0.47	91.9±0.45
+ MLCA (43)	2.81	1.4	91.3±0.46	84.0±0.51	92.0±0.41
+ GAM (Ours)	3.22	1.5	91.5±0.44	86.9±0.50	93.4±0.44

Data are presented as mean ± standard deviation unless otherwise stated. CARAFE, content-aware reassembly of features; CBAM, convolutional block attention module; GAM, global attention module; LCTD, lung computed tomography image dataset; mAP, mean average precision; MLCA, mixed local channel attention.

Performance under different downsampling modules

To validate the effectiveness of the proposed ADSM, we conducted ablation studies using YOLOv11n as the baseline model. Specifically, we replaced the native k =3, s =2 convolutional layers adjacent to the C3k2 blocks in both the backbone and neck with mainstream downsampling modules, including pooling with learned weights (W-Pool), octave convolution (OctConv), depthwise separable convolution (DSConv), and our proposed ADSM. As shown in Table 6, the model integrated with ADSM achieved optimal performance across three core metrics: precision, recall, and mAP. Compared to the original YOLOv11n, ADSM not only significantly improved detection accuracy but also effectively realized model lightweighting: the parameter count decreased from 2.58 to 2.10 M, and the inference latency was reduced from 1.5 to 1.3 ms. These results demonstrate that ADSM can efficiently extract critical features of lung nodules while successfully balancing model lightweighting and GPU-accelerated near-real-time performance.

Table 6

Comparison of downsampling modules on the YOLOv11n on the augmented LCTD

Method	Params (M)	Inference (ms)	Precision (%)	Recall (%)	mAP (%)
YOLOv11n	2.58	1.5	89.3±0.47	85.6±0.52	91.9±0.40
+ W-Pool	2.61	1.6	91.3±0.48	87.1±0.51	93.4±0.41
+ OctConv	2.42	1.7	91.0±0.50	86.7±0.54	93.2±0.43
+ DSConv	2.35	1.4	90.5±0.49	86.3±0.53	92.8±0.42
+ ADSM (proposed)	2.10	1.3	93.1±0.43	88.4±0.48	94.6±0.37

Data are presented as mean ± standard deviation unless otherwise stated. ADSM, adaptive downsampling module; DSConv, depthwise separable convolution; LCTD, lung computed tomography image dataset; mAP, mean average precision; OctConv, octave convolution; W-Pool, pooling with learned weights.

Contribution of the modules to model performance

We conducted ablation studies to evaluate the impact of each module on lung nodule detection. Specifically, we compared (I) YOLOv11n in combination with the lightweight downsampling module to (+ADSM); (II) YOLOv11n in combination with the GAM (+GAM); and (III) the proposed DAFNet model.

The results of the ablation studies are shown in Table 7 and can be summarized as follows: (I) Adding the ADSM module to YOLOv11n resulted in a significant reduction in the number of parameters, and inference time decreased by 0.2 ms per image. The quantitative scores achieved were 93.1%, 88.4%, and 94.6% for precision, recall, mAP, respectively, improvements of 3.8%, 2.8%, and 2.7% over the original YOLOv11n. Thus, ADSM potentially enhances DAFNet’s performance, primarily by accentuating the main contours and textural features of the nodules; (II) Adding the GAM module to YOLOv11n resulted in quantitative scores of 91.5%, 86.9%, and 93.4% for precision, recall, and mAP, respectively, representing increases of 2.2%, 1.3%, and 1.5% compared to the original YOLOv11n. This ablation study demonstrates that GAM can refine feature representations and improve the localization of nodules; (III) Our full DAFNet model achieved the highest quantitative scores of 93.2%, 90.7%, and 96.3% for precision, recall, and mAP, respectively, demonstrating favorable detection performance. Overall, the ablation study clearly shows that each module contributes positively to optimizing DAFNet, with the best results observed when both modules are included.

Table 7

Ablation experiments with different modules on the augmented LCTD

Method	Params (M)	Inference (ms)	Precision (%)	Recall (%)	mAP (%)
YOLOv11n	2.58	1.5	89.3±0.47	85.6±0.52	91.9±0.40
+ ADSM	2.10	1.3	93.1±0.43	88.4±0.48	94.6±0.37
+ GAM	3.22	1.5	91.5±0.44	86.9±0.50	93.4±0.44
DAFNet	2.51	1.5	93.2±0.43	90.7±0.49	96.3±0.36

Data are presented as mean ± standard deviation unless otherwise stated. ADSM, adaptive downsampling module; DAFNet, downsampling attention fusion network; GAM, global attention module; LCTD, lung computed tomography image dataset; mAP, mean average precision.

Discussion

We developed a novel model for lung nodule detection, which extends the YOLOv11n architecture through the additional integration of ADSM and GAM. This improvement in performance can be mainly attributed to the synergy of these two modules. We first designed an efficient dual-branch parallel downsampling module, which can more effectively preserve the spatial details of low-level features and the semantic information of high-level features. This is particularly important for detecting nodules with blurred edges, low contrast, or small size. We further incorporated the channel-space collaborative GAM, which enhances information-rich channels and spatial regions to suppress the interference of irrelevant anatomical structures on nodule detection.

In terms of module selection, the ablation study demonstrated that the incorporation of ADSM enables the model to surpass the benchmark YOLOv11n in key performance indicators, including parameter count, inference speed, precision, recall, and mAP. Notably, GAM exhibited superior performance compared to the other attention mechanisms with respect to both recall and mAP. After a careful balance was achieved between model complexity and detection accuracy, GAM was selected as the optimal attention module. The proposed ADSM and GAM components were integrated into the YOLOv11n framework, and extensive experiments were conducted on both the augmented LCTD and the LUNA16 dataset to comprehensively evaluate the model’s performance in lung nodule detection. Comparison with state-of-the-art models indicated that the proposed approach can provide a comprehensively superior performance in lung nodule detection tasks.

It is worth noting that through the qualitative analysis of the detected erroneous samples, we found that micronodules with a diameter of less than 3 mm were prone to being missed due to their particularly subtle characteristics being easily masked by background noise. Meanwhile, lesions close to the pleural boundary were susceptible to positioning deviations or false-positive detections. These issues reflect the high complexity of lung nodules in terms of morphology, location, and the CT image background and constitute a clear direction for the subsequent optimization and improvement of the model.

Despite the promising results of our DAFNet in lung nodule detection, there are several limitations in our study that should be addressed. First, although consensus-based annotation by experienced physicians was applied to ensure labeling quality, formal quantitative interrater reliability metrics were not computed. Second, the self-constructed dataset has a relatively small sample size, and thus the model may not be generalizable to other populations. Third, the evaluation protocol on the LUNA16 dataset differs from the official FROC framework, limiting direct comparability with the results reported in other studies. Fourth, the proposed DAFNet only performs nodule detection and does not integrate false-positive reduction or malignancy grading, and its decision-making process lacks interpretability—factors which may hinder its clinical acceptance and practical deployment. Additionally, our study mainly focused on the structural innovation and performance validation of the detection model; thus, practical clinical deployment requires further validation with multicenter data, strict data privacy protection mechanisms, and comprehensive clinical evaluation (44,45).

For future work, we plan to address these limitations in several ways. First, we will quantify interannotator agreement to further improve the reliability of annotations. Second, we will expand the number of high-quality annotated samples and validate the model on a greater diversity of external datasets to improve generalizability. Third, we will conduct standard FROC analysis on the LUNA16 dataset to enable fair and direct comparisons with state-of-the-art methods. Finally, we will investigate interpretable deep learning techniques, including visualization tools and interpretable model designs, to improve transparency and promote clinical trust.

Conclusions

We propose a novel deep learning model, DAFNet, designed for the high-precision and efficient detection of lung nodules in CT images. By integrating the ADSM mechanism and the GAM module, the model enhances the representation capability of multiscale features and improves sensitivity to subtle nodule structures. Experimental results on the public LUNA16 dataset and a self-constructed clinical dataset demonstrate that DAFNet surpasses state-of-the-art detection models in terms of mAP while providing favorable generalization and robustness. Moreover, the model can achieve GPU-accelerated near-real-time inference speed with relatively low computational complexity.

Acknowledgments

None.

Footnote

Reporting Checklist: The authors have completed the STARD-AI reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2026-1-0050/rc

Data Sharing Statement: Available at https://qims.amegroups.com/article/view/10.21037/qims-2026-1-0050/dss

Funding: This work was partially supported by the Science and Technology Research Project of Henan Province (Nos. 242102211059 and 262102211038).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2026-1-0050/coif). All authors report that this work was partially supported by the Science and Technology Research Project of Henan Province (Nos. 242102211059 and 262102211038). M.L. is a current employee of Central Hospital of Jiaozuo Coal Group Co. Ltd. The authors have no other conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. Ethical approval was obtained from the Ethics Committee of Jiaozuo Coal Industry (Group) Co. Ltd. Central Hospital (No. KYXM202401002). The requirement for written informed consent was waived due to the retrospective nature of this study.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Kikano GE, Fabien A, Schilz R. Evaluation of the Solitary Pulmonary Nodule. Am Fam Physician 2015;92:1084-91.
Fu X, Bi L, Kumar A, Fulham M, Kim J. An attention-enhanced cross-task network to analyse lung nodule attributes in CT images. Pattern Recognition 2022;126:108576.
Dadsetan S, Arefan D, Berg WA, Zuley ML, Sumkin JH, Wu S. Deep learning of longitudinal mammogram examinations for breast cancer risk prediction. Pattern Recognit 2022;132:108919. [Crossref] [PubMed]
Wang TW, Wang CK, Hong JS, Chao HS, Chen YM, Wu YT. Deep Learning in Thoracic Oncology: Meta-Analytical Insights into Lung Nodule Early-Detection Technologies. Cancers (Basel) 2025;17:621. [Crossref] [PubMed]
Aberle DR, Adams AM, Berg CD, Black WC, Clapp JD, Fagerstrom RM, Gareen IF, Gatsonis C, Marcus PM, Sicks JD. Reduced lung-cancer mortality with low-dose computed tomographic screening. N Engl J Med 2011;365:395-409.
Kim J, Kim KH. Role of chest radiographs in early lung cancer detection. Transl Lung Cancer Res 2020;9:522-31. [Crossref] [PubMed]
Shakeel PM, Desa MI, Burhanuddin MA. Retracted article: Improved watershed histogram thresholding with probabilistic neural networks for lung cancer diagnosis for cbmir systems. Multimedia tools and applications 2019;79:17115-33.
Grossi F, Le V, Baudot P, Voyton C, Francis D, Munoz E, Huet B. 919p artificial intelligence supporting lung cancer screening: Computer aided diagnosis of lung lesions driven by morphological feature extraction. Ann Oncol 2022;33:S967.
Shakeel PM, Burhanuddin MA, Desa MI. Lung cancer detection from CT image using improved profuse clustering and deep learning instantaneously trained neural networks. Measurement 2019;145:702-12.
Kareem HF. AL-Husieny MS, Mohsen FY, Khalil EA, Hassan ZS. Evaluation of svm performance in the detection of lung cancer in marked CT scan dataset. Indonesian J Elec Eng & Comp Sci 2021;21:1731-8.
Wu S, Wang J. Pulmonary nodules 3D detection on serial CT scans. 2012 Third Global Congress on Intelligent Systems 2012:257-60.
Khan SA, Nazir M, Khan MA, Saba T, Javed K, Rehman A, Akram T, Awais M. Lungs nodule detection framework from computed tomography images using support vector machine. Microsc Res Tech 2019;82:1256-66. [Crossref] [PubMed]
Aung TMM, Khan AA. Enhanced U-Net with Attention Mechanisms for Improved Feature Representation in Lung Nodule Segmentation. Curr Med Imaging 2025;21:e15734056386382. [Crossref] [PubMed]
Cheltha JN, Sharma C, Prashar D, Khan AA, Kadry S. Enhanced human motion detection with hybrid RDA-WOA-based RNN and multiple hypothesis tracking for occlusion handling. Image and Vision Computing 2024;150:105234.
Zubair M, Md Rais HB., Ullah F, Al-Tashi Q, Faheem M, Ahmad Khan A. Enabling predication of the deep learning algorithms for low-dose CT scan image denoising models: A systematic literature review. IEEE Access 2024;12:79025-50.
Zamanidoost Y, Ould-Bachir T, Martel S. OMS-CNN: Optimized Multi-Scale CNN for Lung Nodule Detection Based on Faster R-CNN. IEEE J Biomed Health Inform 2025;29:2148-60. [Crossref] [PubMed]
Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans Pattern Anal Mach Intell 2017;39:1137-49. [Crossref] [PubMed]
Xavier AI, Villavicencio C, Macrohon JJ, Jeng JH, Hsieh JG. Object detection via gradient-based Mask R-CNN using machine learning algorithms. Machines 2022;10:340.
Xu J, Ren H, Cai S, Zhang X. An improved faster R-CNN algorithm for assisted detection of lung nodules. Comput Biol Med 2023;153:106470. [Crossref] [PubMed]
Su Y, Li D, Chen X. Lung Nodule Detection based on Faster R-CNN Framework. Comput Methods Programs Biomed 2021;200:105866. [Crossref] [PubMed]
Wang K, Wang Y, Zhang S, Tian Y, Li D. SLMS-SSD: Improving the balance of semantic and spatial information in object detection. Expert Systems with Applications 2022;206:117682.
Chen L, Zhou Y, Xu S. ERetinaNet: An Efficient Neural Network Based on RetinaNet for Mammographic Breast Mass Detection. IEEE J Biomed Health Inform 2024;28:2866-78. [Crossref] [PubMed]
Khosravan N, Bagci U. S4ND: Single-shot single-scale lung nodule detection. Med Image Comput Comput Assist Interv 2018;794-802.
Tang C, Zhou F, Sun J, Zhang Y. Lung-YOLO: Multiscale feature fusion attention and cross-layer aggregation for lung nodule detection. Biomedical Signal Processing and Control 2024;99:106815.
Mei S, Jiang HQ, Ma L. YOLO-lung: A Practical Detector Based on Imporved YOLOv4 for Pulmonary Nodule Detection. 2021 14th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI) 2021:1-6.
Wu X, Zhang H, Sun J, Wang S, Zhang Y. YOLO-MSRF for lung nodule detection. Biomedical Signal Processing and Control 2024;94:106318.
Ma Z, Wu F. Research on Pulmonary Nodule Detection Method Based on YOLO Algorithm. 2024 5th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT); Nanjing, China; 2024:635-9.
Tang C, Zhou F, Sun J, Zhang Y. Circle-YOLO: an anchor-free lung nodule detection algorithm using bounding circle representation. Pattern Recognition 2024;161:111294.
Zhang Y, Zhou B, Zhao X, Song X. Enhanced object detection in low-visibility haze conditions with YOLOv9s. PLoS One 2025;20:e0317852. [Crossref] [PubMed]
Li Y, Yang W, Wang L, Tao X, Yin Y, Chen D. HawkEye Conv-Driven YOLOv10 with Advanced Feature Pyramid Networks for Small Object Detection in UAV Imagery. Drones 2024;8:713.
Gao P, Li H, Wang F. Enhanced YOLO11 for tiny object detection based on multi-scale information interaction and fusion in UAV aerial images. Journal of Computational Design and Engineering 2026;13:97-113.
Chen Y, Zhang C, Chen B, Huang Y, Sun Y, Wang C, Fu X, Dai Y, Qin F, Peng Y, Gao Y. Accurate leukocyte detection based on deformable-DETR and multi-level feature fusion for aiding diagnosis of blood diseases. Comput Biol Med 2024;170:107917. [Crossref] [PubMed]
Zhang H, Huang SL, Kuruoglu EE. Classification and recognition of brain tumors from MRI images based on improved YOLOv6. Biomedical Signal Processing and Control 2025;113:109048.
Liu Y, Ao Y. Deformable attention mechanism-based YOLOv7 structure for lung nodule detection. The Journal of Supercomputing 2024;80:25450-69.
Wang X, Wu H, Wang L, Chen J, Li Y, He X, Chen T, Wang M, Guo L. Enhanced pulmonary nodule detection with U-Net, YOLOv8, and swin transformer. BMC Med Imaging 2025;25:247. [Crossref] [PubMed]
Zhang T, Han Z, Xu H, Zhang B, Ye Q. CircleNet: Reciprocating Feature Adaptation for Robust Pedestrian Detection. IEEE Transactions on Intelligent Transportation Systems 2020;21:4593-604.
Xue Y, Zhou J. Multiple pedestrian tracking under first-person perspective using deep neural network and social force optimization. Optik 2021;240:166981.
Luo X, Song T, Wang G, Chen J, Chen Y, Li K, Metaxas DN, Zhang S. SCPM-Net: An anchor-free 3D lung nodule detection network using sphere representation and center points matching. Med Image Anal 2022;75:102287. [Crossref] [PubMed]
Zhao D, Liu Y, Yin H, Wang Z. A novel multi-scale CNNs for false positive reduction in pulmonary nodule detection. Expert Systems with Applications 2022;207:117652.
Zhu X, Wang X, Shi Y, Ren S, Wang W. Channel-Wise Attention Mechanism in the 3D Convolutional Network for Lung Nodule Detection. Electronics 2022;11:1600.
Wang J, Chen K, Xu R, Liu Z, Loy CC, Lin D. CARAFE++: Unified Content-Aware ReAssembly of FEatures. IEEE Trans Pattern Anal Mach Intell 2022;44:4674-87. [Crossref] [PubMed]
Zhang Y, Feng W, Wu Z, Li W, Tao L, Liu X, Zhang F, Gao Y, Huang J, Guo X. Deep-Learning Model of ResNet Combined with CBAM for Malignant-Benign Pulmonary Nodules Classification on Computed Tomography Images. Medicina (Kaunas) 2023;59:1088. [Crossref] [PubMed]
Zhang W, Chen G, Zhuang P, Zhao W, Zhou L. CATNet: Cascaded attention transformer network for marine species image classification. Expert Systems with Applications 2024;256:124932.
Kujur A, Raza Z, Khan AA, Wechtaisong C. Data Complexity Based Evaluation of the Model Dependence of Brain MRI Images for Classification of Brain Tumor and Alzheimer’s Disease. IEEE Access 2022;10:112117-33.
Khan AA, Driss M, Boulila W, Sampedro GA, Abbas S, Wechtaisong C. Privacy Preserved and Decentralized Smartphone Recommendation System. IEEE Transactions on Consumer Electronics 2023;70:4617-24.

Cite this article as: Li L, Wen X, Li M, Wang X, Feng X, Lv X. Downsampling attention fusion network (DAFNet): a You Only Look Once network for lung nodule detection. Quant Imaging Med Surg 2026;16(7):532. doi: 10.21037/qims-2026-1-0050