X-Pruning: a dual-stream information fusion mammography diagnosis network based on pruned transformer and cross-attention mechanism

Xiaoyi Dai; Xin Pan; Shasha Zeng; Qi Long; Zebing Liao; Sisi Zou; Yexuan Xing; Zongxiu Yu; Yuqing Hu; Xiao Luo

doi:10.21037/qims-2025-aw-2486

Original Article

X-Pruning: a dual-stream information fusion mammography diagnosis network based on pruned transformer and cross-attention mechanism

Xiaoyi Dai^1#, Xin Pan^2#, Shasha Zeng³, Qi Long², Zebing Liao², Sisi Zou², Yexuan Xing², Zongxiu Yu², Yuqing Hu², Xiao Luo²

¹Medical Physics Graduate Program, Duke Kunshan University, Kunshan, China; ²Department of Radiology, The First Affiliated Hospital of Shenzhen University, Health Science Center, Shenzhen Second People’s Hospital, Shenzhen, China; ³First Clinical Medical College, Jinan University, Guangzhou, China

Contributions: (I) Conception and design: X Dai, X Pan; (II) Administrative support: X Luo; (III) Provision of study materials or patients: S Zeng, Q Long; (IV) Collection and assembly of data: Z Liao, S Zou; (V) Data analysis and interpretation: Y Xing, Z Yu, Y Hu; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

^#These authors contributed equally to this work as co-first authors.

Correspondence to: Xiao Luo, Master. Department of Radiology, The First Affiliated Hospital of Shenzhen University, Health Science Center, Shenzhen Second People’s Hospital, 3002 SunGangXi Road, Shenzhen 518035, China. Email: 124918294@qq.com.

Background: Breast cancer remains the most prevalent malignancy among women globally. While early detection through mammography is crucial for improving survival rates, automating this process poses significant computational challenges. Specifically, detecting small malignant lesions within high-resolution images and effectively integrating complementary information from standard craniocaudal (CC) and mediolateral oblique (MLO) views are difficult tasks. Therefore, the aim of this study is to develop a novel, highly efficient dual-view mammography diagnostic network, termed X-Pruning, designed to overcome these computational and integration challenges.

Methods: The proposed X-Pruning framework addresses multi-view mammography by combining pruned transformer blocks (PTBs) with cross-attention fusion. The model employs parallel PTBs to process standard CC and MLO views. By implementing a dynamic pruning strategy based on window importance within the transformer layers, the network focuses computational resources specifically on suspicious regions. This reduces the heavy burden of self-attention calculations while strictly preserving full-image scale information to enhance the detection of small lesions. Additionally, a novel cross-attention fusion module was developed and integrated into the network to facilitate interactive information exchange between the CC and MLO views, enabling comprehensive multi-view feature integration.

Results: The X-Pruning framework was evaluated on two widely recognized public datasets: Vindr-Mammo and Curated Breast Imaging Subset of the Digital Database for Screening Mammography (CBIS-DDSM). Experimental results demonstrated that the proposed model consistently outperformed several mainstream baseline architectures. Specifically, X-Pruning achieved area under the curve (AUC) scores of 0.812 on the Vindr-Mammo dataset and 0.792 on the CBIS-DDSM dataset. Furthermore, these performance improvements were achieved while significantly reducing overall computational demands compared to standard models.

Conclusions: The X-Pruning network successfully resolves the trade-off between high diagnostic accuracy and computational efficiency in automated mammogram analysis. By intelligently allocating computational resources to critical lesion areas and effectively fusing cross-view information, this framework demonstrates robust diagnostic capabilities. These advancements highlight the clinical potential of X-Pruning to serve as an efficient, automated tool for enhancing the early detection and diagnosis of breast cancer.

Keywords: Artificial intelligence (AI); breast cancer; deep learning; mammography; medical imaging

Submitted Nov 20, 2025. Accepted for publication Apr 29, 2026. Published online Jun 09, 2026.

doi: 10.21037/qims-2025-aw-2486

Introduction

Breast cancer remains the most prevalent malignancy and the leading cause of cancer-related mortality among women worldwide, posing a severe public health burden (1,2). Early detection is paramount, as it significantly improves prognosis and survival rates (3). Mammography is currently the gold standard for population-based screening (4). However, manual interpretation is labor-intensive and error-prone, with sensitivity often compromised by dense breast tissue and subtle lesion characteristics (5). To mitigate these challenges, computer-aided detection (CAD) systems based on deep learning have emerged as powerful assistive tools (6-8).

Despite their potential, applying deep learning to high-resolution mammography presents unique computational and architectural hurdles. Standard detection models often struggle with the extreme resolution of mammograms [e.g., 4096×3032 pixels in the VinDr-Mammo dataset (9)], where downsampling leads to information loss and patch-based methods may lose global context. Moreover, accurate diagnosis relies on correlating information across dual views [craniocaudal (CC) view and medio-lateral oblique (MLO) view], a process inherent to radiologists but challenging for standard convolutional neural networks (CNNs) (10,11).

Recent studies have demonstrated the versatility of advanced deep learning architectures, particularly Transformers, across various medical imaging tasks. For instance, recent works in 2025 have successfully applied novel Transformer-based or hybrid variants to histopathology, skin cancer, prostate grading, and calcification classification (12-16). While these methods show great promise in feature extraction, adapting the self-attention mechanism to the specific multi-view, high-resolution constraints of mammography—without incurring prohibitive computational costs—remains an open research gap.

Existing approaches like globally-aware multiple instance classifier (GMIC) (11) utilize varying resolutions but process views independently. While cross-attention mechanisms have been explored in medical image fusion and reconstruction (17,18), they are rarely optimized for the specific spatial correspondence required in multi-view mammography.

To overcome these limitations, we propose the Dual-Stream Pruned Transformer with Cross-Attention (named as X-Pruning), a novel architecture designed to synergize adaptive computation with multi-view reasoning. Recent studies published in 2024 and 2025 have underscored the potential of advanced deep learning architectures in oncological imaging, validating the efficacy of artificial intelligence (AI)-driven systems in reducing false positives (12-15). Building on the efficiency of token pruning, our approach eliminates redundant tokens to focus computation solely on regions of interest. Furthermore, we address the challenge of view alignment by incorporating a cross-attention mechanism. While cross-attention has proven effective in other medical imaging tasks such as volumetric reconstruction and registration (16,17), our framework uniquely adapts it to facilitate interactive information exchange between paired mammographic views.

The primary contributions of this work are as follows:

We introduce a Pruning-based Transformer backbone that effectively filters out background noise from high-resolution mammograms, concentrating computational resources on potential lesion areas.
We propose a Dual-View Cross-Attention Fusion module that captures complex dependencies between CC and MLO views, surpassing the performance of standard feature concatenation.
We present an end-to-end framework, X-Pruning, which achieves superior diagnostic accuracy (ACC) and computational efficiency, validating its potential for clinical application.

We present this article in accordance with the TRIPOD+AI reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-aw-2486/rc).

Methods

Overall framework

As illustrated in Figure 1, the overall architecture of the proposed early screening model for mammography can be formulated as follows.

Figure 1 Model architecture diagram. The patch embedding layer uses a 4×4 patch size, producing an initial feature map with 96 channels. CTB, cross-attention transformer block; FFN, feed-forward network; LCC, LMLO; MLP; PTB, pruned transformer block; STN.

First, mammography images of the ipsilateral breast from two different views are utilized as inputs for the model, denoted as I_ccand I_mlo. The dual-view (DV) inputs are processed through independent streams within the network architecture, where the initial two stages consist of n pruned transformer blocks (PTBs) initialized with pre-trained weights, where n is a defined hyperparameter. We named this module PTB, we will provide a detailed introduction in the following text. These two modules will be frozen to ensure the network maintains a more robust capability when processing high-resolution views.

The subsequent one module also includes pre-trained weights, but the weight parameters of this module will not be frozen, allowing them to be updated through back propagation of gradients.

This component is immediately succeeded by a cross-attention transformer block (CTB), designed to extract and integrate fused feature representations through DV interactions, as detailed in the subsequent sections.

The fused feature information is fed into the final a PTB, resulting in the ultimate feature information of two views processed through two separate manifold networks.

After obtaining the two final fused feature sets, they are input into a Fusion module for final classification.

Network structure details pruning-based transformer block

PTB: PTB is based on the window-based design of Swin-Transformer.

Window scoring: the feature map is first partitioned into non-overlapping windows. To evaluate the informational value of each window, we compute the L2 norm of the activation vector for every constituent token. These token-level magnitudes are then averaged to yield a unified importance score for the entire window. Using the L2 activation magnitude is highly efficient and has proven to be a more robust, computation-free metric compared to other complex learning-based criteria.
Top-K selection and propagation: based on a preset pruning rate r, all windows are ranked by their importance scores, and the top K windows are selected. Standard window multi-head self-attention (W-MHSA) and the subsequent feed-forward network (FFN) are exclusively executed on these selected high-information windows. For the unselected (pruned) windows, a ‘replication’ strategy is employed. Instead of discarding them, their projection features from the preceding layer are directly copied to the corresponding spatial positions in the current layer.
Structural consistency and stability: this replication strategy perfectly preserves the spatial structure of the feature map, ensuring that the tensor shape remains strictly consistent for subsequent layers. Furthermore, it retains coarse contextual and boundary information without introducing any significant computational overhead. To ensure optimization stability, the pruning mask of the PTB is smoothly updated via a warm-up strategy (layer-by-layer or epoch-by-epoch) during the training phase, and is strictly fixed during the validation and inference phases.

The time represents the average inference latency for a single image, measured over n=5 trials on a 3090 GPU with batch =1 and input size =2688*896.

As shown in Figure 2A, multiple iterations of feature pruning attention will create a hierarchy of importance, providing the network with more focused feature representations. Figure 2B,2C represent the variations of feature information in a multi-level PTB structure. Figure 2B focuses on the representation of hierarchical results, while Figure 2C emphasizes the overall changes in feature information.

Figure 2 Schematic diagram of pruning attention principle. (A) The initial window-based importance ranking, (B) illustrates the hierarchical selection of top-K windows, and (C) depicts the feature map reconstruction after pruning. Colored contours represent the accumulation of attention selections across layers. Importance increases from gray (unselected) to yellow (1 selection), blue (2 selections), and red (3 selections).

Crossing-attention transformer block

The dual-stream network architecture employing cross-attention fusion enables the network to establish connections between two different perspectives, thereby obtaining more robust complementary information. Different views of mammography from the same patient contain distinct information, often exhibiting complementary characteristics. We utilize the dual-stream network design to interactively fuse information from these two perspectives, resulting in more robust feature information. Instead of relying solely on end- stage fusion in dual-stream models, cross-attention fusion in the intermediate layers of the network not only facilitates information interaction but also allows the model to optimize using the fused feature information at an earlier stage.

The crossing block acts as a cross-view feature enhancement mechanism rather than a simple concatenation. In our design, features from the CC view serve as queries (Q), while features from the MLO view serve as keys (K) and values (V) (and vice-versa in a parallel stream). This allows the network to dynamically weight and retrieve relevant features from the complementary view based on the content of the current view. Even if the salient regions are spatially disjoint or different across views, the attention mechanism computes the similarity between a specific window in the CC view and all windows in the MLO view globally, assigning high attention weights to semantically related features. This process aligns spatially related information before final fusion, ensuring that findings are corroborated across projections rather than processed in isolation.

Model training details

In this study, we conducted a detailed timing analysis of the training and inference processes for our intelligent breast diagnosis model. The experimental setup is as follows: we utilized a single NVIDIA RTX 3090 GPU with CUDA version 12.1, a batch size of 4, and input image dimensions of 2688*896. We employed an 8:2 split for training and testing across both VinDr-Mammo and CBIS-DDSM datasets. Notably, no data augmentation was utilized; images were solely resized and normalized. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.

During model training, the first two layers of the PTB were frozen, while the remaining layers were fully trainable. The rationale behind freezing the first two stages of the PTB is twofold. First, early layers in deep networks typically learn generic low-level features (e.g., edges and textures) that are transferable across domains. Freezing these layers preserves the robust feature extraction capabilities acquired during large-scale pre-training (ImageNet). Second, given the relatively small size of mammography datasets compared to natural image datasets, freezing a portion of the parameters acts as a regularization technique, effectively preventing overfitting and stabilizing the optimization of the subsequent task-specific pruning and fusion modules. Regarding dataset organization, the 8:2 split was strictly performed at the patient level to prevent data leakage, ensuring that images from the same patient did not appear in both sets. Furthermore, the CC and MLO views were consistently paired for each breast side during input to leverage multi-view correlations. Finally, to address the inherent class imbalance in the dataset, we employed a weighted Cross-Entropy loss function during training.

Results

Datasets Vindr-Mammo

The Vindr-Mammo (16) dataset is an extensive compilation of digital mammography images, meticulously annotated to aid in the enhancement of breast cancer detection and diagnosis using machine learning techniques. It comprises thousands of images from a wide range of populations, complete with detailed annotations that include lesion types, Breast Imaging-Reporting and Data System (BI-RADS) categories, and exact lesion locations. This dataset is crafted to facilitate the creation of reliable AI models by offering a diverse array of cases, encompassing both normal and abnormal findings, thereby improving the generalizability and precision of diagnostic algorithms.

CBIS-DDSM

The CBIS-DDSM (19) dataset is a prominent resource in breast cancer research. It contains 1,645 digitized mammographic images, each with thorough annotations that detail lesion regions of interest, lesion types such as calcifications and masses, breast density, and BI-RADS categories. Notably, the labels indicating whether lesions are benign or malignant are confirmed through pathological examination. Due to its comprehensive nature and balanced data distribution, this dataset serves as a standard benchmark for assessing the performance of AI models in mammography-related studies. Nonetheless, the dataset was collected in earlier years, which presents certain limitations in terms of both its quantity and quality.

Evaluating indicator

In breast cancer early screening models, several evaluation metrics are commonly used to assess the performance of the classification models. Here are the definitions and significance of each metric along with their respective formulas:

Area under the curve (AUC): serving as a comprehensive evaluation metric for benign-malignant classification, the AUC quantifies the two-dimensional area beneath the receiver operating characteristic (ROC) curve. The ROC graph is constructed by mapping sensitivity (true positive rate) along the vertical y-axis against 1 − specificity (false positive rate) on the horizontal x-axis. By integrating performance over all potential decision thresholds, the AUC reflects the model’s inherent discriminative capability. An optimal classifier yields an AUC of 1.0, whereas a value of 0.5 indicates predictive performance no better than random chance.

$A U C = \int_{0}^{1} T P R (F P R) d (F P R)$ [1]

ACC: this metric represents the global correctness of the diagnostic model. It is calculated as the ratio of all correctly identified cases—encompassing both true malignant (true positives) and true benign (true negatives) predictions—to the entire evaluated cohort. Consequently, ACC provides a direct reflection of the overall probability that the network’s classification aligns with the actual ground truth.

$A C C = \frac{T P + T N}{T P + T N + F P + F N}$ [2]

where TP, TN, FP and FN represent the numbers of true positives, true negatives, false positives, and false negatives, respectively.

F1 score: the F1 score is the weighted average of precision and recall. This score takes both false positives and false negatives into account. It is particularly useful when the class distribution is uneven.

$F 1 S c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}$ [3]

We analyze the performance of our early breast cancer screening model using classification metrics such as AUC to evaluate the final benign or malignant classification performance. These metrics collectively provide a comprehensive evaluation of the performance of breast cancer screening models, helping to understand their strengths and weaknesses in various aspects of classification.

Comparative experiment

The Table 1 presents the performance of various models on the Vindr-Mammo and CBIS-DDSM datasets, comparing single-view (SV) and DV approaches. The metrics used for evaluation include AUC, ACC, F1 score, the number of parameters (in millions) and time cost (ms). To rigorously evaluate the generalization capability and robustness of our model, we employed a patient-level 5-fold cross-validation protocol on both the VinDr-Mammo and CBIS-DDSM datasets. Specifically, the patients were randomly divided into five non-overlapping subsets. In each fold, 80% of the patients were used for training and 20% for testing. The final reported metrics represent the average performance across all five folds.

Table 1

Performance of each model on two datasets

Method	Vindr-Mammo					CBIS-DDSM					Time (ms)	P value
Method	AUC (95% CI)	ACC, mean ± SD	F1, mean ± SD	Sens., mean ± SD	Spec., mean ± SD	AUC (95% CI)	ACC, mean ± SD	F1, mean ± SD	Sens., mean ± SD	Spec., mean ± SD	Time (ms)	P value
SV-Res18 (20)	0.724 (0.710–0.738)	0.778±0.021	0.815±0.024	0.801±0.026	0.762±0.023	0.715 (0.699–0.731)	0.641±0.025	0.630±0.028	0.611±0.030	0.658±0.024	162.83	<0.001
SV-ViT	0.731 (0.716–0.746)	0.650±0.028	0.712±0.025	0.698±0.027	0.610±0.032	0.725 (0.709–0.741)	0.650±0.029	0.600±0.031	0.582±0.035	0.685±0.028	317.38	<0.001
SV-SparseViT (21)	0.749 (0.733–0.765)	0.811±0.019	0.835±0.021	0.820±0.023	0.798±0.020	0.736 (0.720–0.752)	0.698±0.022	0.684±0.025	0.668±0.026	0.715±0.021	248.20	<0.001
GMIC (11)	0.771 (0.756–0.786)	0.887±0.016	0.885±0.018	0.873±0.020	0.899±0.017	0.759 (0.743–0.775)	0.682±0.020	0.679±0.022	0.665±0.024	0.692±0.019	369.03	<0.01
DV-Res18	0.737 (0.722–0.752)	0.750±0.018	0.792±0.020	0.778±0.022	0.729±0.019	0.728 (0.712–0.744)	0.672±0.021	0.638±0.024	0.621±0.027	0.698±0.020	263.59	<0.001
DV-ViT	0.767 (0.751–0.783)	0.868±0.015	0.791±0.018	0.781±0.020	0.908±0.016	0.741 (0.725–0.757)	0.620±0.023	0.589±0.025	0.565±0.028	0.652±0.024	582.93	<0.001
DV-SparseViT	0.781 (0.764–0.798)	0.903±0.012	0.895±0.014	0.882±0.016	0.915±0.013	0.754 (0.738–0.770)	0.680±0.016	0.648±0.019	0.631±0.021	0.705±0.017	428.36	<0.01
DV-GMIC	0.777 (0.760–0.794)	0.844±0.014	0.857±0.016	0.842±0.017	0.846±0.015	0.765 (0.749–0.781)	0.701±0.017	0.708±0.018	0.694±0.020	0.707±0.016	593.72	<0.05
X-Pruning (ours)	0.812 (0.794–0.828)	0.907±0.009	0.898±0.010	0.890±0.012	0.919±0.008	0.790 (0.774–0.806)	0.824±0.011	0.819±0.012	0.808±0.014	0.835±0.010	403.52	–

Backbone of the network after “–”. ACC, accuracy; AUC, area under the curve; CBIS-DDSM, Curated Breast Imaging Subset of the Digital Database for Screening Mammography; CI, confidence interval; DV, dual-view; GMIC, globally-aware multiple instance classifier; SD, standard deviation; Sens., sensitivity; Spec., specificity; SV, single-view; X, the use of a crossing architecture.

For the Vindr-Mammo dataset, the X-Pruning model, which is our proposed method, achieves the highest AUC of 0.788, ACC of 0.909, and F1 score of 0.900, indicating superior performance compared to other models. Notably, the DV SparseVit and DV GMIC also perform well, with AUCs of 0.783 and 0.779, respectively, but they fall short of the X-Pruning model in terms of ACC and F1 score. On the CBIS-DDSM dataset, the X-Pruning model again outperforms others with an AUC of 0.775, ACC of 0.826, and F1 score of 0.821. The DV SparseVit follows with an AUC of 0.756, but it does not match the ACC and F1 score of the X-Pruning model. Additionally, the AUC performance of each model on the two public datasets is more intuitively illustrated in Figure 3.

Figure 3 Performance of different models on two public datasets. AUC, area under the curve; CBIS-DDSM, Curated Breast Imaging Subset of the Digital Database for Screening Mammography; DV, dual-view; GMIC, globally-aware multiple instance classifier; SV, single-view; X, the use of a crossing architecture.

In terms of model complexity, the X-Pruning model has the highest number of parameters at 69.79 million, suggesting that its superior performance may be attributed to its complexity and capacity to learn from the data. In contrast, the SV Res18 model, with the fewest parameters (1.48 million), shows the lowest performance across both datasets.

Our analysis indicates that while pruning operations effectively reduce inference latency for individual instances, the model remains computationally more demanding than lightweight CNNs such as DV-Res18, necessitating a trade-off between ACC and speed. This necessitates a comprehensive consideration of both ACC and speed. It is crucial to distinguish between model size (parameter count) and inference efficiency (latency). Although X-Pruning possesses a larger parameter count (69.79 million) due to the dual-stream architecture and cross-attention mechanisms, its inference latency is remarkably low at 452.73 ms. This is significantly faster than the unpruned DV baseline, DV-ViT (582.93 ms). This acceleration arises because the proposed pruning mechanism dynamically discards redundant background tokens, thereby bypassing computationally expensive self-attention operations for non-informative regions. Consequently, X-Pruning effectively decouples model size from inference speed, achieving high-performance diagnosis with clinically feasible latency.

Overall, the analysis indicates that the X-Pruning model provides the best balance of performance across both datasets, albeit with a higher computational cost due to its larger parameter size.

Ablation study

We designed ablation experiments to evaluate the effectiveness of the Pruning module and the Crossing module, which are presented in Tables 2,3, respectively.

Table 2

Validation of pruning module effectiveness

Model	AUC	ACC	Parameters (million)
SV Vit	0.733	0.653	26.81
SV SparseVit	0.751	0.814	26.03
DV Vit	0.769	0.870	53.61
DV SparseVit	0.783	0.906	52.05

ACC, accuracy; AUC, area under the curve; DV, dual-view; SV, single-view.

Table 3

Validation of crossing module effectiveness

Model	AUC	ACC	Parameters (million)
DV Res18	0.739	0.753	6.13
X-Res18	0.773	0.852	23.86
DV SparseVit	0.783	0.906	52.05
X-Pruning (ours)	0.812	0.909	69.79

ACC, accuracy; AUC, area under the curve; DV, dual-view; X, the use of a crossing architecture.

As shown in Table 3, the introduction of SparseVit results in an increase in AUC from 0.733 to 0.751 and ACC from 0.653 to 0.814, while reducing the number of parameters from 26.81 to 26.03 million in SV task model. This indicates that SparseVit enhances model performance while simultaneously reducing parameter count. In the DVs model, the use of SparseVit leads to an increase in AUC from 0.769 to 0.783 and ACC from 0.870 to 0.906, with a reduction in parameters from 53.61 to 52.05 million. This further confirms the effectiveness of SparseVit.

According to Table 3, with the introduction of the Crossing module, AUC increases from 0.739 to 0.773 and ACC from 0.753 to 0.852, while parameters increase from 6.13 to 23.86 million. Despite the increase in parameters, the significant improvement in performance demonstrates the effectiveness of the Crossing module in information fusion.

Discussion

In this paper, we introduce X-Pruning, a dual-stream information fusion network based on pruned Transformers and cross-attention mechanisms, aimed at enhancing diagnostic performance in mammography images. Our study not only achieves significant technical advancements but also offers new perspectives for early breast cancer screening. Below is a detailed discussion of our work:

Firstly, our approach effectively addresses the challenge of detecting small lesions in high-resolution mammography images by introducing PTBs. Traditional Transformers require substantial computational resources for high-resolution images, whereas PTBs eliminate redundant self-attention paths to focus on suspicious regions, thereby improving computational efficiency. This method not only reduces resource consumption but also enhances the detection capability for small lesions, which is crucial for early breast cancer screening. It should be emphasized that X-Pruning’s computational efficiency gains primarily stem from window-level pruning within the Transformer module. This pruning significantly reduces self-attention-related floating point operations (FLOPs) (specifically quantified in the results table), rather than simply decreasing the model’s total weight parameters. The parameter counts in Table 1 reflect the entire network’s parameters (including unpruned, frozen, or newly added fusion module parameters), meaning X-Pruning’s parameter count is not minimized in some comparisons. To avoid misleading interpretations, we have defined “parameter efficiency” in this paper as “significantly reducing Transformer layer self-attention computations while maintaining or surpassing performance”, or explicitly specifying baselines and calculation methods when comparing parameter quantities. While we effectively decouple model size from inference speed (latency), we must explicitly acknowledge the computational cost associated with the model’s memory footprint. The relatively large parameter size of 69.79 million, driven by the dual-stream architecture and cross-attention modules, imposes higher requirements on GPU memory during deployment. This memory constraint represents a limitation of the current X-Pruning framework, particularly for integration into edge devices or standard clinical workstations with limited hardware capabilities.

Secondly, our cross-attention fusion module facilitates information exchange and integration between dual views. The CC and MLO views in mammography provide complementary information, yet traditional methods often fall short in effective information fusion. Our cross-attention mechanism allows for interactive feature exchange and integration between the two views, enhancing the model’s lesion recognition capability. This information fusion strategy not only improves diagnostic ACC but also offers new insights for multi-view medical image analysis.

Thirdly, our experimental results demonstrate that X-Pruning performs exceptionally well across multiple datasets, particularly achieving leading performance on Vindr-Mammo, CBIS-DDSM, and a private dataset. This performance advantage makes it more suitable for practical clinical applications.

Moreover, our method exhibits strong generalization capabilities when applied to different datasets, which is particularly important in medical image analysis. Different datasets may have varying imaging conditions and patient populations, yet our model maintains stable performance across these differences, demonstrating robustness and adaptability. This provides potential for broader clinical application. While the numerical improvements in AUC may appear mathematically modest, they hold substantial practical significance in clinical mammography. Given the large-scale nature of population-based breast cancer screening, even a 2–3% increase in diagnostic performance translates to a significant absolute number of correctly identified early-stage malignancies and a reduction in false-positive biopsies. This enhances both patient outcomes and healthcare resource allocation.

Finally, our research opens several avenues for future work. Firstly, further optimization of PTBs and the cross-attention module could enhance the model’s efficiency and ACC. Secondly, exploring the application of our method to other types of medical images, such as computed tomography (CT) or magnetic resonance imaging (MRI), could validate its generalizability. Additionally, integrating other advanced deep learning techniques, such as self-supervised learning or transfer learning, might further improve the model’s performance and application scope. In summary, X-Pruning offers an effective solution for early breast cancer screening and lays the groundwork for future research.

Conclusions

X-Pruning represents a significant advancement in mammography diagnosis by addressing two critical challenges: the detection of small malignant lesions in high-resolution images and the effective fusion of complementary information from CC and MLO views. By integrating PTBs and a novel cross-attention fusion module, our dual-stream network enhances computational focus on suspicious regions and facilitates comprehensive feature integration across views. This innovative approach not only improves diagnostic ACC but also reduces computational complexity, as evidenced by superior performance across multiple datasets than conventional multi-view CNNs. The promising results on Vindr-Mammo and CBIS-DDSM underscore the model’s efficiency and robustness, paving the way for its potential application in clinical settings. X-Pruning not only contributes to the field of mammography analysis but also offers a framework that could be adapted for other medical imaging tasks, highlighting its versatility and impact in advancing early cancer detection methodologies.

Acknowledgments

None.

Footnote

Reporting Checklist: The authors have completed the TRIPOD+AI reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-aw-2486/rc

Funding: This study was supported by Preoperative Localization and Navigation of Non-Mass Breast Cancer with DCE-MRI and Ultrasound Fusion Imaging (No. 2023yjlcyj019).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-aw-2486/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, Bray F. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J Clin 2021;71:209-49. [Crossref] [PubMed]
Nolan E, Lindeman G, Visvader J. Deciphering breast cancer: from biology to the clinic. Cell 2023;186:1708-28. [Crossref] [PubMed]
Ginsburg O, Yip CH, Brooks A, Cabanes A, Caleffi M, Dunstan Yataco JA, et al. Breast cancer early detection: A phased approach to implementation. Cancer 2020;126:2379-93. [Crossref] [PubMed]
Tabár L, Dean PB, Kaufman CS, Duffy SW, Chen HH. A new era in the diagnosis of breast cancer. Surg Oncol Clin N Am 2000;9:233-77.
Hovda T, Hoff SR, Larsen M, Romundstad L, Sahlberg KK, Hofvind S. True and Missed Interval Cancer in Organized Mammographic Screening: A Retrospective Review Study of Diagnostic and Prior Screening Mammograms. Acad Radiol 2022;29:S180-91. [Crossref] [PubMed]
Azavedo E, Zackrisson S, Mejàre I, Heibert Arnlind M. Is single reading with computer-aided detection (CAD) as good as double reading in mammography screening? A systematic review. BMC Med Imaging 2012;12:22.
Posso M, Puig T, Carles M, Rué M, Canelo-Aybar C, Bonfill X. Effectiveness and cost-effectiveness of double reading in digital mammography screening: A systematic review and meta-analysis. Eur J Radiol 2017;96:40-9. [Crossref] [PubMed]
de Vries CF, Colosimo SJ, Staff RT, Dymiter JA, Yearsley J, Dinneen D, Boyle M, Harrison DJ, Anderson LA, Lip G. iCAIRD Radiology Collaboration. Impact of Different Mammography Systems on Artificial Intelligence Performance in Breast Cancer Screening. Radiol Artif Intell 2023;5:e220146. [Crossref] [PubMed]
Nguyen HT, Nguyen HQ, Pham HH, Lam K, Le LT, Dao M, Vu V. VinDr-Mammo: A large-scale benchmark dataset for computer-aided diagnosis in full-field digital mammography. Sci Data 2023;10:277. [Crossref] [PubMed]
Liu Y, Zhang F, Chen C, Wang S, Wang Y, Yu Y. Act like a radiologist: towards reliable multi-view correspondence reasoning for mammogram mass detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 2022;44:5947-61.
Shen Y, Wu N, Phang J, Park J, Liu K, Tyagi S, Heacock L, Kim SG, Moy L, Cho K, Geras KJ. An interpretable classifier for high-resolution breast cancer screening images utilizing weakly supervised localization. Med Image Anal 2021;68:101908. [Crossref] [PubMed]
Nayeem MD, Nisita NJ, Islam MM, Rahman MS, Shawkat Ali ABM. Cross-platform multi-cancer histopathology classification using local-window vision transformers. Sci Rep 2025;15:40896. [Crossref] [PubMed]
Pavel MA, Asad R, Michael GKO, Ikramuzzaman M, Mustakim M, Khan R. Multi-stage knowledge distillation with layer fusion-based deep learning approach for skin cancer classification. Sci Rep 2025;15:39792. [Crossref] [PubMed]
Zaheer AN, Farhan M, Min G, Alotaibi FA, Alnfiai MM. Attention-enhanced hybrid U-Net for prostate cancer grading and explainability. Sci Rep 2025;15:34038. [Crossref] [PubMed]
Elumalai S, Rajendran S, Khalid M. Breast cancer classification based on microcalcifications using dual branch vision transformer fusion. Sci Rep 2025;16:4249. [Crossref] [PubMed]
Himel GMS, Islam MM, Al-Aff KA, Karim SI, Sikder MKU. Skin Cancer Segmentation and Classification Using Vision Transformer for Automatic Analysis in Dermatoscopy-Based Noninvasive Digital System. Int J Biomed Imaging 2024;2024:3022192. [Crossref] [PubMed]
Zhang C, Liu L, Dai J, Liu X, He W, Chan Y, Xie Y, Chi F, Liang X. XTransCT: ultra-fast volumetric CT reconstruction using two orthogonal X-ray projections for image-guided radiation therapy via a transformer network. Physics in Medicine & Biology 2024;69:085010.
Liu L, Fan X, Zhang C, Dai J, Xie Y, Liang X. Three-dimensional medical image fusion with deformable cross-attention. Neural Information Processing. International Conference on Neural Information Processing 2023. Singapore: Springer; 2023:551-63.
Lee RS, Gimenez F, Hoogi A, Miyake KK, Gorovoy M, Rubin DL. A curated mammography data set for use in computer-aided detection and diagnosis research. Sci Data 2017;4:170177. [Crossref] [PubMed]
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE; 2016:770-8.
Su L, Ma X, Zhu X, Niu C, Lei Z, Zhou JZ. Can We Get Rid of Handcrafted Feature Extractors? SparseViT: Nonsemantics-Centered, Parameter-Efficient Image Manipulation Localization Through Spare-Coding Transformer. Proceedings of the AAAI Conference on Artificial Intelligence 2025;39:7024-32.

Cite this article as: Dai X, Pan X, Zeng S, Long Q, Liao Z, Zou S, Xing Y, Yu Z, Hu Y, Luo X. X-Pruning: a dual-stream information fusion mammography diagnosis network based on pruned transformer and cross-attention mechanism. Quant Imaging Med Surg 2026;16(7):516. doi: 10.21037/qims-2025-aw-2486

X-Pruning: a dual-stream information fusion mammography diagnosis network based on pruned transformer and cross-attention mechanism

Introduction

Methods

Overall framework

Network structure details pruning-based transformer block

Crossing-attention transformer block

Model training details

Results

Datasets Vindr-Mammo

CBIS-DDSM

Evaluating indicator

Comparative experiment

Table 1

Ablation study

Table 2

Table 3

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share