Deep learning in multi-modal breast cancer data fusion: a literature review
Introduction
Breast cancer is the leading cause of cancer death among women worldwide (1). In recent years, driven by the success of deep learning, research in breast cancer has increasingly explored deep learning methods for prediction (2). Traditionally, studies have relied on extracting information from single-modality data to support predictions in breast cancer. However, this approach involves certain limitations, as single-modality data often offer limited and single-sided information, with which it is difficult to capture the full complexity and diversity of breast cancer. Consequently, many researchers have shifted their focus to the multi-modal field, combining multiple modalities to improve prediction accuracy. Multi-modal data offer richer information, which enhances the robustness of predictive models.
In the realm of multi-modal deep learning, the task of assisting clinicians in diagnosing breast cancer is typically categorized into four areas: breast cancer diagnosis, assessment of neoadjuvant systemic therapy (NSTs), breast cancer prognosis prediction, and tumor segmentation. Breast cancer diagnosis aims to identify abnormal masses in breast tissue and conduct pathological assessments to determine tumor type, grade, and lymph node metastasis status (3). NST evaluation primarily assesses whether pathological complete response (pCR) has been achieved (4). Prognosis prediction focuses on assessing the risk of postoperative recurrence and metastasis, as well as predicting patient survival (5). Tumor segmentation aims to accurately delineate tumor areas in breast images, supporting treatment planning and monitoring (6).
The fusion of multi-modal data for breast cancer prediction typically involves four stages (Figure 1): data preprocessing, feature extraction, multi-modal feature fusion, and model evaluation. In the data-preprocessing stage, various techniques are employed, including feature selection, traditional image-processing methods, and region of interest (ROI) extraction. With technological advancements, the integration of machine learning, deep learning, and conventional image processing has also been widely adopted in the field of medical imaging (7-9). During the feature extraction stage, it is essential to select an appropriate deep learning network, as this enables the extraction of informative representations and facilitates effective multi-modal fusion.
Among the four stages, multi-modal feature fusion plays a critical role in determining model performance. In recent years, various deep learning-based multi-modal fusion methods have been proposed in the medical domain, including feature-level fusion and decision-level fusion strategies (10,11). With the continued development of deep learning techniques such as convolutional neural networks (CNNs) and the growing need to effectively integrate multi-modal information, a greater number of flexible strategies such as hybrid fusion have also been introduced (12,13). Based on a comprehensive review and comparative analysis of recent studies on multi-modal breast cancer data integration, this work categorizes the existing approaches into three representative fusion strategies: feature-level fusion, decision-level fusion, and hybrid fusion. Figure 2 presents the primary technologies associated with each fusion strategy.
To evaluate model performance appropriately according to task objectives, various evaluation metrics are employed for prognosis prediction and lesion segmentation tasks. For prognosis prediction tasks, commonly used metrics include accuracy (ACC), area under the receiver operating characteristic curve (AUC), F1-score, precision, sensitivity, specificity, and the Matthews correlation coefficient (MCC). These metrics are particularly suitable for assessing models on imbalanced datasets. For segmentation tasks, commonly used metrics include the Dice similarity coefficient (Dice), Hausdorff distance (HD), and average boundary distance (ABD), which are used to measure region overlap and boundary accuracy.
Given the increasing complexity, diversity, and rapid evolution of fusion strategies over the past 5 years, this review aims to systematically evaluate their respective strengths and limitations, identify existing challenges, and propose potential directions for future research in this field. Reviews of multi-modal techniques in medicine often omit a comprehensive introduction to multi-modal fusion methods (14). In contrast, the present review provides a structured overview of fusion strategies specifically applied to breast cancer research. Moreover, they lack a detailed discussion of how multi-modal technologies assist clinicians in diagnosis, as well as a thorough exploration of the commonly used types of multi-modal data (15). Our review makes unique contributions by providing a comprehensive framework for multi-modal breast cancer research, highlighting key tasks, datasets, and processing methods that have been previously underexplored. Furthermore, the paper explains the related datasets and offers links to open-source data, serving as a handy resource for researchers who wish to work on multi-modal fusion methods and their applications.
The structure of this review is as follows: the Methods section briefly describes the search strategy for the literature, including databases, time range, and language considerations. In the Discussion, four key aspects are examined: multi-modal data, specific applications of multi-modal methods in practical diagnosis, data-preprocessing techniques, and multi-modal fusion strategies. This section further provides a detailed analysis of the challenges and dilemmas faced by traditional methods in breast cancer diagnosis, lesion segmentation, treatment effect assessment, and prognosis prediction while highlighting the improvements achieved by multi-modal fusion technology in these areas. It additionally introduces data-preprocessing techniques used in the multi-modal field of breast cancer, addressing issues such as the curse of dimensionality in biological molecular data, high noise in image data, and the overfitting caused by the scarcity of breast cancer data. Based on a review of 50 articles from 2019 to 2025, existing multi-modal fusion strategies are categorized into feature-level fusion, decision-level fusion, and hybrid fusion, with detailed descriptions of each strategy. Finally, the Conclusion identified the common challenges in the field and outlines the future prospects of multi-modal approaches in breast cancer research. We present this article in accordance with the Narrative Review reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2024-2903/rc).
Methods
Search strategy
The study selection process is shown in Figure 3. We systematically searched four databases: PubMed, Web of Science, Cochrane Library, and Google Scholar for both published and unpublished studies listed from January 2019 to April 2025. It is worth noting that although Google Scholar may include some non-indexed sources, its broad coverage served as a useful supplement to minimize the risk of missing potentially relevant studies. A comprehensive search strategy was developed via well-structured keyword combinations and Boolean operators to identify all eligible studies as thoroughly as possible. The search terms included: “breast cancer”, “breast lesion”, “artificial intelligence”, “deep learning”, “convolutional neural network”, “multi-modal learning”, “classification”, “segmentation”, “risk assessment”, “survival analysis”, “multi-modal”, “multi-modal”, “multi-omics”, “data fusion”, and “data integration”.
Study selection
Studies were initially screened for eligibility if they focused on breast cancer and employed deep learning methods to integrate multi-modal data. To ensure transparency and reproducibility, Table 1 provides detailed information on the databases searched, keyword strategies, and the timeframe. Study selection was conducted independently by two reviewers (T.L. and S.S.) according a standardized protocol. First, duplicate records were removed with a reference management tool. The two reviewers then independently screened the titles and abstracts of the remaining records to identify potentially eligible studies. Full texts of the selected articles were then reviewed to determine whether they met the inclusion criteria. In cases of disagreement, a third reviewer (Q.W.) was consulted to resolve discrepancies and reach a consensus.
Table 1
| Items | Specification |
|---|---|
| Date of search | The search time range was from May 1, 2019, to April 16, 2025 |
| Databases and other sources searched | PubMed, Web of Science, Cochrane Library, and Google Scholar |
| Search terms used | “Breast cancer”, “breast lesion”, “artificial intelligence”, “deep learning”, “convolutional neural network”, “multi-modal learning”, “classification”, “segmentation”, “risk assessment”, “survival analysis”, “multi-modal”, “multi-modal”, “multi-omics”, “data fusion”, “data integration” |
| Timeframe | First search date: April 26, 2024–June 8, 2024 |
| Second search date: August 16, 2024–August 28, 2024 | |
| Literature selection: September, 21, 2024–April 16, 2025 | |
| Inclusion and exclusion criteria | The included studies focused on deep learning–based multi-modal fusion methods for breast cancer. The multi-modal data used in these studies were required to have clear medical relevance and clinical applicability. Studies were expected to reflect recent advances in the effective integration of heterogeneous information and demonstrate practical value in real-world diagnostic tasks, such as breast cancer classification, segmentation, risk prediction, or prognosis assessment. Studies were excluded if they focused on diseases other than breast cancer, did not apply deep learning methods, or non-original articles, or were published in a language other than English |
| Selection process | The literature selection was conducted independently by T.L. and S.S. Differences were resolved through discussion |
To ensure the representativeness of the included studies, we further screened the full texts from three aspects: the types of multi-modal data used, the fusion strategies adopted, and their specific applications in clinical diagnosis. First, the data modalities used in the studies were required to have clear medical significance and clinical applicability. We prioritized commonly used and highly complementary modality combinations in breast cancer diagnosis and subtyping, such as breast imaging, pathological images, and molecular-level data. In the past 5 years, with the increasing availability of multi-modal data and the improved flexibility of deep learning frameworks in fusion mechanisms, research in the field of breast cancer multi-modal learning has shifted from using simple modality fusion to more effectively integrating heterogeneous information, with the aim to enhance the utility and complementarity of multi-modal data in diagnostic tasks. Therefore, the selected studies were required to reflect recent technological developments in fusion strategies within this field. Finally, only studies with clear practical value in real diagnostic scenarios, such as breast cancer classification, segmentation, risk prediction, and prognosis prediction, were included. These three criteria ensured that the selected studies represent broader trends in multi-modal breast cancer research.
The inclusion criteria were as follows: (I) a focus on breast cancer diagnosis; (II) at least one deep learning method used for diagnostic purposes; (III) use of multi-modal data (from two or more data types); (IV) a study based on human clinical, imaging, or molecular data; and (V) a model addressing diagnostic tasks such as breast cancer classification, segmentation, risk assessment, or prognosis prediction.
The exclusion criteria were as follows: (I) a focus on diseases other than breast cancer; (II) no deep learning methods involved, with a sole reliance on traditional machine learning or statistical approaches; (III) not original research, including reviews, editorials, commentaries, letters, case reports, conference abstracts, preprints, virtual simulation studies, or animal experiments; and (IV) non-English language publications.
Discussion
Publicly available datasets for breast cancer research
This section provides an overview of several publicly available datasets that are applicable to the field of breast cancer research. Figures 4,5 present the widely used breast cancer datasets, which are categorized into two main categories: image data and non-image data. Non-image data is further subdivided into biological molecular data and clinical data. Common types of biological molecular data include messenger RNA (mRNA), deoxyribonucleic acid (DNA), copy number variation (CNV), and proteogenomic data. Image data used in breast cancer research can be broadly categorized into radiological and pathological modalities. Common radiological modalities include magnetic resonance imaging (MRI), X-ray, mammography, and ultrasound imaging. Pathological image modalities primarily consist of hematoxylin and eosin (H&E)-stained images and immunohistochemical (IHC) images. Figures 6-8 show the data usage distribution in multi-modal breast cancer research, including the usage proportion of four data types and their combinations in papers, as well as the usage proportion of various medical imaging technologies.
Among other relevant biomarkers, biological molecular data can provide researchers with key information such as tumor gene expression, gene mutation, and protein function. Clinical data can provide researchers with the patient’s medical history, clinical features, and pathological characteristics. Medical imaging data and the application of different imaging techniques provides researchers with key information regarding tumor location, morphology, etc. The advantages and disadvantages of the multi-modal imaging methods commonly used in the field of breast cancer are compared in Table 2.
Table 2
| Medical modality (abbreviation) | Medical modality (full term) | Advantages | Disadvantages |
|---|---|---|---|
| CESM | Contrast-enhanced spectral mammography | CESM enhances the visual contrast of vascular structures and potential lesion areas within the breast tissue by introducing contrast injections, allowing for small hidden lesions to be revealed (16,17) | Iodine contrast agents used in CESM may cause allergic reactions and kidney damage (18) |
| CESM exhibits high sensitivity in detecting microcalcifications within the breast, offering higher specificity and reducing the rate of false-positive diagnoses (16,17) | |||
| Imaging time is short. Additionally, due to its technical principles being similar to those of traditional X-ray mammography, CESM has lower costs (19) | |||
| DBT | Digital breast tomosynthesis | DBT constructs a stereoscopic three-dimensional image of the breast by acquiring multiple low-dose X-ray projections at different angles (20) | DBT acquires a series of 2D breast cancer images through multiple low-dose exposures to reconstruct the 3D impact, causing greater radiation exposure to patients (21) |
| Compared to traditional 2D mammography, CESM offers clearer details of dense breast tissue and enhances sensitivity (22) | |||
| For patients with small lesions and high breast density, the problem of missing lesions due to tissue stacking can be overcome, allowing the physician to accurately determine the morphology, location, and other characteristics of the lesion (21) | |||
| FFDM | Full-field digital mammography | Compared to film-based conventional mammography, FFDM has higher diagnostic accuracy in non-menopausal women under 50 years of age with high breast density (23) | FFDM does not perform well in dense breasts, and its overlapping structures may cause it to have low sensitivity. When the mass is surrounded by denser surrounding abnormal tissue, the margins of the lesion become blurred and may be missed (24) |
| MRI | Magnetic resonance imaging | Of all current imaging techniques, MRI has the highest sensitivity in the early diagnosis of breast cancer. Especially for patients with high breast density, it is able to clearly show the soft tissue structure of the breast (25) | Compared to mammography, DBT requires more time. It also has higher equipment costs and required the use of contrast agents to enhance the images. Furthermore, DBT has lower specificity compared to mammography (25) |
| US | Ultrasound | US achieved excellent results in detecting and differentiating benign and malignant masses (26) | Ultrasound scanners require an operator to accurately locate lesions for detection and differentiation (27) |
| Compared with MRI, DM, and DBT, US is safer, less expensive, and more versatile (28) | |||
| US has higher diagnostic accuracy in women with high breast density below 35 (29) | |||
| SWE | Shear wave elastography | SWE is an ultrasound imaging technique used to assess the hardness or elasticity of tissues. It has been proven effective in diagnosing breast cancer, as cancerous tissues are typically harder than are normal tissues and thus exhibit higher elastic modulus values on SWE images (30) | SWE is operator-dependent (30) |
| CEUS | Contrast-enhanced ultrasound | CEUS is easier to obtain compared to MRI and DBT, and it does not involve radiation exposure (30) | Contrast agents used during CEUS imaging can be harmful to patients (30) |
| CEUS can comprehensively assess the microvascular distribution and blood perfusion status of breast tumors (31) |
DM, digital mammography; 2D, two-dimensional; 3D, three-dimensional.
The Cancer Genome Atlas (TCGA) database was established collaboratively and is maintained by the National Cancer Institute’s Center for Cancer Genomics and the National Human Genome Research Institute in the United States (32). TCGA is a comprehensive cancer genomics database that encompasses 86 different cancer research projects, including those related to cervical cancer, gastric adenocarcinoma, breast cancer, and acute myeloid leukemia. The database includes approximately 44,700 cases and provides a wealth of data types, such as genome-sequencing data, protein data, and transcriptome data, for scientific research purposes. This database is accessible at https://portal.gdc.cancer.gov/.
The Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) database was jointly established by Canada and the United Kingdom and continues to be maintained by these nations (33). This database is dedicated to subclassifying breast tumors into numerous subtypes based on their molecular features. The research encompasses data from more than 2,000 patients with breast cancer, offering an extensive array of molecular information, including gene expression, CNVs, genetic mutations, and clinical data. This database is accessible at http://www.cbioportal.org/.
The Cancer Imaging Archive (TCIA) database was established and is maintained by the Institute for Computational Biology and Informatics (34). Encompassing 20 common cancer types, TCIA offers an extensive collection of over 9,000 cases, providing researchers with a wealth of medical images and clinical information. This database is accessible at https://www.cancerimagingarchive.net/.
The Breast Cancer Histopathological Database (BreakHis) was developed in collaboration with two institutions: the Laboratory of Vision, Robotics and Imaging at the Federal University of Paraná, and the Laboratory of Research and Development in Pathology and Cytopathology, located in the state of Paraná, Brazil (35). BreakHis comprises 9109 microscopic images of breast tumor tissues, which were sourced from 82 patients and captured at four different magnification levels: 40×, 100×, 200×, and 400×. This database is accessible at https://web.inf.ufpr.br/vri/databases/breast-cancer-histopathological-database-breakhis/.
The Curated Breast Imaging Subset of Digital Database for Screening Mammography (CBIS-DDSM) is an updated and standardized version of the Digital Database for Screening Mammography (36). This collection is a carefully selected subset of data from the DDSM, screened by experienced breast radiologists. The chosen images have been decompressed and converted into Digital Imaging and Communications in Medicine (DICOM) format for ease of use. Additionally, the CBIS-DDSM includes updated ROI segmentations, bounding box information, and corresponding pathological diagnosis training data, providing researchers with a particularly rich and accurate information resource. This database is accessible at https://www.cancerimagingarchive.net/collection/cbis-ddsm/.
Deep learning-assisted diagnosis of breast cancer
With the advancement of deep learning technology, an increasing number of researchers are applying it in the medical field. Deep learning has demonstrated significant potential in the diagnosis of breast cancer, effectively assisting clinicians in identifying disease characteristics by integrating multi-modal medical data, thereby enhancing diagnostic accuracy and efficiency. However, at present, deep learning technology can only serve as a tool to aid in diagnoses, and the final professional judgment must still be made by clinicians. In the field of multi-modal breast cancer prediction, research has mainly focused on breast cancer diagnosis, NST evaluation, breast cancer prognosis prediction, and tumor segmentation. In this section, we examine the research focus in the field of breast cancer prediction and the challenges it faces.
Breast cancer is diagnosed in several steps. First, the clinician will determine if there is an abnormal lump in the breast through clinical examination and imaging tests using breast ultrasound or mammography, with the latter being more common. When masses, calcified spots, or other areas of abnormal density are detected, cancer is suspected, and further confirmation by biopsy is warranted (3). However, this method can be limited by the subjectivity of image interpretation and clinician experiences. Toward achieving improved confirmation of findings, many researchers have begun to experiment with combining breast cancer data from different modalities to aid in imaging. For example, combining contrast-enhanced ultrasound (CEUS) and B-mode ultrasound (B-US) has been demonstrated by Guo et al. (37) to improve sensitivity. Combining two or more modalities of data can provide richer perfusion information and clearer details of the tissue structure, helping physicians to comprehensively evaluate the lesion.
The second step of diagnosis is biopsy, which involves the collection of tissue samples via surgery, endoscopy, or needle biopsy. The samples are stained with H&E and examined microscopically to identify abnormal cells. However, the slice-freezing process may damage the tissue morphology. To overcome this challenge, researchers resort to using multi-modal microscopy imaging techniques. The ability of fused multi-modal microscopy imaging to significantly reduce the effect of the section freezing process on tissue morphology has been demonstrated by Wu et al. (38). Their study further showed that, through the fusion of multi-modal data, richer information on tissue morphology and collagen content in tissues can be obtained, and that this method improves the prediction accuracy as compared to single-modality prediction.
In the deep learning domain, tasks that assist clinicians in diagnosing breast cancer are mainly divided into two categories: classification and tumor-node-metastasis (TNM) grading. Figure 9 shows how classification and grading tasks are divided in breast cancer diagnosis. Observing H&E-stained tissue sections under a microscope can help clinicians determine the type of tumor. Tumor classification is based on histological characteristics and immunohistochemical properties. Histological types are divided into carcinoma in situ and invasive carcinoma based on the morphological structure of cancer cells. In noninvasive breast cancer (39), abnormal cells do not penetrate the basement membrane; rather, they are confined to the ductal system or lobules of the breast and do not invade the surrounding normal tissue (40). In contrast, invasive breast cancer invades the surrounding normal tissue. Abnormal cells metastasize to other parts of the body through the blood and lymphatic systems. IHC testing can be used to determine estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2 (HER2) status, through which molecular subtypes are classified into luminal a, luminal B, HER2-overexpressing, and triple-negative types (41). Aiming to improve the accuracy of pathological assessments, researchers have examined integrating data from multiple modalities. The ability to integrate multi-modal data to improve the accuracy of tumor classification has been confirmed by numerous experiments (40,42,43). For example, numerous studies have combined pathological images and clinical information. Pathological images contain an abundance of high-dimensional details that are challenging to interpret, while clinical information represents a condensed summary of physicians’ accumulated expertise and diagnostic experience. In this way, the automatic classification algorithm for breast cancer can be applied in clinical practice. Breast cancer grading is also a key aspect in diagnosis and treatment. The TNM staging, an internationally accepted staging standard (44), categorizes breast cancer into multiple grades based on the size and depth of tumor invasion, lymph node status, and the presence of distant metastasis. TNM staging is of considerable significance for the treatment and prognosis of breast cancer, allowing physicians to develop suitable treatment plans based on the patient’s stage.
After breast cancer is diagnosed, NST is usually performed before surgical treatment. NST is widely used in the treatment of early or locally advanced breast cancer. NST can shrink the tumor before surgery, allowing breast-conserving surgery to remove fewer glands and improve the success rate of surgery. NST can also help prevent the formation of new metastatic lesions. The primary criterion for determining whether NST is successful is whether pCR is achieved, and pCR is one of the important indicators used to evaluate the prognosis of patients with breast cancer. Aiming to accurately predict pCR, researchers have attempted to integrate multi-modal data. The view that fusing multi-modal data can improve the accuracy of NST evaluation has been supported by numerous studies (3,45,46). In Joo et al.’s work, clinical data and MRI were combined (4), and it was found that the multi-modal model outperformed the model using only clinical information.
After the patient undergoes surgery, the modern artificial intelligence medical system predicts the patient’s breast cancer prognosis. The purpose of prognosis prediction is to assess the patient’s risk of recurrence and metastasis after surgery, as well as to predict the patient’s survival rate. Compared to a single modality, multiple modalities can significantly improve the accuracy of prognosis prediction, as confirmed by numerous studies (40,42,47-54). For example, Tong et al. (48) combined gene expression, DNA methylation, microRNA (miRNA) expression, and CNV for survival prediction. They found that the concordance index of the combination multi-modal model was 0.025 higher than that of the unimodal model that used only DNA methylation and miRNA expression.
Breast cancer tumor segmentation is a significant challenge in multi-modal diagnosis. Traditionally, the segmentation of 3D breast cancer tumors has relied on manual delineation performed by clinical experts. However, manual segmentation is time-consuming, prone to a high error rate, and subjective. To address these challenges, researchers have begun to investigate methods for automated breast cancer segmentation. Yet, in the process of automating breast cancer segmentation, they have encountered a critical challenge: breast cancer tumors occupy a relatively small percentage of three-dimensional medical images, particularly for those small tumors whose boundary information is often blurred and difficult to segment accurately. To achieve better segmentation, researchers have attempted to use multi-modal data. Combining data from multiple modalities can reduce the influence of subjectivity in human segmentation, a benefit confirmed by Yang et al. (55). In their study, CT images of different modalities were combined to improve the accuracy and efficiency of the proposed method. Compared with single-modality segmentation, automatic identification and segmentation of lesions in breast cancer images can reduce the workload of clinicians.
Data processing
Data processing is a crucial step in the breast cancer prediction process, providing a solid foundation for subsequent network training. However, the biological molecular data, clinical data, and medical imaging data used in breast cancer prediction present several challenges. First, biological molecular data faces the issue of the “curse of dimensionality”, in which the number of features greatly exceeds the number of samples. Second, medical imaging data are difficult to acquire and prone to high levels of noise. In addition, clinical data contain numerous non-numerical features that do not conform to the numerical input format required by the models, making them difficult to train directly. To address these issues, preprocessing is necessary before the data are fed into the network. Currently, commonly used data processing techniques in the multi-modal field of breast cancer include feature selection, image processing, ROI extraction, and One-Hot Encoding. The following section will introduce the commonly used data preprocessing techniques in detail.
Feature selection
In the multi-modal study of breast cancer, both biological molecular data and medical image data play important roles. However, biological molecular data often entail issues of high dimensionality and small sample size, meaning that the number of features is significantly greater than is the number of samples. Similarly, image data contain a large amount of redundant information and noise, and directly learning features from all pixels can negatively impact the model’s performance. Aiming to address these issues, researchers typically perform feature selection before network learning. Feature selection helps to mitigate the curse of dimensionality problem in non-image data, a solution that has been widely validated (40,42,43,47,48). Before feature selection, missing values in the dataset are often imputed by machine learning algorithms, such as weighted nearest neighbor algorithms. Feature selection algorithms are generally divided into three categories: filter methods, wrapper methods, and embedded methods (56). In the filter method, features are evaluated independently with statistical criteria such as variance thresholding or the chi-squared test. The wrapper method evaluates feature subsets based on the performance of a predictive model, typically via techniques such as recursive feature elimination or genetic algorithms (57). Embedded methods incorporate the feature selection process into the model training itself, using approaches such as regularization and random forests. Besides these traditional methods, other commonly used techniques include principal component analysis, minimum redundancy and maximum relevance (mRMR), and various others.
Image preprocessing
In the field of breast cancer research, the high cost of data acquisition and the time-consuming labeling process often lead to insufficient data. Aiming to address these issues, researchers commonly use data augmentation techniques to expand the size of the dataset. Data augmentation involves applying various transformation techniques, such as rotation, flipping, and scaling, to increase the number of samples and enhance the model’s generalization ability (58-60). In addition to data augmentation, deep learning methods have also been adopted, improving prediction accuracy. For example, adversarial networks generate data similar to real medical images through the use of generators. Moreover, semisupervised learning has been introduced to address the challenge of difficult data labeling. Semisupervised learning enables data labeling of unlabeled data by training on a small amount of labeled data.
Normalization techniques, such as min-max normalization and z-score standardization, are commonly used to scale features to a comparable range, thereby facilitating subsequent feature learning. H&E-stained tissue images and ultrasound images often exhibit significant differences due to preparation processes and individual variations. Similarly, gene expression, copy number alteration (CNA), and clinical data are susceptible to the same issues. Applying normalization techniques can adjust the grayscale values or numerical values of these data to a uniform range, aiding in subsequent analysis (50,52,61).
During the imaging process, factors such as operational errors and improper control of radiation dose can introduce substantial noise into the image. This noise not only diminishes the image’s clarity but may also obscure key information, thereby significantly impacting the accuracy of the diagnosis. To mitigate this noise, the image is typically denoised (59,62). Several denoising methods exist, including filters, horizontal line detection, and the most-frequent-value-filtering-based anisotropic diffusion algorithm. These methods can effectively reduce the noise in the image, facilitating subsequent model training.
ROI extraction
In breast cancer imaging, several methods for ROI extraction are employed to address issues such as noise and artifacts, improving diagnostic accuracy. For instance, ROI extraction helps identify lesion areas in ultrasound, computed tomography (CT), MRI, and X-ray mammography, enabling focused analysis of morphology, hemodynamic characteristics, and suspicious areas such as masses or calcifications (59,62,63). Traditionally, ROI extraction has been performed manually, but this subjective approach is being replaced by automatic methods using target detection networks. These networks involve two stages: detection, which locates potential ROI areas; and extraction, which gathers details including tumor shape and location. Classic networks such as fast region-based CNN, using region proposal networks and ROI pooling layers, have enhanced ROI extraction and thus improved detection accuracy (64). The effectiveness of deep learning networks for automatic ROI extraction, with human subjectivity being circumvented, has been demonstrated by Zhou et al. (65), with experiments showing improved prediction accuracy and reduced errors under the fast region-based CNN.
One-hot encoding
In breast cancer research, clinical data are a commonly used data type and not only include basic information such as the patient’s age and gender, but also key details such as the tumor’s condition and grade. However, clinical data often consist of non-numerical features and cannot be directly processed by the model. Aiming to use this data for effective learning, researchers often employ one-hot encoding for preprocessing, converting non-numerical features into numerical representations that the model can understand (4,66).
Overview of multi-modal networks
Through an in-depth analysis of 50 papers in the field of multi-modal techniques in breast cancer published from 2019 to 2025, we summarized the proposed fusion strategies and categorized them into three types. Table 3 presents three representative papers from each fusion strategy, and the models proposed in these papers are compared according to four commonly used evaluation indicators.
Table 3
| Article | Fusion method | Year | Multi-modal combination | Task | Multi-modal fusion technique | Performance % | |||
|---|---|---|---|---|---|---|---|---|---|
| AUC | ACC | Specificity | Sensitivity | ||||||
| (40) | Feature-level fusion | 2021 | Clinical profile/CNA profile/gene expression profile | Prognosis prediction for patients | Sigmoid-gated attention mechanism | 0.95 | 0.91 | – | 0.79 |
| (37) | 2023 | CEUS/B-US | Classification of malignant, benign, and normal masses | Difference rectification response | 0.943 | 0.88 | 0.90 | 0.90 | |
| (55) | 2024 | MonoE/IoNW/Z effective | Lesion segmentation | Residual module | – | – | 0.94 | 0.84 | |
| (67) | Decision-level fusion | 2021 | DBT/FFDM | Classification of malignant, benign, and normal masses | Weighted fusion | 0.96 | – | – | – |
| (31) | 2022 | B-US/CEUS | Classification of malignant, benign, and normal masses | Voting method | 0.85 | 0.87 | 0.92 | 0.77 | |
| (68) | 2022 | Clinical profile/mpMRI (DCE-MRI volume, ADC volume, Dixon volume) | Prognosis prediction for patients | Average method | 0.75 | – | 0.57 | 0.90 | |
| (69) | Hybrid fusion | 2022 | MRI (ADC/DMI) | Classification of malignant, benign, and normal masses | Attention module, classification consistency module | 0.85 | 0.86 | 0.83 | 0.88 |
| (61) | 2023 | Clinical profile/CNA profile/gene expression profile | Classification of malignant, benign, and normal masses | Sigmoid-gated attention mechanism, feature concatenation | 0.95 | 0.91 | – | 0.80 | |
ACC, accuracy; ADC, apparent diffusion coefficient; AUC, area under the receiver operating characteristic curve; B-US, B-mode ultrasound; CEUS, contrast-enhanced ultrasound; CNA, copy number alteration; DBT, digital breast tomosynthesis; DCE-MRI, dynamic contrast-enhanced magnetic resonance imaging; DMI, diffusion-weighted imaging; FFDM, full-field digital mammography; IoNW, iodine-no-water maps; MonoE, monoenergetic images; mpMRI, multiparametric magnetic resonance imaging; Z effective, effective atomic number.
In the earlier stages, the fusion methods proposed by the researchers were relatively simple. Initially, multi-modal data was input into a neural network to obtain the prediction results for each modality. Finally, these results were fused to obtain the overall prediction. This method is known as “decision-level fusion”. Decision-level fusion is straightforward to train, retains the unique features of each modality, and improves the robustness of predictions. However, this method fails to use the complementary features across modalities, which limits its predictive performance.
A new fusion method has been proposed to address these issues. In this approach, data from different modalities are first subjected to feature extraction through independent networks. Subsequently, these features are fused one or more times at the final stage of the feature extraction phase, prior to the decision stage. This method is known as “feature-level fusion”. Compared with decision-level fusion, feature-level fusion enables the model to learn from multi-modal data more comprehensively before making a decision, as confirmed by researchers such as Arya et al. (40,52,65). Although feature-level fusion improves interactions between modalities compared to those occurring post-fusion, it still falls short in capturing the shared and complementary information across modalities.
To further enhance the accuracy of breast cancer prediction, researchers have proposed a hybrid fusion strategy. This approach offers greater flexibility by allowing multiple appropriate methods to be selected for fusing features from different modalities, both during the feature extraction stage and the decision stage. Hybrid fusion can involve multiple fusions within the feature extraction stage alone or a combination of fusions across both stages, and the fusion strategy can be adjusted flexibly according to the data characteristics.
Given the scarcity and privatization of datasets in breast cancer multi-modal research, representative studies using the publicly available METABRIC and TCGA-BRCA datasets were selected to enable fair comparison under consistent conditions. As shown in Table 4, variations in fusion strategies and backbone networks result in noticeable differences in model performance. Compared with traditional feature concatenation, methods such as graph-based affinity fusion and attention-based feature weighting are more effective in aligning heterogeneous modalities. In addition, feature-level fusion and hybrid fusion demonstrate a stronger ability to capture complex cross-modal relationships than does decision-level fusion. These results indicate that selecting an appropriate fusion strategy based on data characteristics and task requirements is crucial for optimizing model performance. Figure 10 illustrates the framework of the three fusion models. The following sections describe these three fusion strategies in detail.
Table 4
| Article | Year | Fusion method | Multi-modal fusion technique | Dataset | Performance % | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| AUC | ACC | F1-score | Sn | MCC | Pre | |||||
| (40) | 2021 | Feature-level fusion | Sigmoid-gated attention mechanism | METABRIC | 0.95 | 0.91 | – | 0.79 | 0.76 | 0.84 |
| TCGA-BRCA | 0.93 | 0.93 | – | 0.87 | 0.81 | 0.84 | ||||
| (52) | 2023 | Adversarial representation alignment, feature concatenation | METABRIC | – | 0.93 | – | 0.82 | 0.83 | 0.90 | |
| TCGA-BRCA | – | 0.91 | – | 0.84 | 0.76 | 0.88 | ||||
| (42) | 2019 | Decision-level fusion | Score fusion | METABRIC | 0.84 | 0.79 | – | 0.2 | 0.35 | 0.87 |
| TCGA-BRCA | 0.94 | – | – | – | – | – | ||||
| (53) | 2023 | Choquet fuzzy | METABRIC | 0.83 | 0.63 | 0.59 | 0.75 | |||
| TCGA-BRCA | – | – | – | – | – | – | ||||
| (48) | 2020 | Hybrid fusion | Feature concatenation, cross-modal latent reconstruction, latent averaging | METABRIC | 0.82 | 0.79 | – | 0.70 | – | – |
| TCGA-BRCA | 0.95 | 0.91 | – | 0.80 | – | 0.84 | ||||
| (49) | 2021 | Graph-based affinity fusion, attention-based feature weighting, feature concatenation | METABRIC | 0.94 | 0.89 | 0.92 | – | – | 0.90 | |
| TCGA-BRCA | 0.93 | 0.92 | 0.95 | – | – | 0.92 | ||||
ACC, accuracy; AUC, area under the receiver operating characteristic curve; MCC, Matthews correlation coefficient; METABRIC, Molecular Taxonomy of Breast Cancer International Consortium; Pre, precision; Sn, sensitivity; TCGA-BRCA, The Cancer Genome Atlas-Breast Invasive Carcinoma.
Feature-level fusion network
In the feature-level fusion strategy, breast cancer data from different modalities are first processed independently for feature extraction. The extracted features are then fused during the feature-extraction stage, prior to the decision stage, and finally used for prediction based on task requirements.
Earlier studies often conducted fusion only once, typically at the end of feature extraction, which limited cross-modal interaction. To address this, recently developed approaches have employed multilayer fusion strategies that allow iterative feature integration during extraction, enhancing the model’s robustness to modality differences and data imbalance.
Based on the number of fusions performed during feature extraction, feature-level fusion strategies are categorized into single-layer and multilayer approaches, which are outlined below.
Single-layer feature-level fusion
In single-layer feature-level fusion, breast cancer data from different modalities are first processed independently for feature extraction. These features are then fused at the final stage of extraction, before being passed to the decision layer for prediction. Figure 11 illustrates the typical architecture of this fusion strategy.
Early studies often combined image and non-image data to provide more comprehensive clinical information. Image data were typically processed with deep learning methods, while non-image data were handled using traditional machine learning techniques. This produced early single-layer fusion models in which the feature extraction pipelines for different modalities were highly heterogeneous. The section “Early single-layer fusion” discusses how these models processed and fused multi-modal information.
As the field progressed, researchers began applying deep learning techniques to non-image data as well, enabling more unified representation learning across modalities. Recent approaches have also explored the deeper integration of multiple types of non-image data or multiple imaging modalities. These efforts aim to capture richer information and improve predictive performance.
Depending on whether the same network architecture is used for all modalities, single-layer fusion models can be further classified into homogeneous and heterogeneous networks. The section “Late single-layer fusion” examines this classification in detail, focusing on the fusion of image and non-image data through the use of deep learning-based models.
Early single-layer feature-level fusion
Early single-layer fusion strategies typically process non-image data via traditional machine learning techniques, while image data are processed with deep learning models. The extracted features are then fused before being passed to the decision layer. This approach, as illustrated in Figure 12, represents a classical early-stage design in multi-modal fusion.
In this context, non-image data, particularly gene expression data, often exhibit high dimensionality and small sample size. Meanwhile, image features are usually lower in dimension but semantically rich. To address the heterogeneity between these modalities, dimensionality reduction or feature selection techniques are typically applied to non-image data and convolutional networks optimized to extract visual features. As demonstrated by Yang et al. (50), this combined approach improves model robustness and mitigates the influence of irrelevant features in high-dimensional non-image data.
The ability of random forests to effectively select features has been confirmed by Yang et al. (50) and Yao et al. (51). Random forests use different contribution indicators to evaluate the importance of features. In their study, Yang et al. (50) used a residual network (ResNet) to extract features from pathological images. Additionally, the mean decrease Gini in random forests was used to select features from clinical data. Finally, the proposed multi-modal compact bilinear modular fusion feature was used to assess the risk of recurrence and metastasis in patients with HER2-positive breast cancer. Yao et al. (51) used the contrastive learning at multiple scales library to segment H&E-stained images to solve the noise problem caused by manually labeled images. Image features were learned by the ResNet with an embedded attention module. Meanwhile, clinical information features were filtered with the chi-squared test and out-of-model assessment scores. Sequencing data were evaluated as per the Gini index to quantify its contribution to the random forest decision tree model. Finally, the features extracted from each modality were spliced and passed into the fully connected layer for breast cancer prognosis prediction.
In addition to random forests, researchers have explored various other feature selection techniques. The effectiveness of least absolute shrinkage and selection operator (LASSO) Cox regression for feature selection has been confirmed by Miao et al. (70), who combined CT images and clinical data for multi-modal learning and thus achieved the successful prediction of bone metastasis in breast cancer using subcutaneous fat. CT images were learned with a ResNet network embedded with an attention mechanism. LASSO Cox regression was used to select features for the clinical data. Finally, the features from both modalities were concatenated and fed into a gradient boosting regression tree for classification. Wan et al. (71) combined multi-modal ultrasound radiomics and LASSO-processed pathological data to predict pCR. Similarly, Ye et al. (72) applied LASSO to select features from proteomic, transcriptomic, and clinical data and then integrated them with radiomic features for the same prediction. Nakach et al. (73) used mutual information gain to conduct feature screening on clinical data, gene expression data, and CNV data and then fused these features with those extracted from histopathological images using ResNet50, thereby achieving breast cancer subtype prediction.
Late single-layer feature-level fusion
Since the early single-layer feature-level fusion strategy does not use neural networks to learn from non-image data, it possesses certain limitations. Aiming to address these issues, many researchers have proposed new strategies. In this strategy, machine learning is initially used to select features, and then the selected features are passed to the neural network for deeper feature learning and optimization. The powerful ability of neural networks to learn hidden information in non-image data has been confirmed by Nikhilan and Arya et al. (43). This method further improves the accuracy and reliability of breast cancer prediction. Figure 13 shows the processing of image data and non-image data in late single-layer feature-level fusion. Accurate breast cancer prediction is achieved through the use of image data and non-image data from different modalities. Atrey et al. (59) confirmed that prediction accuracy can be improved by the use of learning richer visual features. Depending on the characteristics of the data, researchers can select networks with similar or different structures to learn data of different modalities. This review classifies these approaches based on whether the neural networks processing each modality’s data are similar or not. The late single-layer feature-level fusion strategy is divided into the homogeneous network fusion strategy and the heterogeneous network fusion strategy.
The homogeneous network fusion strategy first involves using neural networks with the same or similar structures to train the data of each modality separately. Finally, the extracted features are fused using various methods, such as attention mechanisms, feature concatenation, and graph convolutional networks. The fused features are then passed into the deep learning model or machine learning algorithm for prediction. In the homogeneous network strategy, all modalities share the same network parameters and structure, which simplifies the training process. However, this strategy entails several challenges, the most problematic being the differences in the distribution characteristics of each modality. Due to the varying feature representations and distribution characteristics between different modalities, homogeneous networks often struggle to adapt to the characteristics of all modalities during the learning process. These issues can lead to the loss of key information during feature extraction. To address these issues, homogeneous network designs, such as the use of multiscale modules, attention mechanisms, and other modifications, have been devised, which can optimize the performance of CNNs and enhance the adaptability of homogeneous networks to different modalities. Studies in this area have provided novel insights into the design of homogeneous networks (59,65,74). The following section provides an in-depth discussion of the relevant research findings and discusses their applications in homogeneous design networks.
Arya et al. (43) employed gene expression, CNA, and clinical data in a multi-modal learning approach to predict breast cancer prognosis. Features extracted via CNN were concatenated and then fed into various classifiers. Random forest, support vector machine (SVM), Naive Bayes, and logistic regression were used to evaluate the performance of different feature fusion strategies. Atrey et al. (59) proposed a hybrid model, CNN-long short-term memory, combining mammogram and ultrasound imaging to classify breast cancer lesions. However, the CNN structure used in these methods is relatively simple and lacks depth, which may not fully capture the deep and complex features in the data.
In recent years, a series of efficient and powerful network architectures have been proposed. Not only have classic models such as Visual Geometry Group Network (VGGNet) and Residual Network (ResNet) been widely adopted, but several innovative designs based on these networks have also been developed, such as Densely Connected Convolutional Network (DenseNet) and Squeeze-and-Excitation Network (SENet). The ability of these network designs to extract deeper information was demonstrated by Song et al. (75) and Atrey et al. (62), which has opened new avenues in the multi-modal management of breast cancer. Song et al. (75) used the ResNet network to extract features from two different modalities, low-energy (LE) images and dual-energy subtraction (DES) images in contrast-enhanced spectral mammography. The features were then concatenated and passed to the fully connected layer to classify benign and malignant breast cancer. Atrey et al. (62) employed ResNet to learn representations from mammography and ultrasound images independently. The fused fusion features were then passed to the SVM to complete the classification of benign and malignant breast cancer tumors. Verma et al. (45) proposed a multi-modal spatiotemporal deep learning framework. A network stacked with multi-field-of-view modules and convolutional layers was designed. This framework addresses the issue encountered in previous studies, which only considered the pre-NST MRI data and ignored structural and functional changes of the tumor on dynamic contrast-enhanced MRI during NST. Compared to the simple CNN model, this deeper network architecture can better capture the details of each modality and high-level features, significantly improving feature learning. Furtney et al. (76) used a relational graph convolutional network to integrate patient information that had been transformed into a graph model, along with image features extracted by EfficientNet, to achieve molecular subtype prediction.
For the proposed methods mentioned above, clinicians must often first manually identify and segment the lesions in pathological images. The images are then passed to the CNN for learning. However, manual identification and segmentation of lesions are inefficient, subjective, and variable. To solve this problem, researchers have turned their attention to deep learning. Misra et al. (74) confirmed that using CNNs for automatic lesion segmentation can improve efficiency and reduce subjectivity. They used a weighted multi-modal weighted multi-modal U-Net to achieve automatic segmentation of lesions. On this basis, VGG16 was used to classify tumors in the cropped brightness mode and second harmonic echo mode of ultrasound.
The accuracy of breast cancer prediction depends mainly on two factors: subtle pathological changes in the image and key markers in the non-image data, such as gene expression. However, the CNN structures mentioned above often adopt an indiscriminate processing method that may overlook important information in the data. To overcome this, a number of researchers have applied an attention mechanism for multi-modal breast cancer prediction. The ability of the model augmented with an attention mechanism to focus on key information was confirmed by Arya et al. (40) and Zhou et al. (65). This approach helps to ignore secondary information such as background noise and non-tumor areas in case images, thereby improving the network’s performance. Arya et al. (40) proposed a sigmoid-gated attention CNN to learn and fuse genomic details, histopathology images, and clinical details. In this network, the stacked features are passed into a random forest classifier to obtain the breast cancer survival prediction results. Zhou et al. (65) used a network architecture composed of DenseNet, ResNet50, and SENet50 to learn ultrasound images of different modalities to determine the molecular subtypes of breast cancer.
In the multi-modal examination of breast cancer, CNNs are not the only option. In recent years, generative adversarial networks (GANs) have been proposed, bringing new possibilities to the multi-modal field (77). The ability of the GAN network to effectively reduce the difference between modalities through adversarial learning was demonstrated by Du et al. (52). A GAN network primarily consists of two parts: the generator, which attempts to deceive the discriminator by generating as-realistic-as-possible data, and the discriminator; which aims to distinguish real data from the data generated by the generator as accurately as possible. These two networks compete with each other during training, and through continuous learning and adjustment, the gap between them is reduced. In the multi-modal field, the characteristics of GAN network can be used to reduce the discrepancies between different modalities and better align the distributions. Du et al. (52) proposed a multi-modal adversarial network framework in which the adversarial network converts source modality data into the distribution representation of the target modality data, thereby reducing the difference between modalities and improving the accuracy of breast cancer prognosis prediction. Palmal et al. (78) employed a graph CNN to extract features from whole-slide tissue images and miRNA-sequencing expression data. In this approach, multi-modal features are subsequently integrated through feature concatenation, which is followed by the application of a calibrated random forest for classification.
Compared to the homogeneous network fusion strategy, the heterogeneous network can fully account for the unique properties of different modal data, allowing for the flexible selection of the CNN best suited to each modality for efficient feature learning. The key to this strategy lies in the selection of the appropriate network. First, an adapted network is chosen to learn the features of different modalities. A suitable fusion method is then used to integrate these features. Finally, the decision layer makes the predictions. When processing multi-modal data, the heterogeneous network fusion strategy must address two core challenges: first, how to accurately select a CNN that is adapted to each modality; and second, how to efficiently optimize and adjust network parameters during training, given the differing network structures required for processing different modalities. A customized network structure, designed according to the characteristics of each modality, has been proposed to resolve these issues and demonstrated to be effective by Joo et al. (4) and Ding et al. (60), among others.
Joo et al. (4) proposed the three-dimensional ResNet-50, which extracts features from spin-lattice relaxation time and spin–spin relaxation time data, and in which the fully connected layer is used to learn clinical data. Subsequently, the concatenated and fused features are passed to the fully connected layer for predicting the pCR of patients treated with neoadjuvant chemotherapy. Ding et al. (60) used multiscale pathological images to achieve cross-scale feature learning. Specifically, they used a TabNet encoder to generate a representation of clinical pathological parameters. Through this approach, the accuracy of predicting lymph node metastasis could be greatly improved through the fusion of features via a cross-modal attention mechanism.
In using the heterogeneous network fusion strategy, a number of researchers have employed deep neural networks (DNNs) and CNNs for prediction. It is possible to fully leverage the advantages of both DNNs and CNNs when they are combined, as confirmed by Liu et al. (79) and Jadoon et al. (80). DNNs excel at processing nonlinear data, making them particularly suitable for managing non-image data in the field of breast cancer. CNNs, on the other hand, can extract local features in images through convolution operations and thus are suited to processing breast cancer image data. Liu et al. (79) used gene expression and pathological image data to predict molecular subtypes. In their approach, after principal component analysis dimension reduction, gene expression data are passed into the DNN network for learning. Meanwhile, the CNN network built based on VGG16 is used to learn pathological image features. Finally, the linear weighted fusion features are passed into the fully connected layer to achieve breast cancer subtype classification prediction. Similarly, Jadoon et al. (80) used CNNs to extract information from clinical data and gene expression data, with DNN being used to extract CNV data and stacked features then being passed into the random forest to classify short-term and long-term survival.
In order to minimize the loss of information in each modality before data fusion, the dimensionality of high-dimensional data can be reduced. However, this approach tends to cause the loss of high-dimensional data. Aiming to address this issue, Yan et al. (58) used a denoising autoencoder to increase the dimensionality of low-dimensional electronic medical record (EMR) data. They used phased training strategy to effectively integrate the bimodal information of pathological images and EMR, achieving the classification of breast cancer tumors.
Multilayer feature-level fusion
In multilayer feature-level fusion, multi-modal data are first fused at a lower level. The resulting feature representations are then progressively refined through multiple stages of fusion until final feature extraction is completed. As illustrated in Figure 14, each fusion stage builds upon the output of the previous one, enabling a gradual and more comprehensive integration of complementary information across modalities. Compared to single-layer fusion, this approach captures intermodal relationships more effectively and leads to improved prediction performance.
In ResNet, different convolutional layers can capture features of different scales. By fusing features at different levels within ResNet, one can combine information from a low level to a high level and from local to global scales. Yue Liu et al. (81) used MRI and clinical data to predict the pCR. They used no-new U-Net (nnU-Net) to segment dynamic contrast-enhanced MRI, and histogram matching was applied to reduce differences, effectively solving the problem of information interference. In the prediction stage, clinical data and MRI features extracted by ResNet were fused using five multiplications. Guo et al. (37) used Gaussian sampling to acquire CEUS and B-US modality data from breast cancer videos for diagnosis. The network architecture included a feature extraction branch and a fusion branch, with the latter implemented through four iterations. The difference rectification response technique was used to integrate the results from the previous stage of fusion with those processed by the CNN layers in the CEUS and B-US branches.
U-Net extracts features from shallow to deep layers through multiple convolutions and downsampling. It uses a skip connection mechanism to concatenate corresponding features in the encoder and decoder, combining information at different depths to achieve good results in segmentation tasks. By utilizing the multiscale feature fusion and skip connection characteristics of the U-Net network, U-Net can fuse features from different modalities during downsampling. Yang et al. (55) performed the fusion of Z-effective, iodine-no-water (IoNW), and low- (40 keV) and high- (100 keV) tube voltage images generated by the Philips IQon Spectral CT scanner to achieve automatic segmentation of breast cancer lesions. Data from three different modalities were separately processed by three independent U-Net encoders, with feature fusion occurring after each downsampling step. The fusion module consisted of two residual blocks.
Multiscale features can be fused within the same modality to minimize the fusion effect between different modal features. This method not only improves the richness of features but also strengthens the association between different modalities. Peng et al. (6) performed the fusion of different modality MRI images to achieve the segmentation of breast cancer tumors. In the feature extraction branch, high- and low-scale features are fused through the atrous spatial pyramid pooling module. In the fusion branch, the features of each modality are calibrated and complemented with one another through the hourglass to optimize the segmentation results. This dual information interaction preserves the boundary details of the tumor, thereby improving segmentation accuracy.
Decision-level fusion
In decision-level fusion, the data of each modality are first sent to an independent network for deep learning and training, which generates the corresponding prediction results. These prediction results from different modalities are then fused to obtain the final prediction results. Figure 15 shows the decision-level fusion network architecture. The key focus of decision-level fusion is to formulate a suitable fusion strategy, which must take into account the accuracy of the prediction results from different modalities. In the early stages of research in this area, simple methods such as averaging and voting methods were commonly used for fusion. As the research progressed, the realization that the prediction results from each modality should carry different weights led to the development of methods such as weighted fusion. Sun et al. (42), among others, demonstrated that weighted fusion can achieve better prediction results than the average and voting methods.
The average method involves a simple and direct fusion method in breast cancer regression tasks. Rabinovici-Cohen et al. (68) used clinical data and multiple modalities of multiparametric MRI to predict the risk of recurrence in patients. The features of multiple image modalities were predicted by a fully connected layer, while random forest was used to obtain the prediction score of the clinical data. Finally, the average method was used for fusion.
In the breast cancer classification task, voting can be used as a simple decision-level fusion strategy. In this approach, multiple classifiers are trained to process data of different modalities and generate category prediction results. These results are then voted on, with the category receiving the most votes being selected as the overall classification output. Liu et al. (63) used ResNet50 to classify molecular subtypes of mammography and MRI, with the final classification determined by voting. In their study, Xu et al. (31) combined B-US and CEUS for multi-modal learning to achieve breast cancer tumor classification. A CNN, composed of a gated Boltzmann machine and a restricted Boltzmann machine, was used for feature learning. Finally, the voting method was also applied to obtain the prediction result of breast cancer tumor classification.
However, there are limitations to completing multi-modal fusion through voting and averaging. Both methods assume that the prediction results of each modality carry equal weight. In reality, different modalities contribute differently and hold varying importance to the final prediction results. To overcome this limitation, researchers have begun exploring alternative prediction methods. For example, compared to voting and averaging, weighted learning can improve the accuracy of breast cancer prediction, as confirmed by Sun et al. (42), Wu et al. (38), Wang et al. (67), among others. Weights are automatically learned through machine learning algorithms, with different weights being assigned to the prediction results of various modalities, thus enhancing overall prediction accuracy. Sun et al. (42) used a DNN to learn gene expression, CNA, and clinical data. Through continuous learning, the weight parameters of each modality were adjusted to achieve prognosis prediction. Wu et al. (38) combined three imaging methods for multi-modal learning: bright-field microscopy, fluorescence microscopy, and orthogonal polarization microscopy. The prediction results were obtained via the ResNet, and weighted fusion was subsequently achieved through logistic regression. Wang et al. (67) combined digital breast tomosynthesis (DBT) and full-field digital mammography (FFDM) to classify breast cancer tumors. VGG-16 was used to process the ROI in the image, while multilayer perceptron was used for histogram and texture features. Finally, an evaluation matrix was constructed for weighted fusion. Tang et al. (82) analyzed ultrasound and MR imaging using three CNNs and weightedly fused their results through logistic regression. Yan et al. (83) used the Dirichlet distribution to dynamically weight semantic features from different modalities. This approach successfully fused breast ultrasound images from these diverse modalities. Liu et al. (84) obtained MRI prediction results through three branches: utilizing Residual Multi-Modal Neural Network (ResMM-Net) to learn high-level features from images, employing LASSO for feature selection to aid in diagnosis, and relying on radiologists to conduct visual assessment of MR images. Finally, logistic regression was used to integrate the results from these three branches.
In addition to the commonly used method of weighted learning, researchers have examined a variety of other fusion strategies. Oyelade et al. (85) used TwinCNN to extract features from histology and mammography images. In their approach, the optimization module eliminates non-discriminative features. During the decision stage, the prediction results of the two modalities are combined and passed to the optimization module for secondary optimization, leading to improved abnormality classification.
Traditional fusion methods often fail to fully account for deviation and support, Deviation measures the similarity between predicted and actual classes, while support reflects the sparseness of these classes. Ignoring these two aspects may affect prediction accuracy. Meanwhile, Arya et al. (53) used sigmoid-gated attention CNNs to extract features from gene expression, CNA, and clinical data. They used nonlinear functions to calculate the support and deviation of each prediction result and Choquet fuzzy integrals to integrate them. Predictions with high support and low deviation were selected as the final results, leading to greater prediction accuracy.
Hybrid fusion
Traditional fusion strategies, including feature-level and decision-level fusion, typically integrate multi-modal data at a single stage within the network. Although effective in certain contexts, their reliance on a fixed fusion point limits the model’s flexibility in capturing complex relationships between modalities.
To address this limitation, hybrid fusion strategies have been developed to allow multiple, stage-specific fusion operations. These approaches enable the integration of multi-modal features both within the feature extraction phase and across the decision-making process, adapting to the nature of the data and the task. Figures 16,17 illustrate two typical architectures: one performs repeated fusion during feature extraction, while the other incorporates fusion across both stages.
Hybrid fusion improves the model’s capacity to capture complementary and hierarchical information across modalities. Based on their underlying techniques, hybrid fusion strategies can be broadly categorized into three types: attention-based methods, cross-modal autoencoder-based methods, and graph-based methods.
Attention-based fusion method
In hybrid fusion networks, cross-modal attention mechanisms have been widely adopted to enhance the modeling of relationships between different modalities. These mechanisms help the model capture both shared and complementary information by dynamically assigning importance weights to features from different sources. Cross-modal attention can be implemented in several ways. One common approach involves computing the similarity between features from different modalities and using the resulting weights to refine the original features. Another approach concatenates features from multiple modalities and processes them using variants of attention modules, such as channel or spatial attention mechanisms. In addition, some models extract queries from one modality while deriving keys and values from another, allowing for a more targeted focus during feature integration.
The cross-attention mechanism based on similarity calculation can highlight the similar features between the two modalities. This ability to improve prediction accuracy has been demonstrated by Kayikci et al. (61) and Yang et al. (69), among others. In the fusion process, this method enhances the retained key information. Kayikci et al. (61) used a sigmoid gated attention mechanism and feature concatenation to achieve two-stage fusion between clinical data, CNA, and gene expression data. Yang et al. (69) applied ResNet to extract features from the apparent diffusion coefficient (ADC) and diffusion-weighted imaging (DWI) modes of MRI with the fusion completed through the cross-attention mechanism and the classification consistency module, successfully achieving breast cancer tumor classification.
The features of different modalities can be connected and input into the attention mechanism to complete fusion. This strategy effectively learns the dependencies between different modalities and allows for more flexible feature fusion. Zhang et al. (86) combined diagnostic mammography and ultrasound images for multi-modal learning to determine the molecular subtype of breast cancer. Their model uses an intramodal self-attention mechanism and an extramodal attention mechanism to fuse features multiple times. The intermodal attention mechanism comprises the self-attention mechanism, the channel attention mechanism, and the spatial attention mechanism.
Extracting the query from one modality into the attention mechanism and obtaining the key and value from another modality to complete the fusion can establish the mapping relationship between different modalities. This method is especially adept at learning the corresponding relationship between different modalities, and its effectiveness has been confirmed in works such as those by Liu et al. (54) and Li et al (46). Chen et al. (47) combined gene expression data and clinical data for multi-modal learning to achieve prognosis prediction of breast cancer, with the two modalities completing cross-modal fusion through a cross-modal attention mechanism and feature splicing. Liu et al. (54) used an intramodal self-attention mechanism, an intermodal cross-attention mechanism, and an adaptive fusion module for modal fusion, successfully fusing genomic data and pathological images. Li et al. (46) sequentially passed MRI and RNA-sequencing data through a cross-modal attention mechanism and a transformer encoder module to predict the response of breast cancer tumors to neoadjuvant chemotherapy combining the attention mechanism and the transformer allows for the global capture of similar and complementary information between two modalities. Mondol et al. (87) proposed a model that extracts features from histopathological images through a self-supervised learning model, employs a variational autoencoder and a self-attention mechanism for feature integration, and then uses a co-dual cross-attention mechanism to fuse genetic data. Finally, the fused features are combined with clinical data, and the survival data imbalance issue is addressed via the processing of data with a weighted Cox loss function, thereby achieving accurate prediction of survival risks for patients with estrogen receptor-positive breast cancer.
Different forms of attention mechanisms can be combined to achieve multiple-feature fusion. Vo et al. (66) combined the learning of mammograms and clinical features for breast lesion risk assessment, with the fusion accomplished via a linear transformation-based attention mechanism and cross-attention mechanism. Yang et al. (88) proposed a triple-modal interaction mechanism fusion module to effectively integrate DWI and ADC images, achieving the classification of breast cancer tumors. The triple-modal interaction mechanism consists of a cross-attention mechanism, a channel interaction module, and a spatial attention mechanism.
Fusion method based on cross-modal autoencoders
The effectiveness of cross-modal autoencoders in narrowing the gap between different modalities through cross-modal decoders was demonstrated by Tong et al. (48). In the fusion strategy of cross-modal autoencoders, the autoencoder first encodes each modality to obtain the potential feature representation of each modality. Next, a cross-modal decoder is used to reconstruct the original data and infer the data of another modality from the hidden layer of one modality, fully capturing the complementary information between different modalities. Different fusion methods are then selected to integrate the hidden features of these modalities, and finally, the network is used for prediction. Fusion based on cross-modal autoencoders is robust and flexible, performing well in the fusion of multi-modal data.
Tong et al. (48) combined gene expression, DNA methylation, miRNA expression, and CNVs for multi-modal learning to analyze the survival of patients with breast cancer. Different modalities of biological molecular data were reconstructed through cross-modal autoencoders, and finally, the average of the different modal data was fed into the network for prediction.
Fusion based on graph theory
In the framework of hybrid fusion, graph theory–based methods can serve as powerful tools when paired with other fusion methods. The core of graph theory-based multi-modal fusion methods is the building of graph structures that capture the intrinsic associations between different modalities. Various graph theory-based fusion methods exist, such as bipartite graphs and unimodal projections. These methods consider the data of each modality as nodes in the graph, with the weights of the edges representing the similarity between modalities. Graph theory-based fusion methods not only visualize the correlation between modalities through graphical representations but also use graph networks for feature learning. For example, advanced networks such as graph convolutional networks and graph attention networks can be used to learn the nonlinear and complex relationships within the graph structure, enabling effective fusion of information from different modalities.
Guo et al. (49) combined gene expression, CNA, and clinical data for multi-modal learning to predict the prognosis of breast cancer. The integration process they proposed is carried out in two steps. First, a graph convolutional network fuses the gene expression and CNA features after they are learned by a shallow network. Additionally, bipartite graphs and single-modality projections are drawn to illustrate the associations between the two modalities.
Conclusions
Multi-modal medical image fusion techniques have been summarized in several reviews and commentaries (11,12,89-91). For example, Pei et al. (90) examined the pretrained algorithms for multi-modal fusion and medical multi-modal fusion methods. They used bibliometric algorithms to quantitatively analyze relevant research and visualized research hotspots and future trends. Azam et al. (12) proposed a diverse classification of fusion techniques, such as frequency domain fusion, spatial domain fusion, and deep learning fusion. Duan et al. (11) summarized the fusion timeline and challenges of different modalities, including images, biomarkers, and sensor data. Cui et al. (91) discussed how data preprocessing and feature extraction methods affect fusion outcomes. Hermessi et al. (89) provided a detailed overview on the application of techniques such as multiscale geometric decomposition, machine learning, sparse representation, and deep learning in medical image fusion, offering guidance for future research.
Although these articles comprehensively discuss multi-modal medical fusion technologies, they do not provide a complete overview of multi-modal medical public datasets, preprocessing methods for different data types, and the application of multi-modal fusion technologies in actual diagnosis and treatment. Furthermore, some articles do not fully cover fusion methods for various types of data, such as imaging data, pathology data, biomarker data, and text data. Table 5 evaluates these articles based on five aspects: inclusion of public datasets, availability of links to public datasets, discussion of preprocessing methods for different data types, description of the application of multi-modal fusion technologies in diagnosis and treatment, and comprehensive summary of the fusion methods for various data types.
Table 5
In the field of deep learning with single-modality data, breast cancer prediction faces numerous challenges. Due to the limited information provided by single-modality data, only one aspect of the disease can be captured. However, breast cancer is a highly heterogeneous disease, with lesion characteristics varying significantly between patients. Therefore, single-modality data may not fully and deeply reflect the complex characteristics of breast cancer, and thus accurate prediction cannot be guaranteed. Multi-modal deep learning has been developed to address these issues, constituting a new direction for breast cancer research. However, multi-modal prediction entails higher demands in data selection, data processing, and network architecture design. It requires researchers to obtain high-quality, well-adapted multi-modal data and to possess a deep understanding of the characteristics of each modality and its association with breast cancer. In addition, multi-modal learning imposes higher requirements on model design, necessitating the creation of complex network architectures that can capture both similar and complementary information between different modalities, thereby fully exploiting the data from each modality. The purpose of this review is to summarize the latest advancements in this field and provide comprehensive source for future researchers.
Common issues
Selection of fusion strategy
This section discusses three fusion strategies, each with its respective advantages and disadvantages. Decision-level fusion maximizes the preservation of unique features from each modality. However, it involves several limitations, including insufficient interaction between modalities, vulnerability to inaccuracies caused by biases in a single modality, and poor interpretability. This method performs fusion only once at the decision-making stage, which limits its ability to fully learn the similarities and complementary information between different modalities. In the early stages of this field, many decision-level fusion methods relied on simple strategies, such as the average fusion (68) and voting-based fusion (31,68), hindering their ability to fully learn the similarities and complementary information between different modalities. Later, researchers realized that different modalities might contribute variably to the final decision and began exploring weighted fusion techniques. For example, Tang et al. (82), Liu et al. (84), among others, used logistic regression for weighted fusion. However, if a specific modality produces biased results, assigning it a higher weight may negatively affect the final decision. Additionally, since fusion occurs at the decision stage, it is difficult to trace the impact of individual modality features on the final outcome, leading to poor interpretability, which may reduce trust from both clinicians and patients.
Compared to decision-level fusion, feature-level fusion combines features from different modalities during the feature extraction stage, forming a richer and more comprehensive feature representation that facilitates greater accuracy in decision-making from the model. In contrast to fusing low-level features before shallow layers, performing fusion after feature extraction helps to avoid the shortcomings in which poor-quality modality data or large differences in original feature scales can directly affect subsequent feature learning. Nevertheless, similar to decision-level fusion, single-layer feature-level fusion typically adopts a single fusion method during the fusion process and thus lacks the flexibility to apply differentiated fusion strategies for different modalities at various processing stages (43,45,59,62,75). This limitation is particularly evident in the fusion of imaging and non-imaging data, where significant modality differences make achieving effective alignment through a single fusion step difficult. In earlier studies, researchers such as Yang et al. (50) and Nakach et al. (73) used traditional machine learning methods, such as LASSO Cox regression and random forests, for feature learning on non-image data. Although these methods can achieve satisfactory results in certain tasks, compared to deep learning methods, they may lack sufficient expressive power when dealing with complex multi-modal data (51,52,70-73). One of the major challenges in feature-level fusion is how to effectively integrate data with significant modality differences when fusion is performed only once.
Hybrid fusion integrates the advantages of both feature-level and decision-level fusion by enabling multiple stages of information interaction during both feature extraction and decision-making. Compared to single-stage fusion strategies, hybrid fusion can more effectively exploit the complementary information between modalities, especially when there are significant differences between them, leading to superior fusion performance. However, hybrid fusion is associated with issues such as high model complexity, optimization difficulties, and increased data requirements. Fusion methods based on attention mechanisms rely on large-scale interactions among high-dimensional features, with the generation and dynamic updating of cross-modal attention matrices significantly increasing computational resource consumption (46,47,54,88). Methods based on cross-modal autoencoders typically require the design of multibranch encoder structures, resulting in a large number of parameters and higher computational demands during both training and inference (48). Additionally, graph-based fusion methods, although capable of explicitly modeling intermodal relationships, require continuous updates of node features and edge weights during feature interactions, leading to considerable time and memory overhead (49). Overall, although hybrid fusion can more thoroughly capture the similarities and complementarities between different modalities, it inevitably faces practical challenges due to the resulting high computational complexity.
Data-related challenges
Data scarcity, excessive noise, and imbalanced differences between data are major challenges to the multi-modal prediction of breast cancer. Obtaining matched and sufficient multi-modal breast cancer data is particularly challenging, as it requires fine labeling by experienced clinicians, which is an expensive and time-consuming process. Moreover, data scarcity may lead to overfitting problems during training. In mitigating these issues, three main strategies are commonly employed. First, enhancement techniques such as rotation, flipping, and scaling can be used to expand the dataset, thereby increasing the number and diversity of training samples. Second, the complexity of the network model can be reduced by either decreasing the number of layers in the network or by applying regularization methods to mitigate the risk of overfitting. Third, semisupervised learning can be applied to leverage a small amount of labeled data and a large amount of unlabeled data for training, which can improve model performance in situations in which labeled data are scarce.
As multi-modal data integration becomes increasingly sophisticated, concerns regarding data privacy and security will also intensify. Federated learning has emerged as an effective strategy to mitigate data scarcity and protect sensitive information, as it allows decentralized model training across institutions without the need to exchange original data (92).
In addition, researchers can explore the use of image generation techniques to augment datasets, such as GANs (77), diffusion models (93), and variational autoencoders. GANs expand datasets through adversarial learning between a generator and a discriminator. Diffusion models enhance datasets by simulating the gradual process of restoring data from noise to its original form. Variational autoencoders, by introducing probabilistic models with latent variables, generate data and can effectively produce highly variable images from lower-dimensional spaces. These technologies are at the forefront of image generation research.
ROI extraction
The selection of the ROI is a crucial step in breast cancer prediction. The ROI typically corresponds to the lesion area in breast cancer images. The location, morphology, and other characteristics of the tumor can be determined from the ROI, which is essential for the accurate identification and classification of the tumor. Traditionally, the ROI is manually identified and marked by clinicians. However, this process is cumbersome and time-consuming. Moreover, if the selected ROI is inappropriate or does not accurately represent the diagnostic information of the entire image, it may introduce significant noise, thereby affecting the final diagnostic accuracy and reliability. Aiming to address these issues, researchers have explored using machine learning and deep learning methods to assist in the selection of the ROI. The use of automated tools can reduce human errors.
Future insights
Figure 18 shows the usage proportion and development trend of different fusion strategies including feature-level fusion, decision-level fusion, and hybrid fusion in published research papers on the multi-modal examination of breast cancer over the past 5 years, from 2019 to April 2024. As can be seen from the figure, feature-level fusion and decision-level fusion strategies have been favored by a large number of researchers due to their simplicity and efficiency. As research has progressed, hybrid fusion has gradually become a key focus due to its ability to better capture similarities and complementary features across different modalities.
In clinical practice, these fusion strategies provide novel ideas and methods for the diagnosis and treatment of breast cancer. Applying feature-level and decision-level fusion strategies can significantly improve data integration and facilitate the progress of personalized medicine. Hybrid fusion, meanwhile, shows greater potential in handling complex multi-modal data, enabling more accurate identification of disease features and improving diagnostic accuracy. Therefore, with the further development and optimization of these methods, future breast cancer diagnosis and treatment will increasingly rely on multi-modal data fusion technologies, providing healthcare professionals with greater support in decision-making.
In view of the problems faced by the multi-modal field mentioned above, we provide the following recommendations for future research directions:
- High-quality breast cancer datasets that integrate medical imaging, clinical records, and gene expression profiles should be constructed.
- The overfitting problem caused by data scarcity and difficulty in labeling should be solved. In the future, researchers should explore additional methods to expand datasets, such as GANs, diffusion models, and variational autoencoders to generate more realistic breast cancer data. Semisupervised methods can also be studied to improve generalization capabilities via a small amount of labeled and unlabeled data.
- Multi-modal networks should be designed to enhance feature extraction capabilities, including the use of advanced vision transformer architectures and the introduction of cutting-edge technologies such as attention mechanisms and multiscale approaches.
- Data fusion strategies should be optimized and fusion methods developed that can adaptively adjust the fusion weights of different modalities, fully capturing both the similarities and complementary information between modalities.
- Deep learning technology should be employed to automatically detect ROIs in images, assisting clinicians in quickly locating lesions, reducing human errors, and improving diagnostic efficiency and accuracy.
Acknowledgments
The authors would like to express their sincere gratitude to North China University of Technology, Shenzhen University of Advanced Technology, University of Macau, Tsinghua University, Chinese PLA General Hospital, The First People’s Hospital of Foshan, and Lakehead University for their strong support and collaboration. We deeply appreciate the collective efforts and valuable contributions from all team members during the development of this review.
Footnote
Reporting Checklist: The authors have completed the Narrative Review reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2024-2903/rc
Funding: This work was supported by
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2024-2903/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Anderson T, Herrera D, Mireku F, Barner K, Kokkinakis A, Dao H, Webber A, Merida AD, Gallo T, Pierobon M. Geographical Variation in Social Determinants of Female Breast Cancer Mortality Across US Counties. JAMA Netw Open 2023;6:e2333618. [Crossref] [PubMed]
- Zhang Y, Sidibe D, Morel O, Meriaudeau F. Deep multimodal fusion for semantic image segmentation: A survey. Image and Vision Computing 2021;105: [Crossref]
- Milosevic M, Jankovic D, Milenkovic A, Stojanov D. Early diagnosis and detection of breast cancer. Technol Health Care 2018;26:729-59. [Crossref] [PubMed]
- Joo S, Ko ES, Kwon S, Jeon E, Jung H, Kim JY, Chung MJ, Im YH. Multimodal deep learning models for the prediction of pathologic response to neoadjuvant chemotherapy in breast cancer. Sci Rep 2021;11:18800. [Crossref] [PubMed]
- Russo J, Frederick J, Ownby HE, Fine G, Hussain M, Krickstein HI, Robbins TO, Rosenberg B. Predictors of recurrence and survival of patients with breast cancer. Am J Clin Pathol 1987;88:123-31. [Crossref] [PubMed]
- Peng C, Zhang Y, Zheng J, Li B, Shen J, Li M, Liu L, Qiu B, Chen DZ. IMIIN: An inter-modality information interaction network for 3D multi-modal breast tumor segmentation. Comput Med Imaging Graph 2022;95:102021. [Crossref] [PubMed]
- Naqvi RA, Haider A, Kim HS, Jeong D, Lee SW. Transformative Noise Reduction: Leveraging a Transformer-Based Deep Network for Medical Image Denoising. Mathematics 2024;12: [Crossref]
- Fakhouri HN, Alawadi S, Awaysheh FM, Alkhabbas F, Zraqou J. A cognitive deep learning approach for medical image processing. Sci Rep 2024;14:4539. [Crossref] [PubMed]
- Chen L, Wu S, Leung SCH. Interdisciplinary approaches to image processing for medical robotics. Front Med (Lausanne) 2025;12:1564678. [Crossref] [PubMed]
- Zhang YN, Xia KR, Li CY, Wei BL, Zhang B. Review of Breast Cancer Pathologigcal Image Processing. Biomed Res Int 2021;2021:1994764. [Crossref] [PubMed]
- Duan J, Xiong J, Li Y, Ding W. Deep learning based multimodal biomedical data fusion: An overview and comparative review. Information Fusion 2024;112: [Crossref]
- Azam MA, Khan KB, Salahuddin S, Rehman E, Khan SA, Khan MA, Kadry S, Gandomi AH. A review on multimodal medical image fusion: Compendious analysis of medical modalities, multimodal databases, fusion techniques and quality metrics. Comput Biol Med 2022;144:105253. [Crossref] [PubMed]
- Gandhi A, Adhvaryu K, Poria S, Cambria E, Hussain A. Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Information Fusion 2023;91:424-44.
- Stahlschmidt SR, Ulfenborg B, Synnergren J. Multimodal deep learning for biomedical data fusion: a review. Brief Bioinform 2022;23:bbab569. [Crossref] [PubMed]
- Khan SU, Khan MA, Azhar M, Khan F, Lee Y, Javed M. Multimodal medical image fusion towards future research: A review. Journal of King Saud University-Computer and Information Sciences 2023;35: [Crossref]
- Jochelson MS, Dershaw DD, Sung JS, Heerdt AS, Thornton C, Moskowitz CS, Ferrara J, Morris EA. Bilateral contrast-enhanced dual-energy digital mammography: feasibility and comparison with conventional digital mammography and MR imaging in women with known breast carcinoma. Radiology 2013;266:743-51. [Crossref] [PubMed]
- Curigliano G, Burstein HJ, Winer EP, Gnant M, Dubsky P, Loibl S, et al. De-escalating and escalating treatments for early-stage breast cancer: the St. Gallen International Expert Consensus Conference on the Primary Therapy of Early Breast Cancer 2017. Ann Oncol 2017;28:1700-12. [Crossref] [PubMed]
- Zanardo M, Cozzi A, Trimboli RM, Labaj O, Monti CB, Schiaffino S, Carbonaro LA, Sardanelli F. Technique, protocols and adverse reactions for contrast-enhanced spectral mammography (CESM): a systematic review. Insights Imaging 2019;10:76. [Crossref] [PubMed]
- Patel BK, Gray RJ, Pockaj BA. Potential Cost Savings of Contrast-Enhanced Digital Mammography. AJR Am J Roentgenol 2017;208:W231-7. [Crossref] [PubMed]
- Gilbert FJ, Tucker L, Gillan MG, Willsher P, Cooke J, Duncan KA, Michell MJ, Dobson HM, Lim YY, Suaris T, Astley SM, Morrish O, Young KC, Duffy SW. Accuracy of Digital Breast Tomosynthesis for Depicting Breast Cancer Subgroups in a UK Retrospective Reading Study (TOMMY Trial). Radiology 2015;277:697-706. [Crossref] [PubMed]
- Nguyen T, Levy G, Poncelet E, Le Thanh T, Prolongeau JF, Phalippou J, Massoni F, Laurent N. Overview of digital breast tomosynthesis: Clinical cases, benefits and disadvantages. Diagn Interv Imaging 2015;96:843-59. [Crossref] [PubMed]
- Poplack SP, Tosteson TD, Kogel CA, Nagy HM. Digital breast tomosynthesis: initial experience in 98 women with abnormal digital screening mammography. AJR Am J Roentgenol 2007;189:616-23. [Crossref] [PubMed]
- Pisano ED, Gatsonis C, Hendrick E, Yaffe M, Baum JK, Acharyya S, Conant EF, Fajardo LL, Bassett L, D'Orsi C, Jong R, Rebner MDigital Mammographic Imaging Screening Trial (DMIST) Investigators Group. Diagnostic performance of digital versus film mammography for breast-cancer screening. N Engl J Med 2005;353:1773-83. [Crossref] [PubMed]
- Youlden DR, Cramb SM, Dunn NA, Muller JM, Pyke CM, Baade PD. The descriptive epidemiology of female breast cancer: an international comparison of screening, incidence, survival and mortality. Cancer Epidemiol 2012;36:237-48. [Crossref] [PubMed]
- Vaughan CL. Novel imaging approaches to screen for breast cancer: Recent advances and future prospects. Med Eng Phys 2019;72:27-37. [Crossref] [PubMed]
- Jalalian A, Mashohor SB, Mahmud HR, Saripan MI, Ramli AR, Karasfi B. Computer-aided detection/diagnosis of breast cancer in mammography and ultrasound: a review. Clin Imaging 2013;37:420-6. [Crossref] [PubMed]
- Byra M, Sznajder T, Korzinek D, Piotrzkowska-Wroblewska H, Dobruch-Sobczak K, Nowicki A, Marasek K. Impact of Ultrasound Image Reconstruction Method on Breast Lesion Classification with Deep Learning. In: Pattern Recognition and Image Analysis: 9th Iberian Conference, IbPRIA 2019, Madrid, Spain, July 1–4, 2019, Proceedings, Part I [Internet]. Berlin, Heidelberg: Springer-Verlag; 2019:41-52. Available online: https://doi.org/
10.1007/978-3-030-31332-6_4 - Debelee TG, Schwenker F, Ibenthal A, Yohannes D. Survey of deep learning in breast cancer image analysis. Evolving Systems 2020;11:143-63.
- Sudarshan VK, Mookiah MR, Acharya UR, Chandran V, Molinari F, Fujita H, Ng KH. Application of wavelet techniques for cancer diagnosis using ultrasound images: A Review. Comput Biol Med 2016;69:97-111. [Crossref] [PubMed]
- Huang R, Jiang L, Xu Y, Gong Y, Ran H, Wang Z, Sun Y. Comparative Diagnostic Accuracy of Contrast-Enhanced Ultrasound and Shear Wave Elastography in Differentiating Benign and Malignant Lesions: A Network Meta-Analysis. Front Oncol 2019;9:102. [Crossref] [PubMed]
- Xu Z, Wang Y, Chen M, Zhang Q. Multi-region radiomics for artificially intelligent diagnosis of breast cancer using multimodal ultrasound. Comput Biol Med 2022;149:105920. [Crossref] [PubMed]
- Tomczak K, Czerwińska P, Wiznerowicz M. The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemp Oncol (Pozn) 2015;19:A68-77. [Crossref] [PubMed]
- Curtis C, Shah SP, Chin SF, Turashvili G, Rueda OM, Dunning MJ, et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 2012;486:346-52. [Crossref] [PubMed]
- Clark K, Vendt B, Smith K, Freymann J, Kirby J, Koppel P, Moore S, Phillips S, Maffitt D, Pringle M, Tarbox L, Prior F. The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository. J Digit Imaging 2013;26:1045-57. [Crossref] [PubMed]
- Spanhol FA, Oliveira LS, Petitjean C, Heutte L. A Dataset for Breast Cancer Histopathological Image Classification. IEEE Trans Biomed Eng 2016;63:1455-62. [Crossref] [PubMed]
- Lee RS, Gimenez F, Hoogi A, Miyake KK, Gorovoy M, Rubin DL. A curated mammography data set for use in computer-aided detection and diagnosis research. Sci Data 2017;4:170177. [Crossref] [PubMed]
- Guo D, Lu C, Chen D, Yuan J, Duan Q, Xue Z, Liu S, Huang Y. A multimodal breast cancer diagnosis method based on Knowledge-Augmented Deep Learning. Biomed Signal Process Control 2024;90: [Crossref]
- Wu J, Xu Z, Shang L, Wang Z, Zhou S, Shang H, Wang H, Yin J. Multimodal microscopic imaging with deep learning for highly effective diagnosis of breast cancer. Optics and Lasers in Engineering 2023;168: [Crossref]
- Sharma GN, Dave R, Sanadya J, Sharma P, Sharma KK. Various types and management of breast cancer: an overview. J Adv Pharm Technol Res 2010;1:109-26.
- Arya N, Saha S. Multi-modal advanced deep learning architectures for breast cancer survival prediction. Knowledge-Based Systems 2021;221: [Crossref]
- Park S, Koo JS, Kim MS, Park HS, Lee JS, Lee JS, Kim SI, Park BW. Characteristics and outcomes according to molecular subtypes of breast cancer as classified by a panel of four biomarkers using immunohistochemistry. Breast 2012;21:50-7. [Crossref] [PubMed]
- Sun D, Wang M, Li A. A Multimodal Deep Neural Network for Human Breast Cancer Prognosis Prediction by Integrating Multi-Dimensional Data. IEEE/ACM Trans Comput Biol Bioinform 2019;16:841-50. [Crossref] [PubMed]
- Arya N, Saha S. Multi-Modal Classification for Human Breast Cancer Prognosis Prediction: Proposal of Deep-Learning Based Stacked Ensemble Model. IEEE/ACM Trans Comput Biol Bioinform 2022;19:1032-41. [Crossref] [PubMed]
- Yamakawa Y, Masaoka A, Hashimoto T, Niwa H, Mizuno T, Fujii Y, Nakahara K. A tentative tumor-node-metastasis classification of thymoma. Cancer 1991;68:1984-7. [Crossref] [PubMed]
- Verma M, Abdelrahman L, Collado-Mesa F, Abdel-Mottaleb M. Multimodal Spatiotemporal Deep Learning Framework to Predict Response of Breast Cancer to Neoadjuvant Systemic Therapy. Diagnostics (Basel) 2023.
- Li H, Zhao Y, Duan J, Gu J, Liu Z, Zhang H, Zhang Y, Li ZC. MRI and RNA-seq fusion for prediction of pathological response to neoadjuvant chemotherapy in breast cancer. Displays 2024;83: [Crossref]
- Chen H, Gao M, Zhang Y, Liang W, Zou X. Attention-Based Multi-NMF Deep Neural Network with Multimodality Data for Breast Cancer Prognosis Model. Biomed Res Int 2019;2019:9523719. [Crossref] [PubMed]
- Tong L, Mitchel J, Chatlin K, Wang MD. Deep learning based feature-level integration of multi-omics data for breast cancer patients survival analysis. BMC Med Inform Decis Mak 2020;20:225. [Crossref] [PubMed]
- Guo W, Liang W, Deng Q, Zou X. A Multimodal Affinity Fusion Network for Predicting the Survival of Breast Cancer Patients. Front Genet 2021;12:709027. [Crossref] [PubMed]
- Yang J, Ju J, Guo L, Ji B, Shi S, Yang Z, Gao S, Yuan X, Tian G, Liang Y, Yuan P. Prediction of HER2-positive breast cancer recurrence and metastasis risk from histopathological images and clinical information via multimodal deep learning. Comput Struct Biotechnol J 2022;20:333-42. [Crossref] [PubMed]
- Yao Y, Lv Y, Tong L, Liang Y, Xi S, Ji B, Zhang G, Li L, Tian G, Tang M, Hu X, Li S, Yang J. ICSDA: a multi-modal deep learning model to predict breast cancer recurrence and metastasis risk by integrating pathological, clinical and gene expression data. Brief Bioinform 2022;23:bbac448. [Crossref] [PubMed]
- Du X, Zhao Y. Multimodal adversarial representation learning for breast cancer prognosis prediction. Comput Biol Med 2023;157:106765. [Crossref] [PubMed]
- Arya N, Saha S. Deviation-support based fuzzy ensemble of multi-modal deep learning classifiers for breast cancer prognosis prediction. Sci Rep 2023;13:21326. [Crossref] [PubMed]
- Liu H, Shi Y, Li A, Wang M. Multi-modal fusion network with intra- and inter-modality attention for prognosis prediction in breast cancer. Comput Biol Med 2024;168:107796. [Crossref] [PubMed]
- Yang A, Xu L, Qin N, Huang D, Liu Z, Shu J. MFU-Net: a deep multimodal fusion network for breast cancer segmentation with dual-layer spectral detector CT. Applied Intelligence 2024;54:3808-24.
- Stańczyk U, Jain LC. Feature selection for data and pattern recognition: An introduction. Feature Selection for Data and Pattern Recognition. Heidelberg: Springer; 2014:1-7.
- Mangai UG, Samanta S, Das S, Chowdhury PR. A Survey of Decision Fusion and Feature Fusion Strategies for Pattern Classification. IETE Technical Review 2010;27:293-307.
- Yan R, Zhang F, Rao X, Lv Z, Li J, Zhang L, Liang S, Li Y, Ren F, Zheng C, Liang J. Richer fusion network for breast cancer classification based on multimodal data. BMC Med Inform Decis Mak 2021;21:134. [Crossref] [PubMed]
- Atrey K, Singh BK, Bodhey NK, Pachori RB. Mammography and ultrasound based dual modality classification of breast cancer using a hybrid deep learning approach. Biomed Signal Process Control 2023;86: [Crossref]
- Ding Y, Yang F, Han M, Li C, Wang Y, Xu X, Zhao M, Zhao M, Yue M, Deng H, Yang H, Yao J, Liu Y. Multi-center study on predicting breast cancer lymph node status from core needle biopsy specimens using multi-modal and multi-instance deep learning. NPJ Breast Cancer 2023;9:58. [Crossref] [PubMed]
- Kayikci S, Khoshgoftaar TM. Breast cancer prediction using gated attentive multimodal deep learning. Journal of Big Data 2023;10: [Crossref]
- Atrey K, Singh BK, Bodhey NK. Integration of ultrasound and mammogram for multimodal classification of breast cancer using hybrid residual neural network and machine learning. Image and Vision Computing 2024;145: [Crossref]
- Liu M, Zhang S, Du Y, Zhang X, Wang D, Ren W, Sun J, Yang S, Zhang G. Identification of Luminal A breast cancer by using deep learning analysis based on multi-modal images. Front Oncol 2023;13:1243126. [Crossref] [PubMed]
- Girshick R. IEEE. Fast R-CNN. IEEE International Conference on Computer Vision; 2015:1440-8.
- Zhou BY, Wang LF, Yin HH, Wu TF, Ren TT, Peng C, Li DX, Shi H, Sun LP, Zhao CK, Xu HX. Decoding the molecular subtypes of breast cancer seen on multimodal ultrasound images using an assembled convolutional neural network model: A prospective and multicentre study. EBioMedicine 2021;74:103684. [Crossref] [PubMed]
- Vo HQ, Yuan P, He T, Wong ST, Nguyen HV. Multimodal breast lesion classification using cross-attention deep networks. 2021 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI); 2021:1-4.
- Wang L, He Q, Wang X, Song T, Li X, Zhang S, Qin G, Chen W, Zhou L, Zhen X. Multi-criterion decision making-based multi-channel hierarchical fusion of digital breast tomosynthesis and digital mammography for breast mass discrimination. Knowledge-Based Systems 2021;228: [Crossref]
- Rabinovici-Cohen S, Fernández XM, Grandal Rejo B, Hexter E, Hijano Cubelos O, Pajula J, Pölönen H, Reyal F, Rosen-Zvi M. Multimodal Prediction of Five-Year Breast Cancer Recurrence in Women Who Receive Neoadjuvant Chemotherapy. Cancers (Basel) 2022.
- Yang X, Xi X, Yang L, Xu C, Song Z, Nie X, Qiao L, Li C, Shi Q, Yin Y. Multi-modality relation attention network for breast tumor classification. Comput Biol Med 2022;150:106210. [Crossref] [PubMed]
- Miao S, Jia H, Huang W, Cheng K, Zhou W, Wang R. Subcutaneous fat predicts bone metastasis in breast cancer: A novel multimodality-based deep learning model. Cancer Biomark 2024;39:171-85. [Crossref] [PubMed]
- Wan CF, Jiang ZY, Wang YQ, Wang L, Fang H, Jin Y, Dong Q, Zhang XQ, Jiang LX. Radiomics of Multimodal Ultrasound for Early Prediction of Pathologic Complete Response to Neoadjuvant Chemotherapy in Breast Cancer. Acad Radiol 2025;32:1861-73. [Crossref] [PubMed]
- Ye Z, Yuan J, Hong D, Xu P, Liu W. Multimodal diagnostic models and subtype analysis for neoadjuvant therapy in breast cancer. Front Immunol 2025;16:1559200. [Crossref] [PubMed]
- Nakach FZ, Idri A, Tchokponhoue GAD. Multimodal random subspace for breast cancer molecular subtypes prediction by integrating multi-dimensional data. Multimedia Tools and Applications 2024;1-33.
- Misra S, Yoon C, Kim KJ, Managuli R, Barr RG, Baek J, Kim C. Deep learning-based multimodal fusion network for segmentation and classification of breast cancers using B-mode and elastography ultrasound images. Bioeng Transl Med 2023;8:e10480. [Crossref] [PubMed]
- Song J, Zheng Y, Zakir Ullah M, Wang J, Jiang Y, Xu C, Zou Z, Ding G. Multiview multimodal network for breast cancer diagnosis in contrast-enhanced spectral mammography images. Int J Comput Assist Radiol Surg 2021;16:979-88. [Crossref] [PubMed]
- Furtney I, Bradley R, Kabuka MR. Patient Graph Deep Learning to Predict Breast Cancer Molecular Subtype. IEEE/ACM Trans Comput Biol Bioinform 2023;20:3117-27. [Crossref] [PubMed]
- Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative Adversarial Networks. Communications of the Acm 2020;63:139-44.
- Palmal S, Arya N, Saha S, Tripathy S. Integrative prognostic modeling for breast cancer: Unveiling optimal multimodal combinations using graph convolutional networks and calibrated random forest. Applied Soft Computing 2024;154:111379.
- Liu T, Huang J, Liao T, Pu R, Liu S, Peng Y. A Hybrid Deep Learning Model for Predicting Molecular Subtypes of Human Breast Cancer Using Multimodal Data. IRBM 2022;43:62-74.
- Jadoon EK, Khan FG, Shah S, Khan A, ElAffendi M. Deep Learning-Based Multi-Modal Ensemble Classification Approach for Human Breast Cancer Prognosis. IEEE Access 2023;11:85760-9.
- Liu Y, Chen Z, Chen J, Shi Z, Fang G. Pathologic complete response prediction in breast cancer lesion segmentation and neoadjuvant therapy. Front Med (Lausanne) 2023;10:1188207. [Crossref] [PubMed]
- Tang X, Zhang H, Mao R, Zhang Y, Jiang X, Lin M, Xiong L, Chen H, Li L, Wang K, Zhou J. Preoperative Prediction of Axillary Lymph Node Metastasis in Patients With Breast Cancer Through Multimodal Deep Learning Based on Ultrasound and Magnetic Resonance Imaging Images. Acad Radiol 2025;32:1-11. [Crossref] [PubMed]
- Yan P, Gong W, Li M, Zhang J, Li X, Jiang Y, Luo H, Zhou H. TDF-Net: Trusted Dynamic Feature Fusion Network for breast cancer diagnosis using incomplete multimodal ultrasound. Information Fusion 2024;112: [Crossref]
- Liu W, Li L, Deng J, Li W. A comprehensive approach for evaluating lymphovascular invasion in invasive breast cancer: Leveraging multimodal MRI findings, radiomics, and deep learning analysis of intra- and peritumoral regions. Comput Med Imaging Graph 2024;116:102415. [Crossref] [PubMed]
- Oyelade ON, Irunokhai EA, Wang H. A twin convolutional neural network with hybrid binary optimizer for multimodal breast cancer digital image classification. Sci Rep 2024;14:692. [Crossref] [PubMed]
- Zhang T, Tan T, Han L, Appelman L, Veltman J, Wessels R, Duvivier KM, Loo C, Gao Y, Wang X, Horlings HM, Beets-Tan RGH, Mann RM. Predicting breast cancer types on and beyond molecular level in a multi-modal fashion. NPJ Breast Cancer 2023;9:16. [Crossref] [PubMed]
- Mondol RK, Millar EKA, Sowmya A, Meijering E. BioFusionNet: Deep Learning-Based Survival Risk Stratification in ER plus Breast Cancer Through Multifeature and Multimodal Data Fusion. Ieee Journal of Biomedical and Health Informatics 2024;28:5290-302. [Crossref] [PubMed]
- Yang X, Xi X, Wang K, Sun L, Meng L, Nie X, Qiao L, Yin Y. Triple-attention interaction network for breast tumor classification based on multi-modality images. Pattern Recognition 2023;139: [Crossref]
- Hermessi H, Mourali O, Zagrouba E. Multimodal medical image fusion review: Theoretical background and recent advances. Signal Processing 2021;183: [Crossref]
- Pei X, Zuo K, Li Y, Pang Z. A Review of the Application of Multi-modal Deep Learning in Medicine: Bibliometrics and Future Directions. International Journal of Computational Intelligence Systems 2023;16: [Crossref]
- Cui C, Yang H, Wang Y, Zhao S, Asad Z, Coburn LA, Wilson KT, Landman BA, Huo Y. Deep multimodal fusion of image and non-image data in disease diagnosis and prognosis: a review. Prog Biomed Eng (Bristol) 2023;5:10. [Crossref] [PubMed]
- Abbas SR, Abbas Z, Zahir A, Lee SW. Federated Learning in Smart Healthcare: A Comprehensive Review on Privacy, Security, and Predictive Analytics with IoT Integration. Healthcare (Basel) 2024;12:2587. [Crossref] [PubMed]
- Nichol A, Dhariwal P. Improved Denoising Diffusion Probabilistic Models. International Conference on Machine Learning (ICML); 2021.


