A two-stage multimodal learning framework for the automated diagnosis of obstructive coronary artery disease based on dynamic single-photon emission computed tomography
Introduction
Obstructive coronary artery disease (OCAD) remains one of the most prevalent and life-threatening cardiovascular diseases in the world (1). The primary cause of OCAD is the accumulation of atherosclerotic plaques in the coronary arteries, leading to reduced blood flow to the myocardium (2). This restriction in blood supply can manifest as angina, myocardial infarction, and even sudden cardiac death. In addition to acute complications, chronic OCAD can also result in heart failure, arrhythmias, and long-term disability, placing a heavy burden on both healthcare systems and patients’ quality of life. Therefore, timely diagnosis and treatment strategies for OCAD are crucial to mitigating the impact of this disease and improving patient prognosis.
Advances in imaging diagnostic technologies have transformed the treatment of OCAD, with technologies such as dynamic single-photon emission computed tomography (D-SPECT) playing a key role (3). D-SPECT is a cutting-edge, noninvasive nuclear medicine imaging technique that uses radionuclide myocardial perfusion imaging to visualize the blood flow to the heart and the functional status of the myocardium, offering higher sensitivity and specificity (4). Furthermore, quantitative gated single-photon emission computed tomography (QGS)-derived functional parameters provide direct evidence for the reliable diagnosis of OCAD (5).
Deep learning technologies provide powerful tools for the intelligent diagnosis of OCAD. These models can automatically extract complex features and learn potential pathological patterns from historical medical data, enabling high-precision diagnosis and prediction of OCAD (6,7). For instance, Sapra et al. (8) proposed a dual approach based on deep learning for diagnosing coronary artery disease and checking the severity of the disease. Kusumoto et al. (9) developed a multibranch three-dimensional (3D) convolutional neural network for automating the diagnosis of myocardial perfusion imaging via single-photon emission computed tomography (SPECT). Furthermore, several recent studies have pioneered the integration of imaging and QGS-derived functional data for the assessment of OCAD. For instance, Otaki et al. (10) developed a deep learning model that integrated myocardial perfusion imaging from SPECT with electronic report data derived from QGS software for predicting coronary artery disease, demonstrating the feasibility and value of multimodal integration. Bock et al. (11) further advanced this field by showing that the integration of electrocardiography, QGS-derived data, and myocardial perfusion scans significantly improves diagnostic performance for coronary artery disease. Although these studies have confirmed the importance of multimodal learning, they did not examine the contribution of each modality during training.
In this study, we developed a two-stage multimodal learning framework for the automated diagnosis of OCAD using D-SPECT and QGS-derived functional data. First, in the region of interest (ROI) segmentation stage, we extracted the cardiac slices along different axial dimensions to define ROIs. We then constructed a multimodal learning network for feature extraction and fusion. Additionally, we designed a feature adaptation weighting mechanism (FAWM) to adaptively allocate the contribution of multimodal features during training. Finally, we validated the robustness of the proposed model on a set of real-world datasets.
The main contributions of this paper are as follows:
- We propose a two-stage multimodal learning framework for the automated diagnosis of OCAD;
- We designed a FAWM to adaptively allocate the contribution of multimodal features during training;
- We validated the robustness of the proposed model on a set of real-world datasets, achieving competitive performance.
We present this article in accordance with the TRIPOD+AI reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-617/rc).
Methods
Participants
We collected the data of 298 participants who attended from Gansu Provincial Hospital between April 2023 and June 2024, including 114 patients with OCAD and 184 healthy controls (HCs). This study was approved and supported by the Ethics Committee of Gansu Provincial Hospital (approval No. 2025-012) and was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. Informed consent was obtained from all individual participants. All participants underwent D-SPECT scans (Spectrum-Dynamics Medical, Caesarea, Israel). The exclusion criteria for this study were as follows: (I) a history of myocardial infarction or percutaneous coronary intervention (PCI)/coronary artery bypass grafting (CABG); (II) a history or risk of severe bradycardia; (III) dilated or hypertrophic cardiomyopathy; (IV) severe arrhythmia (atrial fibrillation, frequent ventricular premature beats, or paroxysmal tachycardia); (V) missing data; (VI) a lack of diagnosis by coronary angiography or computed tomography angiography (CTA).
All participants underwent 1-day rest and pharmacological stress myocardial perfusion imaging via cadmium zinc telluride (CZT)-based cardiac D-SPECT. All participants were instructed to discontinue the use of nitrate-based vasodilators, beta-blockers, and calcium antagonists; avoid caffeine-containing products within 24 hours; and fast on the day of the exam. The imaging agent and vasodilator used were 99mTc-sestamibi (99mTc-MIBI) and regadenoson, respectively. Additionally, we reconstructed three axial images (vertical long axis, horizontal long axis, and short axis) using QGS software (12) and obtained global left ventricular functional parameters under both stress and rest conditions, including end-diastolic volume (EDV), end-systolic volume (ESV), ejection fraction (EF), peak ejection rate (PER), peak filling rate (PFR), time to peak filling (TTPF), and mean filling rate over the first third of diastole (MFR/3). Importantly, all the aforementioned functional parameters in this study were consistently obtained from the same D-SPECT system through QGS analysis and not from external clinical records. The detailed demographic information is reported in Table 1.
Table 1
| Demographics | OCAD (n=114) | HC (n=184) |
|---|---|---|
| Male/female | 81/33 | 94/90 |
| Age (years) | 60.65±10.51 | 57.53±10.14 |
Data are presented as n or mean ± standard deviation. HC, healthy control; OCAD, obstructive coronary artery disease.
Overview of the methods
In this study, we developed a two-stage multimodal learning framework for automated diagnosis of OCAD (Figure 1), including an ROI segmentation stage and a multimodal learning stage. Specifically, the purpose of the ROI segmentation stage is to extract cardiac slices corresponding to the horizontal long axis, vertical long axis, and short axis from the image as an ROI (Figure 1A). The multimodal learning stage involves the extraction, fusion, and classification of multimodal features (images and records) (Figure 1B). Finally, the predicted results are output through a self-defined multilayer perceptron (MLP) structure.
ROI segmentation
To obtain cardiac slices in different orientations, a connected component detection method is used to calculate the boundaries for ROI segmentation, as shown in Figure 2. Specifically, the raw image is first binarized to distinguish the foreground from the background. Subsequently, a contour detection algorithm is applied to identify all independent connected components in the binary image. To mitigate interference from noise, such as small text elements that may generate minor connected domains, a fixed threshold screening method is employed to filter out connected domains with excessively small areas (set to <4,500 pixels). Finally, the boundaries of the identified connected regions are defined as the segmentation boundaries for the ROI.
Multimodal feature extraction, fusion, and classification
Structure of multimodal feature encoder
We designed two feature extractors to encode features from both D-SPECT and QGS-derived functional data, as shown in Figure 1B. For D-SPECT imaging in different orientations, we designed 3D DenseNet 121, inspired by the classic DenseNet 121 (13), for feature encoding. The 3D DenseNet 121 introduces an additional spatial dimension (i.e., the convolutional kernel size is 3×3×3) and consists of a backbone network with four dense blocks and transition layers. Specifically, DenseNet alleviates the vanishing gradient problem effectively by introducing dense connections and feature reuse mechanisms (14). For QGS-derived functional data, we developed a shallow deep neural network for encoding, composed of a series of stacked fully connected layers (FCs) with neuron configurations of 19, 128, and 64, respectively.
Feature adaptive weighting mechanism
Considering the varying contributions of different feature branches during model training (15), we propose a FAWM to automatically adjust the contribution of each input feature. As shown in Eqs. [1-2], the input features are assigned adaptive weights through the softmax function and propagated backward, ensuring adaptive control over the contribution of each modality during training. Notably, this mechanism facilitates the model in learning the complementarity of features across modalities, thereby improving both the model’s effectiveness and robustness.
where Y is the integrated feature, Yi is the input feature, i =[1, 4], and σ is the softmax function.
Multimodal feature classification
To achieve automated classification of multimodal features, we custom-designed an MLP for automatic diagnosis of OCAD. Specifically, we concatenated the output features from multiple branches along the channels and constructed a stacked three-layer FC structure as the feature classifier for diagnostic OCAD, with neuron configurations set to 256, 128, and 2, respectively.
Implementation details
In this study, a stratified fivefold cross-validation strategy was employed to evaluate model performance. Specifically, the dataset was divided into training and testing sets at a ratio of 4:1, with one fold of the training set reserved as the validation set for parameter tuning. Notably, the model training was trained from scratch, with the batch size and training epoch being set to 12 and 50, respectively. Furthermore, we employed an early stopping strategy to prevent model overfitting. The model parameters were updated via the adaptive moment estimation (Adam) optimizer with a learning rate of 1e−5 and a weight decay coefficient of 1e−4. Additionally, the experimental code was compiled using PyTorch-2.0 and run on an A100 GPU (Nvidia, Santa Clara, CA, USA). If readers are interested in this work, we are willing to respond to their requests and provide the core code.
Results
Evaluation metrics
We used five metrics from image classification to perform a comprehensive evaluation of the proposed model from different perspectives, including accuracy, precision, recall, specificity, F-1 score, and paired t-test, as shown in Eqs. [3-7], respectively. Furthermore, we determined the receiver operating characteristic (ROC) curves of the proposed model to visually demonstrate its effectiveness.
where TP and FP are the number of true positives and false positives, respectively; and TN and FN are the number of true negatives and false negatives, respectively.
Results and analysis
We conducted a series of comparative experiments and ablation studies to evaluate the effectiveness of the proposed model.
Table 2 provides the performance comparison of different modality inputs in the proposed model. The experimental results indicated that the best performance was achieved when both D-SPECT imaging data and clinical records were used as inputs, suggesting that image and QGS-derived functional data are complementary. The proposed model effectively mines the intrinsic relationships between multimodal features. Notably, we found that in the unimodal input experiments, models using only imaging data outperformed those relying solely on QGS-derived parameters, suggesting that medical images may contain more valuable information. This finding also provides guidance for future research directions. Figures 3,4 present the confusion matrix and ROC curve of the proposed model based on D-SPECT image data and QGS-derived data.
Table 2
| Modality | Accuracy (%) | Precision (%) | Recall (%) | Specificity (%) | F-1 (%) | P value† |
|---|---|---|---|---|---|---|
| D-SPECT only | 75.00 | 75.00 | 52.17 | 89.19 | 61.54 | 0.12 |
| QGS-derived data only | 70.00 | 60.00 | 65.22 | 72.97 | 62.50 | <0.05 |
| Multimodal (proposed) | 81.67 | 87.50 | 60.87 | 94.59 | 71.79 | – |
†, paired t-test. D-SPECT, dynamic single-photon emission computed tomography; QGS, quantitative gated single-photon emission computed tomography.
Table 3 presents the performance comparison of the proposed model with the current classical deep learning models [visual geometry group network (VGGNet) (16), residual net (ResNet) (17), Vision Transformer (18), and Swin Transformer (19)]. According to the experimental results, the proposed model achieved optimal performance, indicating that it can effectively exploit the complementarity between different modalities. Furthermore, the 3D DenseNet introduces dense connections and feature reuse, which not only alleviates the problem of gradient vanishing but also further enhances the interaction between low-level and high-level features, thereby improving the representational capacity of the extracted features.
Table 3
| Models | Accuracy (%) | Precision (%) | Recall (%) | Specificity (%) | F-1 (%) | P value† |
|---|---|---|---|---|---|---|
| VGGNet | 68.33 | 62.50 | 43.48 | 83.78 | 51.28 | <0.05 |
| ResNet | 76.67 | 90.91 | 43.48 | 97.30 | 58.82 | <0.05 |
| Vision Transformer | 56.67 | 36.36 | 17.39 | 81.08 | 23.53 | <0.05 |
| Swin Transformer | 53.33 | 14.29 | 4.35 | 83.78 | 6.67 | <0.05 |
| Ours | 81.67 | 87.50 | 60.87 | 94.59 | 71.79 | – |
†, paired t-test. ResNet, residual net; VGGNet, visual geometry group network.
Table 4 provides the performance comparison of different fusion strategies in the proposed model. The experimental results demonstrate that feature concatenation is an effective multimodal fusion strategy, achieving the best experimental performance. Of note, the feature concatenation operation can preserve the representativeness of both image and text features, which contributes to enhancing the generalizability of the fused features. This finding is consistent with previous research (20).
Table 4
| Fusion strategy | Accuracy (%) | Precision (%) | Recall (%) | Specificity (%) | F-1 (%) | P value† |
|---|---|---|---|---|---|---|
| Add | 78.57 | 78.57 | 47.83 | 91.89 | 59.46 | <0.05 |
| Voting | 73.33 | 76.92 | 43.48 | 91.89 | 55.56 | <0.05 |
| Concatenation (proposed) | 81.67 | 87.50 | 60.87 | 94.59 | 71.79 | – |
†, paired t-test.
Table 5 presents a comparison of the data augmentation in the proposed model. The experimental results indicate that simple geometric transformations (e.g., data translation and rotation) for sample augmentation did not achieve the desired performance. This may be attributed to the fact that geometric transformations merely replicate the original samples without enhancing feature diversity (21). This finding provides guidance for the implementation of data augmentation strategies.
Table 5
| Data augmentation | Accuracy (%) | Precision (%) | Recall (%) | Specificity (%) | F-1 (%) | P value† |
|---|---|---|---|---|---|---|
| √ | 66.67 | 80.00 | 17.39 | 97.30 | 28.57 | <0.05 |
| × | 81.67 | 87.50 | 60.87 | 94.59 | 71.79 | – |
†, paired t-test. ×: the operation was not performed. √: the operation was performed.
Table 6 reports the results of the ablation experiment for the FAWM in the proposed model. The findings demonstrate that the introduction of FAWM achieved the desired performance, indicating that FAWM can effectively allocate the contribution of multimodal features during training. It is well-known that multimodal learning is a dynamic training process that enables deep exploration of the complementarities between modalities at different training stages (22). FAWM enables adaptive regulation of multimodal feature weights during the learning process, which facilitates the proposed model in exploring the interactions among modalities.
Table 6
| FAWM | Accuracy (%) | Precision (%) | Recall (%) | Specificity (%) | F-1 (%) | P value† |
|---|---|---|---|---|---|---|
| × | 76.67 | 71.43 | 65.22 | 83.78 | 68.18 | 0.06 |
| √ | 81.67 | 87.50 | 60.87 | 94.59 | 71.79 | – |
†, paired t-test. ×: the FAWM module was removed. √: the FAWM module was retained. FAWM, feature adaptive weighting mechanism.
Discussion
Interpretation of findings
This study designed a two-stage framework for the automated diagnosis of OCAD based on D-SPECT. Moreover, we comprehensively validated the performance of the proposed model through a series of comparative experiments and ablation studies.
Table 2 demonstrates that integrating D-SPECT and QGS-derived data enhances the diagnostic capability of the model (accuracy =81.67%). In single-modality experiments, we observed that D-SPECT achieved superior performance, suggesting that medical imaging data may contain more valuable information (23). It can thus be concluded that enhancing the feature representation capability of D-SPECT is likely to improve the diagnostic performance of the model. Furthermore, a key advantage of using QGS-derived parameters is that they can be obtained directly from the same D-SPECT system, ensuring data consistency and enabling a streamlined “one-stop” diagnostic workflow without reliance on additional clinical records or laboratory data.
Tables 3,4 indicate that the proposed model achieved state-of-the-art performance, confirming that the constructed 3D DenseNet and the multimodal feature concatenation strategy can effectively enable the mining and fusion of multimodal features. As seen in Table 5, the traditional geometric transformations failed to achieve the goal of data augmentation. This may be attributed to the fact that geometric transformations only replicate samples without enhancing the diversity of features. Nonetheless, data augmentation remains an option for addressing the small sample problem. Additionally, Table 6 provides the results of the ablation study for the FAWM mechanism, demonstrating that this weight adaptation mechanism can effectively and adaptively allocate the contribution of multimodal features during training, which is crucial in multimodal learning (24).
Limitations and future research directions
Although the proposed model achieved the desired performance, certain limitations should be noted: (I) the small sample size limited the validation of the model’s generalization. (II) The black-box nature of deep learning methods reduces the trust in such approaches in clinical settings. (III) Additional multimodal information can be further integrated to enhance the robustness of the proposed model.
In the future, we plan to implement the following strategies to improve the performance of the model: (I) collect more clinical data to validate the model’s generalization ability and use generative adversarial networks for sample generation to enhance the diversity of generated features (25); (II) improve the interpretability of the model through gradient-weighted class activation mapping (26); and (III) incorporate multiple modalities and design more powerful feature encoders to enhance the diagnostic capability of the model.
Conclusions
This study developed a two-stage multimodal learning framework for the automated diagnosis of OCAD based on D-SPECT and QGS-derived functional parameters. Specifically, the first stage aims to segment cardiac slices in different orientations from the raw image as ROIs, and the second stage focuses on the extraction, fusion, and classification of multimodal features. Furthermore, we designed an FAWM to adaptively allocate the contribution of multimodal features during training. Finally, we validated the model’s robustness and generalization from various perspectives using a set of real-world datasets. The experimental results indicate that the proposed method achieves competitive performance and confirm that multimodal information fusion contributes to enhancing the model’s diagnostic capability.
Acknowledgments
None.
Footnote
Reporting Checklist: The authors have completed the TRIPOD+AI reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-617/rc
Data Sharing Statement: Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-617/dss
Funding: This work was supported by
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-617/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Ethics Committee of Gansu Provincial Hospital (No. 2025-012), and informed consent was taken from all individual participants.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Acampa W, Assante R, Mannarino T, Zampella E, D'Antonio A, Buongiorno P, Gaudieri V, Nappi C, Giordano A, Mainolfi CG, Petretta M, Cuocolo A. Low-dose dynamic myocardial perfusion imaging by CZT-SPECT in the identification of obstructive coronary artery disease. Eur J Nucl Med Mol Imaging 2020;47:1705-12. [Crossref] [PubMed]
- Marzilli M, Merz CN, Boden WE, Bonow RO, Capozza PG, Chilian WM, DeMaria AN, Guarini G, Huqi A, Morrone D, Patel MR, Weintraub WS. Obstructive coronary atherosclerosis and ischemic heart disease: an elusive link! J Am Coll Cardiol 2012;60:951-6. [Crossref] [PubMed]
- Zhang J, Xie J, Li M, Fang W, Hsu B. SPECT myocardial blood flow quantitation for the detection of angiographic stenoses with cardiac-dedicated CZT SPECT. J Nucl Cardiol 2023;30:2618-32. [Crossref] [PubMed]
- Erlandsson K, Kacperski K, van Gramberg D, Hutton BF. Performance evaluation of D-SPECT: a novel SPECT system for nuclear cardiology. Phys Med Biol 2009;54:2635-49. [Crossref] [PubMed]
- Diamond GA, Forrester JS. Analysis of probability as an aid in the clinical diagnosis of coronary-artery disease. N Engl J Med 1979;300:1350-8. [Crossref] [PubMed]
- Alizadehsani R, Abdar M, Roshanzamir M, Khosravi A, Kebria PM, Khozeimeh F, Nahavandi S, Sarrafzadegan N, Acharya UR. Machine learning-based coronary artery disease diagnosis: A comprehensive review. Comput Biol Med 2019;111:103346. [Crossref] [PubMed]
- Hampe N, Wolterink JM, van Velzen SGM, Leiner T, Išgum I. Machine Learning for Assessment of Coronary Artery Disease in Cardiac CT: A Survey. Front Cardiovasc Med 2019;6:172. [Crossref] [PubMed]
- Sapra V, Sapra L, Bhardwaj A, Bharany S, Saxena A, Karim FK, Ghorashi S, Mohamed AW. Integrated approach using deep neural network and CBR for detecting severity of coronary artery disease. Alexandria Engineering Journal 2023;68:709-20.
- Kusumoto D, Akiyama T, Hashimoto M, Iwabuchi Y, Katsuki T, Kimura M, Akiba Y, Sawada H, Inohara T, Yuasa S, Fukuda K, Jinzaki M, Ieda M. A deep learning-based automated diagnosis system for SPECT myocardial perfusion imaging. Sci Rep 2024;14:13583. [Crossref] [PubMed]
- Otaki Y, Singh A, Kavanagh P, Miller RJH, Parekh T, Tamarappoo BK, et al. Clinical Deployment of Explainable Artificial Intelligence of SPECT for Diagnosis of Coronary Artery Disease. JACC Cardiovasc Imaging 2022;15:1091-102. [Crossref] [PubMed]
- Bock C, Walter JE, Rieck B, Strebel I, Rumora K, Schaefer I, Zellweger MJ, Borgwardt K, Müller C. Enhancing the diagnosis of functionally relevant coronary artery disease with machine learning. Nat Commun 2024;15:5034. [Crossref] [PubMed]
- Xu Y, Hayes S, Ali I, Ruddy TD, Wells RG, Berman DS, Germano G, Slomka PJ. Automatic and visual reproducibility of perfusion and function measures for myocardial perfusion SPECT. J Nucl Cardiol 2010;17:1050-7. [Crossref] [PubMed]
- Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely Connected Convolutional Networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA: IEEE; 2017.
- Adegun AA, Viriri S. FCN-Based DenseNet Framework for Automated Detection and Classification of Skin Lesions in Dermoscopy Images. IEEE Access 2020;8:150377-96.
- Kümmerer M, Wallis TSA, Gatys LA, Bethge M. Understanding Low- and High-Level Contributions to Fixation Prediction. 2017 IEEE International Conference on Computer Vision (ICCV). Venice: IEEE; 2017.
- Simonyan K, Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition. International Conference on Learning Representations. 2015.
- He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE; 2016.
- Dosovitskiy A. An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations. 2021.
- Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal: IEEE; 2021.
- Huang SC, Pareek A, Seyyedi S, Banerjee I, Lungren MP. Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digit Med 2020;3:136. [Crossref] [PubMed]
- Li T, Mao J, Yu J, Zhao Z, Chen M, Yao Z, Fang L, Hu B. Fully automated classification of pulmonary nodules in positron emission tomography-computed tomography imaging using a two-stage multimodal learning approach. Quant Imaging Med Surg 2024;14:5526-40. [Crossref] [PubMed]
- Lv F, Chen X, Huang Y, Duan L, Lin G. Progressive Modality Reinforcement for Human Multimodal Emotion Recognition from Unaligned Multimodal Sequences. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA: IEEE; 2021.
- Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, Ghafoorian M, van der Laak JAWM, van Ginneken B, Sánchez CI. A survey on deep learning in medical image analysis. Med Image Anal 2017;42:60-88. [Crossref] [PubMed]
- Gong P, Liu J, Zhang X, Li X, Wei L, He H. Adaptive Multimodal Graph Integration Network for Multimodal Sentiment Analysis. IEEE Transactions on Audio, Speech and Language Processing 2025;33:23-36.
- Tran NT, Tran VH, Nguyen NB, Nguyen TK, Cheung NM. On Data Augmentation for GAN Training. IEEE Trans Image Process 2021;30:1882-97. [Crossref] [PubMed]
- Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. 2017 IEEE International Conference on Computer Vision (ICCV). Venice: IEEE; 2017.

