A two-stage multimodal learning framework for the automated diagnosis of obstructive coronary artery disease based on dynamic single-photon emission computed tomography

Rong Wang; Haijun Wang; Chuan Zhou; Zhongwei Li; Dekui Chen; Yueqian Zhu; Jing Liu; Xingyu Wang; Tongtong Li; Ping Xie

doi:10.21037/qims-2025-617

Original Article

A two-stage multimodal learning framework for the automated diagnosis of obstructive coronary artery disease based on dynamic single-photon emission computed tomography

Rong Wang^1,2, Haijun Wang², Chuan Zhou³, Zhongwei Li⁴, Dekui Chen², Yueqian Zhu², Jing Liu², Xingyu Wang², Tongtong Li⁵, Ping Xie^1,4

¹The First School of Clinical Medicine, Lanzhou University, Lanzhou, China; ²Department of Nuclear Medicine, Gansu Provincial Hospital, Lanzhou, China; ³Department of Geriatric General Surgery, Sichuan Provincial People’s Hospital, University of Electronic Science and Technology of China, Chengdu, China; ⁴Department of Cardiology, Gansu Provincial Hospital, Lanzhou, China; ⁵School of Information Science and Engineering, Lanzhou University, Lanzhou, China

Contributions: (I) Conception and design: R Wang, T Li; (II) Administrative support: P Xie; (III) Provision of study materials or patients: R Wang, H Wang, C Zhou, Z Li, D Chen, Y Zhu, J Liu, X Wang; (IV) Collection and assembly of data: R Wang, H Wang, T Li; (V) Data analysis and interpretation: R Wang, T Li; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Ping Xie, MD, PhD. The First School of Clinical Medicine, Lanzhou University, No. 1 Donggang West Road, Lanzhou 730000, China; Department of Cardiology, Gansu Provincial Hospital, No. 204 Donggang West Road, Lanzhou 730000, China. Email: 1160023677@qq.com; Tongtong Li, CS, PhD. School of Information Science and Engineering, Lanzhou University, No. 222 South Tianshui Road, Lanzhou 730000, China. Email: ttli2022@lzu.edu.cn.

Background: Obstructive coronary artery disease (OCAD) is among the most life-threatening cardiovascular diseases in the world. Dynamic single-photon emission computed tomography (D-SPECT) offers a noninvasive technique for visualizing the perfusion of the heart and the functional state of the myocardium. However, computer-aided diagnostic models for OCAD mainly focus on analyzing medical images and neglect the potential benefits of integrating functional parameters derived from quantitative gated single-photon emission computed tomography (QGS) obtained within the same imaging system. The objective of this study was to develop a two-stage multimodal learning framework that integrates D-SPECT images and QGS-derived functional data to enhance diagnostic accuracy and support a one-stop, imaging-based diagnostic workflow for OCAD.

Methods: We developed a two-stage multimodal learning framework for automated OCAD diagnosis using both D-SPECT images and QGS-derived functional data. In stage I, cardiac slices along multiple axial dimensions were extracted as regions of interest (ROIs). In stage II, a multimodal learning network was constructed to extract, fuse, and classify features from both inputs. Furthermore, a feature adaptation weighting mechanism (FAWM) was introduced to adaptively allocate the contributions of different modalities during training. Finally, the performance of the proposed model was evaluated on a dataset of 298 D-SPECT scans.

Results: The multimodal method achieved an accuracy of 81.67%, outperforming models trained with single inputs (D-SPECT images only: 75%, P=0.12; QGS-derived data only: 70.00%, P<0.05; paired t-test).

Conclusions: The integration of D-SPECT imaging and QGS-derived functional parameters within a unified multimodal framework significantly improves diagnostic performance for OCAD. This approach demonstrates the feasibility of serving as a one-stop, image-based diagnostic workflow for clinical practice.

Keywords: Computer-aided diagnosis; obstructive coronary artery disease (OCAD); dynamic single-photon emission computed tomography (D-SPECT); deep learning; multimodal fusion

Submitted Mar 11, 2025. Accepted for publication Oct 24, 2025. Published online Dec 31, 2025.

doi: 10.21037/qims-2025-617

Introduction

Obstructive coronary artery disease (OCAD) remains one of the most prevalent and life-threatening cardiovascular diseases in the world (1). The primary cause of OCAD is the accumulation of atherosclerotic plaques in the coronary arteries, leading to reduced blood flow to the myocardium (2). This restriction in blood supply can manifest as angina, myocardial infarction, and even sudden cardiac death. In addition to acute complications, chronic OCAD can also result in heart failure, arrhythmias, and long-term disability, placing a heavy burden on both healthcare systems and patients’ quality of life. Therefore, timely diagnosis and treatment strategies for OCAD are crucial to mitigating the impact of this disease and improving patient prognosis.

Advances in imaging diagnostic technologies have transformed the treatment of OCAD, with technologies such as dynamic single-photon emission computed tomography (D-SPECT) playing a key role (3). D-SPECT is a cutting-edge, noninvasive nuclear medicine imaging technique that uses radionuclide myocardial perfusion imaging to visualize the blood flow to the heart and the functional status of the myocardium, offering higher sensitivity and specificity (4). Furthermore, quantitative gated single-photon emission computed tomography (QGS)-derived functional parameters provide direct evidence for the reliable diagnosis of OCAD (5).

Deep learning technologies provide powerful tools for the intelligent diagnosis of OCAD. These models can automatically extract complex features and learn potential pathological patterns from historical medical data, enabling high-precision diagnosis and prediction of OCAD (6,7). For instance, Sapra et al. (8) proposed a dual approach based on deep learning for diagnosing coronary artery disease and checking the severity of the disease. Kusumoto et al. (9) developed a multibranch three-dimensional (3D) convolutional neural network for automating the diagnosis of myocardial perfusion imaging via single-photon emission computed tomography (SPECT). Furthermore, several recent studies have pioneered the integration of imaging and QGS-derived functional data for the assessment of OCAD. For instance, Otaki et al. (10) developed a deep learning model that integrated myocardial perfusion imaging from SPECT with electronic report data derived from QGS software for predicting coronary artery disease, demonstrating the feasibility and value of multimodal integration. Bock et al. (11) further advanced this field by showing that the integration of electrocardiography, QGS-derived data, and myocardial perfusion scans significantly improves diagnostic performance for coronary artery disease. Although these studies have confirmed the importance of multimodal learning, they did not examine the contribution of each modality during training.

In this study, we developed a two-stage multimodal learning framework for the automated diagnosis of OCAD using D-SPECT and QGS-derived functional data. First, in the region of interest (ROI) segmentation stage, we extracted the cardiac slices along different axial dimensions to define ROIs. We then constructed a multimodal learning network for feature extraction and fusion. Additionally, we designed a feature adaptation weighting mechanism (FAWM) to adaptively allocate the contribution of multimodal features during training. Finally, we validated the robustness of the proposed model on a set of real-world datasets.

The main contributions of this paper are as follows:

We propose a two-stage multimodal learning framework for the automated diagnosis of OCAD;
We designed a FAWM to adaptively allocate the contribution of multimodal features during training;
We validated the robustness of the proposed model on a set of real-world datasets, achieving competitive performance.

We present this article in accordance with the TRIPOD+AI reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-617/rc).

Methods

Participants

We collected the data of 298 participants who attended from Gansu Provincial Hospital between April 2023 and June 2024, including 114 patients with OCAD and 184 healthy controls (HCs). This study was approved and supported by the Ethics Committee of Gansu Provincial Hospital (approval No. 2025-012) and was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. Informed consent was obtained from all individual participants. All participants underwent D-SPECT scans (Spectrum-Dynamics Medical, Caesarea, Israel). The exclusion criteria for this study were as follows: (I) a history of myocardial infarction or percutaneous coronary intervention (PCI)/coronary artery bypass grafting (CABG); (II) a history or risk of severe bradycardia; (III) dilated or hypertrophic cardiomyopathy; (IV) severe arrhythmia (atrial fibrillation, frequent ventricular premature beats, or paroxysmal tachycardia); (V) missing data; (VI) a lack of diagnosis by coronary angiography or computed tomography angiography (CTA).

All participants underwent 1-day rest and pharmacological stress myocardial perfusion imaging via cadmium zinc telluride (CZT)-based cardiac D-SPECT. All participants were instructed to discontinue the use of nitrate-based vasodilators, beta-blockers, and calcium antagonists; avoid caffeine-containing products within 24 hours; and fast on the day of the exam. The imaging agent and vasodilator used were ^99mTc-sestamibi (^99mTc-MIBI) and regadenoson, respectively. Additionally, we reconstructed three axial images (vertical long axis, horizontal long axis, and short axis) using QGS software (12) and obtained global left ventricular functional parameters under both stress and rest conditions, including end-diastolic volume (EDV), end-systolic volume (ESV), ejection fraction (EF), peak ejection rate (PER), peak filling rate (PFR), time to peak filling (TTPF), and mean filling rate over the first third of diastole (MFR/3). Importantly, all the aforementioned functional parameters in this study were consistently obtained from the same D-SPECT system through QGS analysis and not from external clinical records. The detailed demographic information is reported in Table 1.

Table 1

Demographic information of participants in this study

Demographics	OCAD (n=114)	HC (n=184)
Male/female	81/33	94/90
Age (years)	60.65±10.51	57.53±10.14

Data are presented as n or mean ± standard deviation. HC, healthy control; OCAD, obstructive coronary artery disease.

Overview of the methods

In this study, we developed a two-stage multimodal learning framework for automated diagnosis of OCAD (Figure 1), including an ROI segmentation stage and a multimodal learning stage. Specifically, the purpose of the ROI segmentation stage is to extract cardiac slices corresponding to the horizontal long axis, vertical long axis, and short axis from the image as an ROI (Figure 1A). The multimodal learning stage involves the extraction, fusion, and classification of multimodal features (images and records) (Figure 1B). Finally, the predicted results are output through a self-defined multilayer perceptron (MLP) structure.

Figure 1 Overview of the proposed two-stage framework for the automated diagnosis of obstructive coronary artery disease via D-SPECT. (A) Stage I: ROI segmentation. (B) Stage II: multimodal learning. 3D, three-dimensional; D-SPECT, dynamic single-photon emission computed tomography; EDV, end-diastolic volume; EF, ejection fraction; ESV, end-systolic volume; HC, healthy control; MFR/3, mean filling rate over the first third of diastole; OCAD, obstructive coronary artery disease; PER, peak ejection rate; PFR, peak filling rate; QGS, quantitative gated single-photon emission computed tomography; ROI, region of interest; TTPF, time to peak filling.

ROI segmentation

To obtain cardiac slices in different orientations, a connected component detection method is used to calculate the boundaries for ROI segmentation, as shown in Figure 2. Specifically, the raw image is first binarized to distinguish the foreground from the background. Subsequently, a contour detection algorithm is applied to identify all independent connected components in the binary image. To mitigate interference from noise, such as small text elements that may generate minor connected domains, a fixed threshold screening method is employed to filter out connected domains with excessively small areas (set to <4,500 pixels). Finally, the boundaries of the identified connected regions are defined as the segmentation boundaries for the ROI.

Figure 2 An example of the ROI segmentation process. The yellow box indicates the example region, and the white dashed line denotes the boundary along which the ROI is segmented. ROI, region of interest.

Multimodal feature extraction, fusion, and classification

Structure of multimodal feature encoder

We designed two feature extractors to encode features from both D-SPECT and QGS-derived functional data, as shown in Figure 1B. For D-SPECT imaging in different orientations, we designed 3D DenseNet 121, inspired by the classic DenseNet 121 (13), for feature encoding. The 3D DenseNet 121 introduces an additional spatial dimension (i.e., the convolutional kernel size is 3×3×3) and consists of a backbone network with four dense blocks and transition layers. Specifically, DenseNet alleviates the vanishing gradient problem effectively by introducing dense connections and feature reuse mechanisms (14). For QGS-derived functional data, we developed a shallow deep neural network for encoding, composed of a series of stacked fully connected layers (FCs) with neuron configurations of 19, 128, and 64, respectively.

Feature adaptive weighting mechanism

Considering the varying contributions of different feature branches during model training (15), we propose a FAWM to automatically adjust the contribution of each input feature. As shown in Eqs. [1-2], the input features are assigned adaptive weights through the softmax function and propagated backward, ensuring adaptive control over the contribution of each modality during training. Notably, this mechanism facilitates the model in learning the complementarity of features across modalities, thereby improving both the model’s effectiveness and robustness.

$Y = \sum σ i (Y i)$ [1]

$σ i = \frac{Y i}{\sum_{k = 1}^{4} e^{Y k}}$ [2]

where Y is the integrated feature, Y_i is the input feature, i =[1, 4], and σ is the softmax function.

Multimodal feature classification

To achieve automated classification of multimodal features, we custom-designed an MLP for automatic diagnosis of OCAD. Specifically, we concatenated the output features from multiple branches along the channels and constructed a stacked three-layer FC structure as the feature classifier for diagnostic OCAD, with neuron configurations set to 256, 128, and 2, respectively.

Implementation details

In this study, a stratified fivefold cross-validation strategy was employed to evaluate model performance. Specifically, the dataset was divided into training and testing sets at a ratio of 4:1, with one fold of the training set reserved as the validation set for parameter tuning. Notably, the model training was trained from scratch, with the batch size and training epoch being set to 12 and 50, respectively. Furthermore, we employed an early stopping strategy to prevent model overfitting. The model parameters were updated via the adaptive moment estimation (Adam) optimizer with a learning rate of 1e−5 and a weight decay coefficient of 1e−4. Additionally, the experimental code was compiled using PyTorch-2.0 and run on an A100 GPU (Nvidia, Santa Clara, CA, USA). If readers are interested in this work, we are willing to respond to their requests and provide the core code.

Results

Evaluation metrics

We used five metrics from image classification to perform a comprehensive evaluation of the proposed model from different perspectives, including accuracy, precision, recall, specificity, F-1 score, and paired t-test, as shown in Eqs. [3-7], respectively. Furthermore, we determined the receiver operating characteristic (ROC) curves of the proposed model to visually demonstrate its effectiveness.

$A c c u r a c y = \frac{T P + T N}{T P + F P + T N + F N}$ [3]

$P r e c i s i o n = \frac{T P}{T P + F P}$ [4]

$R e c a l l = \frac{T P}{T P + F N}$ [5]

$S p e c i f i c i t y = \frac{T N}{T N + F P}$ [6]

$F -1 = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}$ [7]

where TP and FP are the number of true positives and false positives, respectively; and TN and FN are the number of true negatives and false negatives, respectively.

Results and analysis

We conducted a series of comparative experiments and ablation studies to evaluate the effectiveness of the proposed model.

Table 2 provides the performance comparison of different modality inputs in the proposed model. The experimental results indicated that the best performance was achieved when both D-SPECT imaging data and clinical records were used as inputs, suggesting that image and QGS-derived functional data are complementary. The proposed model effectively mines the intrinsic relationships between multimodal features. Notably, we found that in the unimodal input experiments, models using only imaging data outperformed those relying solely on QGS-derived parameters, suggesting that medical images may contain more valuable information. This finding also provides guidance for future research directions. Figures 3,4 present the confusion matrix and ROC curve of the proposed model based on D-SPECT image data and QGS-derived data.

Table 2

Comparison of different modality inputs in the proposed model

Modality	Accuracy (%)	Precision (%)	Recall (%)	Specificity (%)	F-1 (%)	P value^†
D-SPECT only	75.00	75.00	52.17	89.19	61.54	0.12
QGS-derived data only	70.00	60.00	65.22	72.97	62.50	<0.05
Multimodal (proposed)	81.67	87.50	60.87	94.59	71.79	–

^†, paired t-test. D-SPECT, dynamic single-photon emission computed tomography; QGS, quantitative gated single-photon emission computed tomography.

Figure 3 Confusion matrix of the proposed model based on D-SPECT image data and QGS-derived data. Rows and columns indicate the predicted classes and the ground truth, respectively. The numbers within the matrix represent the number of patients in the corresponding classification results. The color intensity corresponds to the magnitude of patient counts, with darker colors indicating larger values, as shown by the color bar. D-SPECT, dynamic single-photon emission computed tomography; HC, healthy control; OCAD, obstructive coronary artery disease; QGS, quantitative gated single-photon emission computed tomography.

Figure 4 The ROC curve of the proposed model based on D-SPECT image data and QGS-derived data. The curves correspond to D-SPECT image data only (blue line), QGS-derived data only (orange line), and fusion inputs integrating both (red line). AUC, area under the curve; D-SPECT, dynamic single-photon emission computed tomography; QGS, quantitative gated single-photon emission computed tomography; ROC, receiver operating characteristic.

Table 3 presents the performance comparison of the proposed model with the current classical deep learning models [visual geometry group network (VGGNet) (16), residual net (ResNet) (17), Vision Transformer (18), and Swin Transformer (19)]. According to the experimental results, the proposed model achieved optimal performance, indicating that it can effectively exploit the complementarity between different modalities. Furthermore, the 3D DenseNet introduces dense connections and feature reuse, which not only alleviates the problem of gradient vanishing but also further enhances the interaction between low-level and high-level features, thereby improving the representational capacity of the extracted features.

Table 3

Comparison of the proposed model with the current classical deep learning models

Models	Accuracy (%)	Precision (%)	Recall (%)	Specificity (%)	F-1 (%)	P value^†
VGGNet	68.33	62.50	43.48	83.78	51.28	<0.05
ResNet	76.67	90.91	43.48	97.30	58.82	<0.05
Vision Transformer	56.67	36.36	17.39	81.08	23.53	<0.05
Swin Transformer	53.33	14.29	4.35	83.78	6.67	<0.05
Ours	81.67	87.50	60.87	94.59	71.79	–

^†, paired t-test. ResNet, residual net; VGGNet, visual geometry group network.

Table 4 provides the performance comparison of different fusion strategies in the proposed model. The experimental results demonstrate that feature concatenation is an effective multimodal fusion strategy, achieving the best experimental performance. Of note, the feature concatenation operation can preserve the representativeness of both image and text features, which contributes to enhancing the generalizability of the fused features. This finding is consistent with previous research (20).

Table 4

Comparison of different fusion strategies in the proposed model

Fusion strategy	Accuracy (%)	Precision (%)	Recall (%)	Specificity (%)	F-1 (%)	P value^†
Add	78.57	78.57	47.83	91.89	59.46	<0.05
Voting	73.33	76.92	43.48	91.89	55.56	<0.05
Concatenation (proposed)	81.67	87.50	60.87	94.59	71.79	–

^†, paired t-test.

Table 5 presents a comparison of the data augmentation in the proposed model. The experimental results indicate that simple geometric transformations (e.g., data translation and rotation) for sample augmentation did not achieve the desired performance. This may be attributed to the fact that geometric transformations merely replicate the original samples without enhancing feature diversity (21). This finding provides guidance for the implementation of data augmentation strategies.

Table 5

Comparison of data augmentation in the proposed model

Data augmentation	Accuracy (%)	Precision (%)	Recall (%)	Specificity (%)	F-1 (%)	P value^†
√	66.67	80.00	17.39	97.30	28.57	<0.05
×	81.67	87.50	60.87	94.59	71.79	–

^†, paired t-test. ×: the operation was not performed. √: the operation was performed.

Table 6 reports the results of the ablation experiment for the FAWM in the proposed model. The findings demonstrate that the introduction of FAWM achieved the desired performance, indicating that FAWM can effectively allocate the contribution of multimodal features during training. It is well-known that multimodal learning is a dynamic training process that enables deep exploration of the complementarities between modalities at different training stages (22). FAWM enables adaptive regulation of multimodal feature weights during the learning process, which facilitates the proposed model in exploring the interactions among modalities.

Table 6

Ablation results for the FAWM in the proposed model

FAWM	Accuracy (%)	Precision (%)	Recall (%)	Specificity (%)	F-1 (%)	P value^†
×	76.67	71.43	65.22	83.78	68.18	0.06
√	81.67	87.50	60.87	94.59	71.79	–

^†, paired t-test. ×: the FAWM module was removed. √: the FAWM module was retained. FAWM, feature adaptive weighting mechanism.

Discussion

Interpretation of findings

This study designed a two-stage framework for the automated diagnosis of OCAD based on D-SPECT. Moreover, we comprehensively validated the performance of the proposed model through a series of comparative experiments and ablation studies.

Table 2 demonstrates that integrating D-SPECT and QGS-derived data enhances the diagnostic capability of the model (accuracy =81.67%). In single-modality experiments, we observed that D-SPECT achieved superior performance, suggesting that medical imaging data may contain more valuable information (23). It can thus be concluded that enhancing the feature representation capability of D-SPECT is likely to improve the diagnostic performance of the model. Furthermore, a key advantage of using QGS-derived parameters is that they can be obtained directly from the same D-SPECT system, ensuring data consistency and enabling a streamlined “one-stop” diagnostic workflow without reliance on additional clinical records or laboratory data.

Tables 3,4 indicate that the proposed model achieved state-of-the-art performance, confirming that the constructed 3D DenseNet and the multimodal feature concatenation strategy can effectively enable the mining and fusion of multimodal features. As seen in Table 5, the traditional geometric transformations failed to achieve the goal of data augmentation. This may be attributed to the fact that geometric transformations only replicate samples without enhancing the diversity of features. Nonetheless, data augmentation remains an option for addressing the small sample problem. Additionally, Table 6 provides the results of the ablation study for the FAWM mechanism, demonstrating that this weight adaptation mechanism can effectively and adaptively allocate the contribution of multimodal features during training, which is crucial in multimodal learning (24).

Limitations and future research directions

Although the proposed model achieved the desired performance, certain limitations should be noted: (I) the small sample size limited the validation of the model’s generalization. (II) The black-box nature of deep learning methods reduces the trust in such approaches in clinical settings. (III) Additional multimodal information can be further integrated to enhance the robustness of the proposed model.

In the future, we plan to implement the following strategies to improve the performance of the model: (I) collect more clinical data to validate the model’s generalization ability and use generative adversarial networks for sample generation to enhance the diversity of generated features (25); (II) improve the interpretability of the model through gradient-weighted class activation mapping (26); and (III) incorporate multiple modalities and design more powerful feature encoders to enhance the diagnostic capability of the model.

Conclusions

This study developed a two-stage multimodal learning framework for the automated diagnosis of OCAD based on D-SPECT and QGS-derived functional parameters. Specifically, the first stage aims to segment cardiac slices in different orientations from the raw image as ROIs, and the second stage focuses on the extraction, fusion, and classification of multimodal features. Furthermore, we designed an FAWM to adaptively allocate the contribution of multimodal features during training. Finally, we validated the model’s robustness and generalization from various perspectives using a set of real-world datasets. The experimental results indicate that the proposed method achieves competitive performance and confirm that multimodal information fusion contributes to enhancing the model’s diagnostic capability.

Acknowledgments

None.

Footnote

Reporting Checklist: The authors have completed the TRIPOD+AI reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-617/rc

Data Sharing Statement: Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-617/dss

Funding: This work was supported by the National Natural Science Foundation of China (No. 82460051); in part by the Gansu Natural Science Found (No. 23JRRA1315); and in part by the Science Foundation of Gansu Provincial Hospital (No. 22GSSYD-13), and in part by the Department of Education of Gansu Province, “Innovation Star” Project for Excellent Postgraduates (No. 2025 CXZX-050).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-617/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Ethics Committee of Gansu Provincial Hospital (No. 2025-012), and informed consent was taken from all individual participants.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Acampa W, Assante R, Mannarino T, Zampella E, D'Antonio A, Buongiorno P, Gaudieri V, Nappi C, Giordano A, Mainolfi CG, Petretta M, Cuocolo A. Low-dose dynamic myocardial perfusion imaging by CZT-SPECT in the identification of obstructive coronary artery disease. Eur J Nucl Med Mol Imaging 2020;47:1705-12. [Crossref] [PubMed]
Marzilli M, Merz CN, Boden WE, Bonow RO, Capozza PG, Chilian WM, DeMaria AN, Guarini G, Huqi A, Morrone D, Patel MR, Weintraub WS. Obstructive coronary atherosclerosis and ischemic heart disease: an elusive link! J Am Coll Cardiol 2012;60:951-6. [Crossref] [PubMed]
Zhang J, Xie J, Li M, Fang W, Hsu B. SPECT myocardial blood flow quantitation for the detection of angiographic stenoses with cardiac-dedicated CZT SPECT. J Nucl Cardiol 2023;30:2618-32. [Crossref] [PubMed]
Erlandsson K, Kacperski K, van Gramberg D, Hutton BF. Performance evaluation of D-SPECT: a novel SPECT system for nuclear cardiology. Phys Med Biol 2009;54:2635-49. [Crossref] [PubMed]
Diamond GA, Forrester JS. Analysis of probability as an aid in the clinical diagnosis of coronary-artery disease. N Engl J Med 1979;300:1350-8. [Crossref] [PubMed]
Alizadehsani R, Abdar M, Roshanzamir M, Khosravi A, Kebria PM, Khozeimeh F, Nahavandi S, Sarrafzadegan N, Acharya UR. Machine learning-based coronary artery disease diagnosis: A comprehensive review. Comput Biol Med 2019;111:103346. [Crossref] [PubMed]
Hampe N, Wolterink JM, van Velzen SGM, Leiner T, Išgum I. Machine Learning for Assessment of Coronary Artery Disease in Cardiac CT: A Survey. Front Cardiovasc Med 2019;6:172. [Crossref] [PubMed]
Sapra V, Sapra L, Bhardwaj A, Bharany S, Saxena A, Karim FK, Ghorashi S, Mohamed AW. Integrated approach using deep neural network and CBR for detecting severity of coronary artery disease. Alexandria Engineering Journal 2023;68:709-20.
Kusumoto D, Akiyama T, Hashimoto M, Iwabuchi Y, Katsuki T, Kimura M, Akiba Y, Sawada H, Inohara T, Yuasa S, Fukuda K, Jinzaki M, Ieda M. A deep learning-based automated diagnosis system for SPECT myocardial perfusion imaging. Sci Rep 2024;14:13583. [Crossref] [PubMed]
Otaki Y, Singh A, Kavanagh P, Miller RJH, Parekh T, Tamarappoo BK, et al. Clinical Deployment of Explainable Artificial Intelligence of SPECT for Diagnosis of Coronary Artery Disease. JACC Cardiovasc Imaging 2022;15:1091-102. [Crossref] [PubMed]
Bock C, Walter JE, Rieck B, Strebel I, Rumora K, Schaefer I, Zellweger MJ, Borgwardt K, Müller C. Enhancing the diagnosis of functionally relevant coronary artery disease with machine learning. Nat Commun 2024;15:5034. [Crossref] [PubMed]
Xu Y, Hayes S, Ali I, Ruddy TD, Wells RG, Berman DS, Germano G, Slomka PJ. Automatic and visual reproducibility of perfusion and function measures for myocardial perfusion SPECT. J Nucl Cardiol 2010;17:1050-7. [Crossref] [PubMed]
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely Connected Convolutional Networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA: IEEE; 2017.
Adegun AA, Viriri S. FCN-Based DenseNet Framework for Automated Detection and Classification of Skin Lesions in Dermoscopy Images. IEEE Access 2020;8:150377-96.
Kümmerer M, Wallis TSA, Gatys LA, Bethge M. Understanding Low- and High-Level Contributions to Fixation Prediction. 2017 IEEE International Conference on Computer Vision (ICCV). Venice: IEEE; 2017.
Simonyan K, Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition. International Conference on Learning Representations. 2015.
He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE; 2016.
Dosovitskiy A. An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations. 2021.
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal: IEEE; 2021.
Huang SC, Pareek A, Seyyedi S, Banerjee I, Lungren MP. Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digit Med 2020;3:136. [Crossref] [PubMed]
Li T, Mao J, Yu J, Zhao Z, Chen M, Yao Z, Fang L, Hu B. Fully automated classification of pulmonary nodules in positron emission tomography-computed tomography imaging using a two-stage multimodal learning approach. Quant Imaging Med Surg 2024;14:5526-40. [Crossref] [PubMed]
Lv F, Chen X, Huang Y, Duan L, Lin G. Progressive Modality Reinforcement for Human Multimodal Emotion Recognition from Unaligned Multimodal Sequences. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA: IEEE; 2021.
Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, Ghafoorian M, van der Laak JAWM, van Ginneken B, Sánchez CI. A survey on deep learning in medical image analysis. Med Image Anal 2017;42:60-88. [Crossref] [PubMed]
Gong P, Liu J, Zhang X, Li X, Wei L, He H. Adaptive Multimodal Graph Integration Network for Multimodal Sentiment Analysis. IEEE Transactions on Audio, Speech and Language Processing 2025;33:23-36.
Tran NT, Tran VH, Nguyen NB, Nguyen TK, Cheung NM. On Data Augmentation for GAN Training. IEEE Trans Image Process 2021;30:1882-97. [Crossref] [PubMed]
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. 2017 IEEE International Conference on Computer Vision (ICCV). Venice: IEEE; 2017.

Cite this article as: Wang R, Wang H, Zhou C, Li Z, Chen D, Zhu Y, Liu J, Wang X, Li T, Xie P. A two-stage multimodal learning framework for the automated diagnosis of obstructive coronary artery disease based on dynamic single-photon emission computed tomography. Quant Imaging Med Surg 2026;16(1):43. doi: 10.21037/qims-2025-617

A two-stage multimodal learning framework for the automated diagnosis of obstructive coronary artery disease based on dynamic single-photon emission computed tomography

Introduction

Methods

Participants

Table 1

Overview of the methods

ROI segmentation

Multimodal feature extraction, fusion, and classification

Structure of multimodal feature encoder

Feature adaptive weighting mechanism

Multimodal feature classification

Implementation details

Results

Evaluation metrics

Results and analysis

Table 2

Table 3

Table 4

Table 5

Table 6

Discussion

Interpretation of findings

Limitations and future research directions

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share