Commercial artificial intelligence–assisted performance and interpretation time of first-on-call radiology residents using computed tomography pulmonary angiography to detect pulmonary embolism: a multireader, multicenter study

Xinyu Song; Shuhao Wang; Jiaoyan Wang; Hongmin Shu; Ruipeng Zhang; Bicong Yan; Yang Qu; Yangtong Li; Linghuan Guo; Yanbo Chen; Dan Wang; Yuehua Li

doi:10.21037/qims-2025-1-2557

Original Article

Commercial artificial intelligence–assisted performance and interpretation time of first-on-call radiology residents using computed tomography pulmonary angiography to detect pulmonary embolism: a multireader, multicenter study

Xinyu Song^1# , Shuhao Wang^2# , Jiaoyan Wang³ , Hongmin Shu⁴ , Ruipeng Zhang¹ , Bicong Yan¹ , Yang Qu¹ , Yangtong Li¹ , Linghuan Guo¹ , Yanbo Chen⁵ , Dan Wang^1* , Yuehua Li^1*

¹Institute of Diagnostic and Interventional Radiology, Shanghai Sixth People’s Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai, China; ²Department of Radiology, Shanghai Guanghua Hospital of Integrative Medicine, Shanghai, China; ³Department of Radiology, Shuguang Hospital, Shanghai University of Traditional Chinese Medicine, Shanghai, China; ⁴Department of Radiology, The First Affiliated Hospital of Anhui Medical University, Hefei, China; ⁵Department of Research and Development, Shanghai United Imaging Intelligence Co., Ltd., Shanghai, China

Contributions: (I) Conception and design: D Wang, Yuehua Li; (II) Administrative support: Yuehua Li; (III) Provision of study materials or patients: X Song, J Wang, H Shu; (IV) Collection and assembly of data: X Song, S Wang, J Wang, H Shu, B Yan, Y Qu, Yangtong Li, L Guo; (V) Data analysis and interpretation: X Song, S Wang, R Zhang; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

^#These authors contributed equally to this work as co-first authors.

^*These authors contributed equally to this work.

Correspondence to: Prof. Dan Wang, MD, PhD; Prof. Yuehua Li, MD, PhD. Institute of Diagnostic and Interventional Radiology, Shanghai Sixth People’s Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, No. 600 Yishan Road, Xuhui District, Shanghai 200233, China. Email: joshuastonecn@hotmail.com; liyuehua0529@163.com.

Background: Acute pulmonary embolism (PE) is characterized by sudden onset, high mortality, and a propensity for misdiagnosis. Although artificial intelligence (AI)-based software for PE detection has been shown to accurately identify thrombi on computed tomography pulmonary angiography (CTPA), its impact on the performance of first-on-call radiologists of varying experience levels during initial interpretation remains unclear. This study aimed to examine the performance and time of first-on-call radiology residents’ interpretation of CTPA for acute PE with and without AI support.

Methods: This retrospective study included 196 consecutive emergency CTPA examinations (55 PE-positive and 243 clots) from three centers between June 2023 and August 2023. Six residents (1–6 years’ experience) independently interpreted all CTPA scans with and without National Medical Products Administration-approved AI assistance software that provided PE triage, clot detection, and quantification. The reference standard was established by two senior radiologists. Performance metrics (sensitivity, specificity, and Youden index) and interpretation time were compared. AI dependency was assessed by quantifying AI-induced decision changes.

Results: AI support improved patient-level sensitivity (0.83 to 0.96) and Youden index (0.80 to 0.91), with more pronounced gains at the clot level (sensitivity: 0.75 to 0.95; Youden index: 0.56 to 0.81). Junior residents (<5 years) showed greater improvement than did seniors (≥5 years). Mean interpretation time decreased by 10 seconds (18%) per case (P<0.001), with greater reductions observed among junior residents and for non-PE cases. AI significantly reduced the risk of missed clots [odds ratio (OR) 0.11; 95% confidence interval (CI): 0.08–0.15; P<0.001] and increased the incidence of false positives, but not significantly so (OR 1.46; 95% CI: 0.85–2.52; P=0.17). Junior residents exhibited higher AI dependency, yet the overall rates of incorrect AI guidance remained low (1.6% in clot recall and 2.8% in negative screening).

Conclusions: The AI system enhances diagnostic performance and efficiency of CTPA interpretation by first-on-call residents, substantially reducing missed PE and narrowing experience-based performance gaps.

Keywords: Pulmonary embolism (PE); computed tomography angiography (CTA); artificial intelligence (AI); on-call; radiology residents

Submitted Nov 27, 2025. Accepted for publication Mar 09, 2026. Published online Apr 14, 2026.

doi: 10.21037/qims-2025-1-2557

Introduction

Acute pulmonary embolism (PE), characterized by the occlusion of pulmonary arteries due to blood clots or other materials from elsewhere in the body, ranks as the third most common cardiovascular disease (1,2). Due to its nonspecific clinical presentation, PE is frequently misdiagnosed and carries a high mortality rate, and thus timely and accurate diagnosis is critical for physicians in instituting appropriate treatment strategies (2). Computed tomography pulmonary angiography (CTPA), which directly visualizes thrombus morphology under contrast enhancement, has emerged as the reference standard for diagnosing PE (3). Accurate interpretation of CTPA examinations requires substantial experience and time. However, the majority of CTPA examinations are negative, with reported positive yields ranging from 10% to 30% (4,5). Furthermore, as the volume of CTPA requests increases, the positive yield is likely to decrease further, meaning a substantial number of negative cases occupy the valuable time of emergency radiologists. Additionally, since first-on-call radiology residents, who generate the bulk of preliminary reports in emergency radiology (6), are particularly prone to diagnostic errors due to their limited experience and fatigue (7,8), thereby posing unpredictable risks to patients with PE.

Deep learning models based on convolutional neural networks (CNNs) are increasingly being applied to the automated detection and assessment of PE (4,5,9-16). These models have demonstrated high sensitivity and specificity in both internal and external test sets. Several commercial artificial intelligence (AI) algorithms have received regulatory approval from national authorities for clinical use. Recent studies on radiologist-AI interaction have primarily focused on patient-level triage comparisons (12,17,18), but few have examined clot-level interactions. The severity of PE is currently assessed according to thrombus burden scoring systems (19,20), which have a direct impact on hemodynamics and are associated with a higher incidence of right heart failure, a critical complication (21-23). Accurate thrombus burden scoring is contingent upon the precise identification of thrombi by the radiologist. Furthermore, the potential impact of AI assistance on thrombus detection performance of first-on-call radiology residents has not yet been extensively reported.

Therefore, the aims of this study were as follows: (I) to evaluate and compare the diagnostic performance of first-on-call radiology residents in interpreting CTPA with and without AI assistance at both the patient and clot levels; (II) to assess the impact of AI assistance on interpretation time; and (III) to evaluate the effect of AI algorithm outputs on diagnostic efficacy and dependency. We present this article in accordance with the STARD reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2557/rc).

Methods

This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments and was approved by the Research Ethics Committee of Shanghai Sixth People’s Hospital (No. 2023-KY-124(K)). The requirement for informed consent was waived by the Research Ethics Committee due to the retrospective nature of the analysis. All participating hospitals were informed of and agreed to the study protocol. All data were anonymized prior to analysis.

Study dataset

In this multicenter study, consecutive CTPA imaging datasets were retrospectively collected from three participating centers. The contributing institutions included Shanghai Sixth People’s Hospital (center 1), Shanghai Shuguang Hospital (center 2), and The First Affiliated Hospital of Anhui Medical University (center 3). All consecutive participants who underwent CTPA in the emergency department for suspected PE between June 2023 and August 2023 were identified and enrolled through the picture archiving and communication system. The exclusion criteria were as follows: (I) CTPA images with motion artifacts; (II) CTPA images with poor opacification of the pulmonary arteries; and (III) the presence of metallic implants. Detailed enrollment information for each center is provided in Figure 1. All images were de-identified and encoded prior to analysis. The CTPA acquisition parameters for each center are provided in Table S1.

Figure 1 Data flow and experimental plan of the study. AI, artificial intelligence; CTPA, computed tomography pulmonary angiography.

Definition of the reference standard

De-identified CTPA images were exported to an internal annotation platform. Each patient’s images were independently annotated by two senior radiologists with 15 and 10 years of emergency reading experience, respectively. Annotations included both patient-level and clot-level assessments. Any disagreement was resolved by consensus reading with another senior radiologist with 20 years of experience.

Interpretation by readers

Six first-on-call radiology residents participated in this study. All readers were blinded to the original CT reports and reference standard. Their experience in interpreting CTPA ranged from 1 to 6 years (median 2 years). Based on experience level, four residents with less than 5 years of experience were classified as junior residents, and two residents with 5 or more years of experience were classified as senior residents. All interpretations were performed on the internal platform with diagnostic-quality monitors.

Each reader first performed an independent reading session and then an AI-assisted reading session. To minimize recall bias, the case order was randomized for each reading session. A minimum washout period of 8 weeks was enforced between the two sessions for each reader. For each case, readers were required to make a binary recall decision (positive or negative for PE). If recalled as positive, readers were required to annotate the identified suspicious regions. In the AI-assisted session, readers could add, withdraw, or change the AI-generated labels to determine their final annotations. The time spent by readers on each case was automatically recorded from when the case was opened until the reader proceeded to the next examination (Figure 1).

Study algorithm

The commercially available deep learning software uAI-PulmonaryEmbolism (United Imaging Intelligence, Shanghai, China; https://www.uii-ai.com/product/17.html) was used in this study. This software has received class III medical device approval from the National Medical Products Administration of China (No. 20243211866). It utilizes a deep CNN to detect and quantify PE on CTPA images.

The proposed system includes a three-phase deep learning framework for automated PE detection on CTPA images, integrating vessel and embolus segmentation with subsequent false-positive reduction. In the first stage, a CNN performs pulmonary vessel segmentation to produce a dilated vessel mask and a vascular skeleton. The mask encompasses both pulmonary arteries and veins, whereas the skeleton specifically represents the arterial tree. These anatomical priors, combined with the original CTPA image, serve as inputs for the next phase. In the second stage, a dedicated CNN exploits the comprehensive geometric and contextual information to accurately delineate masks for pulmonary arteries and emboli. In the final stage, connected components are extracted from the embolus segmentation mask to form candidate embolic regions. Each candidate is evaluated by a CNN-based false-positive reduction module that assigns a classification probability, while the average probability of the top 50% voxels within the probability map is also computed to refine confidence estimation. This hierarchical, anatomy-guided design enhances detection reliability and achieves high clinical accuracy in identifying embolic lesions. The algorithm was developed based on a substantial dataset of CTPA imaging data sourced from the RSNA Pulmonary Embolism CT Dataset (24) and various medical institutions in China. None of the data used in this study were part of the algorithm’s development dataset.

AI model dependency assessment

To evaluate the impact of the AI model on reader decisions, the following metrics were calculated for each reader under AI guidance: changed predictions (%), incorrectly induced (%), and correctly induced (%). The respective formulas for these metrics were as follows: changed predictions (%) = (number of AI-guided decision alterations/total number of cases) × 100%; incorrectly induced (%) = (number of incorrect AI-guided decision alterations/total number of cases) × 100%; correctly induced (%) = (number of correct AI-guided decision alterations/total number of cases) × 100%. This dependency assessment was performed at two levels: the positive clot level and the negative patient level.

Data analysis

Data normality was assessed with the Shapiro-Wilk test. Since reader outputs were binary in this study, reader performance metrics [sensitivity, specificity, negative predictive value (NPV), positive predictive value (PPV), and Youden index] with and without AI assistance were calculated. Paired differences in sensitivity, specificity, and Youden index were analyzed via Wilcoxon signed-rank tests, with 95% confidence intervals (CIs) derived via bootstrap (10,000 resamples). P values were adjusted with the Holm method for multiple comparisons. A two-sided P value <0.05 was considered statistically significant. Reading time differences were evaluated with 20,000 bootstrap resamples. A two-tailed P value was obtained by doubling the smaller proportion of resampled mean differences above or below zero (25). Binomial generalized linear models (logistic regression) assessed the effects of AI assistance, reader experience, and their interaction on diagnostic error, with the results being reported as odds ratios (ORs) with 95% CIs. Model comparisons were conducted via the Akaike information criterion, likelihood ratio tests, and the McNemar test. All analyses were conducted with GraphPad Prism version 10.1.2 (Dotmatics, Boston, MA, USA) and Python version 3.12.4 (Python Software Foundation, Wilmington, DE, USA) with the pandas, NumPy, statsmodels, and SciPy packages.

The sample size was determined based on a paired comparison of clot-level sensitivities via the McNemar test for correlated proportions. We assumed that reader sensitivity would increase from 70% without AI assistance to 90% with AI, with a discordant proportion of 0.22 for clots missed without AI but detected with AI (P01) and 0.02 for clots detected without AI but missed with AI (P10). To achieve 90% statistical power at a two-sided significance level of 0.05, a minimum of 59 clots was required. Given an average of 1.5 clots per patient with PE, this would correspond to 40 patients. Accounting for an anticipated 10% dropout rate, we deemed that a total of 45 patients would be required. The sample size was computed with PASS 2023 version 23.0.2 software (NCSS, Kaysville, UT, USA).

Results

Characteristics of the study sample

During the study period, 196 consecutive patients (median age 68 years, IQR: 51–73 years) underwent CTPA at the participating institutions. Among them, 63, 59, and 74 patients were enrolled from centers 1, 2, and 3, respectively. The final dataset included 55 (28.1%) patients with PE, comprising 243 clot lesions in total (Table 1).

Table 1

Characteristics of the selected study group

Variable	Total (n=196)	Center 1 (n=63)	Center 2 (n=59)	Center 3 (n=74)
Age (years)	68 (51–73)	69 (55–79)	67 (56–77)	69 (54–76)
Male	91 (46.4)	27 (42.9)	29 (49.2)	35 (47.3)
Attenuation of pulmonary trunk (HU)	434 (359–613)	432 (367–591)	561 (334–680)	389 (347–461)
Patients with PE	55 (28.1)	15 (23.8)	11 (18.6)	29 (39.2)
Blood clot (n)	3 (1–6)	2 (1–5)	1 (1–2.5)	4 (1–9)
Volume (mm³)	391.0 (64.0–2,048.5)	331.7 (84.3–1,662.6)	298.3 (45.1–908.9)	803.4 (143.1–2,481.0)
Number of PE lesions	243	49	28	166
Central	81 (33.3)	16 (32.7)	8 (28.6)	57 (34.3)
Segmental/subsegmental	162 (66.7)	33 (67.3)	20 (71.4)	109 (65.7)
Clot volume (mm³)	104.3 (40.8–285.4)	89.0 (32.8–311.3)	70.7 (46.0–468.7)	107.5 (44.3–274.5)

Data are presented as median (interquartile range) or n (%), unless otherwise indicated. Center 1, Shanghai Sixth People’s Hospital; center 2, Shanghai Shuguang Hospital; center 3, The First Affiliated Hospital of Anhui Medical University. n, number of thrombi in patients with PE. HU, Hounsfield unit; PE, pulmonary embolism.

Changes in diagnostic performance at the patient level

With AI support, the mean reader sensitivity increased from 0.83 to 0.96, and the mean Youden index increased from 0.80 to 0.91 (Table S2), which was comparable to the performance of stand-alone AI (Figure S1). As shown in Figure 2, although all readers showed improved lesion detection with AI, a slight increase in the rate of false positives was observed (Figure 2A), which was reflected by minor decreases in mean specificity and PPV (by 0.01 and 0.02, respectively). Improvements in diagnostic performance with AI, as indicated by Youden index difference, were more pronounced among junior residents (0.15) than among senior residents (0.05) (Figure 3).

Figure 2 Diagnostic performance changes for PE at the (A) patient level and (B) clot level with and without AI support. Solid circles and solid triangles show diagnostic performance without and with AI support, respectively. Arrows point from the performance without AI support to that with AI support. Open circles and open triangles represent the mean across all radiologists. AI support resulted in a higher reading performance than did that without AI support. AI, artificial intelligence; PE, pulmonary embolism.

Figure 3 Mean diagnostic performance across radiologist subgroups. (A-C) Patient level. (D-F) Clot level. AI assistance improved overall diagnostic performance, with a more pronounced benefit observed in junior residents (<5 years of experience) than in senior residents (≥5 years of experience). AI, artificial intelligence; NPV, negative predictive value; PPV, positive predictive value; Se, sensitivity; Sp, specificity.

Changes in diagnostic performance at the clot level

AI assistance improved readers’ true-positive rate while reducing the false-positive rate (Table S3, Figure 2B). As summarized in Table 2, AI support increased the mean sensitivity from 0.75 to 0.95 and the mean Youden index from 0.56 to 0.81, approaching the performance of the stand-alone AI (Figure S1).

Table 2

Mean diagnostic performance for subgroups reading with and without AI support at the clot level

Subgroup	Sensitivity					Speciﬁcity					Youden index
Subgroup	Stand-alone AI	Without AI	With AI	Difference (95% CI)^†	P	Stand-alone AI	Without AI	With AI	Difference (95% CI)^†	P	Stand-alone AI	Without AI	With AI	Difference (95% CI)^†	P
All clots	0.94	0.75	0.95	0.20 (0.07–0.32)	0.018	0.91	0.80	0.85	0.06 (0.02–0.09)	0.018	0.86	0.56	0.81	0.25 (0.16–0.34)	0.011
Experience level
<5 years	–	0.68	0.95	0.27 (0.19–0.35)	0.005	–	0.81	0.86	0.06 (−0.01–0.12)	0.083	–	0.49	0.81	0.32 (0.27–0.38)	0.005
≥5 years	–	0.91	0.97	0.06 (−0.18–0.29)	0.354	–	0.79	0.85	0.06 (−0.15–0.26)	0.354	–	0.70	0.81	0.11 (0.11–0.12)	0.044
Clot location
Central	1.00	0.86	0.99	0.13 (0.05–0.21)	0.027	0.97	0.97	0.97	0.00 (0.00–0.01)	0.221	0.97	0.83	0.97	0.14 (0.08–0.19)	0.027
Segmental/subsegmental	0.92	0.71	0.94	0.23 (0.08–0.38)	0.023	0.94	0.82	0.88	0.06 (0.02–0.09)	0.023	0.86	0.53	0.81	0.29 (0.18–0.39)	0.017
Clot volume (mm³)^‡
≤ Q1	0.83	0.62	0.87	0.25 (0.11–0.40)	0.012	0.95	0.83	0.88	0.05 (0.02–0.09)	0.012	0.79	0.45	0.76	0.31 (0.20–0.41)	0.009
Q1 < V ≤ Q2	0.95	0.70	0.96	0.26 (0.07–0.45)	0.043	0.97	0.97	0.97	0.00 (−0.01–0.02)	0.663	0.92	0.67	0.93	0.26 (0.13–0.39)	0.043
Q2 < V ≤ Q3	1.00	0.84	1.00	0.16 (0.02–0.29)	0.083	1.00	0.99	0.99	0.00 (0.00–0.01)	0.097	1.00	0.83	0.99	0.16 (0.07–0.25)	0.083
> Q3	1.00	0.87	0.99	0.12 (0.03–0.22)	0.057	0.98	0.99	0.99	0.00 (0.00–0.00)	0.232	0.99	0.86	0.99	0.12 (0.06–0.19)	0.057
Center
Center 1	0.92	0.69	0.93	0.25 (0.15–0.34)	0.003	1.00	0.90	0.95	0.05 (−0.01–0.11)	0.086	0.92	0.59	0.89	0.30 (0.22–0.39)	0.004
Center 2	0.93	0.74	0.95	0.21 (0.06–0.37)	0.040	0.88	0.91	0.87	−0.04 (−0.09–0.01)	0.078	0.81	0.65	0.82	0.17 (0.09–0.25)	0.040
Center 3	0.95	0.78	0.96	0.18 (0.04–0.32)	0.026	0.87	0.64	0.74	0.10 (0.03–0.18)	0.026	0.83	0.42	0.71	0.28 (0.20–0.37)	0.007

^†, the differences were calculated with precise values before rounding. ^‡, Q1, Q2, and Q3 represent the 25th, 50th, and 75th percentiles of clot volume, respectively. For the Youden index difference, the 95% CI was estimated using the bootstrap method with 10,000 resamples. Statistical significance was assessed via Wilcoxon signed-rank tests. Due to multiple comparisons (sensitivity, specificity, and Youden index), P values were adjusted via the Holm method to control for the family-wise error rate. Center 1, Shanghai Sixth People’s Hospital; center 2, Shanghai Shuguang Hospital; center 3, The First Affiliated Hospital of Anhui Medical University. AI, artificial intelligence; CI, confidence interval.

Subgroup analysis (Table 2, Figure 3) showed that AI assistance improved diagnostic performance for both junior and senior residents, reducing the impact of the experience gap. For junior residents, there was an increase in sensitivity from 0.68 to 0.95 (P=0.005) and in specificity from 0.81 to 0.86 (P=0.083). For senior residents, there was an increase in sensitivity from 0.91 to 0.97 (P=0.354) and in specificity from 0.79 to 0.85 (P=0.354).

AI assistance significantly improved reader performance for both central clots (Youden index: 0.83 to 0.97; P=0.027) and segmental/subsegmental clots (Youden index: 0.53 to 0.81; P=0.017), with greater improvement observed for segmental/subsegmental clots (difference +0.29) than for central clots (difference +0.14). However, the improvement in senior readers’ performance for segmental/subsegmental clots was not statistically significant (P=0.195) (Figure 4).

Figure 4 Impact of reader experience on diagnostic accuracy (Youden index) for (A) central and (B) segmental/subsegmental clots with and without AI support. AI, artificial intelligence.

When analyzed by clot volume, AI not only improved detection across all volume subgroups but also reduced performance disparities among them. The improvement was most notable in the smallest quartile (Q1) of clot volume (Youden index difference: +0.31; 95% CI: 0.20–0.41; P=0.009). Similarly, AI improved the diagnostic performance across all centers, with statistically significant improvements among junior readers (P<0.05) (Figure 5).

Figure 5 Impact of reader experience on diagnostic accuracy (Youden index) for different centers with and without AI support. Center 1, Shanghai Sixth People’s Hospital; center 2, Shanghai Shuguang Hospital; center 3, The First Affiliated Hospital of Anhui Medical University. AI, artificial intelligence.

Interpretation time

The AI model significantly reduced interpretation time (Table 3, Figure 6). The mean reading time per case decreased by 10 seconds with AI support (P<0.001). Junior residents exhibited a greater reduction (12.6 sec, 22%) than did senior residents (4.6 sec, 9%). The mean interpretation time for PE cases was longer than that for non-PE cases. With AI assistance, the mean reading time decreased by 10% for PE cases and by 24% for non-PE cases (Table 3). In the analysis of reader experience, both junior and senior readers showed limited reduction in interpretation time for PE cases with AI assistance, whereas a statistically significant reduction was observed for non-PE cases (Figures S2,S3). The individual reader interpretation times are provided in Table S4.

Table 3

Mean interpretation time per screening examination for each radiologist reading with and without AI support

Variable	Without AI (sec)	With AI (sec)	Difference^†	P value
Mean interpretation time (sec)	56.0	46.0	10.0 (18%)	<0.001
Experience level
<5 years	57.7	45.1	12.6 (22%)	<0.001
≥5 years	52.4	47.8	4.6 (9%)	<0.001
PE
Positive	88.2	79.0	9.2 (10%)	<0.001
Negative	43.4	33.1	10.3 (24%)	<0.001

^†, absolute reduction in mean interpretation time (seconds) and the corresponding proportion of reduction (%) with AI support. We performed 20,000 bootstrap resamples with replacement of both examinations and readers. For each resample, the mean difference was calculated. A two-tailed P value was obtained by doubling the smaller proportion of resampled mean differences above or below zero. AI, artificial intelligence; PE, pulmonary embolism.

Figure 6 Comparison of the interpretation times of readers with and without AI support. (A,D) Total cases. (B,E) PE-positive cases. (C,F) PE-negative cases. AI assistance significantly reduced the overall interpretation time, with a more pronounced reduction observed in PE-negative cases than in PE-positive cases. Statistical significance in violin plots was assessed via the nonparametric test for overall group comparisons. ns, P≥0.05; ****, P<0.0001. AI, artificial intelligence; PE, pulmonary embolism.

Evaluation of AI model dependency

In the recall of positive clot lesions (Table 4), AI assistance was associated with an obvious reduction in the odds of a missed diagnosis (OR 0.11; 95% CI: 0.08–0.15; P<0.001). The miss rate decreased from 31.7% to 4.8% for junior residents and from 8.8% to 3.3% for senior residents. Greater reader experience was independently associated with significantly lower odds of a missed diagnosis (OR 0.21; 95% CI: 0.15–0.29; P<0.001). An interaction model indicated that the effect of AI assistance on the miss rate was significantly modified by reader experience (OR 3.20; 95% CI: 1.64–6.26; P=0.001). Junior residents demonstrated a higher rate of changed predictions due to AI results compared to senior residents (30.6% vs. 7.6%). The average rate of incorrect induced was low (1.6%), with junior residents being slightly more likely to be incorrectly influenced by AI than senior residents (Table S5).

Table 4

Interaction effect between readers experience and AI assistance

Model^†	OR	95% CI	P value
False-negative risk in clot cases
AI assistance	0.11	0.08, 0.15	<0.001
Experience ≥5 years	0.21	0.15, 0.29	<0.001
AI assistance + experience ≥5 years	3.20	1.64, 6.26	0.001
False-positive risk in non-PE patients
AI assistance	1.46	0.85, 2.52	0.17
Experience ≥5 years	0.43	0.16, 1.13	0.09
AI assistance + experience ≥5 years	0.97	0.27, 3.48	0.96

^†, Binomial generalized linear models were used to assess the effects of AI assistance, reader experience, and their interaction on diagnostic error rates. AI, artificial intelligence; CI, confidence interval; OR, odds ratio; PE, pulmonary embolism.

In the screening of negative cases (Table 4), AI assistance increased the risk of false positives, but not significantly so (OR 1.46; 95% CI: 0.85–2.52; P=0.17). The false-positive rate increased from 4.1% to 5.9% for junior residents and from 1.8% to 2.5% for senior residents. Greater experience reduced the false-positive rate, but not significantly so (OR 0.43; 95% CI: 0.16–1.13; P=0.09). The interaction model revealed no significant difference in the AI’s effect on false-positive rates between the experience groups (OR 0.97; 95% CI: 0.27–3.48; P=0.96). A high proportion of junior residents altered their decisions based on AI output (5.7%) as compared to senior residents (1.4%). The average rate of incorrectly induced negative cases was also low (2.8%), with a slightly higher rate of incorrect guidance observed among junior residents (Table S6). Representative examples of changed predictions from AI support are presented in Figure 7.

Figure 7 Representative examples of changed predictions from AI support. (A) Clot in the right lower lobe subsegmental artery was correctly identified by AI, for which five readers were correctly guided by AI. (B) Clot in the right lower lobe subsegmental artery that was missed by AI, for which one junior reader was misled by AI. (C) Hilar lymph node was misidentified by AI, but none of the six readers were misled. (D) Beam hardening artifact from the superior vena cava that was misidentified by AI, for which one junior reader was misled by AI. Red marks indicate AI segmentation labels. Yellow arrows indicate thrombus locations. AI, artificial intelligence.

Discussion

In this multireader study of 196 consecutive CTPA examinations for acute PE, AI assistance improved the screening and clot-detection performance of first-on-call residents and reduced the mean interpretation time by 10 seconds per CTPA examination. Dependency analysis revealed a substantial rate of changed predictions influenced by AI results (22.9%) at the clot level and a low rate of incorrectly induced predictions (1.6%). These changes suggest the potential for AI assistance to enable first-on-call residents to identify and screen for thrombi more efficiently, without substantially increasing the risk of diagnostic errors due to AI reliance.

We employed uAI-PulmonaryEmbolism software, which was previously trained, tested, and validated on multicenter datasets. In contrast to previous commercial AI tools (4,5,9-14,17,18), its algorithm innovatively performs an initial segmentation of the entire pulmonary vascular tree prior to the precise segmentation of clots, which significantly reduces false-positive clot identifications. In an independent sample comprising 30 positive and 40 negative anonymized PE cases, the software demonstrated a sensitivity of 76.67% and a specificity of 95.00% for patient-level triage, with a clot-level detection sensitivity of 84.43% (26). Although the specificity aligns closely with the results encountered in our study, the higher sensitivity we observed may be attributable to the use of a newer software version, larger sample size, and lower disease prevalence in our cohort. Another study reported that this software increased radiologists’ sensitivity from 79.8% to 91.7% even in ultra-low-dose CTPA images (27), confirming the reliability of its diagnostic assistance.

Our results indicate that AI support can improve the triage performance of first-on-call radiologists, primarily reflected by sensitivity and NPV, reaching levels comparable to those of stand-alone AI. These improvements are clinically relevant, as they may aid in reducing the false negatives and diagnostic failures that may adversely affect patient outcomes. In AI-assisted triage, there was a tendency for a slight increase in the false-positive recalls, particularly among junior residents. However, the falsely recalled cases predominantly involved patients with very low clot burdens located in the segmental or subsegmental arteries. In the absence of intermediate- or high-risk features such as hemodynamic instability, such cases are usually not managed aggressively (2,28). This suggests that the potential clinical impact of these false positives is likely limited, although they may still prompt additional review or workup.

We found that AI support improved clot detection performance for all readers and could effectively reduce the performance gap between junior and senior residents. This effect was consistent across clot levels and datasets from multiple centers, which is clinically important given the known variability in reporting quality among residents of different experience levels (29,30). Subgroup analysis demonstrated that AI assistance enabled extremely high accuracy in the detection of central emboli and large-volume clots. In contrast, identifying segmental/subsegmental emboli and small-volume clots remains challenging, with both readers and stand-alone AI showing reduced performance for these clots as compared to larger, central clots. Central emboli and large-volume clots are more likely to cause hemodynamic compromise, account for the majority of PE-related deaths, and are independent predictors of adverse clinical outcomes (31-33). Consequently, attaining high sensitivity in their detection has considerable clinical significance. In the multicenter comparison, the AI model maintained high sensitivity across datasets with varying disease prevalences. It elevated residents’ performance to a level comparable to that of the AI alone, indicating that the robust generalizability of the model was unaffected by the incidence rate of different institutions. However, the mean specificity of readers at center 3 was significantly lower than that at the other centers; even with AI assistance, it did not approach the performance of stand-alone AI. In this study, false positives mostly occurred with the segmental/subsegmental pulmonary arteries, which are easily misclassified as PE due to suboptimal contrast opacification and artifacts (34). The thinner slice thickness at center 3 allowed visualization of more distal branches, leading to a higher risk of misinterpretation, which may partially explain the lower specificity there.

We further found that AI assistance reduced the reading time for readers. The reduction was more pronounced for negative cases than for positive cases, a finding consistent with other studies demonstrating that AI can substantially reduce reading times for negative or low-risk examinations (25,35). In contrast, positive or high-risk cases still required sufficient time for a comprehensive review. Given that clinical practice typically involves a higher volume of negative cases, AI decision support has the potential to reduce the overall reading time and improve workflow efficiency. However, Rothenberg et al. (17) reported no significant reduction in interpretation time with AI support for PE in clinical practice. Potential explanations for this discrepancy include fundamental differences in the AI systems evaluated. The model used by Rothenberg et al. was primarily designed for triage and lacked precise clot segmentation capabilities and synchronous display with source images. In contrast, our software provided precise segmentation labels and was integrated into a superior user interface. Furthermore, our study specifically evaluated the time spent on clot identification itself. These methodological differences likely account for the divergent findings.

The primary goal of AI software implementation in the clinical management of PE is to enhance radiologists’ ability to detect abnormalities. However, its use can also lead to reliance on AI, which manifests as either correct or incorrect guidance (18,25). This necessitates secondary assessment of the AI output by radiologists, a process influenced by readers’ experience. An ideal AI model should maximize correct guidance while minimizing incorrect guidance. Our findings indicate that the impact of AI assistance on radiologists’ performance in diagnosing PE was heterogeneous, varying according to experience level and task type. For clot detection, AI significantly reduced the risk of missed diagnosis. The benefit was more pronounced for junior residents, suggesting that AI can effectively mitigate diagnostic blind spots resulting from limited experience. Notably, although the baseline miss rate of senior residents was already low, AI assistance further reduced it, demonstrating added value across all experience levels. Interaction analysis confirmed that reader experience significantly modified the effect of AI assistance, with junior residents being more likely to adopt AI suggestions. However, for the screening of negative cases, AI assistance provided no significant benefit and was associated with a higher false-positive rate. This phenomenon may stem from the AI model’s oversensitivity to subtle or atypical imaging findings. Interestingly, the overall rate of incorrect induced was low (1.6% for positive tasks and 2.8% for negative tasks). Furthermore, senior residents were less influenced by incorrect AI suggestions, reflecting their stronger independent judgment and more robust secondary assessment of AI results.

Our study involved several limitations that should be addressed. First, the number of readers in this study was relatively small and recruited from a single center. Future confirmation with a larger, multicenter cohort of readers is needed. Second, although the AI-assisted reading session was conducted after an 8-week washout period, learning and implicit memory effects could not be completely avoided and might have partially confounded the observed improvements. All residents continued their routine clinical reading duties between the two sessions, which could have diluted memory of the study CTPA cases. Future studies could consider a randomized crossover design at the reader and case levels to minimize recall bias. Third, we did not formally assess the influence of contrast opacification on the results. Although cases with suboptimal overall contrast opacification were excluded, regional attenuation in segmental and subsegmental arteries could still pose challenges, particularly for less-experienced readers. However, in Cheikh et al.’s study (12), AI maintained an advantage over radiologists even in cases with suboptimal contrast enhancement. Fourth, this study evaluated only a single, commercially available AI model already in use at our institution, and comparison with other commercially available PE AI models was not performed. As junior residents demonstrated a reliance on AI output, other models could potentially influence radiologists’ performance differently. Finally, this study was conducted in a controlled setting and might not have accounted for the factors relevant to real-world emergency on-call practice (e.g., access to clinical information, interruptions, competing tasks, and time pressure). Residents’ performance under routine clinical conditions remains to be confirmed in future prospective studies.

Conclusions

The AI model examined in our study improved the overall ability of first-on-call radiology residents to detect and screen for PE on CTPA while also reducing interpretation time and maintaining a high rate of correct guidance. These improvements have the potential to enhance both the safety and efficiency of PE diagnosis; however, these findings should be confirmed in future prospective studies conducted in real-world emergency settings.

Acknowledgments

None.

Footnote

Reporting Checklist: The authors have completed the STARD reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2557/rc

Data Sharing Statement: Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2557/dss

Funding: This study was supported by Key R&D subproject of the Ministry of Science and Technology (No. 2023YFF1204804, to Yuehua Li), National Natural Science Foundation of China (No. 8225024, to Yuehua Li), and Shanghai Municipal Commission of Science and Technology Explorer Program (No. 23TS1400400, to Yuehua Li).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2557/coif). Y.C. is an employee of Shanghai United Imaging Intelligence Co., Ltd. Yuehua Li reports funding from the Key R&D subproject of the Ministry of Science and Technology (No. 2023YFF1204804), National Natural Science Foundation of China (No. 8225024), and Shanghai Municipal Commission of Science and Technology Explorer Program (No. 23TS1400400). The other authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. This study was approved by the Research Ethics Committee of Shanghai Sixth People’s Hospital (No. 2023-KY-124(K)). The study’s retrospective design led to the exemption from informed consent by the overseeing Research Ethics Committee. All participating hospitals were informed of and agreed to the study. All data were anonymized prior to analysis.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Freund Y, Cohen-Aubart F, Bloom B. Acute Pulmonary Embolism: A Review. JAMA 2022;328:1336-45. [Crossref] [PubMed]
Konstantinides SV, Meyer G, Becattini C, Bueno H, Geersing GJ, Harjola VP, et al. 2019 ESC Guidelines for the diagnosis and management of acute pulmonary embolism developed in collaboration with the European Respiratory Society (ERS). Eur Heart J 2020;41:543-603. [Crossref] [PubMed]
Di Nisio M, van Es N, Büller HR. Deep vein thrombosis and pulmonary embolism. Lancet 2016;388:3060-73. [Crossref] [PubMed]
Weikert T, Winkel DJ, Bremerich J, Stieltjes B, Parmar V, Sauter AW, Sommer G. Automated detection of pulmonary embolism in CT pulmonary angiograms using an AI-powered algorithm. Eur Radiol 2020;30:6545-53. [Crossref] [PubMed]
Huhtanen H, Nyman M, Mohsen T, Virkki A, Karlsson A, Hirvonen J. Automated detection of pulmonary embolism from CT-angiograms using deep learning. BMC Med Imaging 2022;22:43. [Crossref] [PubMed]
Szymanski KA, Hoang AT, Van Tassel D, Kang P, Pfeifer CM. On-Call Radiology Resident Preliminary Report Major Discrepancies: A Meta-analysis. Acad Radiol 2025;32:2342-56.
Patel AG, Pizzitola VJ, Johnson CD, Zhang N, Patel MD. Radiologists Make More Errors Interpreting Off-Hours Body CT Studies during Overnight Assignments as Compared with Daytime Assignments. Radiology 2020;297:374-9.
Ruutiainen AT, Durand DJ, Scanlon MH, Itri JN. Increased error rates in preliminary reports issued by radiology residents working more than 10 consecutive hours overnight. Acad Radiol 2013;20:305-11.
Ma X, Ferguson EC, Jiang X, Savitz SI, Shams S. A multitask deep learning approach for pulmonary embolism detection and identification. Sci Rep 2022;12:13087.
Liu W, Liu M, Guo X, Zhang P, Zhang L, Zhang R, Kang H, Zhai Z, Tao X, Wan J, Xie S. Evaluation of acute pulmonary embolism and clot burden on CTPA with deep learning. Eur Radiol 2020;30:3567-75.
Huang SC, Kothari T, Banerjee I, Chute C, Ball RL, Borus N, Huang A, Patel BN, Rajpurkar P, Irvin J, Dunnmon J, Bledsoe J, Shpanskaya K, Dhaliwal A, Zamanian R, Ng AY, Lungren MP. PENet-a scalable deep-learning model for automated diagnosis of pulmonary embolism using volumetric CT imaging. NPJ Digit Med 2020;3:61. [Crossref] [PubMed]
Cheikh AB, Gorincour G, Nivet H, May J, Seux M, Calame P, Thomson V, Delabrousse E, Crombé A. How artificial intelligence improves radiological interpretation in suspected pulmonary embolism. Eur Radiol 2022;32:5831-42. [Crossref] [PubMed]
Topff L, Ranschaert ER, Bartels-Rutten A, Negoita A, Menezes R, Beets-Tan RGH, Visser JJ. Artificial Intelligence Tool for Detection and Worklist Prioritization Reduces Time to Diagnosis of Incidental Pulmonary Embolism at CT. Radiol Cardiothorac Imaging 2023;5:e220163. [Crossref] [PubMed]
Djahnine A, Lazarus C, Lederlin M, Mulé S, Wiemker R, Si-Mohamed S, et al. Detection and severity quantification of pulmonary embolism with 3D CT data using an automated deep learning-based artificial solution. Diagn Interv Imaging 2024;105:97-103. [Crossref] [PubMed]
Abdelhamid A, El-Ghamry A, Abdelhay EH, Abo-Zahhad MM, Moustafa HE. Improved pulmonary embolism detection in CT pulmonary angiogram scans with hybrid vision transformers and deep learning techniques. Sci Rep 2025;15:31443. [Crossref] [PubMed]
Song J, Chen A, Yu H, Song L. Improvement of artificial intelligence-based computed tomography pulmonary angiography in identifying acute pulmonary embolism. Quant Imaging Med Surg 2025;15:9729-37.
Rothenberg SA, Savage CH, Abou Elkassem A, Singh S, Abozeed M, Hamki O, Junck K, Tridandapani S, Li M, Li Y, Smith AD. Prospective Evaluation of AI Triage of Pulmonary Emboli on CT Pulmonary Angiograms. Radiology 2023;309:e230702.
Vallée A, Quint R, Laure Brun A, Mellot F, Grenier PA. A deep learning-based algorithm improves radiology residents' diagnoses of acute pulmonary embolism on CT pulmonary angiograms. Eur J Radiol 2024;171:111324. [Crossref] [PubMed]
Fink MA, Mayer VL, Schneider T, Seibold C, Stiefelhagen R, Kleesiek J, Weber TF, Kauczor HU. CT Angiography Clot Burden Score from Data Mining of Structured Reports for Pulmonary Embolism. Radiology 2022;302:175-84. [Crossref] [PubMed]
Qanadli SD, El Hajjam M, Vieillard-Baron A, Joseph T, Mesurolle B, Oliva VL, Barré O, Bruckert F, Dubourg O, Lacombe P. New CT index to quantify arterial obstruction in pulmonary embolism: comparison with angiographic index and echocardiography. AJR Am J Roentgenol 2001;176:1415-20. [Crossref] [PubMed]
Sen HS, Abakay Ö, Cetincakmak MG, Sezgi C, Yilmaz S, Demir M, Taylan M, Gümüs H. A single imaging modality in the diagnosis, severity, and prognosis of pulmonary embolism. Biomed Res Int 2014;2014:470295. [Crossref] [PubMed]
Bazeed MF, Saad A, Sultan A, Ghanem MA, Khalil DM. Prediction of pulmonary embolism outcome and severity by computed tomography. Acta Radiol 2010;51:271-6. [Crossref] [PubMed]
van der Meer RW, Pattynama PM, van Strijen MJ, van den Berg-Huijsmans AA, Hartmann IJ, Putter H, de Roos A, Huisman MV. Right ventricular dysfunction and pulmonary obstruction index at helical CT: prediction of clinical outcome during 3-month follow-up in patients with acute pulmonary embolism. Radiology 2005;235:798-803. [Crossref] [PubMed]
Colak E, Kitamura FC, Hobbs SB, Wu CC, Lungren MP, Prevedello LM, et al. The RSNA Pulmonary Embolism CT Dataset. Radiol Artif Intell 2021;3:e200254.
Gommers JJJ, Verboom SD, Duvivier KM, van Rooden CJ, van Raamt AF, Houwers JB, Naafs DB, Duijm LEM, Eckstein MP, Abbey CK, Broeders MJM, Sechopoulos I. Influence of AI Decision Support on Radiologists' Performance and Visual Search in Screening Mammography. Radiology 2025;316:e243688.
Qiao Y, Gao Y, Chen Y, Ye X, Yan C, Zeng M. Quantitative assessment and risk stratification of random acute pulmonary embolism cases using a deep learning model based on computed tomography pulmonary angiography images. Quant Imaging Med Surg 2025;15:1950-62. [Crossref] [PubMed]
Lu J, Shen L, Zhou C, Bi Z, Ye X, Zhao Z, Zeng M, Wang M. Image Quality Improvement and Artificial Intelligence Performance in Pulmonary Embolism Detection at Deep Learning Reconstruction-Based Ultra-low Radiation Dose CT Pulmonary Angiography. Acad Radiol 2025;32:7562-72. [Crossref] [PubMed]
Peiman S, Abbasi M, Allameh SF, Asadi Gharabaghi M, Abtahi H, Safavi E. Subsegmental pulmonary embolism: A narrative review. Thromb Res 2016;138:55-60. [Crossref] [PubMed]
Tamjeedi B, Correa J, Semionov A, Mesurolle B. Interobserver Agreement between On-Call Radiology Resident and General Radiologist Interpretations of CT Pulmonary Angiograms and CT Venograms. PLoS One 2015;10:e0126116.
Joshi R, Wu K, Kaicker J, Choudur H. Reliability of on-call radiology residents' interpretation of 64-slice CT pulmonary angiography for the detection of pulmonary embolism. Acta Radiol 2014;55:682-90.
Cantu-Martinez O, Martinez Manzano JM, Tito S, Prendergast A, Jarrett SA, Chiang B, Wattoo A, Azmaiparashvili Z, Lo KB, Benzaquen S, Eiger G. Clinical features and risk factors of adverse clinical outcomes in central pulmonary embolism using machine learning analysis. Respir Med 2023;215:107295. [Crossref] [PubMed]
Vedovati MC, Becattini C, Agnelli G, Kamphuisen PW, Masotti L, Pruszczyk P, Casazza F, Salvi A, Grifoni S, Carugati A, Konstantinides S, Schreuder M, Golebiowski M, Duranti M. Multidetector CT scan for acute pulmonary embolism: embolic burden and clinical outcome. Chest 2012;142:1417-24. [Crossref] [PubMed]
Hariharan P, Dudzinski DM, Rosovsky R, Haddad F, MacMahon P, Parry B, Chang Y, Kabrhel C. Relation Among Clot Burden, Right-Sided Heart Strain, and Adverse Events After Acute Pulmonary Embolism. Am J Cardiol 2016;118:1568-73. [Crossref] [PubMed]
Hutchinson BD, Navin P, Marom EM, Truong MT, Bruzzi JF. Overdiagnosis of Pulmonary Embolism by Pulmonary CT Angiography. AJR Am J Roentgenol 2015;205:271-7. [Crossref] [PubMed]
Shin HJ, Han K, Ryu L, Kim EK. The impact of artificial intelligence on the reading times of radiologists for chest radiographs. NPJ Digit Med 2023;6:82. [Crossref] [PubMed]

Cite this article as: Song X, Wang S, Wang J, Shu H, Zhang R, Yan B, Qu Y, Li Y, Guo L, Chen Y, Wang D, Li Y. Commercial artificial intelligence–assisted performance and interpretation time of first-on-call radiology residents using computed tomography pulmonary angiography to detect pulmonary embolism: a multireader, multicenter study. Quant Imaging Med Surg 2026;16(5):399. doi: 10.21037/qims-2025-1-2557

Commercial artificial intelligence–assisted performance and interpretation time of first-on-call radiology residents using computed tomography pulmonary angiography to detect pulmonary embolism: a multireader, multicenter study

Introduction

Methods

Study dataset

Definition of the reference standard

Interpretation by readers

Study algorithm

AI model dependency assessment

Data analysis

Results

Characteristics of the study sample

Table 1

Changes in diagnostic performance at the patient level

Changes in diagnostic performance at the clot level

Table 2

Interpretation time

Table 3

Evaluation of AI model dependency

Table 4

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share