Original Article


Value of the DeepSeek-R1 large language model in extracting structured data from magnetic resonance imaging reports of rectal cancer and assisting in tumor staging

Fan Xie, Li-Zhu Ouyang, Bao-Liang Guo, Xi-Yi Huang, Zi-Wei Liu, Lan-Ni Zhou, Jia-Ling Pan, Li-Wen Wang, Ming Chen, Yun-Jing Li, Qiong-Qi Lin, Xin-Jie Chen, Qiu-Gen Hu, Fu-Sheng Ouyang

Abstract

Background: Structured magnetic resonance imaging (MRI) reports improve rectal cancer diagnosis and treatment management, whereas free-text reports better describe complex MRI features, leading to divergent preferences and inconsistent acceptance among radiologists. This study aimed to evaluate the potential of DeepSeek-R1 to assist in the analysis of MRI reports for rectal cancer.

Methods: This retrospective study analyzed 465 MRI reports of rectal cancer. Sixty reports were used to facilitate structured information extraction, refine reporting reminders and complete preliminary screening of large language models (LLMs). LLMs from the same developers were compared [i.e., GPT-4 vs. GPT-4o, Wenxinyiyan-4 vs. Wenxinyiyan-4Turbo, and DeepSeek-R1-7B vs. DeepSeek-R1-32B, vs. DeepSeek-R1-671B (DSR1-671B)], and the accuracy and average processing time of the LLMs were evaluated with paired-samples t-tests, one-way repeated-measures analysis of variance, McNemar tests, and Cochran Q tests. The top-performing model was selected for tumor-node (TN) staging determination. Both a five-point Likert scale and accuracy were employed as performance metrics for evaluating two prompting strategies (default knowledge and in-context knowledge) in TN staging determination, and any confabulations were documented. To assess reproducibility, the LLM analysis was independently repeated three times for all reports. Radiologists’ consensus interpretation of the report text served as the reference standard.

Results: Five of the LLMs evaluated, including DSR1-671B, Wenxinyiyan-4, Wenxinyiyan-4Turbo, GPT-4, and GPT-4o, were identified in the preliminary screening by the same developers. All five LLMs processed reports significantly faster than did radiologists (time per report: 22.4–84.6 vs. 132.1 s; P<0.01). DSR1-671B outperformed all other LLMs in the extraction of structured information and the generation of reporting cues, with an accuracy of 98.9% (P<0.01). Thus, DSR1-671B was selected for TN staging reasoning. The in-context knowledge strategy yielded a higher median score for TN stage than did the default knowledge strategy (5.0 vs. 4.0; P<0.01) and also yielded a higher accuracy in T substaging (86.5% vs. 72.3%; P<0.01) and N substaging (95.0% vs. 77.2%; P<0.01). Significant differences in the number of events and confabulation rates between the two prompting strategies were identified by the paired-samples Wilcoxon test (P<0.01), although their medians were equal. However, the basis for TN staging of the in-context knowledge strategy was more robust than that of the default knowledge strategy (mean number of events: 3.3 vs. 2.8) and produced a lower mean confabulation rate (2.9% vs. 4.1%). DSR1-671B showed good reproducibility across all tasks, and the lowest agreement was observed for N staging determination (Fleiss’ kappa value =0.932).

Conclusions: In an English-language environment with human supervision, DSR1-671B can perform structured information extraction from MRI reports of rectal cancer, generate reporting cues, and enable TN staging based on the improved reports.

Download Citation