Value of the DeepSeek-R1 large language model in extracting structured data from magnetic resonance imaging reports of rectal cancer and assisting in tumor staging

Fan Xie; Li-Zhu Ouyang; Bao-Liang Guo; Xi-Yi Huang; Zi-Wei Liu; Lan-Ni Zhou; Jia-Ling Pan; Li-Wen Wang; Ming Chen; Yun-Jing Li; Qiong-Qi Lin; Xin-Jie Chen; Qiu-Gen Hu; Fu-Sheng Ouyang

doi:10.21037/qims-2025-aw-2441

Original Article

Value of the DeepSeek-R1 large language model in extracting structured data from magnetic resonance imaging reports of rectal cancer and assisting in tumor staging

Fan Xie^1#, Li-Zhu Ouyang^2#, Bao-Liang Guo^1#, Xi-Yi Huang^3#, Zi-Wei Liu^1#, Lan-Ni Zhou¹, Jia-Ling Pan¹, Li-Wen Wang¹, Ming Chen¹, Yun-Jing Li¹, Qiong-Qi Lin¹, Xin-Jie Chen¹, Qiu-Gen Hu^1*, Fu-Sheng Ouyang^1*

¹Department of Radiology, The Eighth Affiliated Hospital of Southern Medical University (The First People’s Hospital of Shunde), Foshan, China; ²Department of Ultrasound, The Eighth Affiliated Hospital of Southern Medical University (The First People’s Hospital of Shunde), Foshan, China; ³Department of Clinical Laboratory, Lecong Hospital of Shunde, Foshan, China

Contributions: (I) Conception and design: F Xie, LZ Ouyang, BL Guo, XY Huang, ZW Liu; (II) Administrative support: FS Ouyang, QG Hu; (III) Provision of study materials or patients: ZW Liu, XJ Chen; (IV) Collection and assembly of data: F Xie, BL Guo, XY Huang, LN Zhou, YJ Li, QQ Lin; (V) Data analysis and interpretation: F Xie, LZ Ouyang, JL Pan, LW Wang, M Chen; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

^#These authors contributed equally to this work as co-first authors.

^*These authors contributed equally to this work.

Correspondence to: Fu-Sheng Ouyang, MD, PhD; Qiu-Gen Hu, MD, PhD. Department of Radiology, The Eighth Affiliated Hospital of Southern Medical University (The First People’s Hospital of Shunde), No. 1 Jiazi Road, Lunjiao, Shunde District, Foshan 528308, China. Email: ouyangfusheng@21cn.com; qiugenhu@126.com.

Background: Structured magnetic resonance imaging (MRI) reports improve rectal cancer diagnosis and treatment management, whereas free-text reports better describe complex MRI features, leading to divergent preferences and inconsistent acceptance among radiologists. This study aimed to evaluate the potential of DeepSeek-R1 to assist in the analysis of MRI reports for rectal cancer.

Methods: This retrospective study analyzed 465 MRI reports of rectal cancer. Sixty reports were used to facilitate structured information extraction, refine reporting reminders and complete preliminary screening of large language models (LLMs). LLMs from the same developers were compared [i.e., GPT-4 vs. GPT-4o, Wenxinyiyan-4 vs. Wenxinyiyan-4Turbo, and DeepSeek-R1-7B vs. DeepSeek-R1-32B, vs. DeepSeek-R1-671B (DSR1-671B)], and the accuracy and average processing time of the LLMs were evaluated with paired-samples t-tests, one-way repeated-measures analysis of variance, McNemar tests, and Cochran Q tests. The top-performing model was selected for tumor-node (TN) staging determination. Both a five-point Likert scale and accuracy were employed as performance metrics for evaluating two prompting strategies (default knowledge and in-context knowledge) in TN staging determination, and any confabulations were documented. To assess reproducibility, the LLM analysis was independently repeated three times for all reports. Radiologists’ consensus interpretation of the report text served as the reference standard.

Results: Five of the LLMs evaluated, including DSR1-671B, Wenxinyiyan-4, Wenxinyiyan-4Turbo, GPT-4, and GPT-4o, were identified in the preliminary screening by the same developers. All five LLMs processed reports significantly faster than did radiologists (time per report: 22.4–84.6 vs. 132.1 s; P<0.01). DSR1-671B outperformed all other LLMs in the extraction of structured information and the generation of reporting cues, with an accuracy of 98.9% (P<0.01). Thus, DSR1-671B was selected for TN staging reasoning. The in-context knowledge strategy yielded a higher median score for TN stage than did the default knowledge strategy (5.0 vs. 4.0; P<0.01) and also yielded a higher accuracy in T substaging (86.5% vs. 72.3%; P<0.01) and N substaging (95.0% vs. 77.2%; P<0.01). Significant differences in the number of events and confabulation rates between the two prompting strategies were identified by the paired-samples Wilcoxon test (P<0.01), although their medians were equal. However, the basis for TN staging of the in-context knowledge strategy was more robust than that of the default knowledge strategy (mean number of events: 3.3 vs. 2.8) and produced a lower mean confabulation rate (2.9% vs. 4.1%). DSR1-671B showed good reproducibility across all tasks, and the lowest agreement was observed for N staging determination (Fleiss’ kappa value =0.932).

Conclusions: In an English-language environment with human supervision, DSR1-671B can perform structured information extraction from MRI reports of rectal cancer, generate reporting cues, and enable TN staging based on the improved reports.

Keywords: Rectal cancer; magnetic resonance imaging reports (MRI reports); large language models (LLMs); structured information extraction

Submitted Nov 14, 2025. Accepted for publication Apr 27, 2026. Published online Jun 13, 2026.

doi: 10.21037/qims-2025-aw-2441

Introduction

Colorectal cancer is among the most common malignant tumors worldwide, and its incidence rates are increasing in Asian countries (1). Both the “DISTANCE” mnemonic (2) and its updated version, the “DISTANCED” mnemonic (3), highlight the essential role of structured reporting in the management of rectal cancer. The compulsory inclusion of “DISTANCED” information in structured magnetic resonance imaging (MRI) reports of rectal cancer has improved the consistency and efficiency of clinical diagnosis (3). However, El Khababi et al. (4) reported that interobserver agreement is low among radiologists in evaluating extramural vascular invasion (EMVI) and multicategory tumor-node (TN) staging within structured frameworks, exceeding 80% in only 28–55% of cases. This inconsistency may explain why some radiologists prefer to use free-text reports to clearly describe these complex features. MRI evaluation of rectal cancer is particularly challenging because, in addition to TN staging, it requires the assessment of the anterior peritoneal reflection in upper rectal cancer and the anal sphincter complex in lower rectal cancer (5). However, these imaging examinations are challenging for clinicians who lack sufficient expertise (6) and may be incorrectly conducted by junior radiologists.

Large language models (LLMs) are artificial intelligence systems designed to process and generate human-like text and interact with users through textual input and output (7). Since the introduction of OpenAI’s ChatGPT in November 2022, various LLMs have been widely applied for a variety of healthcare tasks, such as answering medical exam questions, supporting diagnostic processes, educating patients, and suggesting treatment options (8). Despite LLMs being susceptible to hallucination (9), a review of literature on their use in healthcare (10) revealed that only 18.3% of studies incorporated factuality as an evaluation metric. DeepSeek-R1 is an LLM developed by DeepSeek and is recognized for its cost-effectiveness and open-source availability (11). According to a technical report from DeepSeek (12), the model employs a reinforcement learning framework, which enables advanced reasoning patterns to emerge, substantially enhancing its reasoning performance. As of March 9, 2025, more than 300 hospitals in China have deployed DeepSeek locally. A critical remaining question, however, is whether these institutions have comprehensively assessed the model’s clinical safety—including issues such as hallucinations—and its overall utility (11). Although recent publications suggest that DeepSeek-R1 performs comparably to GPT-4o and GPT-o1 in clinical decision-making and diagnostic reasoning (13,14), few studies have specifically focused on DeepSeek-R1.

Therefore, this study aimed to evaluate and compare seven LLMs—GPT-4 (15), GPT-4o (16), Wenxinyiyan-4 (WXYY-4) (17), Wenxinyiyan-4Turbo (WXYY-4T) (18), DeepSeek-R1-7B (DSR1-7B) (19), DeepSeek-R1-32B (DSR1-32B) (20), and DeepSeek-R1-671B (DSR1-671B) (21)—in their ability to extract structured information, mine data from MRI reports of rectal cancer, generate cues for report improvement, and inform TN staging. We present this article in accordance with the STARD-AI reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-aw-2441/rc).

Methods

Ethical approval

The retrospective study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments, and was approved by the Institutional Ethics Committee of The Eighth Affiliated Hospital of Southern Medical University (approval No. KYLS20231054). The requirement for written informed consent was waived due to the retrospective nature of the study, and all patient data were anonymized prior to analysis. A detailed description of patient data processing is available in Appendix 1.

Study design

Free-text rectal MRI reports from contrast-enhanced scans performed between June 2018 and February 2025 were consecutively retrieved in a retrospective manner from the Picture Archiving and Communication System of The Eighth Affiliated Hospital of Southern Medical University. The exclusion criteria were as follows: (I) MRI findings suggestive of a diagnosis other than rectal cancer; (II) absence of pathological confirmation or a pathological diagnosis other than rectal cancer; and (III) examinations performed for posttreatment surveillance. For patients with multiple reports, only the earliest one was included.

All reports were translated into English via DeepL.com and then reviewed and revised by a radiologist proficient in English, with a focus on ensuring the accuracy and consistency of clinical terminology (e.g., standardized phrases for invasion and lymph node descriptions). LLMs were accessed in their browser version, with English being the language for input and output and markdown being the output format. The usage of LLMs in each report was conducted with the environmental factors, including equipment, bandwidth, and access time, being the same (Appendix 2). The output generated by the LLMs for each report was copied into a CSV table for summarization. The details of the data capture pipeline are provided in Appendix 3.

This study aimed to evaluate the ability of LLMs to assist radiologists with the generation of rectal cancer MRI reports, so as to improve report quality, achieve systematic management of report data, and facilitate cross-departmental communication. The overall study design is shown in Figure 1.

Figure 1 Flowchart of the study design. The upper red dashed box delineates the extraction of structured information and generation of reporting cues module, whereas the lower red dashed box demarcates the TN staging diagnosis assistance module. DSR1-32B, DeepSeek-R1-32B; DSR1-671B, DeepSeek-R1-671B; DSR1-7B, DeepSeek-R1-7B; LLM, large language model; TN, tumor-node; WXYY-4, Wenxinyiyan-4; WXYY-4T, Wenxinyiyan-4Turbo.

Structured information extraction and generation of reporting cues

For the phase of the study involving structured information extraction and generation of reporting cues, imaging descriptions from radiological reports were used. To maximize the accuracy of structured information extraction and the generation of reporting cues, 60 reports were randomly selected from the cohort for prompt learning. A comparative analysis was conducted with the 60 reports to assess the performance of LLMs from the same developers (i.e., GPT-4 vs. GPT-4o, WXYY-4 vs. WXYY-4T, and DSR1-7B vs. DSR1-32B vs. DSR1-671B). Following this, the representative LLM from each developer and optimized prompts were selected for validation of the remaining reports (n=405).

None of the included reports contained assessments of tumor deposits. Thus, the structured information extraction framework implemented in the LLMs specifically targeted “DISTANCE”-related parameters. A function for generating reporting cues was designed for the LLMs to highlight omissions in anatomical structure assessment, specifically, failure to evaluate the anterior peritoneal reflection in upper rectal cancer or the anal sphincter complex in lower rectal cancer. Additionally, the processing time required by each LLM to analyze a single report was recorded. The prompts used before and after prompt learning are provided in Appendix 4.

Two board-certified radiologists who had 3 and 14 years of diagnostic experience, respectively, and who were unaware of the clinical data independently performed manual data extraction and generated report reminders by referring to the imaging description sections of the original Chinese reports. Any discrepancies were resolved via consensus discussion. The results were translated into English via DeepL.com and then reviewed and revised by a radiologist proficient in English. The performance of the LLMs was evaluated for accuracy, defined as the proportion of correct cases to total cases, with these consensus results serving as the reference standard. For multifield features, including lateral lymph nodes, the anal sphincter complex, and adjacent organs or structures, partially correct labels were established (supplementary material 1). Confusion matrix analysis was implemented to assess the key binary features of circumferential resection margin (CRM), mesorectal fascia (MRF), and EMVI within the validation dataset (supplementary material 2).

TN staging support

A radiologist systematically reviewed the imaging descriptions from the reports and excluded those with incomplete or ambiguously documented T-stage or N-stage classifications (supplementary material 3). Only reports with clearly documented tumor infiltration depth (T stage) and number of suspicious lymph nodes (N stage) were ultimately included.

The top-performing LLM from the previous module was selected for this phase. Two prompt strategies, default knowledge and in-context knowledge (22), were designed (Appendix 5). For each report, we input the prompts and imaging descriptions into the LLM, which was instructed to output both the TN staging and its reasoning basis in markdown format. Clinical data were not provided to the LLMs. The runtime for processing each report was recorded.

Two radiologists who were unaware of the diagnostic impression, model outputs, and clinical data independently assessed TN staging based on the imaging descriptions of the included reports in accordance with the eighth edition of the colorectal cancer tumor-node-metastasis (TNM) staging criteria of the American Joint Committee on Cancer (AJCC). Any discrepancies were resolved through consensus discussion. With these consensus results serving as the reference standard, the performance of both prompt strategies in TN staging was evaluated according to a predefined five-point Likert scale and four endpoints: (I) T stage match, (II) T substage match, (III) N stage match, and (IV) N substage match (Appendix 6).

Figure 2 provides an illustrative example of a patient’s MRI report, the corresponding LLM outputs under both prompt strategies, and the consensus evaluation by the two radiologists. Reports with TN staging scores below 5 points under the higher-performing prompt strategy were screened out, and their deficiencies were analyzed to optimize the prompt template. These low-scoring reports, along with the optimized templates, were then re-input into the LLM to verify improvement and potentially refine optimization strategies for these underperforming reports.

Figure 2 An MRI report of a 58-year-old patient with rectal cancer. (A) The TN staging output results of DSR1-671B and (B,C) the radiologists’ assessment. (B) The default knowledge prompt strategy. (C) The in-context knowledge prompt strategy. (B,C) The left table shows the output of DSR1-671B, and the right table shows the radiologists’ assessment. DSR1-671B, DeepSeek-R1-671B; DWI, diffusion-weighted imaging; EMVI, extramural vascular invasion; MRI, magnetic resonance imaging; T1WI, T1-weighted imaging; T2WI, T2-weighted imaging; TN, tumor-node.

According to the reference standard, a radiologist identified reports containing TN staging errors in the diagnostic impression item. These reports were used to evaluate the error correction performance of the LLMs. Following a predefined audit protocol (Appendix 7), two radiologists independently quantified documented events and confabulations based on the reasoning provided in the LLM-generated TN staging. Discrepancies between the two radiologists were resolved through consensus discussions.

Data mining of MRI reports for rectal cancer

Reports with undocumented or ambiguously described lesion localization were excluded. The included reports were categorized into six subgroups based on lesion location: upper rectum, middle rectum, lower rectum, middle-upper rectum, middle-lower rectum, and entire rectum. The optimal LLM paired with the optimal prompt strategy from the prior phases was selected to extract critical information. Sankey diagrams and chord diagrams were employed to visualize the intrinsic correlations among key information components, which were stratified by rectal lesion localization.

Statistical analysis

Statistical analyses were performed with SPSS version 26.0 (IBM Corp., Armonk, NY, USA) and R software version 4.2.1 (The R Foundation for Statistical Computing, Vienna, Austria). Sample size was determined based on available resources and feasibility, with reference to similar work (22). In the prompt-learning phase for the extraction of structured information and generation of reporting cues, processing time and accuracy were compared via paired-samples t-tests and McNemar tests, respectively. For the evaluation of representative LLMs from different developers, accuracy comparisons were analyzed via the Cochran Q test, while differences in processing time were assessed using one-way repeated-measures analysis of variance. Multiple comparison adjustments were performed with the Bonferroni correction method. In the TN-staging assistance module, the effectiveness of the two prompt strategies was compared via the Wilcoxon signed-rank test and McNemar test, while their processing time was compared via the paired-samples Wilcoxon test (supplementary material 4). To evaluate reproducibility, the LLM analysis was independently repeated three times for all reports from each of the two study modules. Given the mixed, nonstrict quantitative and categorical nature of the data from the module for the extraction of structured information and generation of reporting cues, the reproducibility of LLM outputs was assessed according to percentage agreement. Meanwhile, the reproducibility of TN staging determination was evaluated via Fleiss’ kappa (supplementary material 5). For the percent agreement test, results were interpreted as follows: <70%, poor agreement; 70–79%, moderate agreement; 80–89%, good agreement; and 90–100%, excellent agreement. The kappa values of the Fleiss’ kappa consistency test were interpreted as follows: <0.20, poor agreement; 0.20–0.39, fair agreement; 0.40–0.59, moderate agreement; 0.60–0.79, good agreement; and 0.80–1.00, excellent agreement. Statistical significance was set at P<0.05.

Results

Study cohort

A total of 465 reports were initially included. These reports were generated by 17 radiologists and consisted of two components: image descriptions and diagnostic impressions. In the module for structured information extraction and generation of reporting cues, 60 reports were randomly selected for prompt learning, resulting in 405 reports being ultimately included for analysis. In the module for TN staging assistance, 140 reports lacking clear descriptions of imaging features relevant to TN staging were excluded, with the remaining reports retained for evaluation (median age 67.0 years, interquartile range: 58.5–75.5 years) (Table 1). In the data mining module, 53 reports with absent or unclear descriptions of tumor location were excluded, resulting in 412 reports ultimately being included. Figure 3 presents the flowchart of participant inclusion.

Table 1

Demographic and clinical characteristics of patients examined for TN staging

Characteristic	Value
Sex
Male	202 (62.1)
Female	123 (37.8)
Age (years)	67.0 [58.5, 75.5]
T stage in reports
1	2 (0.6)
2	40 (12.3)
3	202 (62.1)
3–4a	2 (0.6)
4	79 (24.3)
N stage in reports
0	49 (15.0)
1	119 (36.6)
2	156 (48.0)
3	1 (0.3)

Data are expressed as median [first and third quartile] or n (%). TN, tumor-node.

Figure 3 Flowchart of participant inclusion. MRI, magnetic resonance imaging; TN, tumor-node.

Structured information extraction and generation of reporting cues

In the comparative analysis of LLMs from the same developers, DSR1-671B outperformed both DSR1-7B and DSR1-32B in accuracy, achieving an accuracy of 95.4%, compared to 67.1% (P<0.01) and 86.4% (P<0.01), respectively (supplementary material 6). In contrast, no statistically significant differences were observed in accuracy between various versions of Wenxinyiyan (WXYY-4: 88.2%; WXYY-4T: 91.8%; P=0.12; supplementary material 6) or between the GPT models (GPT-4: 92.1%; GPT-4o: 94.3%; P=0.21; supplementary material 6).

The optimized prompt and five LLMs, including DSR1-671B, WXYY-4, WXYY-4T, GPT-4, and GPT-4o, were selected for the evaluation on 405 reports. The results of multiple comparisons are documented in supplementary material 7, and the overall performance evaluation is summarized in Table 2.

Table 2

Performance of large language models in extraction of structured information and generation of reporting cues

Feature	GPT-4	GPT-4o	WXYY-4	WXYY-4T	DSR1-671B	Radiologists	P value
Overall	95.8 (95.3–96.3)	97.5 (97.0–97.9)	93.0 (92.3–93.7)	93.6 (92.9–94.2)	98.9 (98.6–99.1)	Ref.	<0.001
Tumor location	98.0 (96.1–99.1)	98.3 (96.5–99.3)	96.0 (93.7–97.7)	98.8 (97.1–99.6)	99.0 (97.5–99.7)	Ref.	0.002
DIS	97.8 (95.8–99.0)	98.0 (96.1–99.1)	98.8 (97.1–99.6)	97.8 (95.8–99.0)	99.0 (97.5–99.7)	Ref.	0.24
Tumor size	97.8 (95.8–99.0)	98.5 (96.8–99.5)	97.0 (94.9–98.5)	97.3 (95.2–98.6)	99.3 (97.9–99.8)	Ref.	0.02
Circumferential extent	96.5 (94.3–98.1)	98.5 (96.8–99.5)	99.0 (97.5–99.7)	98.5 (96.8–99.5)	99.3 (97.9–99.8)	Ref.	0.018
Longitudinal extent	96.3 (94.0–97.9)	97.5 (95.5–98.8)	96.8 (94.6–98.3)	96.8 (94.6–98.3)	98.8 (97.1–99.6)	Ref.	0.001
Depth of invasion	95.1 (92.5–97.0)	96.3 (94.0–97.9)	94.8 (92.2–96.8)	95.8 (93.4–97.5)	98.0 (96.1–99.1)	Ref.	0.021
Anterior peritoneal reflection	98.3 (96.5–99.3)	99.3 (97.9–99.8)	98.8 (97.1–99.6)	98.3 (96.5–99.3)	100.0 (99.1–100.0)	Ref.	0.027
Lateral lymph nodes	91.1 (87.9–93.7)	94.3 (91.6–96.4)	74.6 (70.0–78.7)	88.4 (84.9–91.3)	97.8 (95.8–99.0)	Ref.	0.000
Mesorectal lymph nodes	90.0 (87.6–93.5)	95.6 (93.1–97.3)	81.2 (77.1–84.9)	87.9 (84.3–90.9)	97.8 (95.8–99.0)	Ref.	0.000
MRF or CRM	98.8 (97.1–99.6)	99.8 (98.6–100.0)	98.0 (96.1–99.1)	99.5 (98.2–99.9)	100.0 (99.1–100.0)	Ref.	0.005
EMVI	99.5 (98.2–99.9)	99.8 (98.6–100.0)	99.8 (98.6–100.0)	99.8 (98.6–100.0)	100.0 (99.1–100.0)	Ref.	0.736
Anal sphincter complex	92.8 (89.9–95.2)	96.8 (94.6–98.3)	92.1 (89.0–94.5)	90.6 (87.3–93.3)	98.5 (96.8–99.5)	Ref.	<0.001
Adjacent organs or structures	95.8 (93.4–97.5)	96.8 (94.6–98.3)	91.1 (87.9–93.7)	84.0 (80.0–87.4)	98.5 (96.8–99.5)	Ref.	<0.001
Reminder	92.8 (89.9–95.2)	95.1 (92.5–97.0)	84.4 (80.5–87.8)	77.3 (72.9–81.3)	98.5 (96.8–99.5)	Ref.	<0.001
Processing time^†	33.1 (32.8–33.3)	33.0 (32.8–33.2)	22.5 (22.3–22.7)	22.4 (22.1–22.6)	84.6 (84.3–85.0)	132.1 (128.9–135.3)	<0.001

Unless otherwise stated, the data are percentages, with 95% CIs in parentheses. ^†, data are the average processing time per report, measured in seconds. Accuracy comparisons were analyzed with the Cochran Q test, and processing time differences were assessed through one-way repeated-measures analysis of variance. CI, confidence interval; CRM, circumferential resection margin; DIS, distance from the distal tumor boundary to the anal verge; DSR1-671B, DeepSeek-R1-671B; EMVI, extramural vascular invasion; MRF, mesorectal fascia; Ref., reference; WXYY-4, Wenxinyiyan-4; WXYY-4T, Wenxinyiyan-4Turbo.

DSR1-671B demonstrated the highest overall accuracy (98.9%; P<0.01), with an accuracy exceeding 97% for each individual feature. In the reminder category, the difference in accuracy between DSR1-671B (98.5%) and GPT-4o (95.1%) was not statistically significant (P=0.74); however, DSR1-671B did significantly outperformed the other three models: GPT-4 (92.8%; P=0.03), WXYY-4 (84.4%; P<0.01), and WXYY-4T (77.3%; P<0.01).

Additionally, with the optimized prompt, all five LLMs achieved accuracy rates above 98% for both MRF/CRM and EMVI features. Compared to manual extraction by radiologists, all LLMs required significantly less processing time per report (P<0.01), with DSR1-671B requiring the longest average processing time among the five LLMs (P<0.01).

According to the distribution of partially correct reports across multifield features (supplementary material 1), only lateral lymph nodes included a considerable number of partially correct reports; in contrast, for anal sphincter complex and adjacent organs or structures, there were very few or no such results. In the confusion matrix analysis of the key binary features (CRM or MRF and EMVI), DSR1-671B achieved 100% sensitivity and specificity for both features (supplementary material 2). Detailed information on confabulation for this research module is provided in supplementary material 8.

TN staging support

Interobserver agreement between two radiologists for T staging and N staging assessment yielded Cohen kappa values of 0.955 and 0.973, respectively, indicating statistically significant consistency (P<0.01). The accuracy of TN staging from DSR1-671B, evaluated with a five-point Likert scale, varied between the two prompting strategies (Table 3). The in-context knowledge strategy yielded a higher median score compared to the default knowledge strategy (5.0 vs. 4.0; P<0.01). The in-context knowledge strategy also yielded a higher accuracy compared to the default knowledge strategy for T staging (86.5% vs. 72.3%; P<0.01), T substaging (86.5% vs. 72.3%; P<0.01), and N substaging (95.0% vs. 77.2%; P<0.01) (Figure 4). Significant differences in the number of events, number of confabulations, and confabulation rates between the two prompting strategies were identified by the paired-samples Wilcoxon test (P<0.01), although their medians were equal (Table 3). Accordingly, mean values were used to quantify these differences. When generating TN staging outputs, DSR1-671B coupled with in-context knowledge strategy incorporated significantly more staging criteria than when coupled with default knowledge approach, with mean values of 3.3 and 2.8, respectively. Both strategies produced minimal confabulation, with a mean value of 0.1 for each. The default knowledge strategy produced a higher confabulation rate than did the in-context knowledge strategy, with mean rates of 4.1% and 2.9%, respectively. Detailed information on confabulation for this research module is documented in supplementary material 8. However, the default knowledge strategy showed significantly faster processing times than did the in-context knowledge strategy, with a median time per report of 15.0 and 22.0 s, respectively (P<0.01) (Table 3).

Table 3

Comparison of TN staging between the two prompt strategies

Variable	Default	In-context	Median difference (default vs. in-context)	Z value	P value
5-point Likert score	4.0 (2.0, 4.0)	5.0 (5.0, 5.0)	−1.0	−11.696	<0.01
Number of events	3.0 (2.0, 3.0)	3.0 (3.0, 4.0)	0.0	−8.653	<0.01
Number of confabulations	0.0 (0.0, 0.0)	0.0 (0.0, 0.0)	0.0	0.555	<0.01
Confabulation rate (%)	0.0 (0.0, 0.0)	0.0 (0.0, 0.0)	0.0	2.286	<0.01
Processing time default (s)	15.0 (12.0, 18.0)	22.0 (16.0, 34.0)	−7.0	−10.592	<0.01

Data are presented as paired median (P25, P75) unless otherwise indicated. All comparisons were performed with the paired-samples Wilcoxon test. Default, default knowledge strategy; In-context, in-context knowledge strategy; P25, 25th percentile; P75, 75th percentile; TN, tumor-node.

Figure 4 Accuracy of TN staging between the two prompt strategies. Statistical analyses were performed via the McNemar test. Default, default knowledge strategy; In-context, in-context knowledge strategy; TN, tumor-node.

Among the 325 evaluated reports, 58 (17.8%) contained TN staging errors. For these erroneous cases, the in-context knowledge strategy showed superior corrective performance, successfully rectifying 54 reports (93.1%), while the default knowledge strategy rectified 28 reports (48.3%) (P<0.01).

After initial evaluation, 57 reports scored below 5 points under the in-context knowledge strategy. A radiologist subsequently analyzed the TN staging reasoning generated by DSR1-671B alongside the original imaging descriptions for these cases, identified limitations in the prompt strategy, and developed an optimized version (Appendix 5). Reevaluation conducted with the optimized prompt strategy demonstrated significant improvement in TN staging scores, with an increase in the median score from 2.0 to 5.0 (P<0.01).

Reproducibility of LLMs

In the module of structured information extraction and generation of reporting cues, the reproducibility rates of GPT-4, GPT-4o, WXYY-4, WXYY-4T and DSR1-671B reached 95.0%, 94.8%, 90.0%, 95.0% and 97.1%, respectively. The Fleiss’ kappa values for T staging and N staging were 0.960 and 0.932 (both P<0.01). Detailed data are summarized in supplementary material 5. The DSR1-671B model showed good reproducibility across all tasks, with the lowest agreement occurring in N staging reasoning (Fleiss’ kappa value =0.932).

Data mining of MRI reports

Analysis of 412 reports revealed the following distribution of rectal cancer locations: upper rectum, 24.2% (100/412); middle-upper rectum, 33.2% (137/412); middle rectum, 8.0% (33/412); middle-lower rectum, 15.7% (65/412); lower rectum, 15.5% (64/412); and entire rectum, 3.1% (13/412). In this module, the accuracy for tumor location, EMVI, and MRF or CRM in the included reports reached 100% [95% confidence interval (CI): 99.1–100.0%]. The accuracy of T staging, N staging, adjacent organ invasion, and lymph node metastasis was 98.1% (95% CI: 95.8–99.2%), 99.4% (95% CI: 97.6–99.9%), 99.0% (95% CI: 97.3–99.7%), and 99.3% (95% CI: 97.7–99.8%), respectively. All errors were corrected before module analysis, and all results from this module were descriptive and hypothesis-generating.

For this module, we first examined the correlation between lesion location and EMVI or CRM/MRF status. After 19 reports lacking EMVI documentation were excluded, 393 reports were analyzed for EMVI correlation. The EMVI-positive rate was highest in middle-upper rectal cancer (74.6%; 95% CI: 67.0–82.2%) and lowest in lower rectal cancer (42.4%; 95% CI: 29.4–55.4%) (Figure 5A). After 27 reports without CRM/MRF documentation were excluded, 385 reports were included in the CRM/MRF analysis. The CRM/MRF-positive rate was highest in entire rectal cancer (72.7%; 95% CI: 39.0–94.0%) and lowest in middle rectal cancer (29.0%; 95% CI: 12.1–46.0%) (Figure 5B). Cancers involving multiple rectal segments (middle-upper, middle-lower, and entire rectum) showed relatively high positivity rates for both CRM/MRF and EMVI.

Figure 5 Visualization of the involvement of EMVI and MRF/CRM by rectal cancer location. (A) Proportion of EMVI involvement: upper rectum, −64.3% (63/98); middle and upper rectum, −74.6% (97/130); middle rectum, −54.8% (17/31); middle and lower rectum, −65.6% (42/64); lower rectum, −42.4% (25/59); and entire rectum, −72.7% (8/11). (B) Proportion of MRF/CRM involvement: upper rectum, −51.1% (48/94); middle and upper rectum, −57.1% (72/126); middle rectum, −29.0% (9/31); middle and lower rectum, −65.1% (41/63); lower rectum, −43.3% (26/60); and entire rectum, −72.7% (8/11). CRM, circumferential resection margin; EMVI, extramural vascular invasion; MRF, mesorectal fascia; Neg, negative; Pos, positive.

Second, we analyzed the correlation between lesion location and TN staging. After 98 reports with incomplete or ambiguously documented T-stage or N-stage classifications were excluded, 314 reports were analyzed for TN staging correlation. For entire rectal cancer, the highest proportion among T stages was T4 (57.1%; 95% CI: 22.5–87.1%), while for the other segments of rectal cancer, T3 accounted for the highest proportion (50.0–68.0%). For lower rectal cancer, the highest proportion among N stages was N1 (51.0%, 95% CI: 36.7–65.2%), while for the other segments of rectal cancer, N2 accounted for the highest proportion (39.4–85.7%) (Figure 6).

Figure 6 The visualization of the correlation between rectal cancer location and TN staging. (A) The correlation between lesion location and T stage. (B) The correlation between lesion location and N stage. Date are presented as % (n/N). TN, tumor-node.

Third, we determined the correlation between lesion location and adjacent organ invasion or lymph node metastasis, with all 412 reports being included. The proportion of lateral lymph node metastasis was relatively high in lower rectal cancer (39.1%; 95% CI: 27.8–51.6%) and middle-lower rectal cancer (38.5%; 95% CI: 27.3–51.0%). In rectal cancer involving multiple rectal segments, the proportions of muscle (external anal sphincter and pelvic floor musculature) invasion and anterior peritoneal reflection invasion were relatively high. The proportion of muscle invasion was 50.0% (95% CI: 34.9–65.1%) for middle-lower rectal cancer, 33.3% (95% CI: 7.5–70.1%) for entire rectal cancer, and 28.2% (95% CI: 16.4–43.5%) for lower-rectal cancer. The proportions of anterior peritoneal reflection invasion were as follows: entire rectal cancer, 44.4% (95% CI: 16.9–74.9%); middle-upper rectal cancer, 29.0% (95% CI: 22.0–37.6%); and upper rectal cancer, 25.0% (95% CI: 17.7–34.1%) (Figure 7).

Figure 7 Chord diagram of the relationship between rectal cancer localization, adjacent organ invasion, and lymph node metastasis.

Discussion

DSR1-671B demonstrated superior overall accuracy (98.9%), significantly outperforming the other six LLMs, and achieved 98.5% accuracy in generating reporting cues. In TN staging inference, the in-context knowledge prompt strategy yielded higher accuracy than did the default knowledge strategy and invoked more TN staging criteria in its reasoning. Both prompt strategies resulted in low confabulation rates, with a median rate of 0 and a mean rate of less than 5%. The descriptive hypothesis-generating results of data mining analysis stratified by rectal cancer lesion location revealed that tumors involving multiple rectal segments exhibited higher positive rates for both CRM/MRF and EMVI. Additionally, cancers located in the lower and middle-lower rectum showed high rates of lateral lymph node metastasis.

Since the introduction of the Transformer for natural language processing (NLP) in 2017, LLMs have shown remarkable progress across a diversity of NLP tasks. However, additional training and fine-tuning remain crucial to ensuring their efficacy in domain-specific applications (23,24). Prompt learning has emerged as a novel NLP paradigm that, alongside traditional fine-tuning, constitutes a core strategy for adapting LLMs to targeted tasks. Unlike fine-tuning, which often requires extensive annotated datasets and computational resources, prompt learning achieves competitive performance via carefully designed textual prompts. As a resource-efficient approach with lower technical barriers, it offers distinct advantages in computationally constrained settings, with empirical studies confirming its ability to match or exceed fine-tuning in certain NLP tasks (25,26).

We observed inconsistencies in the structured extraction of CRM, MRF, and EMVI features, indicating that LLMs encounter challenges in distinguishing semantically similar concepts. Although semantics-guided learning methods for few-shot relation extraction, such as that proposed by Wu et al. (27), can improve discrimination between such objects, MRI reports typically document only one of either MRF or CRM. To address this, we combined both features into a single entity, MRF/CRM, via prompt learning. Subsequent validation across 405 MRI reports indicated an accuracy exceeding 98% for all three features across all evaluated LLMs.

When the reminder task was initially framed as a four-step sequence in the prompt, all models exhibited low accuracy, with the top performer reaching only 80%. Chowdhery et al. (28) noted that LLMs often underperform in multistep tasks but found that chain-of-thought prompting can effectively mitigate this limitation and even surpass fine-tuning in certain scenarios. By decomposing the reminder task into three discrete subtasks through prompt modification, we observed substantial accuracy improvements across most LLMs, with the highest accuracy reaching 98.5% on the same 405 reports.

In TN staging determination, many low-scoring reports (below 5 points) involved lower rectal cancer assessments by junior radiologists. These reports frequently misdescribed tumor invasion of the rectal serosal layer despite the anatomical absence of a serosal layer in the lower rectum. After prompt optimization, a marked improvement was noted in these cases, as the median score rose from 2.0 to 5.0 (P<0.01). Thus, through systematic prompt learning, LLMs achieved near-optimal performance across multiple NLP tasks involving MRI reports of rectal cancer, aligning with findings from earlier studies (22,29,30).

Several studies have examined the application of traditional language models to the interpretation of MRI reports of rectal cancer; however, the potential of generative LLMs in this domain remains largely unexamined. Our study comprehensively evaluated the multitask performance of LLMs from multiple developers in applications related to rectal cancer MRI. Liu et al. (31) developed a rule-based pattern-matching NLP system to extract key imaging features from free-text rectal cancer MRI reports, achieving an accuracy of 93.2%, which is lower than the 98.9% observed in our study. More recently, Chizhikova et al. (32) constructed a robustly optimized BERT pretraining approach (RoBERTa)-based TNM staging system for MRI reports of rectal cancer; however, its accuracy for both T staging (84%) and N staging (89%) was lower than that of the in-context knowledge strategy in our study, which achieved 86.5% accuracy for T substaging and 95.0% accuracy for N substaging.

Although LLMs have good application potential in healthcare, concerns regarding accuracy, dataset bias, privacy, and ethical considerations have limited their widespread adoption in real-world clinical practice (7,33). The recent onsite deployment of DeepSeek across numerous medical institutions in China may represent a breakthrough in this regard, yet the scarcity of rigorous clinical validation studies has led to doubts concerning its practical utility (11). Emerging evidence suggests that compared to other LLMs, DeepSeek is competitive in performing clinical tasks. Tordjman et al. (14) recently reported that DeepSeek-R1 outperformed GPT-4o in clinical reasoning, achieving higher mean scores on a five-point Likert scale (3.61 vs. 3.22; P=0.005), which is consistent with our findings. Sandmann et al. (13) translated 125 clinical cases in German language into English language for evaluation across multiple LLMs and found that DeepSeek-R1 and GPT-4o performed similarly (P=0.309), with both significantly surpassing Gemini 2 (P=5.73×10^–5). Given DSR1-671B’s exceptional performance in structured information extraction and generation of reporting cues, we decided to examine this model in terms prompt learning in TN staging inference.

In our TN staging study, the in-context knowledge prompt strategy yielded a superior performance compared to the default knowledge approach, achieving higher median accuracy scores (5.0 vs. 4.0; P<0.01) and providing more comprehensive reasoning foundations (mean event counts: 3.3 vs. 2.8), a finding consistent with a previous study (22). The corrective capability of the in-context knowledge strategy was notably superior, successfully rectifying 54 of 58 (93.1%) erroneous TN staging entries in diagnostic impressions; in comparison, the default knowledge corrected 28 of 58 (48.3%) (P<0.01). Despite the advantages of the in-context knowledge strategy, it, along with the default knowledge strategy, exhibited hallucination phenomena, with mean confabulation rates of 2.9% and 4.1%, respectively. This underscores the necessity of human supervision in real-world settings.

In the descriptive analysis that used data mining to generate hypotheses from rectal cancer MRI reports, we observed that tumors located in the lower and middle-lower rectum had a higher propensity for lateral lymph node metastasis. This finding aligns with previous studies (34,35), which identified lower rectal involvement as a significant risk factor for lateral lymph node metastasis.

Our study involved certain limitations which should be addressed. First, while the inclusion of free-text reports from 17 radiologists enhances the generalizability of our findings, the single-center, cross-sectional design means that further multicenter validation is needed. Second, our analysis relied exclusively on textual reports, which may introduce radiologists’ interpretation bias. The reference standard was radiologists’ consensus based on report texts, and review of the original MRI images, pathology, and multidisciplinary outcomes was lacking. Third, the study did not conduct bilingual evaluation in both Chinese and English. However, with reference to previous similar studies (13), we translated the reports into English and used English as both the input and output language of LLMs. Furthermore, one study demonstrated that the differential diagnostic performance of GPT-4o in a comprehensive corpus of rare-disease cases was largely consistent across Chinese and English contexts (36). Another study evaluating the application performance of LLMs in health communication for patients with atherosclerosis showed that DeepSeek-R1 and GPT-4o exhibited no significant linguistic differences in good response rates and accuracy in both the Chinese and English languages (37). Fourth, despite our efforts to standardize the access environment, the use of browser-accessed LLMs without fixed model snapshots or controlled inference parameters constitutes a limitation that could affect reproducibility over time due to potential model updates. Finally, the data mining component remains exploratory, and all results from this module are descriptive and hypothesis-generating; future work will focus on correlating the changes in MRI features during neoadjuvant therapy with long-term survival outcomes in patients with rectal cancer.

Conclusions

In an English-language environment, the prompt learning-enhanced DSR1-671B demonstrated the ability to effectively extract structured information from MRI reports of rectal cancer, generate reporting cues, and apply TN staging reasoning based on the reports. Nevertheless, for clinical applications, human oversight remains essential for identifying and correcting errors.

Acknowledgments

None.

Footnote

Reporting Checklist: The authors have completed the STARD-AI reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-aw-2441/rc

Data Sharing Statement: Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-aw-2441/dss

Funding: This work was supported by the Guangdong Medical Science and Technology Research Fund (Nos. B2025410, A2023204, A2024582 and A2024338), the Clinical Research Start Plan of Shunde Hospital of Southern Medical University (Nos. CRSP2022005, SRSP2024031 and SRSP2021021), the Special Projects in Key Fields of Ordinary Universities in Guangdong Province (No. 2024ZDZX2029), the Characteristic Innovation Projects of Ordinary Universities in Guangdong Province (No. 2023KTSCX022), and the Guangdong Province Basic and Applied Basic Research Fund Project, Guangdong Foshan Project (No. 2024A1515140150).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-aw-2441/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. This retrospective study was approved by the Institutional Ethics Committee of The Eighth Affiliated Hospital of Southern Medical University (approval No. KYLS20231054). The requirement for written informed consent was waived due to the retrospective nature of the study, and all patient data were anonymized prior to analysis.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Paragomi P, Zhang Z, Abe SK, Islam MR, Rahman MS, Saito E, et al. Body Mass Index and Risk of Colorectal Cancer Incidence and Mortality in Asia. JAMA Netw Open 2024;7:e2429494. [Crossref] [PubMed]
Nougaret S, Reinhold C, Mikhael HW, Rouanet P, Bibeau F, Brown G. The use of MR imaging in treatment planning for patients with rectal carcinoma: have you checked the "DISTANCE"? Radiology 2013;268:330-44. [Crossref] [PubMed]
Nougaret S, Gormly K, Lambregts DMJ, Reinhold C, Goh V, Korngold E, Denost Q, Brown G. MRI of the Rectum: A Decade into DISTANCE, Moving to DISTANCED. Radiology 2025;314:e232838. [Crossref] [PubMed]
El Khababi N, Beets-Tan RG, Curvo-Semedo L, Tissier R, Nederend J, Lahaye MJ, Maas M, Beets GL, Lambregts DM. Pearls and pitfalls of structured staging and reporting of rectal cancer on MRI: an international multireader study. Br J Radiol 2023;96:20230091.
Fernandes MC, Gollub MJ, Brown G. The importance of MRI for rectal cancer evaluation. Surg Oncol 2022;43:101739. [Crossref] [PubMed]
Nougaret S, Jhaveri K, Kassam Z, Lall C, Kim DH. Rectal cancer MR staging: pearls and pitfalls at baseline examination. Abdom Radiol (NY) 2019;44:3536-48. [Crossref] [PubMed]
Omiye JA, Gui H, Rezaei SJ, Zou J, Daneshjou R. Large Language Models in Medicine: The Potentials and Pitfalls : A Narrative Review. Ann Intern Med 2024;177:210-20. [Crossref] [PubMed]
Meng X, Yan X, Zhang K, Liu D, Cui X, Yang Y, et al. The application of large language models in medicine: A scoping review. iScience 2024;27:109713. [Crossref] [PubMed]
Park YJ, Pillai A, Deng J, Guo E, Gupta M, Paget M, Naugler C. Assessing the research landscape and clinical utility of large language models: a scoping review. BMC Med Inform Decis Mak 2024;24:72. [Crossref] [PubMed]
Bedi S, Liu Y, Orr-Ewing L, Dash D, Koyejo S, Callahan A, Fries JA, Wornow M, Swaminathan A, Lehmann LS, Hong HJ, Kashyap M, Chaurasia AR, Shah NR, Singh K, Tazbaz T, Milstein A, Pfeffer MA, Shah NH. Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. JAMA 2025;333:319-28. [Crossref] [PubMed]
Zeng D, Qin Y, Sheng B, Wong TY. DeepSeek's "Low-Cost" Adoption Across China's Hospital Systems: Too Fast, Too Soon? JAMA 2025;333:1866-9. [Crossref] [PubMed]
Guo D, Yang D, Zhang H, Song J, Wang P, Zhu Q, et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature 2025;645:633-8. [Crossref] [PubMed]
Sandmann S, Hegselmann S, Fujarski M, Bickmann L, Wild B, Eils R, Varghese J. Benchmark evaluation of DeepSeek large language models in clinical decision-making. Nat Med 2025;31:2546-9. [Crossref] [PubMed]
Tordjman M, Liu Z, Yuce M, Fauveau V, Mei Y, Hadjadj J, et al. Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning. Nat Med 2025;31:2550-5. [Crossref] [PubMed]
Open AI, GPT-4 [Large language model]. [Accessed March 2-20, 2025]. Available online: https://chat.openai.com/
Open AI, GPT-4o [Large language model]. [Accessed March 2-20, 2025]. Available online: https://chat.openai.com/
Baidu, Wenxinyiyan-4 [Large language model]. [Accessed March 2-20, 2025]. Available online: https://yiyan.baidu.com/
Baidu, Wenxinyiyan-4Turbo [Large language model]. [Accessed March 2-20, 2025]. Available online: https://yiyan.baidu.com/
DeepSeek, DeepSeek-R1-7B [Large language model]. [Accessed March 2-20, 2025]. Accessed via volcengine. Available online: https://console.volcengine.com/
DeepSeek, DeepSeek-R1-32B [Large language model]. Accessed via volcengine. [Accessed March 2-20, 2025]. Available online: https://console.volcengine.com/
DeepSeek, DeepSeek-R1-671B [Large language model]. Accessed via volcengine. [Accessed March 2-20, 2025]. Available online: https://console.volcengine.com/
Bhayana R, Nanda B, Dehkharghanian T, Deng Y, Bhambra N, Elias G, Datta D, Kambadakone A, Shwaartz CG, Moulton CA, Henault D, Gallinger S, Krishna S. Large Language Models for Automated Synoptic Reports and Resectability Categorization in Pancreatic Cancer. Radiology 2024;311:e233117. [Crossref] [PubMed]
Zhang H, Shafiq MO. Survey of transformers and towards ensemble learning using transformers for natural language processing. J Big Data 2024;11:25. [Crossref] [PubMed]
Raza M, Jahangir Z, Riaz MB, Saeed MJ, Sattar MA. Industrial applications of large language models. Sci Rep 2025;15:13755. [Crossref] [PubMed]
Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Computing Surveys 2023;55:1-35.
Taylor N, Zhang Y, Joyce DW, Gao Z, Kormilitzin A, Nevado-Holgado A. Clinical Prompt Learning With Frozen Language Models. IEEE Trans Neural Netw Learn Syst 2024;35:16453-63.
Wu H, He Y, Chen Y, Bai Y, Shi X. Improving few-shot relation extraction through semantics-guided learning. Neural Netw 2024;169:453-61. [Crossref] [PubMed]
Chowdhery A, Narang S, Devlin J, Bosma M, Mishra G, Roberts A, et al. Palm: Scaling language modeling with pathways. J Mach Learn Res 2023;24:1-113.
Fink MA, Bischoff A, Fink CA, Moll M, Kroschke J, Dulz L, Heußel CP, Kauczor HU, Weber TF. Potential of ChatGPT and GPT-4 for Data Mining of Free-Text CT Reports on Lung Cancer. Radiology 2023;308:e231362. [Crossref] [PubMed]
Lehnen NC, Dorn F, Wiest IC, Zimmermann H, Radbruch A, Kather JN, Paech D. Data Extraction from Free-Text Reports on Mechanical Thrombectomy in Acute Ischemic Stroke Using ChatGPT: A Retrospective Analysis. Radiology 2024;311:e232741. [Crossref] [PubMed]
Liu W, Cai L, Li Y. Application of natural language processing to post-structuring of rectal cancer MRI reports. Clin Radiol 2024;79:e204-10. [Crossref] [PubMed]
Chizhikova M, López-Úbeda P, Martín-Noguerol T, Díaz-Galiano MC, Ureña-López LA, Luna A, Martín-Valdivia MT. Automatic TNM staging of colorectal cancer radiology reports using pre-trained language models. Comput Methods Programs Biomed 2025;259:108515. [Crossref] [PubMed]
Gao Y, Baptista-Hon DT, Zhang K. The inevitable transformation of medicine and research by large language models: the possibilities and pitfalls. MEDCOMM-Future Medicine 2023;2:e49.
Zeng DX, Yang Z, Tan L, Ran MN, Liu ZL, Xiao JW. Risk factors for lateral pelvic lymph node metastasis in patients with lower rectal cancer: a systematic review and meta-analysis. Front Oncol 2023;13:1219608. [Crossref] [PubMed]
Zhang L, Shi F, Hu C, Zhang Z, Liu J, Liu R, She J, Tang J. Development and External Validation of a Preoperative Nomogram for Predicting Lateral Pelvic Lymph Node Metastasis in Patients With Advanced Lower Rectal Cancer. Front Oncol 2022;12:930942. [Crossref] [PubMed]
Chimirri L, Caufield JH, Bridges Y, Matentzoglu N, Gargano M, Cazalla M, et al. Consistent performance of large language models in rare disease diagnosis across ten languages and 4917 cases. EBioMedicine 2025;121:105957. [Crossref] [PubMed]
Li P, Xu Y, Liu X, Shen Z, Wang Y, Lv X, Lu Z, Wu H, Zhuang J, Chen Y. Large Language Models in Patient Health Communication for Atherosclerotic Cardiovascular Disease: Pilot Cross-Sectional Comparative Analysis. JMIR Med Inform 2026;14:e81422.

Cite this article as: Xie F, Ouyang LZ, Guo BL, Huang XY, Liu ZW, Zhou LN, Pan JL, Wang LW, Chen M, Li YJ, Lin QQ, Chen XJ, Hu QG, Ouyang FS. Value of the DeepSeek-R1 large language model in extracting structured data from magnetic resonance imaging reports of rectal cancer and assisting in tumor staging. Quant Imaging Med Surg 2026;16(7):550. doi: 10.21037/qims-2025-aw-2441

Value of the DeepSeek-R1 large language model in extracting structured data from magnetic resonance imaging reports of rectal cancer and assisting in tumor staging

Introduction

Methods

Ethical approval

Study design

Structured information extraction and generation of reporting cues

TN staging support

Data mining of MRI reports for rectal cancer

Statistical analysis

Results

Study cohort

Table 1

Structured information extraction and generation of reporting cues

Table 2

TN staging support

Table 3

Reproducibility of LLMs

Data mining of MRI reports

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share